How to Select the Right LLM for Your Project: 8 Essential Criteria from the Microsoft Curriculum
To select the right LLM for your project, evaluate eight core criteria: task complexity, domain specificity, performance requirements, cost constraints, ecosystem availability, safety needs, regulatory compliance, and data sensitivity.
Choosing between GPT-4, Claude, Llama 3, or a fine-tuned open-source model requires more than benchmarking accuracy. The microsoft/generative-ai-for-beginners repository provides a comprehensive framework to help you select the right LLM for your project, starting with the fundamental question posed in 02-exploring-and-comparing-different-llms/README.md: "How to select the right model for your use case."
Core Criteria to Select the Right LLM for Your Project
Task Complexity and Model Capability
Complex reasoning or code generation often requires larger, instruction-tuned models such as GPT-4 or Claude. Simple classification or summarization tasks can be handled efficiently by smaller models with fewer parameters. According to the Enhanced Features Roadmap in docs/ENHANCED_FEATURES_ROADMAP.md, model selection based on task complexity is a primary architectural consideration.
Domain Specificity and Data Sensitivity
Domain-specific fine-tuning or retrieval-augmented generation (RAG) improves relevance when dealing with specialized content. Confidential data may require on-premise or open-source models to maintain data sovereignty. The fine-tuning lesson in 18-fine-tuning/README.md highlights when to adapt models versus using off-the-shelf solutions.
Performance, Latency, and Throughput
Real-time UI applications demand low latency, typically achieved with smaller models (7-30B parameters) or optimized endpoints. Batch processing pipelines can tolerate higher latency in exchange for greater accuracy from larger models. The LLMOps discussion in 14-the-generative-ai-application-lifecycle/README.md covers these operational trade-offs.
Cost Per 1,000 Tokens
Budget constraints directly influence model choice between premium APIs like Azure OpenAI GPT-4 and cost-effective alternatives such as GPT-3.5-Turbo or open-source models. Self-hosting open-source models can reduce per-token costs but requires infrastructure investment. The open-source models lesson in 16-open-source-models/README.md lists cost-effective deployment options.
Availability and Ecosystem Integration
Integration libraries such as the Azure SDK, OpenAI Python client, and LangChain affect developer velocity. Azure OpenAI, OpenAI, Hugging Face, and on-premise deployments each offer different ecosystem advantages. The function calling lesson in 11-integrating-with-function-calling/README.md demonstrates SDK usage patterns that inform platform selection.
Safety Guardrails and Content Filtering
Models with built-in safety features, such as Azure OpenAI's content moderation filters, simplify compliance with enterprise security standards. Hallucination mitigation techniques vary between model families and deployment options. The securing AI applications lesson in 13-securing-ai-applications/README.md outlines security concerns that should influence model selection.
Regulatory and Licensing Constraints
Open-source models like Mistral and Llama 3 offer permissive licenses suitable for restricted environments or redistribution. Commercial APIs impose usage restrictions and data processing agreements that may conflict with regulatory requirements. The open-source models lesson discusses licensing implications for enterprise deployment.
A 5-Step Decision Flow to Select the Right LLM
-
Define the primary task. Determine whether your use case requires text generation, summarization, code completion, conversational chat, or multimodal output. This initial classification immediately narrows the suitable model families.
-
Identify operational constraints. Document your latency requirements (real-time UI versus batch processing), budget limits, data privacy requirements, and regulatory compliance needs. These constraints eliminate models that cannot meet your operational reality.
-
Map constraints to model families. Match your requirements to specific deployment options:
- High-quality, general-purpose: Azure OpenAI
gpt-4o, OpenAIgpt-4 - Cost-effective, fast: Azure OpenAI
gpt-35-turbo, open-sourceMistral-7B - Domain-specific: Fine-tuned base models via Azure OpenAI or Hugging Face
- On-premise / air-gapped: Local deployment with Ollama or Hugging Face (
Llama-3.1,Phi-3)
- High-quality, general-purpose: Azure OpenAI
-
Prototype and evaluate. Deploy a candidate model and measure it against the LLM evaluation checklist: accuracy on your specific tasks, end-to-end latency, actual token costs, and safety compliance. Use the code patterns below to inspect available models programmatically.
-
Iterate based on metrics. If the prototype fails on any dimension—whether hallucination rates exceed thresholds or inference costs surpass budget—return to the criteria matrix and select the next-best model family.
Inspecting LLM Options with Code
Before committing to a model, programmatically inspect available options to verify capabilities, token limits, and deployment status. The following examples from the microsoft/generative-ai-for-beginners curriculum demonstrate how to audit both cloud-hosted and local models.
Listing Azure OpenAI Deployments
Use the Azure OpenAI Python SDK to enumerate available models and their deployment names. This helps verify which models are provisioned in your Azure resource before making API calls.
import os, openai, json
from dotenv import load_dotenv
load_dotenv() # reads .env without exposing secrets
client = openai.AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-02-01",
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
)
# List available deployments (models) in your Azure OpenAI resource
deployments = client.deployments.list()
print(json.dumps([d.id for d in deployments], indent=2))
This output reveals model names (such as gpt-35-turbo or gpt-4o), associated token limits, and pricing tiers—critical data points for the Performance and Cost criteria.
Auditing Local Models with Ollama
For on-premise or offline requirements, inspect locally available models using Ollama. This script parses the local model registry to confirm which versions of Llama, Mistral, or Phi are ready for inference.
import subprocess, json
def list_ollama_models():
result = subprocess.run(["ollama", "list"], capture_output=True, text=True)
models = [line.split()[0] for line in result.stdout.strip().split("\n")[1:]]
return models
print("Available local models:", list_ollama_models())
Use this to validate Latency expectations (local inference eliminates network round-trips) and Regulatory compliance (data never leaves the host machine).
Evaluating RAG Integration with Azure Cognitive Search
Test how a candidate model performs with retrieval-augmented generation (RAG) to determine if a smaller model can meet your Domain Specificity requirements when augmented with external knowledge.
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
# Assumes 'client' is initialized as shown in the Azure OpenAI example above
search_client = SearchClient(
endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
index_name="genai-index",
credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_KEY"))
)
def retrieve_context(question):
results = search_client.search(question, top=3)
return "\n".join([r["content"] for r in results])
def ask_gpt(prompt):
response = client.chat.completions.create(
model=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
messages=[{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":prompt}]
)
return response.choices[0].message.content
question = "What is the difference between fine‑tuning and RAG?"
context = retrieve_context(question)
answer = ask_gpt(f"Context:\n{context}\n\nQuestion: {question}")
print(answer)
This pattern helps validate whether a smaller, cost-effective model can handle complex Task Complexity when paired with a retrieval system, potentially saving significant inference costs compared to using a larger foundation model alone.
Key Curriculum Resources
The microsoft/generative-ai-for-beginners repository contains detailed lessons that expand on each selection criterion:
| File | Coverage |
|---|---|
02-exploring-and-comparing-different-llms/README.md |
Introduces the core selection framework and comparison methodology. |
14-the-generative-ai-application-lifecycle/README.md |
Covers LLMOps, latency optimization, and lifecycle management. |
16-open-source-models/README.md |
Details open-source alternatives, licensing, and cost structures. |
18-fine-tuning/README.md |
Explains when to fine-tune versus use prompt engineering or RAG. |
13-securing-ai-applications/README.md |
Addresses safety guardrails and security requirements. |
docs/ENHANCED_FEATURES_ROADMAP.md |
References task-complexity based selection strategies. |
11-integrating-with-function-calling/README.md |
Demonstrates SDK integration patterns for ecosystem evaluation. |
Summary
- Task complexity determines whether you need a large reasoning model (GPT-4) or a smaller efficient model (Mistral-7B).
- Data sensitivity and domain requirements dictate whether to use proprietary APIs, fine-tuning, or RAG with open-source models.
- Latency and throughput requirements filter models by size and deployment architecture (cloud API vs. local inference).
- Cost per token and infrastructure budget determine the trade-off between premium models and self-hosted alternatives.
- Safety, regulatory, and licensing constraints may mandate specific model families or deployment environments (Azure OpenAI vs. on-premise Llama 3).
Frequently Asked Questions
What is the difference between using a proprietary LLM and an open-source model?
Proprietary models like Azure OpenAI's GPT-4 offer managed infrastructure, built-in safety filters, and enterprise support, but require sending data to third-party APIs. Open-source models like Llama 3 or Mistral allow on-premise deployment, full data privacy, and customization via fine-tuning, but require you to manage infrastructure and security. The 16-open-source-models/README.md lesson details licensing and cost considerations for this decision.
How do I balance cost and performance when selecting an LLM?
Start with a smaller, cost-effective model (such as GPT-3.5-Turbo or Mistral-7B) and measure its performance on your specific tasks. If accuracy is insufficient, implement RAG to augment the smaller model with domain knowledge before upgrading to a larger, more expensive model. The 14-the-generative-ai-application-lifecycle/README.md discusses LLMOps strategies for optimizing this cost-performance trade-off.
When should I choose fine-tuning over retrieval-augmented generation (RAG)?
Choose fine-tuning when you need to permanently change the model's behavior, tone, or knowledge for a specific domain, and you have high-quality training data available. Choose RAG when you need to provide the model with access to frequently changing knowledge bases or private documents without retraining. The 18-fine-tuning/README.md lesson provides detailed guidance on making this architectural choice.
Can I switch LLMs after deploying my application?
Yes, but switching requires abstracting your LLM calls behind a consistent interface (such as LangChain or a custom adapter layer) to minimize code changes. Evaluate the new model against your original evaluation checklist—particularly around token limits, response format, and safety filters—before migrating production traffic. The 11-integrating-with-function-calling/README.md demonstrates SDK patterns that facilitate such migrations.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →