Fine-Tuning LLMs vs Prompt Engineering: Key Differences and When to Use Each

Prompt engineering modifies the input text sent to a base model without changing its weights, while fine-tuning updates the model's parameters through training on curated datasets to create a persistent custom model.

The microsoft/generative-ai-for-beginners repository provides comprehensive guidance on both approaches in lessons 04-prompt-engineering-fundamentals/README.md and 18-fine-tuning/README.md. Understanding when to apply fine-tuning versus prompt engineering determines your project's cost, latency, and maintenance requirements. These techniques operate on different layers of the generative AI stack—one manipulates the model's input context, while the other permanently alters the model's internal representations.

What is Prompt Engineering?

Prompt engineering is the practice of designing and optimizing the text sent to a large language model to elicit desired outputs. According to the source code in 04-prompt-engineering-fundamentals/README.md, this approach involves constructing prompts with system messages, few-shot examples, and contextual cues that guide the model's behavior without modifying any underlying parameters.

The only logic added exists in your prompt-construction code, which runs immediately before inference. Because the base model weights remain static, you can switch between different prompt strategies instantly by changing the input text.

Implementation Example: Few-Shot Prompting

The following example from the repository's prompt engineering lesson demonstrates how to teach a model to translate English to Spanish using in-context examples:

import os, openai

openai.api_key = os.getenv("OPENAI_API_KEY")
model = "gpt-3.5-turbo"

# Few-shot prompt that teaches the model to translate English to Spanish

prompt = [
    {"role": "system", "content": "You are a helpful translation assistant."},
    {"role": "user", "content": "Translate to Spanish: Hello, how are you?"},
    {"role": "assistant", "content": "Hola, ¿cómo estás?"},
    {"role": "user", "content": "Translate to Spanish: The sky is blue."}
]

response = openai.ChatCompletion.create(
    model=model,
    messages=prompt,
    temperature=0.2,
)

print(response["choices"][0]["message"]["content"])

# → "El cielo es azul."

This approach requires no data collection beyond crafting the prompt text. The examples act as implicit guidance without changing the model itself, as detailed in the "Prompt Construction" and "Examples" sections of 04-prompt-engineering-fundamentals/README.md.

What is Fine-Tuning?

Fine-tuning creates a custom model by updating the base model's weights through additional training on a curated dataset of input-output pairs. As implemented in 18-fine-tuning/README.md, this process produces a new model artifact that persists the learned behavior across all subsequent inference calls.

The training loop lives entirely outside the request-response path. After fine-tuning, you deploy the resulting model as its own endpoint, allowing you to send much simpler prompts (often just the user query) while receiving specialized outputs.

Implementation Example: Using a Fine-Tuned Model

After creating a fine-tuned deployment named gpt-35-turbo-finetuned following the Azure tutorial in the fine-tuning lesson, you interact with it as a specialized endpoint:

import os, openai

openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_version = "2023-05-15"

# The fine-tuned deployment behaves as a specialized "recipe assistant"

deployment = "gpt-35-turbo-finetuned"

response = openai.ChatCompletion.create(
    engine=deployment,               # Azure OpenAI uses the `engine` (deployment) name

    messages=[{"role": "user", "content": "Give me a quick vegan lasagna recipe."}],
    temperature=0.7,
)

print(response["choices"][0]["message"]["content"])

Because the model has already learned the mapping from user query to recipe format during training, the prompt can be concise. The "How can we fine-tune a pre-trained model?" section in 18-fine-tuning/README.md explains how this embeds domain knowledge directly into the model weights.

Core Differences Between Fine-Tuning and Prompt Engineering

Understanding the architectural implications helps determine which approach suits your use case.

What Changes:

  • Prompt engineering only alters the text sent to the model. The underlying weights stay exactly the same.
  • Fine-tuning updates the model's parameters, producing a new custom model artifact.

Pipeline Location:

  • Prompt engineering happens immediately before inference. The prompt is built, possibly enriched with examples or system messages, then sent to the inference endpoint of a base LLM.
  • Fine-tuning occurs during a separate training step. After completion, the new model is deployed and used for inference.

Cost Model:

  • Prompt engineering incurs no extra compute cost beyond normal request fees. The trade-off is higher token usage because more context and examples are added to each call.
  • Fine-tuning requires a one-time compute cost for the training job (GPU/CPU hours) and storage for the fine-tuned model. Afterwards, inference can be cheaper because fewer tokens are needed per request.

Data Requirements:

  • Prompt engineering requires no data collection; you simply craft better prompts.
  • Fine-tuning requires a training set of input-output pairs, often hundreds to thousands of examples, plus careful data cleaning as noted in 18-fine-tuning/README.md.

Skill Level:

  • Prompt engineering is accessible to most developers and domain experts.
  • Fine-tuning requires ML-engineering expertise, familiarity with hyper-parameters, and access to a training environment (GPU, Azure OpenAI, or Hugging Face).

Portability:

  • Prompt engineering code can be moved between providers (OpenAI, Azure, Hugging Face) with minor adjustments.
  • Fine-tuned artifacts are provider-specific. An Azure OpenAI fine-tuned model differs from an OpenAI-hosted fine-tuned model.

Architectural Pipeline Comparison

The microsoft/generative-ai-for-beginners repository illustrates two distinct architectural paths for shaping model behavior.

Prompt Engineering Path

  1. The user or application builds a prompt that may include system messages, few-shot examples, and cues.
  2. The constructed prompt is sent to the inference endpoint of a base LLM.
  3. The model returns a completion based on the provided context.

This path relies on shared/python/api_utils.py and shared/python/env_utils.py for reusable request logic and secure API key handling, as seen in the repository's code examples. The behavior changes must be reapplied on every request.

Fine-Tuning Path

  1. Curate a dataset of input-to-desired-output pairs.
  2. Launch a fine-tuning job (via Azure OpenAI, OpenAI, or Hugging Face) that updates the base model weights.
  3. Deploy the newly minted fine-tuned model as its own endpoint.
  4. At inference time, send a simplified prompt (often just the user query) to the fine-tuned endpoint.

The training loop exists outside the request-response path. Because fine-tuning modifies the model itself, the resulting behavior is persistent across all calls.

Practical Side-by-Side Comparison

The following snippet from the repository demonstrates latency and output differences between approaches:

import os, openai, time

openai.api_key = os.getenv("OPENAI_API_KEY")
model_base = "gpt-3.5-turbo"
model_finetuned = "ft:gpt-3.5-turbo-xxxx"   # placeholder ID from OpenAI fine-tuning

prompt = [{"role":"user","content":"Write a haiku about sunrise."}]

def ask(model):
    start = time.time()
    resp = openai.ChatCompletion.create(
        model=model,
        messages=prompt,
        temperature=0.5,
    )
    elapsed = time.time() - start
    return resp["choices"][0]["message"]["content"], elapsed

base_ans, base_t = ask(model_base)
fine_ans, fine_t = ask(model_finetuned)

print("Base model (", round(base_t,2),"s):", base_ans)
print("Fine-tuned (", round(fine_t,2),"s):", fine_ans)

While inference latency remains roughly equivalent, the fine-tuned model produces more on-topic outputs (if trained for specific styles) while using fewer tokens, illustrating the practical trade-offs documented in 18-fine-tuning/README.md.

When to Choose Which Approach

Select prompt engineering when you need:

  • Quick experiments and rapid iteration
  • Limited training data or budget constraints
  • Flexibility to test many variations without retraining
  • Portability across different LLM providers

Select fine-tuning when you require:

  • Stable, domain-specific behavior for production workloads
  • Repeatedly serving the same specialized task at scale
  • Reduced token consumption to lower per-request costs
  • Persistent model behavior without complex prompt templates

Summary

  • Prompt engineering manipulates model inputs (system messages, few-shot examples) while keeping base weights static, making it ideal for rapid prototyping and provider-agnostic solutions.
  • Fine-tuning updates model parameters through training on curated datasets, creating persistent custom models that require simpler prompts but demand ML expertise and provider-specific deployment.
  • Cost structures differ significantly: prompt engineering increases per-request token costs, while fine-tuning incurs upfront training compute costs but can reduce long-term inference expenses.
  • Data requirements favor prompt engineering for low-data scenarios, whereas fine-tuning requires hundreds to thousands of high-quality input-output pairs.
  • Implementation files in the microsoft/generative-ai-for-beginners repository include 04-prompt-engineering-fundamentals/README.md for input optimization and 18-fine-tuning/README.md for model customization workflows.

Frequently Asked Questions

Can I use prompt engineering and fine-tuning together?

Yes. You can apply prompt engineering techniques to a fine-tuned model. In fact, the 18-fine-tuning/README.md lesson suggests that even after fine-tuning, you may still need to craft specific system messages or prompts to guide the custom model's behavior for edge cases.

Does fine-tuning reduce API costs compared to prompt engineering?

Often yes, but not always. Fine-tuning requires an upfront compute investment for the training job (GPU hours and storage). However, because the fine-tuned model understands the domain implicitly, you can send shorter prompts with fewer tokens per request, reducing ongoing inference costs at scale according to the cost analysis in the repository.

How much data do I need for fine-tuning versus prompt engineering?

Prompt engineering requires no training data—you only need example prompts to refine your approach. Fine-tuning typically requires hundreds to thousands of curated input-output pairs, plus validation data, as specified in the "What is fine-tuning for language models" section of 18-fine-tuning/README.md.

Is fine-tuning possible with any LLM provider?

No. Fine-tuning availability depends on the provider and specific model. The microsoft/generative-ai-for-beginners repository covers fine-tuning workflows for Azure OpenAI, OpenAI, and Hugging Face, but not all base models support fine-tuning, and the resulting artifacts are provider-specific and cannot be transferred between platforms.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →