how-to-guide

How to Optimize Costs When Using Different LLM Providers: A Complete Guide

February 26, 2026 microsoft/generative-ai-for-beginners ↗

You can optimize costs when using different LLM providers by selecting cheaper models for appropriate tasks, enforcing token budgets with max_tokens, implementing response caching, and leveraging Retrieval-Augmented Generation (RAG) to reduce prompt size.

The microsoft/generative-ai-for-beginners repository provides production-ready patterns for cost optimization across OpenAI, Azure OpenAI, and Hugging Face. Since all major providers charge per token, understanding how to optimize costs when using different LLM providers is essential for building financially sustainable AI applications.

Understanding Token-Based Pricing Across Providers

LLM providers bill by the token (roughly 4 characters of text), charging for both input (prompt) and output (completion) tokens. Pricing varies dramatically between providers and model tiers:

GPT-4 costs 5-10× more per 1k tokens than GPT-3.5-turbo
Azure OpenAI offers reserved capacity pricing distinct from OpenAI's pay-as-you-go model
Hugging Face provides free tiers for open models with specific rate limits

The provider comparison table in 00-course-setup/03-providers.md【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/00-course-setup/03-providers.md#L11-L15】 details these pricing structures and quota limits.

Architectural Strategies to Optimize LLM Costs

Model Selection and Tier Optimization

Use the smallest model capable of meeting your quality requirements. According to the model comparison in 02-exploring-and-comparing-different-llms/README.md【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/02-exploring-and-comparing-different-llms/README.md#L39-L40】, GPT-3.5-turbo handles most conversational tasks at a fraction of GPT-4's cost. For embeddings, text-embedding-ada-002 remains the lowest-cost option across all supported providers.

Token Budgeting and Output Controls

Enforce hard limits using the max_tokens parameter. The text generation lesson in 06-text-generation-apps/README.md【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/06-text-generation-apps/README.md#L81-L89】 demonstrates how reserving tokens for the response prevents runaway costs from unexpectedly long completions.

Lower the temperature parameter (0.0-0.3) to produce more deterministic, concise outputs, which reduces completion token counts as noted in the same lesson【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/06-text-generation-apps/README.md#L91-L96】.

Caching and Request Batching

Store deterministic completions to avoid paying for identical prompts. The shared utility shared/python/api_utils.py【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/shared/python/api_utils.py】 provides a thin wrapper around the OpenAI client where you can inject caching logic without modifying application code.

Group multiple user requests into single API calls where possible. Batching reduces per-request overhead and maximizes throughput within rate limits.

Retrieval-Augmented Generation (RAG)

Implement RAG to send only relevant context rather than full knowledge bases. Lesson 08 in 08-building-search-applications/README.md【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/08-building-search-applications/README.md#L85-L90】 explains how retrieving specific document chunks reduces prompt token counts by 80-90% compared to including entire documents.

Implementing Cost Controls in Python

Measuring Tokens with `tiktoken`

Use OpenAI's tiktoken library to calculate costs before sending requests:

import os, tiktoken, openai
from dotenv import load_dotenv

load_dotenv()                     # reads .env (see .env.copy)

openai.api_key = os.getenv("OPENAI_API_KEY")

enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

def token_count(text: str) -> int:
    """Return number of tokens for a given string."""
    return len(enc.encode(text))

prompt = "Explain the difference between RAG and fine‑tuning."
print(f"Prompt tokens: {token_count(prompt)}")

This pattern uses the environment utilities referenced in shared/python/env_utils.py.

Enforcing Token Budgets

Implement hard stops to prevent budget overruns:

MAX_BUDGET = 800               # max tokens per request

def safe_completion(prompt: str):
    prompt_tokens = token_count(prompt)
    if prompt_tokens > MAX_BUDGET:
        raise ValueError(f"Prompt exceeds budget ({prompt_tokens}>{MAX_BUDGET})")
    # Reserve tokens for the response (e.g., 200)

    remaining = MAX_BUDGET - prompt_tokens
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=remaining,          # <-- cost‑control knob

        temperature=0.5,
    )
    return response.choices[0].message.content

print(safe_completion("List three ways to reduce token usage in LLM calls."))

This implementation references the token-budget discussion in 06-text-generation-apps/README.md.

Implementing Response Caching

Add zero-cost repeat queries using Python's caching utilities:

from functools import lru_cache

@lru_cache(maxsize=1024)
def cached_completion(prompt: str) -> str:
    # Re‑use the safe_completion defined above

    return safe_completion(prompt)

# First call hits the API, subsequent identical prompts are free

print(cached_completion("What is a token?"))
print(cached_completion("What is a token?"))   # cached result

Insert similar logic into shared/python/api_utils.py where the client is created.

Switching Providers via Environment Variables

Manage multi-provider deployments without code changes:


# .env (example)

OPENAI_API_KEY=sk-...
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://my-aoai.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-35-turbo
HUGGING_FACE_API_KEY=hf_...

# In code you can pick the provider:

import os
provider = os.getenv("LLM_PROVIDER", "openai")   # default to OpenAI

if provider == "azure":
    openai.api_type = "azure"
    openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
    openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
    deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
elif provider == "huggingface":
    # Use HF inference endpoint (example)

    pass
else:
    openai.api_key = os.getenv("OPENAI_API_KEY")

This provider-selection logic mirrors the Choosing & Configuring an LLM Provider guide in 00-course-setup/03-providers.md.

Provider-Specific Cost Management Tools

Each provider offers native tools to monitor and control spend:

Azure OpenAI: Cost Management dashboards track deployment-level usage and support reserved capacity pricing for predictable workloads
OpenAI: Usage-based reports in the developer portal show token consumption by model and time period
Hugging Face: Free tier limits and inference endpoint pricing calculators help estimate costs for open models

Reference the provider comparison table in 00-course-setup/03-providers.md【/cache/repos/github.com/microsoft/generative-ai-for-beginners/main/00-course-setup/03-providers.md#L11-L15】 for direct links to each pricing calculator.

Summary

Select the smallest viable model: GPT-3.5-turbo costs 5-10× less than GPT-4 and handles most conversational tasks adequately
Enforce token budgets: Use max_tokens parameters and pre-flight token counting with tiktoken to prevent runaway costs
Cache deterministic responses: Store hash(prompt) → completion mappings to eliminate redundant API calls for repeated queries
Implement RAG: Retrieve only relevant context chunks instead of sending full documents, reducing prompt tokens by 80-90%
Leverage provider tools: Use Azure Cost Management, OpenAI usage dashboards, and Hugging Face pricing calculators to monitor actual spend

Frequently Asked Questions

How do I calculate token costs before sending an API request?

Use the tiktoken library to encode your prompt and count tokens before calling the API. The microsoft/generative-ai-for-beginners repository includes tiktoken as a dependency, and you can implement a token_count() function using tiktoken.encoding_for_model() to calculate exact costs based on provider pricing tables.

When should I switch from GPT-4 to GPT-3.5-turbo?

Switch to GPT-3.5-turbo for routine conversational tasks, data extraction, and simple classification where the output quality difference is negligible. According to the model comparison in 02-exploring-and-comparing-different-llms/README.md, GPT-3.5-turbo costs 5-10× less per token while maintaining sufficient performance for approximately 80% of common LLM use cases.

How can I prevent duplicate API calls from increasing my costs?

Implement response caching using Python's functools.lru_cache or a Redis backend. The shared utility file shared/python/api_utils.py provides a thin wrapper around the OpenAI client where you can inject caching logic. By storing hash(prompt) → completion mappings, subsequent identical queries return cached results at zero API cost.

What is the most effective way to reduce prompt token counts?

Implement Retrieval-Augmented Generation (RAG) to send only relevant context chunks rather than full documents. As explained in 08-building-search-applications/README.md, RAG retrieves specific document segments related to the query, reducing prompt token counts by 80-90% compared to including entire knowledge bases in the context window.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how microsoft/generative-ai-for-beginners works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →