Production LLM Deployment Best Practices: Caching, Rate Limiting, and Cost Optimization
Production LLM deployment requires a multi-layered architecture combining semantic caching to eliminate redundant token costs, strict per-user rate limiting to prevent abuse, and intelligent model fallback chains to optimize spend while maintaining high availability.
The ai-engineering-from-scratch repository provides a complete reference implementation for production LLM deployment in its capstone lesson spanning phases 11 through 13. This curriculum demonstrates exactly how to balance low latency, controlled spend, and safety when serving large language models to real users.
Architecture Overview
A production-grade LLM service must orchestrate eight distinct components to handle traffic safely and economically. According to the architecture diagram in phases/11-llm-engineering/13-production-app/docs/en.md, the request flow is:
- API Gateway – authenticates users and applies rate limits before forwarding.
- Input Guardrails – blocks prompt injection and redacts PII.
- Prompt Router – selects versioned templates to support A/B testing.
- Semantic Cache – performs embedding similarity lookup to skip expensive LLM calls.
- LLM Call – executes with exponential back-off and model fallback.
- Output Guardrails – filters unsafe content.
- Eval & Cost Tracker – logs latency, token usage, and USD cost per request.
- Response Streaming – delivers tokens via SSE to improve perceived latency.
All components are wired together in the ProductionLLMService class defined in phases/11-llm-engineering/13-production-app/code/production_app.py.
Semantic Caching for Cost and Latency
Semantic caching eliminates duplicate spend by storing and reusing responses for semantically identical queries. The reference implementation uses a 64-dimensional embedding (simple_embedding) to index cached entries.
When a request arrives, the system:
- Computes the query embedding.
- Scores cached entries using cosine similarity.
- Returns the cached response when similarity exceeds the configurable threshold of 0.92.
Cache statistics (hits, misses, hit_rate_pct) are exposed via the health endpoint. The curriculum projects that a 35% cache hit rate can reduce monthly LLM spend by approximately $4,000 for high-volume applications.
# Inside ProductionLLMService.handle_request(...)
cached = self.cache.get(effective_query)
if cached:
# Cache hit – no LLM call, zero cost
self.cost_tracker.total_cache_hits += 1
return {
"request_id": request_id,
"response": cached["response"],
"cache_hit": True,
"similarity": cached["similarity"],
"latency_ms": log.latency_ms,
"cost_usd": 0.0,
}
Rate Limiting Implementation
Place rate limiting at the API Gateway layer to protect both your infrastructure and provider quotas. The skill-production-checklist.md in the repository mandates specific tiers:
- Authenticated clients: 10–50 requests per minute.
- Anonymous clients: 5 requests per minute to prevent credential-free scraping.
- Burst allowance: 2-second windows to accommodate legitimate traffic spikes.
When limits are exceeded, the gateway returns HTTP 429. You can implement this via FastAPI middleware or cloud-native solutions like LiteLLM, Portkey, or Kong AI Gateway.
# from fastapi import FastAPI, Request, HTTPException
# from fastapi_limiter import FastAPILimiter
# from redis import Redis
app = FastAPI()
redis = Redis(host="redis", port=6379, db=0)
FastAPILimiter.init(redis)
@app.middleware("http")
async def rate_limit(request: Request, call_next):
# 10 requests per minute per user_id header
identifier = request.headers.get("X-User-Id", "anonymous")
limit = "10/minute"
if not await FastAPILimiter.check(identifier, limit):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
response = await call_next(request)
return response
Cost Optimization Techniques
The ProductionLLMService implements four primary strategies to minimize spend without sacrificing reliability:
Model Fallback Chain: The FALLBACK_CHAIN attempts claude-sonnet-4-20250514, then gpt-4o, then gpt-4o-mini. This guarantees availability while preferring cheaper models when higher-tier options fail.
Token Accounting: The calculate_cost function uses the MODEL_PRICING registry to compute per-request USD cost based on input and output token counts.
Dynamic Throttling: The cost alerting exercise demonstrates downgrading heavy spenders to gpt-4o-mini automatically when budgets are exceeded.
Streaming Responses: Server-Sent Events (SSE) deliver the first token within 200–500ms, improving perceived latency without increasing token count.
async def call_with_fallback(prompt, preferred_model=None):
chain = list(FALLBACK_CHAIN)
if preferred_model and preferred_model in chain:
chain.remove(preferred_model)
chain.insert(0, preferred_model)
for model in chain:
try:
return await call_llm_with_retry(prompt, model)
except ConnectionError:
# retry on next model in the chain
continue
# All models failed → graceful fallback response
return {
"text": "I apologize, but I am temporarily unable to process your request.",
"model": "fallback",
"input_tokens": estimate_tokens(prompt),
"output_tokens": 20,
"error": "All providers unavailable",
}
def calculate_cost(model, input_tokens, output_tokens):
pricing = MODEL_PRICING.get(model, MODEL_PRICING[ModelName.GPT_4O])
input_cost = input_tokens / 1_000_000 * pricing["input"]
output_cost = output_tokens / 1_000_000 * pricing["output"]
return round(input_cost + output_cost, 8)
Observability and Monitoring
Every request generates a structured RequestLog entry containing the request ID, user ID, prompt template version, model used, input/output token counts, latency metrics, cache-hit flags, guardrail outcomes, and final cost. The CostTracker.summary() method aggregates these into unit economics (cost_per_request < $0.01).
Combined with OpenTelemetry tracing (covered in exercise 5), you can visualize end-to-end latency to identify whether delays stem from cache misses or LLM latency, and set alerts on cost spikes.
Summary
- Implement semantic caching with embedding similarity thresholds (default 0.92) to eliminate redundant LLM calls and reduce costs by up to 35%.
- Enforce rate limiting at the API Gateway (10–50 req/min for authenticated users) to prevent abuse and protect provider quotas.
- Configure model fallback chains (e.g., Claude → GPT-4o → GPT-4o-mini) to maintain availability while minimizing expensive API calls.
- Track per-request costs using token-based pricing calculators and monitor unit economics via
CostTracker.summary(). - Use streaming responses (SSE) to improve perceived latency without additional token spend.
Frequently Asked Questions
What is semantic caching in LLM deployment?
Semantic caching stores responses indexed by query embeddings, allowing the system to return cached answers for semantically similar questions without invoking the LLM. According to the ai-engineering-from-scratch implementation, this uses 64-dimensional embeddings and cosine similarity thresholds (default 0.92) to match incoming queries against cached entries, reducing both latency and token costs to zero on cache hits.
How do you implement rate limiting for LLM APIs?
Rate limiting should be implemented at the API Gateway layer using Redis-backed middleware or cloud-native gateways. The reference implementation recommends 10–50 requests per minute for authenticated clients and 5 requests per minute for anonymous traffic, with HTTP 429 responses when limits are exceeded. This prevents credential-free scraping and protects your LLM provider quotas from runaway clients.
What is the best fallback strategy for LLM model failures?
The optimal strategy uses a cascading fallback chain defined in FALLBACK_CHAIN: attempt the primary model (e.g., claude-sonnet-4-20250514), then fall back to gpt-4o, then to gpt-4o-mini if connection errors occur. Each attempt uses exponential back-off, and if all models fail, the system returns a graceful error response with estimated token counts for cost tracking continuity.
How do you track and optimize LLM costs in production?
Track costs by implementing a calculate_cost function that multiplies input and output token counts by per-model pricing rates from a MODEL_PRICING registry. The CostTracker class aggregates these into summaries showing total spend, average cost per request, and cache hit rates. For optimization, use dynamic model throttling to downgrade heavy users to cheaper models (like gpt-4o-mini) when budgets are exceeded, and maintain cache hit rates above 30% to significantly reduce monthly spend.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →