# Production LLM Deployment Best Practices: Caching, Rate Limiting, and Cost Optimization

> Master production LLM deployment. Learn caching, rate limiting, and cost optimization strategies for efficient, high-availability AI systems. Optimize your token costs now.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: best-practices
- Published: 2026-06-11

---

**Production LLM deployment requires a multi-layered architecture combining semantic caching to eliminate redundant token costs, strict per-user rate limiting to prevent abuse, and intelligent model fallback chains to optimize spend while maintaining high availability.**

The `ai-engineering-from-scratch` repository provides a complete reference implementation for production LLM deployment in its capstone lesson spanning phases 11 through 13. This curriculum demonstrates exactly how to balance low latency, controlled spend, and safety when serving large language models to real users.

## Architecture Overview

A production-grade LLM service must orchestrate eight distinct components to handle traffic safely and economically. According to the architecture diagram in [`phases/11-llm-engineering/13-production-app/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/13-production-app/docs/en.md), the request flow is:

1. **API Gateway** – authenticates users and applies rate limits before forwarding.
2. **Input Guardrails** – blocks prompt injection and redacts PII.
3. **Prompt Router** – selects versioned templates to support A/B testing.
4. **Semantic Cache** – performs embedding similarity lookup to skip expensive LLM calls.
5. **LLM Call** – executes with exponential back-off and model fallback.
6. **Output Guardrails** – filters unsafe content.
7. **Eval & Cost Tracker** – logs latency, token usage, and USD cost per request.
8. **Response Streaming** – delivers tokens via SSE to improve perceived latency.

All components are wired together in the `ProductionLLMService` class defined in [`phases/11-llm-engineering/13-production-app/code/production_app.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/13-production-app/code/production_app.py).

## Semantic Caching for Cost and Latency

**Semantic caching** eliminates duplicate spend by storing and reusing responses for semantically identical queries. The reference implementation uses a 64-dimensional embedding (`simple_embedding`) to index cached entries.

When a request arrives, the system:

- Computes the query embedding.
- Scores cached entries using cosine similarity.
- Returns the cached response when similarity exceeds the configurable threshold of **0.92**.

Cache statistics (`hits`, `misses`, `hit_rate_pct`) are exposed via the health endpoint. The curriculum projects that a **35%** cache hit rate can reduce monthly LLM spend by approximately **$4,000** for high-volume applications.

```python

# Inside ProductionLLMService.handle_request(...)

cached = self.cache.get(effective_query)
if cached:
    # Cache hit – no LLM call, zero cost

    self.cost_tracker.total_cache_hits += 1
    return {
        "request_id": request_id,
        "response": cached["response"],
        "cache_hit": True,
        "similarity": cached["similarity"],
        "latency_ms": log.latency_ms,
        "cost_usd": 0.0,
    }

```

## Rate Limiting Implementation

Place rate limiting at the **API Gateway** layer to protect both your infrastructure and provider quotas. The [`skill-production-checklist.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/skill-production-checklist.md) in the repository mandates specific tiers:

- **Authenticated clients**: 10–50 requests per minute.
- **Anonymous clients**: 5 requests per minute to prevent credential-free scraping.
- **Burst allowance**: 2-second windows to accommodate legitimate traffic spikes.

When limits are exceeded, the gateway returns HTTP 429. You can implement this via FastAPI middleware or cloud-native solutions like LiteLLM, Portkey, or Kong AI Gateway.

```python

# from fastapi import FastAPI, Request, HTTPException

# from fastapi_limiter import FastAPILimiter

# from redis import Redis

app = FastAPI()
redis = Redis(host="redis", port=6379, db=0)
FastAPILimiter.init(redis)

@app.middleware("http")
async def rate_limit(request: Request, call_next):
    # 10 requests per minute per user_id header

    identifier = request.headers.get("X-User-Id", "anonymous")
    limit = "10/minute"
    if not await FastAPILimiter.check(identifier, limit):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    response = await call_next(request)
    return response

```

## Cost Optimization Techniques

The `ProductionLLMService` implements four primary strategies to minimize spend without sacrificing reliability:

**Model Fallback Chain**: The `FALLBACK_CHAIN` attempts `claude-sonnet-4-20250514`, then `gpt-4o`, then `gpt-4o-mini`. This guarantees availability while preferring cheaper models when higher-tier options fail.

**Token Accounting**: The `calculate_cost` function uses the `MODEL_PRICING` registry to compute per-request USD cost based on input and output token counts.

**Dynamic Throttling**: The cost alerting exercise demonstrates downgrading heavy spenders to `gpt-4o-mini` automatically when budgets are exceeded.

**Streaming Responses**: Server-Sent Events (SSE) deliver the first token within 200–500ms, improving perceived latency without increasing token count.

```python
async def call_with_fallback(prompt, preferred_model=None):
    chain = list(FALLBACK_CHAIN)
    if preferred_model and preferred_model in chain:
        chain.remove(preferred_model)
        chain.insert(0, preferred_model)

    for model in chain:
        try:
            return await call_llm_with_retry(prompt, model)
        except ConnectionError:
            # retry on next model in the chain

            continue
    # All models failed → graceful fallback response

    return {
        "text": "I apologize, but I am temporarily unable to process your request.",
        "model": "fallback",
        "input_tokens": estimate_tokens(prompt),
        "output_tokens": 20,
        "error": "All providers unavailable",
    }

```

```python
def calculate_cost(model, input_tokens, output_tokens):
    pricing = MODEL_PRICING.get(model, MODEL_PRICING[ModelName.GPT_4O])
    input_cost = input_tokens / 1_000_000 * pricing["input"]
    output_cost = output_tokens / 1_000_000 * pricing["output"]
    return round(input_cost + output_cost, 8)

```

## Observability and Monitoring

Every request generates a structured `RequestLog` entry containing the request ID, user ID, prompt template version, model used, input/output token counts, latency metrics, cache-hit flags, guardrail outcomes, and final cost. The `CostTracker.summary()` method aggregates these into unit economics (`cost_per_request < $0.01`).

Combined with OpenTelemetry tracing (covered in exercise 5), you can visualize end-to-end latency to identify whether delays stem from cache misses or LLM latency, and set alerts on cost spikes.

## Summary

- Implement **semantic caching** with embedding similarity thresholds (default 0.92) to eliminate redundant LLM calls and reduce costs by up to 35%.
- Enforce **rate limiting** at the API Gateway (10–50 req/min for authenticated users) to prevent abuse and protect provider quotas.
- Configure **model fallback chains** (e.g., Claude → GPT-4o → GPT-4o-mini) to maintain availability while minimizing expensive API calls.
- Track **per-request costs** using token-based pricing calculators and monitor unit economics via `CostTracker.summary()`.
- Use **streaming responses** (SSE) to improve perceived latency without additional token spend.

## Frequently Asked Questions

### What is semantic caching in LLM deployment?

Semantic caching stores responses indexed by query embeddings, allowing the system to return cached answers for semantically similar questions without invoking the LLM. According to the `ai-engineering-from-scratch` implementation, this uses 64-dimensional embeddings and cosine similarity thresholds (default 0.92) to match incoming queries against cached entries, reducing both latency and token costs to zero on cache hits.

### How do you implement rate limiting for LLM APIs?

Rate limiting should be implemented at the API Gateway layer using Redis-backed middleware or cloud-native gateways. The reference implementation recommends 10–50 requests per minute for authenticated clients and 5 requests per minute for anonymous traffic, with HTTP 429 responses when limits are exceeded. This prevents credential-free scraping and protects your LLM provider quotas from runaway clients.

### What is the best fallback strategy for LLM model failures?

The optimal strategy uses a cascading fallback chain defined in `FALLBACK_CHAIN`: attempt the primary model (e.g., `claude-sonnet-4-20250514`), then fall back to `gpt-4o`, then to `gpt-4o-mini` if connection errors occur. Each attempt uses exponential back-off, and if all models fail, the system returns a graceful error response with estimated token counts for cost tracking continuity.

### How do you track and optimize LLM costs in production?

Track costs by implementing a `calculate_cost` function that multiplies input and output token counts by per-model pricing rates from a `MODEL_PRICING` registry. The `CostTracker` class aggregates these into summaries showing total spend, average cost per request, and cache hit rates. For optimization, use dynamic model throttling to downgrade heavy users to cheaper models (like `gpt-4o-mini`) when budgets are exceeded, and maintain cache hit rates above 30% to significantly reduce monthly spend.