# Production Deployment Patterns for LLM Applications: 7 Production-Grade Architectures

> Explore 7 production deployment patterns for LLM applications: speculative decoding, hybrid RAG, multi-agent orchestration, TensorRT-LLM, safety, and observability architectures.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: architecture
- Published: 2026-06-09

---

**The `rohitg00/ai-engineering-from-scratch` repository defines seven essential production deployment patterns for LLM applications, covering speculative decoding inference stacks, Model Context Protocol servers, hybrid RAG pipelines, multi-agent orchestration, TensorRT-LLM acceleration, layered safety moderation, and OpenTelemetry-based observability.**

Moving large language models from research environments to customer-facing services requires architectural patterns that solve throughput bottlenecks, security constraints, and failure modes unique to generative AI. The curriculum maps these patterns to specific implementation files and capstone projects, providing reference architectures used in high-traffic production systems. Each pattern below includes direct source references and runnable code templates you can adapt for your own inference infrastructure.

## Speculative Decoding Serving Stack

**Speculative decoding** deploys a fast draft model (e.g., Llama 3 8B) to generate candidate token sequences, which a larger target model (e.g., Llama 3 70B) verifies in parallel. This pattern achieves 2-3× throughput gains while maintaining output quality, making it essential for high-QPS services running on commodity GPUs.

According to [`phases/19-capstone-projects/14-speculative-decoding-server/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/14-speculative-decoding-server/docs/en.md), production implementations use **vLLM 0.7** or **SGLang** runtimes with Kubernetes **Horizontal Pod Autoscaling (HPA)** configured on queue-wait time metrics. The draft-target pipeline is exposed via JSON-over-HTTP APIs, with autoscaling rules tuned to keep tail latency under strict SLAs.

```python

# Install vLLM (>=0.7) and torch

# pip install vllm torch

from vllm import LLM, SamplingParams

# Draft model (fast, quantized)

draft = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,
    dtype="float16"
)

# Target model (high-quality, larger)

target = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    tensor_parallel_size=4,
    dtype="bfloat16"
)

def speculative_generate(prompt: str, max_tokens: int = 128):
    draft_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=max_tokens)
    draft_output = draft.generate(prompt, draft_params)[0].outputs[0].text
    
    # Verify draft with target model

    target_params = SamplingParams(temperature=0.0, max_tokens=max_tokens)
    target_output = target.generate(draft_output, target_params)[0].outputs[0].text
    return target_output

print(speculative_generate("Write a concise summary of the latest AI safety research."))

```

*Source:* [`phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md)

## Model Context Protocol (MCP) Server Architecture

The **MCP server pattern** exposes LLMs through a stateless HTTP endpoint implementing the Model Context Protocol specification (StreamableHTTP transport). This architecture decouples client applications (IDE assistants, agents) from backend models while enforcing enterprise security through **OAuth 2.1** scopes and **Open Policy Agent (OPA)** policy gating.

As implemented in [`phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md), the server exposes a [`.well-known/mcp.json`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/.well-known/mcp.json) registry for service discovery and validates tokens against per-tenant scope restrictions. The OPA layer enforces cost limits and content policies before request execution.

```ts
import { createServer } from "http";
import { verifyOAuthToken } from "./oauth";
import { authorize } from "./opa";

const server = createServer(async (req, res) => {
  if (req.method !== "POST") {
    res.writeHead(405); res.end(); return;
  }

  // 1️⃣ OAuth 2.1 token check
  const auth = req.headers["authorization"];
  if (!auth || !(await verifyOAuthToken(auth))) {
    res.writeHead(401); res.end(); return;
  }

  // 2️⃣ OPA policy check (e.g., cost limits)
  const body = await new Promise<string>((r) => {
    let data = "";
    req.on("data", (chunk) => (data += chunk));
    req.on("end", () => r(data));
  });
  if (!(await authorize(body))) {
    res.writeHead(403); res.end(); return;
  }

  // 3️⃣ Forward to underlying LLM
  const llmResponse = await callYourLLM(JSON.parse(body));
  res.writeHead(200, { "Content-Type": "application/json" });
  res.end(JSON.stringify(llmResponse));
});

server.listen(8080, () => console.log("MCP server listening on :8080"));

```

*Source:* [`phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md)

## Production-Grade Retrieval-Augmented Generation (RAG)

**Production RAG** combines hybrid retrieval (dense vector + BM25 sparse), cross-encoder re-ranking, prompt caching, and multi-layer guardrails. This pattern handles document ingestion from heterogeneous sources (PDF, HTML, code) while maintaining sub-200ms latency through aggressive caching and optimized indexing.

The reference implementation in [`phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md) uses **bge-reranker-v2-gemma** for relevance scoring and **Claude Sonnet 4.7** with prompt caching (60-80% hit rates). The safety stack combines **Llama Guard 4** and **NeMo Guardrails** for input/output moderation, while **Langfuse** and **Phoenix** provide real-time drift detection.

```python
from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
from langchain.llms import OpenAI
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import BM25Retriever, VectaraRetriever, HybridRetriever
from llama_index.postprocessor import LLMRerankPostprocessor
from llama_index.prompts import PromptTemplate

# 1️⃣ Ingestion

documents = SimpleDirectoryReader("./data").load_data()
service_ctx = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4o-mini"))
index = VectorStoreIndex.from_documents(documents, service_context=service_ctx)

# 2️⃣ Hybrid retriever (dense + BM25)

bm25 = BM25Retriever.from_defaults()
dense = VectaraRetriever.from_defaults()
retriever = HybridRetriever(bm25, dense)

# 3️⃣ Reranker (cross-encoder)

reranker = LLMRerankPostprocessor.from_defaults(
    llm=OpenAI(model="gpt-4o-mini"),
    top_n=3
)

# 4️⃣ Query engine with guardrails integration

prompt = PromptTemplate(
    "You are a helpful assistant. Answer using only the retrieved context.\n\n{context}\n\nQuestion: {query}"
)
query_engine = RetrieverQueryEngine(
    retriever,
    node_postprocessors=[reranker],
    response_synthesizer_kwargs={"prompt_template": prompt}
)

print(query_engine.query("What are the safety guarantees of speculative decoding?"))

```

*Source:* [`phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md)

## Multi-Agent Orchestration and Scaling

**Supervisor-based multi-agent systems** delegate complex tasks to specialized sub-agents (retrieval, summarization, guard) running in isolated containers. The orchestrator manages parallel execution, aggregates results, and handles failures through durable queues and checkpointing substrates.

Per [`phases/16-multi-agent-and-swarms/05-supervisor-orchestrator-pattern/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/16-multi-agent-and-swarms/05-supervisor-orchestrator-pattern/docs/en.md), production deployments use **Redis**, **SQS**, or **Kafka** for message durability, with checkpoint frequency tuned to wall-clock time rather than step count. The shared-memory blackboard pattern enables fast intra-node communication between co-located agents.

```python
import asyncio
from typing import List

class Agent:
    async def run(self, task: str) -> str:
        await asyncio.sleep(0.5)
        return f"{self.__class__.__name__} completed {task}"

class RetrieverAgent(Agent): pass
class SummarizerAgent(Agent): pass
class GuardAgent(Agent): pass

class Supervisor:
    def __init__(self, agents: List[Agent]):
        self.agents = agents

    async def orchestrate(self, user_query: str):
        # Dispatch sub-tasks in parallel

        tasks = [
            self.agents[0].run(f"retrieve for '{user_query}'"),
            self.agents[1].run(f"summarize retrieval results"),
        ]
        results = await asyncio.gather(*tasks)
        # Guard step runs after others

        guard = await self.agents[2].run(f"guard {results}")
        return {"retrieve": results[0], "summary": results[1], "guard": guard}

async def main():
    sup = Supervisor([RetrieverAgent(), SummarizerAgent(), GuardAgent()])
    print(await sup.orchestrate("Explain speculative decoding safety."))

asyncio.run(main())

```

*Source:* [`phases/16-multi-agent-and-swarms/22-production-scaling-queues-checkpoints/outputs/skill-scaling-advisor.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/16-multi-agent-and-swarms/22-production-scaling-queues-checkpoints/outputs/skill-scaling-advisor.md)

## Low-Latency Inference Optimization

**TensorRT-LLM compilation** and **FP8/INT4 quantization** maximize throughput on NVIDIA H100 GPUs, targeting latency budgets below 200ms for chat endpoints. This pattern fuses attention, linear, and activation kernels into optimized CUDA graphs, integrated with **NVIDIA Triton** for production serving.

The curriculum in [`phases/10-llms-from-scratch/12-inference-optimization/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/12-inference-optimization/docs/en.md) covers **vLLM** deployment for rapid prototyping and **GPTQ/AWQ/GGUF** utilities for edge quantization. Production services compile models to TensorRT engines with FP8 precision, achieving near-theoretical FLOPs utilization.

```bash

# Prereqs: Docker, NVIDIA Container Toolkit

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
docker build -t trt-llm:latest .

# Compile Llama-3-70B with FP8 on H100

docker run --gpus all -v $PWD:/workspace trt-llm:latest \
  python3 examples/compile.py \
    --model_dir /models/Meta-Llama-3-70B-Instruct \
    --dtype fp8 \
    --output_dir /compiled/llama3-70b-fp8

# Serve with Triton

docker run --gpus all -p 8000:8000 \
  -v /compiled/llama3-70b-fp8:/model \
  nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver \
    --model-repository=/model

```

*Source:* [`phases/10-llms-from-scratch/12-inference-optimization/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/12-inference-optimization/docs/en.md)

## Production Safety and Red-Teaming

**Layered moderation** combines input classifiers, output classifiers, and domain-specific rules to prevent harmful generations and prompt injection attacks. Continuous validation through **Llama Guard 4**, **Garak**, and **PyRIT** creates a red-team pipeline that runs nightly regression scans and pre-release attack campaigns.

According to [`phases/18-ethics-safety-alignment/29-moderation-systems-openai-perspective-llamaguard/outputs/skill-moderation-stack.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/18-ethics-safety-alignment/29-moderation-systems-openai-perspective-llamaguard/outputs/skill-moderation-stack.md), production systems implement a three-layer moderation stack: input filtering, output filtering, and business-logic rules. The indirect prompt injection audit specifically checks that untrusted retrieval data is labeled and sandboxed.

```python
from garak import Garak
from pyrit import PyRIT

def run_redteam(prompt: str):
    # 1️⃣ Llama Guard classifier (input)

    guard = Garak(tool="llama_guard")
    guard_score = guard.evaluate(prompt)

    # 2️⃣ Garak probe suite (nightly)

    garak = Garak()
    garak_results = garak.run(prompt)

    # 3️⃣ PyRIT campaign (pre-release)

    pyrit = PyRIT()
    pyrit_results = pyrit.run_attack(prompt)

    return {
        "guard": guard_score,
        "garak": garak_results,
        "pyrit": pyrit_results,
    }

print(run_redteam("Write a tutorial on how to exfiltrate corporate data via a language model."))

```

*Source:* [`phases/18-ethics-safety-alignment/16-red-team-tooling-garak-llamaguard-pyrit/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/18-ethics-safety-alignment/16-red-team-tooling-garak-llamaguard-pyrit/docs/en.md)

## GenAI Observability Implementation

**OpenTelemetry GenAI semantic conventions** provide unified telemetry across OpenAI, Anthropic, LangChain, and vLLM SDKs. This pattern stores high-cardinality span data in **ClickHouse**, metadata in **Postgres**, and exposes cost attribution dashboards through **Next.js** frontends.

As detailed in [`phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md), the instrumentation auto-injects trace IDs, token usage metrics, and latency measurements. Alerting rules detect drift, hallucination patterns, and PII leakage through periodic evaluation jobs using **DeepEval** and **RAGAS**.

```python
from opentelemetry import trace
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure OTLP exporter to ClickHouse collector

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
processor = BatchSpanProcessor(exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument OpenAI SDK (similar for Anthropic, LangChain, vLLM)

OpenAIInstrumentor().instrument()

```

*Source:* [`phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md)

## Summary

- **Speculative decoding** pairs draft and target models through vLLM/SGLang to achieve 2-3× throughput gains on Kubernetes with HPA autoscaling.
- **MCP servers** standardize LLM access via stateless HTTP with OAuth 2.1 and OPA policy enforcement, discoverable through `.well-known` registries.
- **Production RAG** requires hybrid retrieval (dense + BM25), cross-encoder reranking, prompt caching, and layered guardrails (Llama Guard + NeMo) for reliable document Q&A.
- **Multi-agent orchestration** uses supervisor patterns with durable Redis/SQS/Kafka queues and checkpointing to handle failures across distributed agent workers.
- **Inference optimization** relies on TensorRT-LLM FP8 compilation and vLLM batching to meet sub-200ms latency requirements on H100 GPUs.
- **Safety stacks** implement three-layer moderation plus continuous red-teaming via Garak and PyRIT to prevent CVE-style vulnerabilities like prompt injection.
- **Observability pipelines** export OpenTelemetry GenAI spans to ClickHouse for real-time cost attribution and drift detection across six major SDK families.

## Frequently Asked Questions

### What is the most cost-effective pattern for high-throughput LLM APIs?

**Speculative decoding** provides the best cost-throughput ratio for high-QPS services. By running a smaller draft model (e.g., 8B parameters) to generate candidate tokens verified by a larger target model (70B+), you achieve 2-3× higher throughput per GPU without sacrificing quality. Deploy this via vLLM 0.7 or SGLang on Kubernetes with HPA autoscaling based on queue-wait time, as documented in [`phases/19-capstone-projects/14-speculative-decoding-server/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/14-speculative-decoding-server/docs/en.md).

### How does MCP server architecture improve enterprise LLM security?

The **Model Context Protocol (MCP)** server pattern enforces least-privilege access through OAuth 2.1 token validation and per-tenant scoping, while the **Open Policy Agent (OPA)** layer provides runtime enforcement of cost and safety policies. This decouples client applications from direct model access, allowing platform teams to audit and control all LLM interactions through a centralized gateway defined in [`phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/13-mcp-server-with-registry/docs/en.md).

### Which RAG components are critical for production reliability?

Production RAG requires **hybrid retrieval** combining dense vectors with BM25 sparse indexing, **bge-reranker-v2-gemma** for relevance scoring, and **prompt caching** to achieve 60-80% cache hit rates. Equally critical is the safety stack combining **Llama Guard 4** for input/output moderation with **NeMo Guardrails** for domain-specific constraints, monitored via **Langfuse** or **Phoenix** for drift detection per [`phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/08-production-rag-chatbot/docs/en.md).

### What observability standards are recommended for LLM applications?

Adopt **OpenTelemetry GenAI semantic conventions** to instrument OpenAI, Anthropic, LangChain, and vLLM SDKs with auto-injected trace IDs and token usage metrics. Store high-cardinality span data in **ClickHouse** and relational metadata in **Postgres**, then visualize through **Next.js** dashboards for real-time cost attribution and SLO monitoring, as specified in [`phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/11-llm-observability-dashboard/docs/en.md).