how-to-guide

How to Deploy AI Models from the AI Engineering From Scratch Repository

June 6, 2026 rohitg00/ai-engineering-from-scratch ↗

Deploy AI models from the rohitg00/ai-engineering-from-scratch repository using a three-layer architecture of Docker containerization, FP8/INT4 quantization, and Kubernetes orchestration with queue-based autoscaling.

The rohitg00/ai-engineering-from-scratch curriculum teaches you to build AI models from first principles and deploy them in production environments. Whether you are serving a small prototype or a 600B-parameter MoE model, the repository provides a repeatable pipeline that combines containerization, model optimization, and scalable serving infrastructure.

The Three-Layer Deployment Pattern

Every capstone project in the repository follows a consistent deployment strategy. This architecture ensures reproducible builds and efficient GPU utilization at scale.

Containerization with Docker

The foundation is a multi-stage Docker image defined in phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile. This container packages the model weights, Python dependencies, and serving framework—typically vLLM 0.7 or SGLang 0.4—into a portable artifact that runs identically across development workstations and production clusters. The Dockerfile uses layer caching to speed up rebuilds when only application code changes.

Model Quantisation for GPU Efficiency

Before deployment, models are converted to efficient numeric formats to reduce memory footprint. According to phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md, you should quantize to FP8 (Marlin) or INT4 (AWQ). This reduction enables 70B-scale dense models and Mixture-of-Experts (MoE) architectures to fit on single H100 GPUs while preserving inference accuracy. The quantization pipeline supports both dense models (Llama 3 8B) and speculative-decoding-enabled configurations.

Kubernetes Orchestration with Queue-Based Autoscaling

The production stack uses Kubernetes Deployments with Horizontal Pod Autoscalers (HPA). Unlike traditional CPU-based scaling, the repository configures HPA to scale on the queue_wait_ms metric. As documented in phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md, this approach prevents GPU saturation by scaling replica count based on request queue latency rather than processor utilization.

Step-by-Step Deployment Workflow

1. Build the Container Image

Start by building the Docker image using the provided Dockerfile. This creates a reproducible environment with all CUDA drivers and Python dependencies locked.

docker build -t ai-model:latest \
    -f phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile .

2. Quantize the Model Weights

Convert your trained model to FP8 using the vLLM quantizer. This step is essential for reducing GPU memory requirements before serving.

python -m vllm.quantize \
    --model ./models/llama3_8b \
    --dtype fp8 \
    --output ./models/llama3_8b_fp8

3. Launch vLLM with Speculative Decoding

Start the inference server with EAGLE-3 speculative decoding enabled. This configuration provides continuous batching, paged attention, and draft-verification loops that boost throughput 2–3× with low tail latency.

vllm serve ./models/llama3_8b_fp8 \
    --port 80 \
    --tensor-parallel-size 1 \
    --enable-speculative-decoding \
    --draft-model ./models/eagle3_draft \
    --max-model-len 131072

4. Deploy to Kubernetes with HPA

Apply the Kubernetes manifests to create a Deployment and HorizontalPodAutoscaler. The HPA configuration references the custom queue_wait_ms metric to ensure responsive scaling under load.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: vllm
        image: ai-model:latest
        ports:
        - containerPort: 80
        env:
        - name: VLLM_SPEC_DECODING
          value: "true"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: queue_wait_ms
      target:
        type: AverageValue
        averageValue: "50"

Production Observability Setup

Instrument your deployment using OpenTelemetry to capture GenAI semantic conventions. The repository's observability blueprint in phases/19-capstone-projects/11-llm-observability-dashboard/outputs/skill-llm-observability.md recommends tracing all inference calls for end-to-end latency analysis and drift detection.

import opentelemetry.sdk.trace as trace_sdk
from opentelemetry.instrumentation.openai import OpenAIInstrumentor

trace_sdk.tracer_provider = trace_sdk.TracerProvider()
OpenAIInstrumentor().instrument()

# Calls to the OpenAI SDK (or vLLM client) now emit GenAI spans.

Summary

Containerize using the Dockerfile at phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile to ensure reproducible environments across dev and production.
Quantize models to FP8 (Marlin) or INT4 (AWQ) before deployment to fit large models on single GPUs.
Serve using vLLM 0.7 with EAGLE-3 speculative decoding enabled for 2–3× throughput improvements.
Orchestrate with Kubernetes HPA configured to scale on queue_wait_ms rather than CPU to prevent GPU saturation.
Observe using OpenTelemetry with GenAI semantic conventions to trace requests and monitor model drift.

Frequently Asked Questions

The repository primarily uses vLLM 0.7 for production serving, with SGLang 0.4 as an alternative. Both engines support continuous batching and paged attention, but vLLM is featured in the speculative decoding capstone due to its native EAGLE-3 draft model support. See phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md for a decision matrix comparing latency and throughput characteristics.

How does speculative decoding improve inference performance?

Speculative decoding uses a smaller draft model (EAGLE-3) to predict tokens that are then verified by the main model in parallel. According to phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md, this reduces per-step latency and increases throughput by 2–3× compared to standard autoregressive generation, particularly for long-form text generation tasks.

Why scale on queue_wait_ms instead of CPU or GPU utilization?

Scaling on queue_wait_ms (request queue latency) prevents the Horizontal Pod Autoscaler from adding replicas too late. As implemented in the repository's Kubernetes manifests, this metric reflects actual user-perceived latency. CPU-based scaling fails for LLM inference because GPUs can be saturated while CPU usage remains low, leading to request backlog and timeout errors.

Is model quantization required before deployment?

While not strictly required, quantization to FP8 or INT4 is strongly recommended for production deployments of models larger than 7B parameters. The skill-inference-server.md file documents that FP8 Marlin quantization reduces GPU memory usage by 50% with minimal accuracy loss, enabling 70B-scale models to serve efficiently on single H100 instances rather than requiring multi-node setups.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →