How to Deploy AI Models from the AI Engineering From Scratch Repository
Deploy AI models from the rohitg00/ai-engineering-from-scratch repository using a three-layer architecture of Docker containerization, FP8/INT4 quantization, and Kubernetes orchestration with queue-based autoscaling.
The rohitg00/ai-engineering-from-scratch curriculum teaches you to build AI models from first principles and deploy them in production environments. Whether you are serving a small prototype or a 600B-parameter MoE model, the repository provides a repeatable pipeline that combines containerization, model optimization, and scalable serving infrastructure.
The Three-Layer Deployment Pattern
Every capstone project in the repository follows a consistent deployment strategy. This architecture ensures reproducible builds and efficient GPU utilization at scale.
Containerization with Docker
The foundation is a multi-stage Docker image defined in phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile. This container packages the model weights, Python dependencies, and serving framework—typically vLLM 0.7 or SGLang 0.4—into a portable artifact that runs identically across development workstations and production clusters. The Dockerfile uses layer caching to speed up rebuilds when only application code changes.
Model Quantisation for GPU Efficiency
Before deployment, models are converted to efficient numeric formats to reduce memory footprint. According to phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md, you should quantize to FP8 (Marlin) or INT4 (AWQ). This reduction enables 70B-scale dense models and Mixture-of-Experts (MoE) architectures to fit on single H100 GPUs while preserving inference accuracy. The quantization pipeline supports both dense models (Llama 3 8B) and speculative-decoding-enabled configurations.
Kubernetes Orchestration with Queue-Based Autoscaling
The production stack uses Kubernetes Deployments with Horizontal Pod Autoscalers (HPA). Unlike traditional CPU-based scaling, the repository configures HPA to scale on the queue_wait_ms metric. As documented in phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md, this approach prevents GPU saturation by scaling replica count based on request queue latency rather than processor utilization.
Step-by-Step Deployment Workflow
1. Build the Container Image
Start by building the Docker image using the provided Dockerfile. This creates a reproducible environment with all CUDA drivers and Python dependencies locked.
docker build -t ai-model:latest \
-f phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile .
2. Quantize the Model Weights
Convert your trained model to FP8 using the vLLM quantizer. This step is essential for reducing GPU memory requirements before serving.
python -m vllm.quantize \
--model ./models/llama3_8b \
--dtype fp8 \
--output ./models/llama3_8b_fp8
3. Launch vLLM with Speculative Decoding
Start the inference server with EAGLE-3 speculative decoding enabled. This configuration provides continuous batching, paged attention, and draft-verification loops that boost throughput 2–3× with low tail latency.
vllm serve ./models/llama3_8b_fp8 \
--port 80 \
--tensor-parallel-size 1 \
--enable-speculative-decoding \
--draft-model ./models/eagle3_draft \
--max-model-len 131072
4. Deploy to Kubernetes with HPA
Apply the Kubernetes manifests to create a Deployment and HorizontalPodAutoscaler. The HPA configuration references the custom queue_wait_ms metric to ensure responsive scaling under load.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 1
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: vllm
image: ai-model:latest
ports:
- containerPort: 80
env:
- name: VLLM_SPEC_DECODING
value: "true"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: queue_wait_ms
target:
type: AverageValue
averageValue: "50"
Production Observability Setup
Instrument your deployment using OpenTelemetry to capture GenAI semantic conventions. The repository's observability blueprint in phases/19-capstone-projects/11-llm-observability-dashboard/outputs/skill-llm-observability.md recommends tracing all inference calls for end-to-end latency analysis and drift detection.
import opentelemetry.sdk.trace as trace_sdk
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
trace_sdk.tracer_provider = trace_sdk.TracerProvider()
OpenAIInstrumentor().instrument()
# Calls to the OpenAI SDK (or vLLM client) now emit GenAI spans.
Summary
- Containerize using the Dockerfile at
phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfileto ensure reproducible environments across dev and production. - Quantize models to FP8 (Marlin) or INT4 (AWQ) before deployment to fit large models on single GPUs.
- Serve using vLLM 0.7 with EAGLE-3 speculative decoding enabled for 2–3× throughput improvements.
- Orchestrate with Kubernetes HPA configured to scale on
queue_wait_msrather than CPU to prevent GPU saturation. - Observe using OpenTelemetry with GenAI semantic conventions to trace requests and monitor model drift.
Frequently Asked Questions
What serving engine does the repository recommend for production?
The repository primarily uses vLLM 0.7 for production serving, with SGLang 0.4 as an alternative. Both engines support continuous batching and paged attention, but vLLM is featured in the speculative decoding capstone due to its native EAGLE-3 draft model support. See phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md for a decision matrix comparing latency and throughput characteristics.
How does speculative decoding improve inference performance?
Speculative decoding uses a smaller draft model (EAGLE-3) to predict tokens that are then verified by the main model in parallel. According to phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md, this reduces per-step latency and increases throughput by 2–3× compared to standard autoregressive generation, particularly for long-form text generation tasks.
Why scale on queue_wait_ms instead of CPU or GPU utilization?
Scaling on queue_wait_ms (request queue latency) prevents the Horizontal Pod Autoscaler from adding replicas too late. As implemented in the repository's Kubernetes manifests, this metric reflects actual user-perceived latency. CPU-based scaling fails for LLM inference because GPUs can be saturated while CPU usage remains low, leading to request backlog and timeout errors.
Is model quantization required before deployment?
While not strictly required, quantization to FP8 or INT4 is strongly recommended for production deployments of models larger than 7B parameters. The skill-inference-server.md file documents that FP8 Marlin quantization reduces GPU memory usage by 50% with minimal accuracy loss, enabling 70B-scale models to serve efficiently on single H100 instances rather than requiring multi-node setups.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →