# How to Deploy AI Models from the AI Engineering From Scratch Repository

> Deploy AI models from rohitg00/ai-engineering-from-scratch using Docker, FP8/INT4 quantization, and Kubernetes with queue-based autoscaling. Learn efficient AI model deployment now.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-06

---

**Deploy AI models from the rohitg00/ai-engineering-from-scratch repository using a three-layer architecture of Docker containerization, FP8/INT4 quantization, and Kubernetes orchestration with queue-based autoscaling.**

The rohitg00/ai-engineering-from-scratch curriculum teaches you to build AI models from first principles and deploy them in production environments. Whether you are serving a small prototype or a 600B-parameter MoE model, the repository provides a repeatable pipeline that combines containerization, model optimization, and scalable serving infrastructure.

## The Three-Layer Deployment Pattern

Every capstone project in the repository follows a consistent deployment strategy. This architecture ensures reproducible builds and efficient GPU utilization at scale.

### Containerization with Docker

The foundation is a multi-stage Docker image defined in `phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile`. This container packages the model weights, Python dependencies, and serving framework—typically **vLLM 0.7** or **SGLang 0.4**—into a portable artifact that runs identically across development workstations and production clusters. The Dockerfile uses layer caching to speed up rebuilds when only application code changes.

### Model Quantisation for GPU Efficiency

Before deployment, models are converted to efficient numeric formats to reduce memory footprint. According to [`phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md), you should quantize to **FP8 (Marlin)** or **INT4 (AWQ)**. This reduction enables 70B-scale dense models and Mixture-of-Experts (MoE) architectures to fit on single H100 GPUs while preserving inference accuracy. The quantization pipeline supports both dense models (Llama 3 8B) and speculative-decoding-enabled configurations.

### Kubernetes Orchestration with Queue-Based Autoscaling

The production stack uses Kubernetes Deployments with Horizontal Pod Autoscalers (HPA). Unlike traditional CPU-based scaling, the repository configures HPA to scale on the `queue_wait_ms` metric. As documented in [`phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md), this approach prevents GPU saturation by scaling replica count based on request queue latency rather than processor utilization.

## Step-by-Step Deployment Workflow

### 1. Build the Container Image

Start by building the Docker image using the provided Dockerfile. This creates a reproducible environment with all CUDA drivers and Python dependencies locked.

```bash
docker build -t ai-model:latest \
    -f phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile .

```

### 2. Quantize the Model Weights

Convert your trained model to FP8 using the vLLM quantizer. This step is essential for reducing GPU memory requirements before serving.

```bash
python -m vllm.quantize \
    --model ./models/llama3_8b \
    --dtype fp8 \
    --output ./models/llama3_8b_fp8

```

### 3. Launch vLLM with Speculative Decoding

Start the inference server with **EAGLE-3 speculative decoding** enabled. This configuration provides continuous batching, paged attention, and draft-verification loops that boost throughput 2–3× with low tail latency.

```bash
vllm serve ./models/llama3_8b_fp8 \
    --port 80 \
    --tensor-parallel-size 1 \
    --enable-speculative-decoding \
    --draft-model ./models/eagle3_draft \
    --max-model-len 131072

```

### 4. Deploy to Kubernetes with HPA

Apply the Kubernetes manifests to create a Deployment and HorizontalPodAutoscaler. The HPA configuration references the custom `queue_wait_ms` metric to ensure responsive scaling under load.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: vllm
        image: ai-model:latest
        ports:
        - containerPort: 80
        env:
        - name: VLLM_SPEC_DECODING
          value: "true"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: queue_wait_ms
      target:
        type: AverageValue
        averageValue: "50"

```

## Production Observability Setup

Instrument your deployment using OpenTelemetry to capture GenAI semantic conventions. The repository's observability blueprint in [`phases/19-capstone-projects/11-llm-observability-dashboard/outputs/skill-llm-observability.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/11-llm-observability-dashboard/outputs/skill-llm-observability.md) recommends tracing all inference calls for end-to-end latency analysis and drift detection.

```python
import opentelemetry.sdk.trace as trace_sdk
from opentelemetry.instrumentation.openai import OpenAIInstrumentor

trace_sdk.tracer_provider = trace_sdk.TracerProvider()
OpenAIInstrumentor().instrument()

# Calls to the OpenAI SDK (or vLLM client) now emit GenAI spans.

```

## Summary

- **Containerize** using the Dockerfile at `phases/00-setup-and-tooling/07-docker-for-ai/code/Dockerfile` to ensure reproducible environments across dev and production.
- **Quantize** models to FP8 (Marlin) or INT4 (AWQ) before deployment to fit large models on single GPUs.
- **Serve** using vLLM 0.7 with EAGLE-3 speculative decoding enabled for 2–3× throughput improvements.
- **Orchestrate** with Kubernetes HPA configured to scale on `queue_wait_ms` rather than CPU to prevent GPU saturation.
- **Observe** using OpenTelemetry with GenAI semantic conventions to trace requests and monitor model drift.

## Frequently Asked Questions

### What serving engine does the repository recommend for production?

The repository primarily uses **vLLM 0.7** for production serving, with **SGLang 0.4** as an alternative. Both engines support continuous batching and paged attention, but vLLM is featured in the speculative decoding capstone due to its native EAGLE-3 draft model support. See [`phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/17-infrastructure-and-production/28-self-hosted-serving-selection/docs/en.md) for a decision matrix comparing latency and throughput characteristics.

### How does speculative decoding improve inference performance?

Speculative decoding uses a smaller draft model (EAGLE-3) to predict tokens that are then verified by the main model in parallel. According to [`phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/14-speculative-decoding-server/outputs/skill-inference-server.md), this reduces per-step latency and increases throughput by 2–3× compared to standard autoregressive generation, particularly for long-form text generation tasks.

### Why scale on queue_wait_ms instead of CPU or GPU utilization?

Scaling on `queue_wait_ms` (request queue latency) prevents the Horizontal Pod Autoscaler from adding replicas too late. As implemented in the repository's Kubernetes manifests, this metric reflects actual user-perceived latency. CPU-based scaling fails for LLM inference because GPUs can be saturated while CPU usage remains low, leading to request backlog and timeout errors.

### Is model quantization required before deployment?

While not strictly required, quantization to FP8 or INT4 is strongly recommended for production deployments of models larger than 7B parameters. The [`skill-inference-server.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/skill-inference-server.md) file documents that FP8 Marlin quantization reduces GPU memory usage by 50% with minimal accuracy loss, enabling 70B-scale models to serve efficiently on single H100 instances rather than requiring multi-node setups.