deep-dive

How Model Training Is Handled in ai-engineering-from-scratch: From NumPy Loops to Distributed ZeRO

June 6, 2026 rohitg00/ai-engineering-from-scratch ↗

Model training in ai-engineering-from-scratch is taught by building raw NumPy training loops first, then layering in PyTorch best practices like gradient clipping, mixed precision, and distributed ZeRO sharding.

The ai-engineering-from-scratch repository by rohitg00 teaches model training through a “Build It → Use It → Ship It” progression. Instead of calling high-level trainer APIs, learners handcraft forward passes, backward passes, and parameter updates in pure NumPy before graduating to production-grade PyTorch loops. This approach demystifies how gradients flow, how optimizers update weights, and how distributed training scales across GPUs.

Model Training Patterns in ai-engineering-from-scratch

Every lesson follows the same six-beat pattern. A tiny, self-contained model is built with only the standard library and NumPy.

Next, a handcrafted training loop computes forward passes, loss, gradients, and parameter updates. Optional enhancements such as gradient clipping, mixed-precision, and learning-rate schedules are introduced in later capstones.

Finally, the same architecture is instantiated with PyTorch to reveal the exact correspondence between manual math and library calls.

All loops share a common backbone:


# 1️⃣ Forward pass

logits = model(x)

# 2️⃣ Compute loss

loss = loss_fn(logits, y)

# 3️⃣ Back-prop

loss.backward()

# 4️⃣ Optimizer step

optimizer.step()
optimizer.zero_grad()

According to the ai-engineering-from-scratch source code, the curriculum demonstrates exactly what each line does by re-implementing operations like layernorm_backward and ffn_backward before showing the concise PyTorch equivalent.

Mini-GPT Pre-Training with Manual Backpropagation

In [phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py), the repository implements a complete transformer decoder using only NumPy. The train_mini_gpt function exposes every step of the optimization process:


# phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py

def train_mini_gpt(text, vocab_size=256, embed_dim=128,
                   num_heads=4, num_layers=4, seq_len=64,
                   num_steps=200, lr=3e-4):
    tokens = np.array(list(text.encode()))[:2048]
    model = MiniGPT(vocab_size, embed_dim, num_heads,
                    num_layers, max_seq_len=seq_len,
                    ff_dim=embed_dim * 4)

    for step in range(num_steps):
        start = np.random.randint(0, len(tokens) - seq_len - 1)
        x = tokens[start:start+seq_len].reshape(1, -1)
        y = tokens[start+1:start+seq_len+1].reshape(1, -1)

        logits = model.forward(x)
        loss = cross_entropy(logits, y)
        grads = compute_grads(loss, model)          # manual chain rule

        update_parameters(model, grads, lr)         # SGD step

This lesson defines layernorm_backward and ffn_backward explicitly so learners see the chain rule in action. There is no autograd; every gradient is derived by hand and applied through update_parameters.

Gradient Clipping and Automatic Mixed Precision (AMP)

Once the fundamentals are solid, the curriculum moves to production-grade PyTorch training. In [phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py), the AmpTrainState class wraps an AdamW optimizer with automatic mixed precision and gradient clipping:


# phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py

class AmpTrainState:
    def __init__(self, model, lr=1e-3, fp16=False):
        self.model = model
        self.opt = torch.optim.AdamW(model.parameters(), lr=lr)
        self.scaler = torch.cuda.amp.GradScaler(enabled=fp16)

    def run_step(self, x, y):
        with torch.cuda.amp.autocast(enabled=self.scaler.is_enabled()):
            logits = self.model(x)
            loss = F.cross_entropy(logits, y)
        self.scaler.scale(loss).backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.scaler.step(self.opt)
        self.scaler.update()
        self.opt.zero_grad()
        return loss.item()

The run_step method sequences the entire mixed-precision step: cast to autocast, scale the loss, back-propagate, clip norms, and step the scaler.

Learning Rate Scheduling and Warmup

Before scaling across GPUs, the repository covers learning-rate scheduling. phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py implements a cosine warmup schedule that decays the learning rate after a linear warmup phase. This pattern is reused in later distributed training scripts to stabilize early steps.

Distributed Training with ZeRO-1 Sharding

The most advanced training lesson lives in [phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py). It combines Distributed Data Parallel (DDP) with a custom ZeRO-1 optimizer to train models larger than a single GPU’s memory.

As implemented in rohitg00/ai-engineering-from-scratch, the ZeroOptimizer class shards parameters and gradients across ranks:


# phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py

class ZeroOptimizer:
    def __init__(self, model, world_size, rank, lr):
        self.model = model
        self.world_size = world_size
        self.rank = rank
        self.lr = lr
        self.shard = self._init_shard()

    def step(self):
        for p in self.model.parameters():
            dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)
        for p, s in zip(self.model.parameters(), self.shard):
            p.data = s - self.lr * p.grad

Inside the training loop, only the rank’s parameter shard is updated after gradients are all-reduced. The repository also covers per-rank gradient accumulation and sharded checkpoint saving, all built from raw torch.distributed primitives.

Reinforcement Learning and Vision-Language Loops

Model training in ai-engineering-from-scratch extends beyond supervised transformers. Under phases/09-reinforcement-learning/, scripts like train_one_epoch and train_step implement policy-gradient and Q-learning loops that sample actions, compute returns, and update policies.

For multimodal workloads, phases/12-multimodal-ai/62-vision-language-pretraining/code/main.py uses a contrastive InfoNCE loss to jointly train vision and language encoders. The train(cfg) function handles batching, loss scaling, and metric logging for cross-modal pre-training.

Fine-Tuning with LoRA

In phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py, the repository demonstrates parameter-efficient training. The train_lora function freezes the base model weights and updates only low-rank adapter matrices, drastically reducing memory requirements while preserving model quality.

Summary

NumPy-first loops: phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py exposes manual back-propagation with layernorm_backward and ffn_backward.
Modern PyTorch steps: phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py packages AMP, AdamW, and gradient clipping inside AmpTrainState.run_step.
Learning-rate control: Cosine warmup schedules in phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py prepare models for large-batch training.
Distributed scale: phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py implements a custom ZeroOptimizer for ZeRO-1 sharding across GPUs.
Domain-specific loops: RL agents, vision-language contrastive training, and LoRA fine-tuning each supply focused training functions that reuse the same four-step pattern.

Frequently Asked Questions

How does ai-engineering-from-scratch teach back-propagation?

The repository disables autograd in early lessons. In phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py, gradients are computed by explicitly calling layernorm_backward, ffn_backward, and compute_grads so learners trace every partial derivative before using PyTorch’s backward().

What mixed-precision training utilities are used?

The capstone project in phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py uses torch.cuda.amp.autocast and GradScaler inside AmpTrainState.run_step. Gradient norms are clipped with torch.nn.utils.clip_grad_norm_ before the optimizer step.

Can the training code scale to multiple GPUs?

Yes. phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py demonstrates a ZeroOptimizer that shards parameters across ranks using torch.distributed. It performs all-reduce on gradients and updates only local shards, enabling models larger than a single GPU’s memory.

Where is distributed data parallel introduced?

Before the full ZeRO pipeline, phases/19-capstone-projects/77-data-parallel-ddp/code/main.py introduces DDP utilities that broadcast parameters and synchronize gradients with torch.distributed. This lesson lays the groundwork for the end-to-end distributed training capstone.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →