# How Model Training Is Handled in ai-engineering-from-scratch: From NumPy Loops to Distributed ZeRO

> Learn how model training is handled in ai-engineering-from-scratch. Start with NumPy loops, then master PyTorch best practices like gradient clipping and distributed ZeRO sharding.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-06-06

---

**Model training in ai-engineering-from-scratch is taught by building raw NumPy training loops first, then layering in PyTorch best practices like gradient clipping, mixed precision, and distributed ZeRO sharding.**

The `ai-engineering-from-scratch` repository by rohitg00 teaches model training through a “Build It → Use It → Ship It” progression. Instead of calling high-level trainer APIs, learners handcraft forward passes, backward passes, and parameter updates in pure NumPy before graduating to production-grade PyTorch loops. This approach demystifies how gradients flow, how optimizers update weights, and how distributed training scales across GPUs.

## Model Training Patterns in ai-engineering-from-scratch

Every lesson follows the same six-beat pattern. A tiny, self-contained model is built with only the standard library and NumPy.

Next, a handcrafted training loop computes forward passes, loss, gradients, and parameter updates. Optional enhancements such as gradient clipping, mixed-precision, and learning-rate schedules are introduced in later capstones.

Finally, the same architecture is instantiated with PyTorch to reveal the exact correspondence between manual math and library calls.

All loops share a common backbone:

```python

# 1️⃣ Forward pass

logits = model(x)

# 2️⃣ Compute loss

loss = loss_fn(logits, y)

# 3️⃣ Back-prop

loss.backward()

# 4️⃣ Optimizer step

optimizer.step()
optimizer.zero_grad()

```

According to the ai-engineering-from-scratch source code, the curriculum demonstrates exactly what each line does by re-implementing operations like `layernorm_backward` and `ffn_backward` before showing the concise PyTorch equivalent.

## Mini-GPT Pre-Training with Manual Backpropagation

In [[`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py)](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py), the repository implements a complete transformer decoder using only NumPy. The `train_mini_gpt` function exposes every step of the optimization process:

```python

# phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py

def train_mini_gpt(text, vocab_size=256, embed_dim=128,
                   num_heads=4, num_layers=4, seq_len=64,
                   num_steps=200, lr=3e-4):
    tokens = np.array(list(text.encode()))[:2048]
    model = MiniGPT(vocab_size, embed_dim, num_heads,
                    num_layers, max_seq_len=seq_len,
                    ff_dim=embed_dim * 4)

    for step in range(num_steps):
        start = np.random.randint(0, len(tokens) - seq_len - 1)
        x = tokens[start:start+seq_len].reshape(1, -1)
        y = tokens[start+1:start+seq_len+1].reshape(1, -1)

        logits = model.forward(x)
        loss = cross_entropy(logits, y)
        grads = compute_grads(loss, model)          # manual chain rule

        update_parameters(model, grads, lr)         # SGD step

```

This lesson defines `layernorm_backward` and `ffn_backward` explicitly so learners see the chain rule in action. There is no autograd; every gradient is derived by hand and applied through `update_parameters`.

## Gradient Clipping and Automatic Mixed Precision (AMP)

Once the fundamentals are solid, the curriculum moves to production-grade PyTorch training. In [[`phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py)](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py), the `AmpTrainState` class wraps an **AdamW** optimizer with **automatic mixed precision** and **gradient clipping**:

```python

# phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py

class AmpTrainState:
    def __init__(self, model, lr=1e-3, fp16=False):
        self.model = model
        self.opt = torch.optim.AdamW(model.parameters(), lr=lr)
        self.scaler = torch.cuda.amp.GradScaler(enabled=fp16)

    def run_step(self, x, y):
        with torch.cuda.amp.autocast(enabled=self.scaler.is_enabled()):
            logits = self.model(x)
            loss = F.cross_entropy(logits, y)
        self.scaler.scale(loss).backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.scaler.step(self.opt)
        self.scaler.update()
        self.opt.zero_grad()
        return loss.item()

```

The `run_step` method sequences the entire mixed-precision step: cast to `autocast`, scale the loss, back-propagate, clip norms, and step the scaler.

## Learning Rate Scheduling and Warmup

Before scaling across GPUs, the repository covers learning-rate scheduling. [`phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py) implements a **cosine warmup** schedule that decays the learning rate after a linear warmup phase. This pattern is reused in later distributed training scripts to stabilize early steps.

## Distributed Training with ZeRO-1 Sharding

The most advanced training lesson lives in [[`phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py)](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py). It combines **Distributed Data Parallel (DDP)** with a custom **ZeRO-1** optimizer to train models larger than a single GPU’s memory.

As implemented in rohitg00/ai-engineering-from-scratch, the `ZeroOptimizer` class shards parameters and gradients across ranks:

```python

# phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py

class ZeroOptimizer:
    def __init__(self, model, world_size, rank, lr):
        self.model = model
        self.world_size = world_size
        self.rank = rank
        self.lr = lr
        self.shard = self._init_shard()

    def step(self):
        for p in self.model.parameters():
            dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)
        for p, s in zip(self.model.parameters(), self.shard):
            p.data = s - self.lr * p.grad

```

Inside the training loop, only the rank’s parameter shard is updated after gradients are all-reduced. The repository also covers per-rank gradient accumulation and sharded checkpoint saving, all built from raw `torch.distributed` primitives.

## Reinforcement Learning and Vision-Language Loops

Model training in ai-engineering-from-scratch extends beyond supervised transformers. Under `phases/09-reinforcement-learning/`, scripts like `train_one_epoch` and `train_step` implement policy-gradient and Q-learning loops that sample actions, compute returns, and update policies.

For multimodal workloads, [`phases/12-multimodal-ai/62-vision-language-pretraining/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/12-multimodal-ai/62-vision-language-pretraining/code/main.py) uses a contrastive **InfoNCE** loss to jointly train vision and language encoders. The `train(cfg)` function handles batching, loss scaling, and metric logging for cross-modal pre-training.

## Fine-Tuning with LoRA

In [`phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py), the repository demonstrates parameter-efficient training. The `train_lora` function freezes the base model weights and updates only low-rank adapter matrices, drastically reducing memory requirements while preserving model quality.

## Summary

- **NumPy-first loops**: [`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py) exposes manual back-propagation with `layernorm_backward` and `ffn_backward`.
- **Modern PyTorch steps**: [`phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py) packages AMP, AdamW, and gradient clipping inside `AmpTrainState.run_step`.
- **Learning-rate control**: Cosine warmup schedules in [`phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py) prepare models for large-batch training.
- **Distributed scale**: [`phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py) implements a custom `ZeroOptimizer` for ZeRO-1 sharding across GPUs.
- **Domain-specific loops**: RL agents, vision-language contrastive training, and LoRA fine-tuning each supply focused training functions that reuse the same four-step pattern.

## Frequently Asked Questions

### How does ai-engineering-from-scratch teach back-propagation?

The repository disables autograd in early lessons. In [`phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py), gradients are computed by explicitly calling `layernorm_backward`, `ffn_backward`, and `compute_grads` so learners trace every partial derivative before using PyTorch’s `backward()`.

### What mixed-precision training utilities are used?

The capstone project in [`phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py) uses `torch.cuda.amp.autocast` and `GradScaler` inside `AmpTrainState.run_step`. Gradient norms are clipped with `torch.nn.utils.clip_grad_norm_` before the optimizer step.

### Can the training code scale to multiple GPUs?

Yes. [`phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py) demonstrates a `ZeroOptimizer` that shards parameters across ranks using `torch.distributed`. It performs all-reduce on gradients and updates only local shards, enabling models larger than a single GPU’s memory.

### Where is distributed data parallel introduced?

Before the full ZeRO pipeline, [`phases/19-capstone-projects/77-data-parallel-ddp/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/77-data-parallel-ddp/code/main.py) introduces DDP utilities that broadcast parameters and synchronize gradients with `torch.distributed`. This lesson lays the groundwork for the end-to-end distributed training capstone.