How Model Training Is Handled in ai-engineering-from-scratch: From NumPy Loops to Distributed ZeRO
Model training in ai-engineering-from-scratch is taught by building raw NumPy training loops first, then layering in PyTorch best practices like gradient clipping, mixed precision, and distributed ZeRO sharding.
The ai-engineering-from-scratch repository by rohitg00 teaches model training through a “Build It → Use It → Ship It” progression. Instead of calling high-level trainer APIs, learners handcraft forward passes, backward passes, and parameter updates in pure NumPy before graduating to production-grade PyTorch loops. This approach demystifies how gradients flow, how optimizers update weights, and how distributed training scales across GPUs.
Model Training Patterns in ai-engineering-from-scratch
Every lesson follows the same six-beat pattern. A tiny, self-contained model is built with only the standard library and NumPy.
Next, a handcrafted training loop computes forward passes, loss, gradients, and parameter updates. Optional enhancements such as gradient clipping, mixed-precision, and learning-rate schedules are introduced in later capstones.
Finally, the same architecture is instantiated with PyTorch to reveal the exact correspondence between manual math and library calls.
All loops share a common backbone:
# 1️⃣ Forward pass
logits = model(x)
# 2️⃣ Compute loss
loss = loss_fn(logits, y)
# 3️⃣ Back-prop
loss.backward()
# 4️⃣ Optimizer step
optimizer.step()
optimizer.zero_grad()
According to the ai-engineering-from-scratch source code, the curriculum demonstrates exactly what each line does by re-implementing operations like layernorm_backward and ffn_backward before showing the concise PyTorch equivalent.
Mini-GPT Pre-Training with Manual Backpropagation
In [phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py), the repository implements a complete transformer decoder using only NumPy. The train_mini_gpt function exposes every step of the optimization process:
# phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py
def train_mini_gpt(text, vocab_size=256, embed_dim=128,
num_heads=4, num_layers=4, seq_len=64,
num_steps=200, lr=3e-4):
tokens = np.array(list(text.encode()))[:2048]
model = MiniGPT(vocab_size, embed_dim, num_heads,
num_layers, max_seq_len=seq_len,
ff_dim=embed_dim * 4)
for step in range(num_steps):
start = np.random.randint(0, len(tokens) - seq_len - 1)
x = tokens[start:start+seq_len].reshape(1, -1)
y = tokens[start+1:start+seq_len+1].reshape(1, -1)
logits = model.forward(x)
loss = cross_entropy(logits, y)
grads = compute_grads(loss, model) # manual chain rule
update_parameters(model, grads, lr) # SGD step
This lesson defines layernorm_backward and ffn_backward explicitly so learners see the chain rule in action. There is no autograd; every gradient is derived by hand and applied through update_parameters.
Gradient Clipping and Automatic Mixed Precision (AMP)
Once the fundamentals are solid, the curriculum moves to production-grade PyTorch training. In [phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py), the AmpTrainState class wraps an AdamW optimizer with automatic mixed precision and gradient clipping:
# phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py
class AmpTrainState:
def __init__(self, model, lr=1e-3, fp16=False):
self.model = model
self.opt = torch.optim.AdamW(model.parameters(), lr=lr)
self.scaler = torch.cuda.amp.GradScaler(enabled=fp16)
def run_step(self, x, y):
with torch.cuda.amp.autocast(enabled=self.scaler.is_enabled()):
logits = self.model(x)
loss = F.cross_entropy(logits, y)
self.scaler.scale(loss).backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.scaler.step(self.opt)
self.scaler.update()
self.opt.zero_grad()
return loss.item()
The run_step method sequences the entire mixed-precision step: cast to autocast, scale the loss, back-propagate, clip norms, and step the scaler.
Learning Rate Scheduling and Warmup
Before scaling across GPUs, the repository covers learning-rate scheduling. phases/19-capstone-projects/44-cosine-lr-warmup/code/main.py implements a cosine warmup schedule that decays the learning rate after a linear warmup phase. This pattern is reused in later distributed training scripts to stabilize early steps.
Distributed Training with ZeRO-1 Sharding
The most advanced training lesson lives in [phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py). It combines Distributed Data Parallel (DDP) with a custom ZeRO-1 optimizer to train models larger than a single GPU’s memory.
As implemented in rohitg00/ai-engineering-from-scratch, the ZeroOptimizer class shards parameters and gradients across ranks:
# phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py
class ZeroOptimizer:
def __init__(self, model, world_size, rank, lr):
self.model = model
self.world_size = world_size
self.rank = rank
self.lr = lr
self.shard = self._init_shard()
def step(self):
for p in self.model.parameters():
dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)
for p, s in zip(self.model.parameters(), self.shard):
p.data = s - self.lr * p.grad
Inside the training loop, only the rank’s parameter shard is updated after gradients are all-reduced. The repository also covers per-rank gradient accumulation and sharded checkpoint saving, all built from raw torch.distributed primitives.
Reinforcement Learning and Vision-Language Loops
Model training in ai-engineering-from-scratch extends beyond supervised transformers. Under phases/09-reinforcement-learning/, scripts like train_one_epoch and train_step implement policy-gradient and Q-learning loops that sample actions, compute returns, and update policies.
For multimodal workloads, phases/12-multimodal-ai/62-vision-language-pretraining/code/main.py uses a contrastive InfoNCE loss to jointly train vision and language encoders. The train(cfg) function handles batching, loss scaling, and metric logging for cross-modal pre-training.
Fine-Tuning with LoRA
In phases/11-llm-engineering/08-fine-tuning-lora/code/lora.py, the repository demonstrates parameter-efficient training. The train_lora function freezes the base model weights and updates only low-rank adapter matrices, drastically reducing memory requirements while preserving model quality.
Summary
- NumPy-first loops:
phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.pyexposes manual back-propagation withlayernorm_backwardandffn_backward. - Modern PyTorch steps:
phases/19-capstone-projects/45-gradient-clipping-amp/code/main.pypackages AMP, AdamW, and gradient clipping insideAmpTrainState.run_step. - Learning-rate control: Cosine warmup schedules in
phases/19-capstone-projects/44-cosine-lr-warmup/code/main.pyprepare models for large-batch training. - Distributed scale:
phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.pyimplements a customZeroOptimizerfor ZeRO-1 sharding across GPUs. - Domain-specific loops: RL agents, vision-language contrastive training, and LoRA fine-tuning each supply focused training functions that reuse the same four-step pattern.
Frequently Asked Questions
How does ai-engineering-from-scratch teach back-propagation?
The repository disables autograd in early lessons. In phases/10-llms-from-scratch/04-pre-training-mini-gpt/code/main.py, gradients are computed by explicitly calling layernorm_backward, ffn_backward, and compute_grads so learners trace every partial derivative before using PyTorch’s backward().
What mixed-precision training utilities are used?
The capstone project in phases/19-capstone-projects/45-gradient-clipping-amp/code/main.py uses torch.cuda.amp.autocast and GradScaler inside AmpTrainState.run_step. Gradient norms are clipped with torch.nn.utils.clip_grad_norm_ before the optimizer step.
Can the training code scale to multiple GPUs?
Yes. phases/19-capstone-projects/81-end-to-end-distributed-train/code/main.py demonstrates a ZeroOptimizer that shards parameters across ranks using torch.distributed. It performs all-reduce on gradients and updates only local shards, enabling models larger than a single GPU’s memory.
Where is distributed data parallel introduced?
Before the full ZeRO pipeline, phases/19-capstone-projects/77-data-parallel-ddp/code/main.py introduces DDP utilities that broadcast parameters and synchronize gradients with torch.distributed. This lesson lays the groundwork for the end-to-end distributed training capstone.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →