How the Transformer Architecture Is Implemented From First Principles
The rohitg00/ai-engineering-from-scratch repository implements every mathematical component of the Transformer architecture—multi-head attention, layer normalization, residual connections, and position-wise feed-forward networks—using pure PyTorch without high-level abstractions, exposing the complete pipeline from tensor operations to a working GPT-style decoder.
This educational codebase reconstructs the "Attention is All You Need" foundation and modern GPT variations through two progressive lessons: first building a single Transformer block with configurable pre/post layer normalization, then assembling a full decoder-only language model. Each implementation lives in self-contained Python modules that mirror the theoretical architecture while remaining fully executable.
Core Components of the Transformer Block
The minimal reusable unit resides in phases/19-capstone-projects/34-transformer-block/code/main.py, where the TransformerBlock class combines four essential sub-layers. This file demonstrates how raw tensors flow through normalization, attention, and feed-forward transformations with explicit residual connections.
Layer Normalization
Layer Normalization stabilizes training by normalizing each token's embedding across the feature dimension. The repository implements this manually in the LayerNorm class (lines 37-55) to expose the scale and shift learnable parameters plus the exact epsilon handling used for numerical stability.
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(d_model))
self.shift = nn.Parameter(torch.zeros(d_model))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
x_norm = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * x_norm + self.shift
Multi-Head Causal Self-Attention
The MultiHeadAttention class (lines 65-78) projects inputs into queries, keys, and values using a fused QKV projection via nn.Linear(d_model, 3*d_model) to reduce memory pressure. It computes scaled dot-product attention, applies a causal mask to prevent future-token leakage, and merges heads before output projection.
A pre-computed causal mask is stored as a buffer using torch.triu (lines 81-86), ensuring autoregressive generation remains mathematically sound during both training and inference.
Position-Wise Feed-Forward Network
Each block contains a FeedForward network (lines 115-124) that applies two linear transformations with a GELU activation in between. The hidden dimension expands by a configurable mlp_expansion factor (typically 4× the model dimension), creating the capacity for non-linear feature extraction.
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.fc1 = nn.Linear(cfg.d_model, cfg.mlp_expansion * cfg.d_model)
self.fc2 = nn.Linear(cfg.mlp_expansion * cfg.d_model, cfg.d_model)
self.gelu = nn.GELU()
def forward(self, x):
return self.fc2(self.gelu(self.fc1(x)))
Residual Connections and Pre/Post Layer Normalization
The TransformerBlock supports both architectural variants through the pre_ln boolean flag. Pre-Layer Normalization (default in modern GPT models) applies normalization before the attention and MLP sub-layers, improving gradient flow in deep stacks. Post-Layer Normalization (original Transformer design) normalizes after the residual addition.
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.pre_ln = cfg.pre_ln
self.ln1 = LayerNorm(cfg.d_model)
self.attn = MultiHeadAttention(cfg)
self.ln2 = LayerNorm(cfg.d_model)
self.mlp = FeedForward(cfg)
def forward(self, x):
if self.pre_ln:
x = x + self.attn(self.ln1(x)) # attention inside residual
x = x + self.mlp(self.ln2(x)) # MLP inside residual
else: # post-LN variant
x = self.ln1(x + self.attn(x))
x = self.ln2(x + self.mlp(x))
return x
The demo() function in the same file constructs two six-layer stacks (pre-LN vs. post-LN) and prints gradient norms at the embedding layer, empirically demonstrating the training stability advantages of pre-LN (lines 52-56).
Assembling the Full GPT-Style Decoder
The complete decoder implementation lives in phases/19-capstone-projects/35-gpt-model-assembly/code/main.py. The GPTModel class stacks twelve TransformerBlock instances and adds the embedding layers, final normalization, and language modeling head required for text generation.
Embedding Layers and Weight Tying
The model initializes token embeddings (tok_embed) and positional embeddings (pos_embed), summing them before applying dropout. Following modern best practices, the repository implements weight tying (lines 36-38), where the output projection lm_head shares parameters with the token embedding matrix:
self.tok_embed = nn.Embedding(cfg.vocab_size, cfg.d_model)
self.pos_embed = nn.Embedding(cfg.context_length, cfg.d_model)
self.embed_dropout = nn.Dropout(cfg.dropout)
# lm_head tied to token embedding for weight sharing
self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)
if cfg.weight_tying:
self.lm_head.weight = self.tok_embed.weight
Stacking Transformer Blocks
The model creates a ModuleList of identical pre-LN blocks and applies residual scaling via _scale_residual_projections (lines 53-58). This scales each block's projection matrices by 1/√(2·L) where L is the number of layers, preventing activation explosion in deep networks.
# 12 identical pre-LN blocks
self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.num_layers)])
self.final_ln = LayerNorm(cfg.d_model)
Dropout is applied to the initial embeddings and after each sub-layer (lines 11-12, 78-79), with a causal mask enforced in every attention layer to maintain autoregressive properties.
Generation Pipeline
The generate() function (lines 92-124) implements autoregressive text generation by repeatedly feeding the most recent context window (maximum cfg.context_length) into the model. It applies temperature scaling to logits, optional top-k filtering to restrict the vocabulary, and multinomial sampling to produce the next token.
def generate(model, prompt, max_new_tokens, temperature=1.0, top_k=None):
model.eval()
for _ in range(max_new_tokens):
# Crop context to maximum length
context = prompt[:, -model.cfg.context_length:]
logits, _ = model(context)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
prompt = torch.cat([prompt, next_token], dim=1)
return prompt
Practical Code Examples
Execute these snippets to experiment with the first-principles implementation:
# -------------------------------------------------
# Example 1: Run the pre-LN vs. post-LN demo
# -------------------------------------------------
from phases.19_capstone_projects.34_transformer_block.code.main import demo
demo() # prints shapes, gradient norms and the ratio
# -------------------------------------------------
# Example 2: Build a tiny GPT model and generate text
# -------------------------------------------------
import torch
from phases.19_capstone_projects.35_gpt_model_assembly.code.main import (
GPTConfig, GPTModel, generate
)
# tiny configuration for quick testing
cfg = GPTConfig(
vocab_size=512, # small vocab for demo
context_length=64,
d_model=64,
num_heads=4,
num_layers=2,
dropout=0.0,
weight_tying=True,
)
model = GPTModel(cfg)
# a simple numeric prompt (token IDs)
prompt = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)
# generate 12 new tokens, temperature 0.8, top-k=20
generated = generate(
model,
prompt,
max_new_tokens=12,
temperature=0.8,
top_k=20,
seed=42,
)
print("Prompt :", prompt.tolist()[0])
print("Generated:", generated.tolist()[0])
Summary
- The Transformer block in
phases/19-capstone-projects/34-transformer-block/code/main.pycombines manual LayerNorm, fused QKV multi-head attention, GELU-based MLP, and configurable pre/post-LN residuals. - Pre-Layer Normalization improves gradient flow compared to post-LN, demonstrated through empirical gradient norm measurements in the
demo()function. - The GPT model in
phases/19-capstone-projects/35-gpt-model-assembly/code/main.pystacks twelve blocks with token/positional embeddings, weight tying, and residual scaling. - Weight tying reduces parameters by sharing the embedding matrix with the output projection layer.
- Autoregressive generation is enforced through causal masking in attention and implemented via the
generate()helper with temperature and top-k sampling.
Frequently Asked Questions
What is the difference between pre-LN and post-LN Transformers?
Pre-Layer Normalization applies normalization before the attention and MLP sub-layers (inside the residual path), while Post-Layer Normalization applies it after adding the residual. According to the implementation in phases/19-capstone-projects/34-transformer-block/code/main.py, pre-LN significantly improves gradient flow in deep networks, as demonstrated by the higher gradient norms at early layers compared to the post-LN variant.
Why is weight tying used in the GPT implementation?
Weight tying reuses the token embedding matrix for the final language modeling head projection, implemented in phases/19-capstone-projects/35-gpt-model-assembly/code/main.py at lines 36-38. This reduces the total parameter count by the size of the vocabulary times the model dimension, acting as a regularization mechanism and improving training efficiency without sacrificing model capacity.
How does the causal mask work in the attention mechanism?
The causal mask prevents the model from attending to future tokens during training, ensuring autoregressive generation remains valid. In phases/19-capstone-projects/34-transformer-block/code/main.py, the mask is pre-computed using torch.triu and stored as a buffer (lines 81-86). During the forward pass, this triangular mask sets attention scores for future positions to negative infinity before the softmax operation.
What is residual scaling and why is it important?
Residual scaling multiplies the projection matrices in each Transformer block by 1/√(2·L) where L is the total number of layers. Implemented in the _scale_residual_projections method of GPTModel, this technique prevents the activation magnitudes from growing exponentially as they pass through deep stacks of residual connections, maintaining stable forward passes and gradients during training.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →