How the Transformer Architecture Is Implemented From First Principles

The rohitg00/ai-engineering-from-scratch repository implements every mathematical component of the Transformer architecture—multi-head attention, layer normalization, residual connections, and position-wise feed-forward networks—using pure PyTorch without high-level abstractions, exposing the complete pipeline from tensor operations to a working GPT-style decoder.

This educational codebase reconstructs the "Attention is All You Need" foundation and modern GPT variations through two progressive lessons: first building a single Transformer block with configurable pre/post layer normalization, then assembling a full decoder-only language model. Each implementation lives in self-contained Python modules that mirror the theoretical architecture while remaining fully executable.

Core Components of the Transformer Block

The minimal reusable unit resides in phases/19-capstone-projects/34-transformer-block/code/main.py, where the TransformerBlock class combines four essential sub-layers. This file demonstrates how raw tensors flow through normalization, attention, and feed-forward transformations with explicit residual connections.

Layer Normalization

Layer Normalization stabilizes training by normalizing each token's embedding across the feature dimension. The repository implements this manually in the LayerNorm class (lines 37-55) to expose the scale and shift learnable parameters plus the exact epsilon handling used for numerical stability.

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(d_model))
        self.shift = nn.Parameter(torch.zeros(d_model))
    
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * x_norm + self.shift

Multi-Head Causal Self-Attention

The MultiHeadAttention class (lines 65-78) projects inputs into queries, keys, and values using a fused QKV projection via nn.Linear(d_model, 3*d_model) to reduce memory pressure. It computes scaled dot-product attention, applies a causal mask to prevent future-token leakage, and merges heads before output projection.

A pre-computed causal mask is stored as a buffer using torch.triu (lines 81-86), ensuring autoregressive generation remains mathematically sound during both training and inference.

Position-Wise Feed-Forward Network

Each block contains a FeedForward network (lines 115-124) that applies two linear transformations with a GELU activation in between. The hidden dimension expands by a configurable mlp_expansion factor (typically 4× the model dimension), creating the capacity for non-linear feature extraction.

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg.d_model, cfg.mlp_expansion * cfg.d_model)
        self.fc2 = nn.Linear(cfg.mlp_expansion * cfg.d_model, cfg.d_model)
        self.gelu = nn.GELU()
    
    def forward(self, x):
        return self.fc2(self.gelu(self.fc1(x)))

Residual Connections and Pre/Post Layer Normalization

The TransformerBlock supports both architectural variants through the pre_ln boolean flag. Pre-Layer Normalization (default in modern GPT models) applies normalization before the attention and MLP sub-layers, improving gradient flow in deep stacks. Post-Layer Normalization (original Transformer design) normalizes after the residual addition.

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.pre_ln = cfg.pre_ln
        self.ln1 = LayerNorm(cfg.d_model)
        self.attn = MultiHeadAttention(cfg)
        self.ln2 = LayerNorm(cfg.d_model)
        self.mlp = FeedForward(cfg)

    def forward(self, x):
        if self.pre_ln:
            x = x + self.attn(self.ln1(x))   # attention inside residual

            x = x + self.mlp(self.ln2(x))    # MLP inside residual

        else:                                 # post-LN variant

            x = self.ln1(x + self.attn(x))
            x = self.ln2(x + self.mlp(x))
        return x

The demo() function in the same file constructs two six-layer stacks (pre-LN vs. post-LN) and prints gradient norms at the embedding layer, empirically demonstrating the training stability advantages of pre-LN (lines 52-56).

Assembling the Full GPT-Style Decoder

The complete decoder implementation lives in phases/19-capstone-projects/35-gpt-model-assembly/code/main.py. The GPTModel class stacks twelve TransformerBlock instances and adds the embedding layers, final normalization, and language modeling head required for text generation.

Embedding Layers and Weight Tying

The model initializes token embeddings (tok_embed) and positional embeddings (pos_embed), summing them before applying dropout. Following modern best practices, the repository implements weight tying (lines 36-38), where the output projection lm_head shares parameters with the token embedding matrix:

self.tok_embed = nn.Embedding(cfg.vocab_size, cfg.d_model)
self.pos_embed = nn.Embedding(cfg.context_length, cfg.d_model)
self.embed_dropout = nn.Dropout(cfg.dropout)

# lm_head tied to token embedding for weight sharing

self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)
if cfg.weight_tying:
    self.lm_head.weight = self.tok_embed.weight

Stacking Transformer Blocks

The model creates a ModuleList of identical pre-LN blocks and applies residual scaling via _scale_residual_projections (lines 53-58). This scales each block's projection matrices by 1/√(2·L) where L is the number of layers, preventing activation explosion in deep networks.


# 12 identical pre-LN blocks

self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.num_layers)])
self.final_ln = LayerNorm(cfg.d_model)

Dropout is applied to the initial embeddings and after each sub-layer (lines 11-12, 78-79), with a causal mask enforced in every attention layer to maintain autoregressive properties.

Generation Pipeline

The generate() function (lines 92-124) implements autoregressive text generation by repeatedly feeding the most recent context window (maximum cfg.context_length) into the model. It applies temperature scaling to logits, optional top-k filtering to restrict the vocabulary, and multinomial sampling to produce the next token.

def generate(model, prompt, max_new_tokens, temperature=1.0, top_k=None):
    model.eval()
    for _ in range(max_new_tokens):
        # Crop context to maximum length

        context = prompt[:, -model.cfg.context_length:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        prompt = torch.cat([prompt, next_token], dim=1)
    return prompt

Practical Code Examples

Execute these snippets to experiment with the first-principles implementation:


# -------------------------------------------------

# Example 1: Run the pre-LN vs. post-LN demo

# -------------------------------------------------

from phases.19_capstone_projects.34_transformer_block.code.main import demo

demo()     # prints shapes, gradient norms and the ratio

# -------------------------------------------------

# Example 2: Build a tiny GPT model and generate text

# -------------------------------------------------

import torch
from phases.19_capstone_projects.35_gpt_model_assembly.code.main import (
    GPTConfig, GPTModel, generate
)

# tiny configuration for quick testing

cfg = GPTConfig(
    vocab_size=512,      # small vocab for demo

    context_length=64,
    d_model=64,
    num_heads=4,
    num_layers=2,
    dropout=0.0,
    weight_tying=True,
)

model = GPTModel(cfg)

# a simple numeric prompt (token IDs)

prompt = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)

# generate 12 new tokens, temperature 0.8, top-k=20

generated = generate(
    model,
    prompt,
    max_new_tokens=12,
    temperature=0.8,
    top_k=20,
    seed=42,
)

print("Prompt :", prompt.tolist()[0])
print("Generated:", generated.tolist()[0])

Summary

  • The Transformer block in phases/19-capstone-projects/34-transformer-block/code/main.py combines manual LayerNorm, fused QKV multi-head attention, GELU-based MLP, and configurable pre/post-LN residuals.
  • Pre-Layer Normalization improves gradient flow compared to post-LN, demonstrated through empirical gradient norm measurements in the demo() function.
  • The GPT model in phases/19-capstone-projects/35-gpt-model-assembly/code/main.py stacks twelve blocks with token/positional embeddings, weight tying, and residual scaling.
  • Weight tying reduces parameters by sharing the embedding matrix with the output projection layer.
  • Autoregressive generation is enforced through causal masking in attention and implemented via the generate() helper with temperature and top-k sampling.

Frequently Asked Questions

What is the difference between pre-LN and post-LN Transformers?

Pre-Layer Normalization applies normalization before the attention and MLP sub-layers (inside the residual path), while Post-Layer Normalization applies it after adding the residual. According to the implementation in phases/19-capstone-projects/34-transformer-block/code/main.py, pre-LN significantly improves gradient flow in deep networks, as demonstrated by the higher gradient norms at early layers compared to the post-LN variant.

Why is weight tying used in the GPT implementation?

Weight tying reuses the token embedding matrix for the final language modeling head projection, implemented in phases/19-capstone-projects/35-gpt-model-assembly/code/main.py at lines 36-38. This reduces the total parameter count by the size of the vocabulary times the model dimension, acting as a regularization mechanism and improving training efficiency without sacrificing model capacity.

How does the causal mask work in the attention mechanism?

The causal mask prevents the model from attending to future tokens during training, ensuring autoregressive generation remains valid. In phases/19-capstone-projects/34-transformer-block/code/main.py, the mask is pre-computed using torch.triu and stored as a buffer (lines 81-86). During the forward pass, this triangular mask sets attention scores for future positions to negative infinity before the softmax operation.

What is residual scaling and why is it important?

Residual scaling multiplies the projection matrices in each Transformer block by 1/√(2·L) where L is the total number of layers. Implemented in the _scale_residual_projections method of GPTModel, this technique prevents the activation magnitudes from growing exponentially as they pass through deep stacks of residual connections, maintaining stable forward passes and gradients during training.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →