# How the Transformer Architecture Is Implemented From First Principles

> Learn how the Transformer architecture is implemented from first principles using pure PyTorch. Explore every mathematical component of a working GPT-style decoder without high-level abstractions.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-06-11

---

**The rohitg00/ai-engineering-from-scratch repository implements every mathematical component of the Transformer architecture—multi-head attention, layer normalization, residual connections, and position-wise feed-forward networks—using pure PyTorch without high-level abstractions, exposing the complete pipeline from tensor operations to a working GPT-style decoder.**

This educational codebase reconstructs the "Attention is All You Need" foundation and modern GPT variations through two progressive lessons: first building a single Transformer block with configurable pre/post layer normalization, then assembling a full decoder-only language model. Each implementation lives in self-contained Python modules that mirror the theoretical architecture while remaining fully executable.

## Core Components of the Transformer Block

The minimal reusable unit resides in [`phases/19-capstone-projects/34-transformer-block/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/34-transformer-block/code/main.py), where the `TransformerBlock` class combines four essential sub-layers. This file demonstrates how raw tensors flow through normalization, attention, and feed-forward transformations with explicit residual connections.

### Layer Normalization

**Layer Normalization** stabilizes training by normalizing each token's embedding across the feature dimension. The repository implements this manually in the `LayerNorm` class (lines 37-55) to expose the `scale` and `shift` learnable parameters plus the exact epsilon handling used for numerical stability.

```python
class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(d_model))
        self.shift = nn.Parameter(torch.zeros(d_model))
    
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * x_norm + self.shift

```

### Multi-Head Causal Self-Attention

The **MultiHeadAttention** class (lines 65-78) projects inputs into queries, keys, and values using a fused QKV projection via `nn.Linear(d_model, 3*d_model)` to reduce memory pressure. It computes scaled dot-product attention, applies a causal mask to prevent future-token leakage, and merges heads before output projection.

A pre-computed causal mask is stored as a buffer using `torch.triu` (lines 81-86), ensuring autoregressive generation remains mathematically sound during both training and inference.

### Position-Wise Feed-Forward Network

Each block contains a **FeedForward** network (lines 115-124) that applies two linear transformations with a GELU activation in between. The hidden dimension expands by a configurable `mlp_expansion` factor (typically 4× the model dimension), creating the capacity for non-linear feature extraction.

```python
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg.d_model, cfg.mlp_expansion * cfg.d_model)
        self.fc2 = nn.Linear(cfg.mlp_expansion * cfg.d_model, cfg.d_model)
        self.gelu = nn.GELU()
    
    def forward(self, x):
        return self.fc2(self.gelu(self.fc1(x)))

```

### Residual Connections and Pre/Post Layer Normalization

The `TransformerBlock` supports both architectural variants through the `pre_ln` boolean flag. **Pre-Layer Normalization** (default in modern GPT models) applies normalization before the attention and MLP sub-layers, improving gradient flow in deep stacks. **Post-Layer Normalization** (original Transformer design) normalizes after the residual addition.

```python
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.pre_ln = cfg.pre_ln
        self.ln1 = LayerNorm(cfg.d_model)
        self.attn = MultiHeadAttention(cfg)
        self.ln2 = LayerNorm(cfg.d_model)
        self.mlp = FeedForward(cfg)

    def forward(self, x):
        if self.pre_ln:
            x = x + self.attn(self.ln1(x))   # attention inside residual

            x = x + self.mlp(self.ln2(x))    # MLP inside residual

        else:                                 # post-LN variant

            x = self.ln1(x + self.attn(x))
            x = self.ln2(x + self.mlp(x))
        return x

```

The `demo()` function in the same file constructs two six-layer stacks (pre-LN vs. post-LN) and prints gradient norms at the embedding layer, empirically demonstrating the training stability advantages of pre-LN (lines 52-56).

## Assembling the Full GPT-Style Decoder

The complete decoder implementation lives in [`phases/19-capstone-projects/35-gpt-model-assembly/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/35-gpt-model-assembly/code/main.py). The `GPTModel` class stacks twelve `TransformerBlock` instances and adds the embedding layers, final normalization, and language modeling head required for text generation.

### Embedding Layers and Weight Tying

The model initializes **token embeddings** (`tok_embed`) and **positional embeddings** (`pos_embed`), summing them before applying dropout. Following modern best practices, the repository implements **weight tying** (lines 36-38), where the output projection `lm_head` shares parameters with the token embedding matrix:

```python
self.tok_embed = nn.Embedding(cfg.vocab_size, cfg.d_model)
self.pos_embed = nn.Embedding(cfg.context_length, cfg.d_model)
self.embed_dropout = nn.Dropout(cfg.dropout)

# lm_head tied to token embedding for weight sharing

self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)
if cfg.weight_tying:
    self.lm_head.weight = self.tok_embed.weight

```

### Stacking Transformer Blocks

The model creates a `ModuleList` of identical pre-LN blocks and applies **residual scaling** via `_scale_residual_projections` (lines 53-58). This scales each block's projection matrices by `1/√(2·L)` where L is the number of layers, preventing activation explosion in deep networks.

```python

# 12 identical pre-LN blocks

self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.num_layers)])
self.final_ln = LayerNorm(cfg.d_model)

```

Dropout is applied to the initial embeddings and after each sub-layer (lines 11-12, 78-79), with a causal mask enforced in every attention layer to maintain autoregressive properties.

### Generation Pipeline

The `generate()` function (lines 92-124) implements autoregressive text generation by repeatedly feeding the most recent context window (maximum `cfg.context_length`) into the model. It applies temperature scaling to logits, optional top-k filtering to restrict the vocabulary, and multinomial sampling to produce the next token.

```python
def generate(model, prompt, max_new_tokens, temperature=1.0, top_k=None):
    model.eval()
    for _ in range(max_new_tokens):
        # Crop context to maximum length

        context = prompt[:, -model.cfg.context_length:]
        logits, _ = model(context)
        logits = logits[:, -1, :] / temperature
        
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        prompt = torch.cat([prompt, next_token], dim=1)
    return prompt

```

## Practical Code Examples

Execute these snippets to experiment with the first-principles implementation:

```python

# -------------------------------------------------

# Example 1: Run the pre-LN vs. post-LN demo

# -------------------------------------------------

from phases.19_capstone_projects.34_transformer_block.code.main import demo

demo()     # prints shapes, gradient norms and the ratio

```

```python

# -------------------------------------------------

# Example 2: Build a tiny GPT model and generate text

# -------------------------------------------------

import torch
from phases.19_capstone_projects.35_gpt_model_assembly.code.main import (
    GPTConfig, GPTModel, generate
)

# tiny configuration for quick testing

cfg = GPTConfig(
    vocab_size=512,      # small vocab for demo

    context_length=64,
    d_model=64,
    num_heads=4,
    num_layers=2,
    dropout=0.0,
    weight_tying=True,
)

model = GPTModel(cfg)

# a simple numeric prompt (token IDs)

prompt = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)

# generate 12 new tokens, temperature 0.8, top-k=20

generated = generate(
    model,
    prompt,
    max_new_tokens=12,
    temperature=0.8,
    top_k=20,
    seed=42,
)

print("Prompt :", prompt.tolist()[0])
print("Generated:", generated.tolist()[0])

```

## Summary

- The **Transformer block** in [`phases/19-capstone-projects/34-transformer-block/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/34-transformer-block/code/main.py) combines manual LayerNorm, fused QKV multi-head attention, GELU-based MLP, and configurable pre/post-LN residuals.
- **Pre-Layer Normalization** improves gradient flow compared to post-LN, demonstrated through empirical gradient norm measurements in the `demo()` function.
- The **GPT model** in [`phases/19-capstone-projects/35-gpt-model-assembly/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/35-gpt-model-assembly/code/main.py) stacks twelve blocks with token/positional embeddings, weight tying, and residual scaling.
- **Weight tying** reduces parameters by sharing the embedding matrix with the output projection layer.
- **Autoregressive generation** is enforced through causal masking in attention and implemented via the `generate()` helper with temperature and top-k sampling.

## Frequently Asked Questions

### What is the difference between pre-LN and post-LN Transformers?

Pre-Layer Normalization applies normalization before the attention and MLP sub-layers (inside the residual path), while Post-Layer Normalization applies it after adding the residual. According to the implementation in [`phases/19-capstone-projects/34-transformer-block/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/34-transformer-block/code/main.py), pre-LN significantly improves gradient flow in deep networks, as demonstrated by the higher gradient norms at early layers compared to the post-LN variant.

### Why is weight tying used in the GPT implementation?

Weight tying reuses the token embedding matrix for the final language modeling head projection, implemented in [`phases/19-capstone-projects/35-gpt-model-assembly/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/35-gpt-model-assembly/code/main.py) at lines 36-38. This reduces the total parameter count by the size of the vocabulary times the model dimension, acting as a regularization mechanism and improving training efficiency without sacrificing model capacity.

### How does the causal mask work in the attention mechanism?

The causal mask prevents the model from attending to future tokens during training, ensuring autoregressive generation remains valid. In [`phases/19-capstone-projects/34-transformer-block/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/34-transformer-block/code/main.py), the mask is pre-computed using `torch.triu` and stored as a buffer (lines 81-86). During the forward pass, this triangular mask sets attention scores for future positions to negative infinity before the softmax operation.

### What is residual scaling and why is it important?

Residual scaling multiplies the projection matrices in each Transformer block by `1/√(2·L)` where L is the total number of layers. Implemented in the `_scale_residual_projections` method of `GPTModel`, this technique prevents the activation magnitudes from growing exponentially as they pass through deep stacks of residual connections, maintaining stable forward passes and gradients during training.