# How to Implement DPO for LLM Alignment: A Complete Guide from the ai-engineering-from-scratch Repository

> Learn to implement Direct Preference Optimization DPO for LLM alignment with this comprehensive guide. Optimize language models directly using human preference pairs. No reward model needed.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-10

---

**Direct Preference Optimization (DPO) trains language models directly on human preference pairs without a reward model by optimizing the implicit reward margin between chosen and rejected completions.**

The `ai-engineering-from-scratch` repository contains a complete, educational implementation of DPO that demonstrates how to implement DPO for LLM alignment on a minimal transformer architecture. This self-contained approach eliminates the complexity of traditional RLHF pipelines—such as PPO policy updates and explicit reward model training—while maintaining alignment effectiveness.

## Understanding DPO Core Concepts

The implementation in [`phases/19-capstone-projects/40-dpo-from-scratch/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/40-dpo-from-scratch/code/main.py) relies on five mathematical foundations derived in the lesson documentation at [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md).

**Bradley-Terry Preference Model** treats human preference as a probability: `sigmoid(r(x, y_w) - r(x, y_l))`, where `y_w` is the chosen completion and `y_l` is the rejected one. This formulation appears in lines 28-33 of the documentation.

**Closed-Form Optimal Policy** establishes that the optimal policy `π*` is proportional to `π_ref · exp(r/β)`, where `β` controls how much the policy diverges from the reference. This derivation is shown in lines 40-44 of [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md).

**Implicit Reward** eliminates the need for a separate reward model by defining `r(x,y) = β·(log πθ(y|x) - log π_ref(y|x))`. The repository implements this calculation in the `dpo_loss()` function starting at line 36 of [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py).

**Reference Invariance** requires freezing the **supervised fine-tuned (SFT) model** (`π_ref`) throughout training. The code enforces this in `build_models()` (lines 64-76) using explicit `torch.no_grad()` contexts in the training loop.

**DPO Loss Function** optimizes the margin between chosen and rejected completions: `L_DPO = -log σ(β·[(log πθ_w - log π_ref_w) - (log πθ_l - log π_ref_l)])`. This is implemented in `dpo_loss()` lines 36-56.

## Architecture Overview

The repository implements a complete DPO pipeline using minimal components:

- **InstructionTokenizer** (lines 40-50): Byte-level tokenizer with special `INST` and `RESP` tokens for formatting prompts.
- **TinyGPT** (lines 60-117): A causal decoder-only transformer with multi-head attention.
- **PreferenceDataset**: Wraps 12 hard-coded preference triples from `make_preferences()` (lines 24-88).
- **Log-Probability Computation**: `sequence_log_prob()` (lines 95-134) sums token-wise log-probabilities while masking the prompt portion.
- **Training Loop**: `train_dpo()` (lines 52-93) runs per-epoch updates keeping the reference frozen and logging reward margins via `evaluate_margins()` (line 30).

## Implementation Walkthrough

### Tokenization and Data Preparation

The pipeline begins with preference triples containing `(prompt, chosen, rejected)`:

```python
from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    InstructionTokenizer, make_preferences, sequence_log_prob
)

# Initialize tokenizer with special instruction tokens

tok = InstructionTokenizer()

# Load 12 preference examples

triples = make_preferences()  # Returns list of (prompt, chosen, rejected)

```

The `sequence_log_prob()` function computes the log-probability of a completion under a given model while masking the prompt tokens to ensure gradients flow only through the response portion.

### The DPO Loss Function

The `dpo_loss()` function at line 36 implements the mathematical objective:

```python
import torch
import torch.nn.functional as F

def dpo_loss(logp_w_pol, logp_l_pol, logp_w_ref, logp_l_ref, beta=0.2):
    """
    logp_w_pol: log πθ(y_w|x) - policy log-prob for chosen
    logp_l_pol: log πθ(y_l|x) - policy log-prob for rejected  
    logp_w_ref: log π_ref(y_w|x) - reference log-prob for chosen
    logp_l_ref: log π_ref(y_l|x) - reference log-prob for rejected
    """
    # Implicit reward differences

    pi_w = logp_w_pol - logp_w_ref
    pi_l = logp_l_pol - logp_l_ref
    
    # Bradley-Terry objective

    logits = beta * (pi_w - pi_l)
    loss = -F.logsigmoid(logits).mean()
    
    # Return loss and margin for monitoring

    margin = (pi_w - pi_l).mean().item()
    return loss, margin

```

### Training Loop with Reference Model

The `train_dpo()` function orchestrates the optimization:

```python
from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    build_models, train_dpo, DPOConfig
)

# Configuration

cfg = DPOConfig(epochs=20, beta=0.2, lr=1e-3, warmup_epochs=5)

# Build frozen reference and trainable policy

ref_model, policy_model = build_models(cfg)

# Freeze reference parameters

for p in ref_model.parameters():
    p.requires_grad = False
ref_model.eval()

# Train DPO

report = train_dpo(policy_model, ref_model, tok, triples, cfg)

```

The `build_models()` function instantiates two TinyGPT instances (lines 64-76), ensuring they share the same architecture but maintain separate parameters.

## Complete Training Example

The following script demonstrates the full workflow including warm-up SFT and DPO phases:

```python
from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    DPOConfig, build_models, make_preferences,
    warmup_pretrain, train_dpo, evaluate_margins,
    InstructionTokenizer,
)

def full_dpo_training():
    # Configuration

    cfg = DPOConfig(epochs=20, beta=0.2, lr=1e-3, warmup_epochs=5)
    tok = InstructionTokenizer()
    triples = make_preferences()
    
    # 1. Warm-up: Train reference model on chosen completions

    ref, _ = build_models(cfg)
    for p in ref.parameters():
        p.requires_grad = True
    ref.train()
    warmup_pretrain(ref, tok, triples, epochs=cfg.warmup_epochs, seed=cfg.seed)
    
    # 2. Freeze reference and initialize policy

    for p in ref.parameters():
        p.requires_grad = False
    ref.eval()
    policy, _ = build_models(cfg)  # Fresh model with same architecture

    
    # 3. Run DPO training

    report = train_dpo(policy, ref, tok, triples, cfg)
    
    # 4. Validate alignment

    final_margin = evaluate_margins(policy, ref, tok, triples)
    print(f"Final chosen-rejected margin: {final_margin:.4f}")
    
    return policy, ref

if __name__ == "__main__":
    policy, reference = full_dpo_training()

```

This pattern—initializing the reference via SFT warm-up, freezing it, then optimizing the policy against it—is the standard DPO workflow as implemented in the repository's `run_demo()` function (lines 101-161).

## Testing and Validation

The repository includes unit tests in [`phases/19-capstone-projects/40-dpo-from-scratch/code/tests/test_main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/19-capstone-projects/40-dpo-from-scratch/code/tests/test_main.py) that verify:

- Loss computation matches the mathematical definition
- Gradients flow only to the policy parameters, not the reference
- Reward margins increase monotonically during training

Run these tests to ensure your implementation correctly aligns the **policy model** toward preferred completions while maintaining divergence constraints enforced by the **beta** parameter.

## Summary

- **Direct Preference Optimization** eliminates the need for explicit reward models and PPO training by deriving a closed-form optimal policy from the Bradley-Terry preference model.
- The implementation uses a **frozen reference model** (`π_ref`) and a **trainable policy** (`πθ`), computing implicit rewards as `β·(log πθ - log π_ref)`.
- Key functions include `dpo_loss()` (lines 36-56), `sequence_log_prob()` (lines 95-134), and `train_dpo()` (lines 52-93) in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py).
- The **beta parameter** (typically 0.1-0.5) controls the trade-off between alignment strength and KL divergence from the reference.
- The repository provides a complete, runnable example using TinyGPT that can be extended to larger Hugging Face models by replacing the model classes while keeping the DPO loss logic identical.

## Frequently Asked Questions

### What is the difference between DPO and RLHF?

**DPO trains directly on preference pairs without a reward model or reinforcement learning.** Traditional RLHF first trains a reward model on human preferences, then uses PPO to optimize the policy against that reward. DPO eliminates both steps by deriving the optimal policy closed-form, reducing implementation complexity and training instability. According to the `ai-engineering-from-scratch` source code, DPO requires only preference triples and a frozen reference model, whereas RLHF requires separate reward model training and value function estimation.

### Why is the reference model frozen during DPO training?

The reference model represents the **supervised fine-tuned (SFT) baseline** that defines the implicit reward function. As derived in [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) lines 46-55, the reward is defined relative to this reference: `r(x,y) = β·(log πθ(y|x) - log π_ref(y|x))`. If the reference were updated, the reward target would shift during training, causing instability and preventing convergence. The code enforces this via `torch.no_grad()` contexts and `requires_grad=False` in `build_models()`.

### How does the beta parameter affect DPO training?

**Beta controls the temperature of the optimal policy and the strength of alignment.** A higher beta (e.g., 0.5) increases the penalty for deviating from the reference model, resulting in conservative updates that stay close to the SFT policy. A lower beta (e.g., 0.1) allows aggressive optimization toward preferred completions but risks overfitting or mode collapse. The repository uses `beta=0.2` as a balanced default, configurable via `DPOConfig`.

### Can this implementation be scaled to production LLMs?

The code architecture scales directly to production models. While the repository demonstrates DPO on TinyGPT (64 hidden dimensions, 4 heads), the `dpo_loss()` function and `train_dpo()` loop are model-agnostic. To apply this to Llama, GPT, or other architectures, replace `TinyGPT` with your Hugging Face `AutoModelForCausalLM` while maintaining the same pattern: freeze the reference, compute log-probabilities for chosen and rejected completions, and optimize the DPO objective. The mathematical implementation in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) lines 36-56 requires no changes for larger models.