How to Implement DPO for LLM Alignment: A Complete Guide from the ai-engineering-from-scratch Repository

Direct Preference Optimization (DPO) trains language models directly on human preference pairs without a reward model by optimizing the implicit reward margin between chosen and rejected completions.

The ai-engineering-from-scratch repository contains a complete, educational implementation of DPO that demonstrates how to implement DPO for LLM alignment on a minimal transformer architecture. This self-contained approach eliminates the complexity of traditional RLHF pipelines—such as PPO policy updates and explicit reward model training—while maintaining alignment effectiveness.

Understanding DPO Core Concepts

The implementation in phases/19-capstone-projects/40-dpo-from-scratch/code/main.py relies on five mathematical foundations derived in the lesson documentation at docs/en.md.

Bradley-Terry Preference Model treats human preference as a probability: sigmoid(r(x, y_w) - r(x, y_l)), where y_w is the chosen completion and y_l is the rejected one. This formulation appears in lines 28-33 of the documentation.

Closed-Form Optimal Policy establishes that the optimal policy π* is proportional to π_ref · exp(r/β), where β controls how much the policy diverges from the reference. This derivation is shown in lines 40-44 of docs/en.md.

Implicit Reward eliminates the need for a separate reward model by defining r(x,y) = β·(log πθ(y|x) - log π_ref(y|x)). The repository implements this calculation in the dpo_loss() function starting at line 36 of main.py.

Reference Invariance requires freezing the supervised fine-tuned (SFT) model (π_ref) throughout training. The code enforces this in build_models() (lines 64-76) using explicit torch.no_grad() contexts in the training loop.

DPO Loss Function optimizes the margin between chosen and rejected completions: L_DPO = -log σ(β·[(log πθ_w - log π_ref_w) - (log πθ_l - log π_ref_l)]). This is implemented in dpo_loss() lines 36-56.

Architecture Overview

The repository implements a complete DPO pipeline using minimal components:

  • InstructionTokenizer (lines 40-50): Byte-level tokenizer with special INST and RESP tokens for formatting prompts.
  • TinyGPT (lines 60-117): A causal decoder-only transformer with multi-head attention.
  • PreferenceDataset: Wraps 12 hard-coded preference triples from make_preferences() (lines 24-88).
  • Log-Probability Computation: sequence_log_prob() (lines 95-134) sums token-wise log-probabilities while masking the prompt portion.
  • Training Loop: train_dpo() (lines 52-93) runs per-epoch updates keeping the reference frozen and logging reward margins via evaluate_margins() (line 30).

Implementation Walkthrough

Tokenization and Data Preparation

The pipeline begins with preference triples containing (prompt, chosen, rejected):

from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    InstructionTokenizer, make_preferences, sequence_log_prob
)

# Initialize tokenizer with special instruction tokens

tok = InstructionTokenizer()

# Load 12 preference examples

triples = make_preferences()  # Returns list of (prompt, chosen, rejected)

The sequence_log_prob() function computes the log-probability of a completion under a given model while masking the prompt tokens to ensure gradients flow only through the response portion.

The DPO Loss Function

The dpo_loss() function at line 36 implements the mathematical objective:

import torch
import torch.nn.functional as F

def dpo_loss(logp_w_pol, logp_l_pol, logp_w_ref, logp_l_ref, beta=0.2):
    """
    logp_w_pol: log πθ(y_w|x) - policy log-prob for chosen
    logp_l_pol: log πθ(y_l|x) - policy log-prob for rejected  
    logp_w_ref: log π_ref(y_w|x) - reference log-prob for chosen
    logp_l_ref: log π_ref(y_l|x) - reference log-prob for rejected
    """
    # Implicit reward differences

    pi_w = logp_w_pol - logp_w_ref
    pi_l = logp_l_pol - logp_l_ref
    
    # Bradley-Terry objective

    logits = beta * (pi_w - pi_l)
    loss = -F.logsigmoid(logits).mean()
    
    # Return loss and margin for monitoring

    margin = (pi_w - pi_l).mean().item()
    return loss, margin

Training Loop with Reference Model

The train_dpo() function orchestrates the optimization:

from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    build_models, train_dpo, DPOConfig
)

# Configuration

cfg = DPOConfig(epochs=20, beta=0.2, lr=1e-3, warmup_epochs=5)

# Build frozen reference and trainable policy

ref_model, policy_model = build_models(cfg)

# Freeze reference parameters

for p in ref_model.parameters():
    p.requires_grad = False
ref_model.eval()

# Train DPO

report = train_dpo(policy_model, ref_model, tok, triples, cfg)

The build_models() function instantiates two TinyGPT instances (lines 64-76), ensuring they share the same architecture but maintain separate parameters.

Complete Training Example

The following script demonstrates the full workflow including warm-up SFT and DPO phases:

from phases.19_capstone_projects.40_dpo_from_scratch.code.main import (
    DPOConfig, build_models, make_preferences,
    warmup_pretrain, train_dpo, evaluate_margins,
    InstructionTokenizer,
)

def full_dpo_training():
    # Configuration

    cfg = DPOConfig(epochs=20, beta=0.2, lr=1e-3, warmup_epochs=5)
    tok = InstructionTokenizer()
    triples = make_preferences()
    
    # 1. Warm-up: Train reference model on chosen completions

    ref, _ = build_models(cfg)
    for p in ref.parameters():
        p.requires_grad = True
    ref.train()
    warmup_pretrain(ref, tok, triples, epochs=cfg.warmup_epochs, seed=cfg.seed)
    
    # 2. Freeze reference and initialize policy

    for p in ref.parameters():
        p.requires_grad = False
    ref.eval()
    policy, _ = build_models(cfg)  # Fresh model with same architecture

    
    # 3. Run DPO training

    report = train_dpo(policy, ref, tok, triples, cfg)
    
    # 4. Validate alignment

    final_margin = evaluate_margins(policy, ref, tok, triples)
    print(f"Final chosen-rejected margin: {final_margin:.4f}")
    
    return policy, ref

if __name__ == "__main__":
    policy, reference = full_dpo_training()

This pattern—initializing the reference via SFT warm-up, freezing it, then optimizing the policy against it—is the standard DPO workflow as implemented in the repository's run_demo() function (lines 101-161).

Testing and Validation

The repository includes unit tests in phases/19-capstone-projects/40-dpo-from-scratch/code/tests/test_main.py that verify:

  • Loss computation matches the mathematical definition
  • Gradients flow only to the policy parameters, not the reference
  • Reward margins increase monotonically during training

Run these tests to ensure your implementation correctly aligns the policy model toward preferred completions while maintaining divergence constraints enforced by the beta parameter.

Summary

  • Direct Preference Optimization eliminates the need for explicit reward models and PPO training by deriving a closed-form optimal policy from the Bradley-Terry preference model.
  • The implementation uses a frozen reference model (π_ref) and a trainable policy (πθ), computing implicit rewards as β·(log πθ - log π_ref).
  • Key functions include dpo_loss() (lines 36-56), sequence_log_prob() (lines 95-134), and train_dpo() (lines 52-93) in main.py.
  • The beta parameter (typically 0.1-0.5) controls the trade-off between alignment strength and KL divergence from the reference.
  • The repository provides a complete, runnable example using TinyGPT that can be extended to larger Hugging Face models by replacing the model classes while keeping the DPO loss logic identical.

Frequently Asked Questions

What is the difference between DPO and RLHF?

DPO trains directly on preference pairs without a reward model or reinforcement learning. Traditional RLHF first trains a reward model on human preferences, then uses PPO to optimize the policy against that reward. DPO eliminates both steps by deriving the optimal policy closed-form, reducing implementation complexity and training instability. According to the ai-engineering-from-scratch source code, DPO requires only preference triples and a frozen reference model, whereas RLHF requires separate reward model training and value function estimation.

Why is the reference model frozen during DPO training?

The reference model represents the supervised fine-tuned (SFT) baseline that defines the implicit reward function. As derived in docs/en.md lines 46-55, the reward is defined relative to this reference: r(x,y) = β·(log πθ(y|x) - log π_ref(y|x)). If the reference were updated, the reward target would shift during training, causing instability and preventing convergence. The code enforces this via torch.no_grad() contexts and requires_grad=False in build_models().

How does the beta parameter affect DPO training?

Beta controls the temperature of the optimal policy and the strength of alignment. A higher beta (e.g., 0.5) increases the penalty for deviating from the reference model, resulting in conservative updates that stay close to the SFT policy. A lower beta (e.g., 0.1) allows aggressive optimization toward preferred completions but risks overfitting or mode collapse. The repository uses beta=0.2 as a balanced default, configurable via DPOConfig.

Can this implementation be scaled to production LLMs?

The code architecture scales directly to production models. While the repository demonstrates DPO on TinyGPT (64 hidden dimensions, 4 heads), the dpo_loss() function and train_dpo() loop are model-agnostic. To apply this to Llama, GPT, or other architectures, replace TinyGPT with your Hugging Face AutoModelForCausalLM while maintaining the same pattern: freeze the reference, compute log-probabilities for chosen and rejected completions, and optimize the DPO objective. The mathematical implementation in main.py lines 36-56 requires no changes for larger models.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →