deep-dive

DPO vs RLHF: How Direct Preference Optimization Compares to Reinforcement Learning from Human Feedback

May 21, 2026 rohitg00/ai-engineering-from-scratch ↗

Direct Preference Optimization (DPO) aligns language models using a single supervised learning loop with two models, while RLHF requires a three-stage pipeline with a separate reward model and PPO optimization, making DPO simpler and more stable but RLHF better for capturing complex, nuanced preferences.

Both Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are techniques for aligning large language models (LLMs) with human preferences. According to the rohitg00/ai-engineering-from-scratch repository, these methods differ fundamentally in their architectural complexity, training workflows, and computational requirements.

Architectural Comparison: Single Loop vs. Three-Stage Pipeline

Core Philosophy and Model Requirements

DPO eliminates the need for an explicit reward model by optimizing the policy directly against pairwise preference data. In phases/10-llms-from-scratch/08-dpo/code/main.py, the dpo_loss function (lines 94-106) implements this as a sigmoid loss based on log-probability ratios between the policy and a reference model.

RLHF, conversely, follows a three-stage pipeline implemented in phases/10-llms-from-scratch/07-rlhf/code/main.py (lines 28-34): (1) Supervised Fine-Tuning (SFT), (2) reward model training, and (3) PPO fine-tuning. This requires maintaining four to five models simultaneously: the SFT checkpoint, the reward model, the policy model, the reference model, and optionally a value model for PPO.

Training Stability and Convergence

DPO operates as supervised learning with gradients derived from log-probability differences, which the repository notes are "usually more stable than RL gradients." With only two models in memory versus four to five for RLHF, DPO significantly reduces memory footprint and training complexity.

However, RLHF offers superior capability for multi-objective alignment and capturing fine-grained preferences. The separate reward model enables online learning where the model can generate, rate, and retrain continuously, though this comes at the cost of high-variance RL gradients that require careful PPO clipping and KL-penalty mechanisms to maintain stability.

Implementation Details and Code Examples

DPO Training Loop

The DPO implementation requires only two models in memory: the policy and the reference (typically the SFT checkpoint). The dpo_train function (lines 141-200) handles the single training loop:


# Load the MiniGPT model (SFT checkpoint)

from phases.10_llms_from_scratch.08_dpo.code.main import MiniGPT, dpo_train, evaluate_preference_accuracy

# Initialise policy and reference models

policy = MiniGPT(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)
reference = MiniGPT(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)

# Copy SFT weights into both models (see copy_model_weights in the file)

policy, reference = copy_model_weights(sft_model, policy), copy_model_weights(sft_model, reference)

# Train with DPO

policy, losses, margins = dpo_train(
    policy,
    reference,
    PREFERENCE_DATA,          # list of {prompt, preferred, rejected}

    num_epochs=5,
    lr=5e-6,
    beta=0.1,
)

# Evaluate preference accuracy before and after DPO

pre_acc = evaluate_preference_accuracy(sft_model, reference, PREFERENCE_DATA)
post_acc = evaluate_preference_accuracy(policy, reference, PREFERENCE_DATA)
print(f"Accuracy before DPO: {pre_acc:.1%}, after DPO: {post_acc:.1%}")

RLHF Training Pipeline

The RLHF approach requires significantly more memory and computational steps. The RewardModel.forward method (lines 65-76) learns to predict human preferences, followed by PPO optimization via ppo_training (lines 220-267):


# Load the MiniGPT model (SFT checkpoint)

from phases.10_llms_from_scratch.07_rlhf.code.main import MiniGPT, RewardModel, train_reward_model, ppo_training

# Initialise SFT, reward, policy, and reference models

sft = MiniGPT(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)
reward = RewardModel(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)
policy = MiniGPT(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)
reference = MiniGPT(vocab_size=256, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512)

# Copy SFT weights into policy and reference

policy = copy_model_weights(sft, policy)
reference = copy_model_weights(sft, reference)

# 1️⃣ Train reward model on preference pairs

reward, rm_losses, rm_accs = train_reward_model(reward, PREFERENCE_DATA, num_epochs=10, lr=1e-4)

# 2️⃣ PPO fine‑tuning using the learned reward

policy, rewards, kls = ppo_training(
    policy,
    reference,
    reward,
    [p["prompt"] for p in PREFERENCE_DATA],
    num_episodes=20,
    lr=1.5e-5,
    kl_coeff=0.02,
)

print(f"Final KL divergence: {kls[-1]:.4f}")

When to Choose DPO vs RLHF

Choose DPO when you need rapid iteration with limited compute, have sparse or simple preference data, or prefer a stable supervised learning workflow. As noted in lines 61-80 of the DPO script, DPO offers "1 training loop (vs 3 for RLHF)" and requires "no reward model to train or maintain."
Choose RLHF when deploying at massive scale (e.g., GPT-4, Claude architectures), capturing complex nuanced preferences, or requiring multi-objective alignment. The repository emphasizes that RLHF is "proven at largest scales" and supports "online learning: generate, rate, retrain."

A common production strategy is to run DPO first for rapid alignment, then switch to RLHF if the model plateaus or requires richer preference modeling.

Summary

DPO uses a single training loop with two models (policy and reference), treating preference optimization as supervised learning without an explicit reward model.
RLHF employs a three-stage pipeline (SFT, reward model training, PPO) requiring four to five models, enabling complex preference modeling through reinforcement learning.
DPO offers greater stability and lower memory requirements, making it ideal for prototyping and limited compute environments.
RLHF excels at scale and with nuanced, multi-objective preferences, though it requires significantly more resources and careful tuning of PPO hyperparameters.

Frequently Asked Questions

Is DPO more stable than RLHF?

Yes. According to the rohitg00/ai-engineering-from-scratch implementation, DPO operates as supervised learning with gradients derived from log-probability ratios, which are usually more stable than the high-variance RL gradients used in PPO. The DPO loss in phases/10-llms-from-scratch/08-dpo/code/main.py (lines 94-106) avoids the clipping and KL-penalty mechanisms required for RLHF stability.

Can DPO replace RLHF entirely?

Not for all use cases. While DPO is simpler and cheaper, the repository notes that RLHF remains superior for "multi-objective alignment" and capturing "complex preferences." Many production workflows use DPO as a refinement step after initial RLHF alignment, or for quick prototyping before scaling to full RLHF pipelines.

How many models are needed for DPO compared to RLHF?

DPO requires two models: the policy being trained and a reference model (typically the SFT checkpoint). RLHF requires four to five models: the SFT model, a separate reward model, the policy model, a reference model, and optionally a value model for PPO optimization. This difference significantly impacts memory requirements and training infrastructure costs.

Which method is better for large-scale deployment?

RLHF has a proven track record at the largest scales (GPT-4, Claude) and excels when preferences are highly nuanced. However, DPO can serve as an efficient refinement step even in large deployments. The repository suggests that DPO is "simpler and cheaper to run" while RLHF is "proven at massive scale," so the choice depends on whether you prioritize resource efficiency or maximum preference fidelity.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →