# How RLHF Works for LLM Alignment and Preference Optimization: A Three-Stage Technical Guide

> Learn how RLHF aligns LLMs using SFT, reward modeling, and PPO. This technical guide details the three-stage process for preference optimization, drawing from the ai-engineering-from-scratch repository.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: deep-dive
- Published: 2026-05-21

---

**TLDR:** RLHF transforms a base language model into an aligned assistant through a three-stage pipeline—Supervised Fine-Tuning (SFT), Reward Model training with Bradley-Terry loss, and Proximal Policy Optimization (PPO) with KL-divergence penalties—as demonstrated in the `rohitg00/ai-engineering-from-scratch` repository.

The `rohitg00/ai-engineering-from-scratch` curriculum provides a complete toy implementation of RLHF for LLM alignment and preference optimization in [`phases/10-llms-from-scratch/07-rlhf/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/07-rlhf/code/main.py). This executable reference demonstrates how **Reinforcement Learning from Human Feedback (RLHF)** converts a token-prediction model into a helpful assistant by training a scalar reward function on human preference pairs and optimizing the policy against that reward while controlling deviation from the original model.

## The Three-Stage RLHF Pipeline

The implementation follows the canonical three-stage architecture used in production systems like InstructGPT, with each stage serving a distinct objective in the alignment process.

### Stage 1: Supervised Fine-Tuning (SFT)

The pipeline begins with **Supervised Fine-Tuning (SFT)** to adapt a base language model to follow instructions. In [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) (lines 33-44), the `MiniGPT` model is instantiated as the SFT foundation:

```python

# 1️⃣ SFT – instantiate a plain language model

sft_model = MiniGPT(vocab_size=256,
                    embed_dim=128,
                    num_heads=4,
                    num_layers=4,
                    max_seq_len=128,
                    ff_dim=512)

```

This stage trains the model on instruction-response datasets using standard cross-entropy loss, creating the baseline policy that subsequent stages refine.

### Stage 2: Training the Reward Model on Human Preferences

The second stage constructs a **Reward Model (RM)** that learns to predict human preferences. Implemented as a mini-transformer in the `RewardModel` class (lines 48-64), this model processes concatenated prompt-response tokens and outputs a scalar score via a linear head `self.reward_head`:

```python

# 2️⃣ Reward model definition (simplified)

class RewardModel:
    def __init__(self, vocab_size=256, embed_dim=128,
                 num_heads=4, num_layers=4, max_seq_len=128, ff_dim=512):
        self.embedding = Embedding(vocab_size, embed_dim, max_seq_len)
        self.blocks = [TransformerBlock(embed_dim, num_heads, ff_dim)
                       for _ in range(num_layers)]
        self.ln_f = LayerNorm(embed_dim)
        self.reward_head = np.random.randn(embed_dim) * 0.02   # ⟶ scalar reward

```

The RM trains on **pairwise human judgments** stored in `PREFERENCE_DATA` (lines 14-45). The **Bradley-Terry loss** function (lines 92-96) converts reward differences into a binary cross-entropy term, encouraging the model to assign higher scores to preferred responses:

```python

# 2️⃣ Bradley‑Terry loss used to train the RM

def bradley_terry_loss(r_pref, r_rej):
    diff = r_pref - r_rej
    loss = -np.log(sigmoid(diff) + 1e-8)
    return loss

```

### Stage 3: Proximal Policy Optimization with KL Control

The final stage uses **Proximal Policy Optimization (PPO)** to update the policy model to maximize reward while maintaining proximity to the SFT baseline. The `ppo_training` loop (lines 20-28) implements this iterative process:

```python

# 3️⃣ PPO training loop (core of RLHF)

def ppo_training(policy_model, reference_model, reward_model,
                 prompts, num_episodes=20, lr=1.5e-5, kl_coeff=0.02):
    for episode in range(num_episodes):
        # Sample a response

        response_tokens = generate_response(policy_model,
                                            tokenize(prompt),
                                            max_new_tokens=20)

        # Score with the reward model

        reward = reward_model.forward(response_tokens)[0]

        # KL‑penalty against the SFT (reference) model

        kl = compute_kl_divergence(policy_model.forward(response_tokens),
                                   reference_model.forward(response_tokens))

        # Total reward used for the update

        total_reward = reward - kl_coeff * kl

        # Simple gradient‑free update (illustrative)

        for block in policy_model.blocks:
            block.ffn.W1 += lr * total_reward * np.random.randn(*block.ffn.W1.shape) * 0.01
            block.ffn.W2 += lr * total_reward * np.random.randn(*block.ffn.W2.shape) * 0.01

```

The **KL-divergence penalty**, computed by `compute_kl_divergence` (lines 63-76), measures drift between the current policy and the reference SFT model. The **KL coefficient** `kl_coeff` (β) serves as the critical hyperparameter: small values allow aggressive optimization toward higher rewards, while large values prevent catastrophic forgetting of language capabilities—often called the **alignment tax**.

## Critical Implementation Details

The `ai-engineering-from-scratch` implementation highlights two technical nuances essential for stable RLHF training.

**Bradley-Terry Pairwise Comparison:** Unlike regression-based reward models, the Bradley-Terry formulation (lines 92-96) explicitly learns from human rankings rather than absolute scores. This approach proves more robust because human annotators consistently judge relative preferences (A is better than B) more reliably than absolute scores (A is 8/10).

**Three-Model Architecture:** As noted in the curriculum catalog ([`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js), lines 1756-1770), the pipeline requires maintaining three distinct models simultaneously: the frozen SFT reference model, the actively updating policy model, and the frozen reward model. This separation prevents the policy from overfitting to the reward model's idiosyncrasies—a failure mode known as reward hacking.

## Summary

- **RLHF for LLM alignment** requires three sequential stages: SFT to establish instruction-following capabilities, Reward Model training to capture human preferences via Bradley-Terry loss, and PPO to optimize the policy against learned rewards.
- The implementation in [`phases/10-llms-from-scratch/07-rlhf/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/10-llms-from-scratch/07-rlhf/code/main.py) demonstrates that while production RLHF systems scale to billions of parameters, the core mechanics fit into a compact, runnable toy model.
- **KL-divergence penalties** prevent the optimized policy from drifting too far from the SFT baseline, balancing reward maximization against the preservation of general language competence.
- The **Bradley-Terry loss** enables learning from pairwise preference data, which reflects how human annotators actually provide feedback.

## Frequently Asked Questions

### What is the purpose of the KL penalty in RLHF?

The **KL penalty** measures the divergence between the current RL policy and the original SFT model, penalizing the optimization when the policy generates outputs that deviate significantly from the baseline distribution. According to the source code in [`main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/main.py) (lines 63-76), this penalty is subtracted from the reward (weighted by coefficient β) to prevent catastrophic forgetting and reward hacking, ensuring the model maintains coherent language generation while improving on human preferences.

### How does the Bradley-Terry loss function work in the Reward Model?

The **Bradley-Terry loss** (lines 92-96) treats preference learning as a binary classification problem. It takes the scalar reward scores for a preferred response (`r_pref`) and a rejected response (`r_rej`), computes their difference, and applies a sigmoid function to yield the probability that the preferred response should rank higher. The negative log-likelihood of this probability drives the loss, effectively training the model to maximize the margin between preferred and rejected outputs without requiring absolute reward values.

### Why does RLHF require three separate models?

RLHF maintains three models simultaneously—the **SFT reference model** (frozen), the **active policy model** (being optimized), and the **Reward Model** (frozen)—to isolate different functions of the training process. The SFT model provides a stable baseline for KL-divergence calculations, the policy model explores the response space to maximize reward, and the Reward Model provides the preference signal. As cataloged in [`site/data.js`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/site/data.js) (lines 1756-1770), this separation prevents the policy from exploiting the Reward Model's specific weaknesses (reward hacking) and preserves the general knowledge captured during pre-training.

### What happens if the KL coefficient is set too high or too low?

Setting the **KL coefficient** (β) too high forces the policy to stay extremely close to the SFT model, resulting in minimal alignment gains because the optimizer cannot explore high-reward regions of the output space. Conversely, setting β too low allows the policy to drift toward outputs that maximize the Reward Model's score but may become incoherent, repetitive, or otherwise misaligned with actual human intent—a phenomenon known as reward hacking. The reference implementation uses `kl_coeff=0.02` as a balanced starting point for toy models.