# How to Implement RLHF for LLM Alignment: A Complete Guide from Scratch

> Learn how to implement RLHF for LLM alignment with this comprehensive guide. Build a reward model and optimize your LLM using PPO from scratch.

- Repository: [Rohit Ghumare/ai-engineering-from-scratch](https://github.com/rohitg00/ai-engineering-from-scratch)
- Tags: how-to-guide
- Published: 2026-06-10

---

**Reinforcement Learning from Human Feedback (RLHF) aligns large language models by training a reward model on pairwise human preferences and then optimizing the policy with a PPO-style loop that penalizes divergence from a reference (SFT) model.**

The `rohitg00/ai-engineering-from-scratch` repository contains a miniature, self-contained implementation of this three-stage pipeline in its **Reward Modeling & RLHF** lesson (Phase 9). This guide examines the actual source code to demonstrate how to implement RLHF for LLM alignment using both synthetic demonstrations and production-grade tools.


## The Three-Stage RLHF Pipeline

The implementation in [`phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py) mirrors the standard alignment pipeline established in Christiano et al. (2017) and Ouyang et al. (2022).

### Stage 1: Preference Data Generation

RLHF begins with pairwise preference data. The reference code uses synthetic generation where `PROMPTS`, `GOOD`, and `BAD` tokens are sampled to form `(prompt, preferred, rejected)` triples.

In production environments, this data comes from human annotators or AI-generated preferences (RLAIF). The `sample_pair` function in [`code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/code/main.py) demonstrates the structure: for each prompt, you sample two completions and label which one is preferred.

### Stage 2: Bradley-Terry Reward Model Training

The reward model (RM) learns to score completions using the **Bradley-Terry pairwise logistic loss**:

\[
L = -\log\sigma\big(R(y_+) - R(y_-)\big)
\]

In the source code, `train_rm` implements a linear scorer `w·bag(y)` that learns weights giving positive mass to "good" tokens and negative mass to "bad" ones. The function iterates over preference pairs, computes the score difference through the `score` method, applies the `sigmoid` function, and minimizes the negative log-likelihood.

```python

# Train the synthetic reward model

import random
from phases_09_reward_modeling_rlhf.code.main import train_rm, rm_accuracy

rng = random.Random(42)
w = train_rm(n_pairs=600, rng=rng)
print("Reward model pairwise accuracy:", rm_accuracy(w))

```

### Stage 3: PPO-Style Policy Optimization

The final stage optimizes a policy `π_θ` using a PPO-style objective with KL regularization. The `rlhf_loop` function in [`code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/code/main.py) implements this core logic:

1. Sample a token from the current policy
2. Score it using the trained reward model
3. Subtract a KL penalty `beta * KL(π_θ‖π_ref)` to prevent divergence from the frozen reference policy `π_ref` (the original SFT model)
4. Update `θ` using the advantage computed from centered-scaled rewards (`adv = (r‑mean)/sd`)

```python

# Run the PPO-style RLHF loop for different beta values

from phases_09_reward_modeling_rlhf.code.main import rlhf_loop

for beta in (0.01, 0.1, 1.0):
    theta, history = rlhf_loop(w, updates=150, beta=beta, rng=random.Random(0))
    first, last = history[0], history[-1]
    print(f"beta={beta:<5}  RM start={first[1]:+.3f} KL start={first[2]:.3f}  "
          f"RM end={last[1]:+.3f} KL end={last[2]:.3f}")

```


## Key Implementation Details from the Source Code

The repository emphasizes three critical knobs for stable RLHF training:

| Component | Implementation | Location |
|-----------|---------------|----------|
| **Reward Model** | Linear scoring of token bags; learns from pairwise preferences | `code/main.py::train_rm` |
| **KL Regularization** | `beta * KL(π_θ‖π_ref)` keeps the policy near the SFT baseline | `code/main.py::rlhf_loop` |
| **Advantage Normalization** | Centered-scaled rewards (`adv = (r‑mean)/sd`) stabilize PPO updates | `code/main.py::rlhf_loop` |

The [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file expands on these implementations, explaining the mathematical rationale behind the KL penalty as a safeguard against reward hacking and detailing how advantage normalization prevents gradient instability during policy updates.


## Production-Scale RLHF with Hugging Face TRL

For real-world LLM alignment, the repository provides a production recipe using Hugging Face TRL. This scales the miniature implementation to actual transformer models:

```python
from trl import RewardTrainer, PPOTrainer, RewardConfig, PPOConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLMWithValueHead

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
rm = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", num_labels=1)

# preference_data = [{"prompt": ..., "chosen": ..., "rejected": ...}, ...]

reward_trainer = RewardTrainer(
    model=rm,
    tokenizer=tokenizer,
    train_dataset=preference_data,
    args=RewardConfig(output_dir="./rm", num_train_epochs=1, learning_rate=1e-5),
)
reward_trainer.train()

policy = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint")
ref    = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint")  # frozen

ppo = PPOTrainer(
    config=PPOConfig(learning_rate=1.41e-5, batch_size=64,
                    init_kl_coef=0.05, target_kl=6.0, adap_kl_ctrl=True),
    model=policy, ref_model=ref, tokenizer=tokenizer,
)

for batch in dataloader:
    responses = ppo.generate(batch["query_ids"], max_new_tokens=128)
    rewards   = rm(torch.cat([batch["query_ids"], responses], dim=-1)).logits[:, 0]
    stats     = ppo.step(batch["query_ids"], responses, rewards)
    print(stats)  # contains mean_kl, clip_frac, value_loss

```

This production pipeline maintains the same three-stage structure: first training the reward model with `RewardTrainer`, then optimizing the policy with `PPOTrainer` while maintaining the KL divergence constraint through `init_kl_coef` and adaptive KL control.


## Critical Hyperparameters for Stable Alignment

The `rohitg00/ai-engineering-from-scratch` implementation highlights specific parameters that control alignment stability:

- **`beta` (KL coefficient)**: Controls the strength of the penalty for deviating from the reference policy. Values around `0.01` to `0.1` typically balance exploration with stability, as demonstrated in the `rlhf_loop` experiments.
- **Advantage normalization**: Centering and scaling rewards (`(r‑mean)/sd`) before computing policy gradients prevents the variance explosion common in early RLHF training.
- **Target KL**: In production TRL setups, `target_kl=6.0` with `adap_kl_ctrl=True` allows the algorithm to dynamically adjust the KL coefficient to maintain a specified divergence budget.


## Summary

- RLHF implementation requires three distinct components: pairwise preference data, a Bradley-Terry reward model trained on preference comparisons, and a PPO-style policy optimizer with KL regularization.
- The `ai-engineering-from-scratch` repository provides a complete, runnable implementation in [`phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py) using synthetic data to demonstrate the full pipeline.
- **KL divergence regularization** (`beta * KL`) is essential to prevent reward hacking and maintain alignment with the base model's capabilities.
- Production implementations use Hugging Face TRL's `RewardTrainer` and `PPOTrainer` to scale these concepts to billions of parameters, while preserving the same mathematical foundations.
- The [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) documentation and [`outputs/skill-rlhf-architect.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/outputs/skill-rlhf-architect.md) artifact provide architectural guidance for extending this implementation to modern alternatives like DPO, GRPO, and RLAIF.


## Frequently Asked Questions

### What is the Bradley-Terry model in RLHF?

The **Bradley-Terry model** is a pairwise comparison framework that learns a scalar reward function `R(y)` such that the probability of preferring completion `y_+` over `y_-` is proportional to the logistic sigmoid of their score difference: `P(y_+ > y_-) = σ(R(y_+) - R(y_-))`. In the source code, this is implemented in `train_rm` as a linear scorer over token bag-of-words features, trained to minimize the negative log-likelihood of observed human preferences.

### Why is KL divergence regularization necessary in RLHF?

**KL divergence regularization** prevents the policy from exploiting the reward model's imperfections—a phenomenon known as reward hacking. By penalizing the policy `π_θ` for deviating too far from the reference policy `π_ref` (the original SFT model) via the term `beta * KL(π_θ‖π_ref)`, the optimization maintains linguistic coherence and factual grounding while still improving on the desired preferences. The `rlhf_loop` function demonstrates this by freezing `π_ref` and subtracting the KL penalty from the reward before computing advantages.

### How does RLHF differ from DPO (Direct Preference Optimization)?

**RLHF** requires training a separate reward model first, then using reinforcement learning (PPO) to optimize the policy against that model. **DPO** (Direct Preference Optimization) bypasses the explicit reward model and RL loop by directly optimizing the policy on preference data using a closed-form loss that implicitly captures the reward. The [`docs/en.md`](https://github.com/rohitg00/ai-engineering-from-scratch/blob/main/docs/en.md) file in the repository discusses DPO as a modern alternative that reduces memory requirements and training complexity, though RLHF remains the dominant approach for fine-grained control over model behavior.

### Can I use synthetic data for the reward model?

Yes, synthetic data is valid for prototyping and educational implementations. The `ai-engineering-from-scratch` repository uses synthetic `(prompt, good, bad)` triples generated from known token lists to demonstrate the `train_rm` functionality. However, production systems require high-quality human preferences or carefully validated AI-generated preferences (RLAIF) to ensure the reward model captures nuanced human values and avoids bias amplification.