How to Implement RLHF for LLM Alignment: A Complete Guide from Scratch
Reinforcement Learning from Human Feedback (RLHF) aligns large language models by training a reward model on pairwise human preferences and then optimizing the policy with a PPO-style loop that penalizes divergence from a reference (SFT) model.
The rohitg00/ai-engineering-from-scratch repository contains a miniature, self-contained implementation of this three-stage pipeline in its Reward Modeling & RLHF lesson (Phase 9). This guide examines the actual source code to demonstrate how to implement RLHF for LLM alignment using both synthetic demonstrations and production-grade tools.
The Three-Stage RLHF Pipeline
The implementation in phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py mirrors the standard alignment pipeline established in Christiano et al. (2017) and Ouyang et al. (2022).
Stage 1: Preference Data Generation
RLHF begins with pairwise preference data. The reference code uses synthetic generation where PROMPTS, GOOD, and BAD tokens are sampled to form (prompt, preferred, rejected) triples.
In production environments, this data comes from human annotators or AI-generated preferences (RLAIF). The sample_pair function in code/main.py demonstrates the structure: for each prompt, you sample two completions and label which one is preferred.
Stage 2: Bradley-Terry Reward Model Training
The reward model (RM) learns to score completions using the Bradley-Terry pairwise logistic loss:
[ L = -\log\sigma\big(R(y_+) - R(y_-)\big) ]
In the source code, train_rm implements a linear scorer w·bag(y) that learns weights giving positive mass to "good" tokens and negative mass to "bad" ones. The function iterates over preference pairs, computes the score difference through the score method, applies the sigmoid function, and minimizes the negative log-likelihood.
# Train the synthetic reward model
import random
from phases_09_reward_modeling_rlhf.code.main import train_rm, rm_accuracy
rng = random.Random(42)
w = train_rm(n_pairs=600, rng=rng)
print("Reward model pairwise accuracy:", rm_accuracy(w))
Stage 3: PPO-Style Policy Optimization
The final stage optimizes a policy π_θ using a PPO-style objective with KL regularization. The rlhf_loop function in code/main.py implements this core logic:
- Sample a token from the current policy
- Score it using the trained reward model
- Subtract a KL penalty
beta * KL(π_θ‖π_ref)to prevent divergence from the frozen reference policyπ_ref(the original SFT model) - Update
θusing the advantage computed from centered-scaled rewards (adv = (r‑mean)/sd)
# Run the PPO-style RLHF loop for different beta values
from phases_09_reward_modeling_rlhf.code.main import rlhf_loop
for beta in (0.01, 0.1, 1.0):
theta, history = rlhf_loop(w, updates=150, beta=beta, rng=random.Random(0))
first, last = history[0], history[-1]
print(f"beta={beta:<5} RM start={first[1]:+.3f} KL start={first[2]:.3f} "
f"RM end={last[1]:+.3f} KL end={last[2]:.3f}")
Key Implementation Details from the Source Code
The repository emphasizes three critical knobs for stable RLHF training:
| Component | Implementation | Location |
|---|---|---|
| Reward Model | Linear scoring of token bags; learns from pairwise preferences | code/main.py::train_rm |
| KL Regularization | beta * KL(π_θ‖π_ref) keeps the policy near the SFT baseline |
code/main.py::rlhf_loop |
| Advantage Normalization | Centered-scaled rewards (adv = (r‑mean)/sd) stabilize PPO updates |
code/main.py::rlhf_loop |
The docs/en.md file expands on these implementations, explaining the mathematical rationale behind the KL penalty as a safeguard against reward hacking and detailing how advantage normalization prevents gradient instability during policy updates.
Production-Scale RLHF with Hugging Face TRL
For real-world LLM alignment, the repository provides a production recipe using Hugging Face TRL. This scales the miniature implementation to actual transformer models:
from trl import RewardTrainer, PPOTrainer, RewardConfig, PPOConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLMWithValueHead
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
rm = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct", num_labels=1)
# preference_data = [{"prompt": ..., "chosen": ..., "rejected": ...}, ...]
reward_trainer = RewardTrainer(
model=rm,
tokenizer=tokenizer,
train_dataset=preference_data,
args=RewardConfig(output_dir="./rm", num_train_epochs=1, learning_rate=1e-5),
)
reward_trainer.train()
policy = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint")
ref = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint") # frozen
ppo = PPOTrainer(
config=PPOConfig(learning_rate=1.41e-5, batch_size=64,
init_kl_coef=0.05, target_kl=6.0, adap_kl_ctrl=True),
model=policy, ref_model=ref, tokenizer=tokenizer,
)
for batch in dataloader:
responses = ppo.generate(batch["query_ids"], max_new_tokens=128)
rewards = rm(torch.cat([batch["query_ids"], responses], dim=-1)).logits[:, 0]
stats = ppo.step(batch["query_ids"], responses, rewards)
print(stats) # contains mean_kl, clip_frac, value_loss
This production pipeline maintains the same three-stage structure: first training the reward model with RewardTrainer, then optimizing the policy with PPOTrainer while maintaining the KL divergence constraint through init_kl_coef and adaptive KL control.
Critical Hyperparameters for Stable Alignment
The rohitg00/ai-engineering-from-scratch implementation highlights specific parameters that control alignment stability:
beta(KL coefficient): Controls the strength of the penalty for deviating from the reference policy. Values around0.01to0.1typically balance exploration with stability, as demonstrated in therlhf_loopexperiments.- Advantage normalization: Centering and scaling rewards (
(r‑mean)/sd) before computing policy gradients prevents the variance explosion common in early RLHF training. - Target KL: In production TRL setups,
target_kl=6.0withadap_kl_ctrl=Trueallows the algorithm to dynamically adjust the KL coefficient to maintain a specified divergence budget.
Summary
- RLHF implementation requires three distinct components: pairwise preference data, a Bradley-Terry reward model trained on preference comparisons, and a PPO-style policy optimizer with KL regularization.
- The
ai-engineering-from-scratchrepository provides a complete, runnable implementation inphases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.pyusing synthetic data to demonstrate the full pipeline. - KL divergence regularization (
beta * KL) is essential to prevent reward hacking and maintain alignment with the base model's capabilities. - Production implementations use Hugging Face TRL's
RewardTrainerandPPOTrainerto scale these concepts to billions of parameters, while preserving the same mathematical foundations. - The
docs/en.mddocumentation andoutputs/skill-rlhf-architect.mdartifact provide architectural guidance for extending this implementation to modern alternatives like DPO, GRPO, and RLAIF.
Frequently Asked Questions
What is the Bradley-Terry model in RLHF?
The Bradley-Terry model is a pairwise comparison framework that learns a scalar reward function R(y) such that the probability of preferring completion y_+ over y_- is proportional to the logistic sigmoid of their score difference: P(y_+ > y_-) = σ(R(y_+) - R(y_-)). In the source code, this is implemented in train_rm as a linear scorer over token bag-of-words features, trained to minimize the negative log-likelihood of observed human preferences.
Why is KL divergence regularization necessary in RLHF?
KL divergence regularization prevents the policy from exploiting the reward model's imperfections—a phenomenon known as reward hacking. By penalizing the policy π_θ for deviating too far from the reference policy π_ref (the original SFT model) via the term beta * KL(π_θ‖π_ref), the optimization maintains linguistic coherence and factual grounding while still improving on the desired preferences. The rlhf_loop function demonstrates this by freezing π_ref and subtracting the KL penalty from the reward before computing advantages.
How does RLHF differ from DPO (Direct Preference Optimization)?
RLHF requires training a separate reward model first, then using reinforcement learning (PPO) to optimize the policy against that model. DPO (Direct Preference Optimization) bypasses the explicit reward model and RL loop by directly optimizing the policy on preference data using a closed-form loss that implicitly captures the reward. The docs/en.md file in the repository discusses DPO as a modern alternative that reduces memory requirements and training complexity, though RLHF remains the dominant approach for fine-grained control over model behavior.
Can I use synthetic data for the reward model?
Yes, synthetic data is valid for prototyping and educational implementations. The ai-engineering-from-scratch repository uses synthetic (prompt, good, bad) triples generated from known token lists to demonstrate the train_rm functionality. However, production systems require high-quality human preferences or carefully validated AI-generated preferences (RLAIF) to ensure the reward model captures nuanced human values and avoids bias amplification.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →