how-to-guide

How to Implement RLHF for LLM Alignment: A Complete Guide from Scratch

June 10, 2026 rohitg00/ai-engineering-from-scratch ↗

Reinforcement Learning from Human Feedback (RLHF) aligns large language models by training a reward model on pairwise human preferences and then optimizing the policy with a PPO-style loop that penalizes divergence from a reference (SFT) model.

The rohitg00/ai-engineering-from-scratch repository contains a miniature, self-contained implementation of this three-stage pipeline in its Reward Modeling & RLHF lesson (Phase 9). This guide examines the actual source code to demonstrate how to implement RLHF for LLM alignment using both synthetic demonstrations and production-grade tools.

The Three-Stage RLHF Pipeline

The implementation in phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py mirrors the standard alignment pipeline established in Christiano et al. (2017) and Ouyang et al. (2022).

Stage 1: Preference Data Generation

RLHF begins with pairwise preference data. The reference code uses synthetic generation where PROMPTS, GOOD, and BAD tokens are sampled to form (prompt, preferred, rejected) triples.

In production environments, this data comes from human annotators or AI-generated preferences (RLAIF). The sample_pair function in code/main.py demonstrates the structure: for each prompt, you sample two completions and label which one is preferred.

Stage 2: Bradley-Terry Reward Model Training

The reward model (RM) learns to score completions using the Bradley-Terry pairwise logistic loss:

[ L = -\log\sigma\big(R(y_+) - R(y_-)\big) ]

In the source code, train_rm implements a linear scorer w·bag(y) that learns weights giving positive mass to "good" tokens and negative mass to "bad" ones. The function iterates over preference pairs, computes the score difference through the score method, applies the sigmoid function, and minimizes the negative log-likelihood.


# Train the synthetic reward model

import random
from phases_09_reward_modeling_rlhf.code.main import train_rm, rm_accuracy

rng = random.Random(42)
w = train_rm(n_pairs=600, rng=rng)
print("Reward model pairwise accuracy:", rm_accuracy(w))

Stage 3: PPO-Style Policy Optimization

The final stage optimizes a policy π_θ using a PPO-style objective with KL regularization. The rlhf_loop function in code/main.py implements this core logic:

Sample a token from the current policy
Score it using the trained reward model
Subtract a KL penalty beta * KL(π_θ‖π_ref) to prevent divergence from the frozen reference policy π_ref (the original SFT model)
Update θ using the advantage computed from centered-scaled rewards (adv = (r‑mean)/sd)


# Run the PPO-style RLHF loop for different beta values

from phases_09_reward_modeling_rlhf.code.main import rlhf_loop

for beta in (0.01, 0.1, 1.0):
    theta, history = rlhf_loop(w, updates=150, beta=beta, rng=random.Random(0))
    first, last = history[0], history[-1]
    print(f"beta={beta:<5}  RM start={first[1]:+.3f} KL start={first[2]:.3f}  "
          f"RM end={last[1]:+.3f} KL end={last[2]:.3f}")

Key Implementation Details from the Source Code

The repository emphasizes three critical knobs for stable RLHF training:

Component	Implementation	Location
Reward Model	Linear scoring of token bags; learns from pairwise preferences	`code/main.py::train_rm`
KL Regularization	`beta * KL(π_θ‖π_ref)` keeps the policy near the SFT baseline	`code/main.py::rlhf_loop`
Advantage Normalization	Centered-scaled rewards (`adv = (r‑mean)/sd`) stabilize PPO updates	`code/main.py::rlhf_loop`

The docs/en.md file expands on these implementations, explaining the mathematical rationale behind the KL penalty as a safeguard against reward hacking and detailing how advantage normalization prevents gradient instability during policy updates.

Production-Scale RLHF with Hugging Face TRL

For real-world LLM alignment, the repository provides a production recipe using Hugging Face TRL. This scales the miniature implementation to actual transformer models:

from trl import RewardTrainer, PPOTrainer, RewardConfig, PPOConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLMWithValueHead

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
rm = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", num_labels=1)

# preference_data = [{"prompt": ..., "chosen": ..., "rejected": ...}, ...]

reward_trainer = RewardTrainer(
    model=rm,
    tokenizer=tokenizer,
    train_dataset=preference_data,
    args=RewardConfig(output_dir="./rm", num_train_epochs=1, learning_rate=1e-5),
)
reward_trainer.train()

policy = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint")
ref    = AutoModelForCausalLMWithValueHead.from_pretrained("./sft-checkpoint")  # frozen

ppo = PPOTrainer(
    config=PPOConfig(learning_rate=1.41e-5, batch_size=64,
                    init_kl_coef=0.05, target_kl=6.0, adap_kl_ctrl=True),
    model=policy, ref_model=ref, tokenizer=tokenizer,
)

for batch in dataloader:
    responses = ppo.generate(batch["query_ids"], max_new_tokens=128)
    rewards   = rm(torch.cat([batch["query_ids"], responses], dim=-1)).logits[:, 0]
    stats     = ppo.step(batch["query_ids"], responses, rewards)
    print(stats)  # contains mean_kl, clip_frac, value_loss

This production pipeline maintains the same three-stage structure: first training the reward model with RewardTrainer, then optimizing the policy with PPOTrainer while maintaining the KL divergence constraint through init_kl_coef and adaptive KL control.

Critical Hyperparameters for Stable Alignment

The rohitg00/ai-engineering-from-scratch implementation highlights specific parameters that control alignment stability:

beta (KL coefficient): Controls the strength of the penalty for deviating from the reference policy. Values around 0.01 to 0.1 typically balance exploration with stability, as demonstrated in the rlhf_loop experiments.
Advantage normalization: Centering and scaling rewards ((r‑mean)/sd) before computing policy gradients prevents the variance explosion common in early RLHF training.
Target KL: In production TRL setups, target_kl=6.0 with adap_kl_ctrl=True allows the algorithm to dynamically adjust the KL coefficient to maintain a specified divergence budget.

Summary

RLHF implementation requires three distinct components: pairwise preference data, a Bradley-Terry reward model trained on preference comparisons, and a PPO-style policy optimizer with KL regularization.
The ai-engineering-from-scratch repository provides a complete, runnable implementation in phases/09-reinforcement-learning/09-reward-modeling-rlhf/code/main.py using synthetic data to demonstrate the full pipeline.
KL divergence regularization (beta * KL) is essential to prevent reward hacking and maintain alignment with the base model's capabilities.
Production implementations use Hugging Face TRL's RewardTrainer and PPOTrainer to scale these concepts to billions of parameters, while preserving the same mathematical foundations.
The docs/en.md documentation and outputs/skill-rlhf-architect.md artifact provide architectural guidance for extending this implementation to modern alternatives like DPO, GRPO, and RLAIF.

Frequently Asked Questions

What is the Bradley-Terry model in RLHF?

The Bradley-Terry model is a pairwise comparison framework that learns a scalar reward function R(y) such that the probability of preferring completion y_+ over y_- is proportional to the logistic sigmoid of their score difference: P(y_+ > y_-) = σ(R(y_+) - R(y_-)). In the source code, this is implemented in train_rm as a linear scorer over token bag-of-words features, trained to minimize the negative log-likelihood of observed human preferences.

Why is KL divergence regularization necessary in RLHF?

KL divergence regularization prevents the policy from exploiting the reward model's imperfections—a phenomenon known as reward hacking. By penalizing the policy π_θ for deviating too far from the reference policy π_ref (the original SFT model) via the term beta * KL(π_θ‖π_ref), the optimization maintains linguistic coherence and factual grounding while still improving on the desired preferences. The rlhf_loop function demonstrates this by freezing π_ref and subtracting the KL penalty from the reward before computing advantages.

How does RLHF differ from DPO (Direct Preference Optimization)?

RLHF requires training a separate reward model first, then using reinforcement learning (PPO) to optimize the policy against that model. DPO (Direct Preference Optimization) bypasses the explicit reward model and RL loop by directly optimizing the policy on preference data using a closed-form loss that implicitly captures the reward. The docs/en.md file in the repository discusses DPO as a modern alternative that reduces memory requirements and training complexity, though RLHF remains the dominant approach for fine-grained control over model behavior.

Can I use synthetic data for the reward model?

Yes, synthetic data is valid for prototyping and educational implementations. The ai-engineering-from-scratch repository uses synthetic (prompt, good, bad) triples generated from known token lists to demonstrate the train_rm functionality. However, production systems require high-quality human preferences or carefully validated AI-generated preferences (RLAIF) to ensure the reward model captures nuanced human values and avoids bias amplification.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:

curl -s "https://instagit.com/install.md"

Add to your MCP client configuration:

{
  "mcpServers": {
    "instagit": {
      "command": "npx",
      "args": ["-y", "instagit@latest"]
    }
  }
}

Ask your agent:

"Use Instagit MCP to understand how rohitg00/ai-engineering-from-scratch works."

Works with

Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →