Definition
RLHF is a training technique that uses human preference judgments to guide LLM behavior. Instead of telling the model what the "correct" answer is (supervised learning), RLHF trains the model to produce outputs that humans rate as better — capturing nuanced human values that are hard to specify explicitly.
The Core Insight
Human preferences are easier to elicit than human-written ideal answers:
- Hard: "Write the perfect response to this question"
- Easy: "Which of these two responses is better: A or B?"
- Start with a pre-trained base model
- Fine-tune on a dataset of (prompt, human-written ideal response) pairs
- Creates a solid starting point for the RL phase
- Typically 10K–100K examples
- InstructGPT (2022) — first publicly demonstrated RLHF-aligned LLM
- ChatGPT — made LLMs practically useful for consumers
- Claude (Anthropic) — uses Constitutional AI + RLHF
- GPT-4 — RLHF + undisclosed safety training
- LLaMA 2 Chat — open-source RLHF demonstration
- Alignment, Fine-Tuning, Reward Model, DPO, PPO, SFT, Instruct Model, Sycophancy
RLHF exploits this asymmetry.
The Three-Phase RLHF Pipeline
Phase 1: Supervised Fine-Tuning (SFT)
Phase 2: Reward Model Training
1. Sample multiple responses from the SFT model for the same prompt
2. Human raters rank/compare these responses (pairwise comparisons)
3. Train a separate reward model (RM) to predict human preference scores
4. The RM takes (prompt, response) → scalar reward score
5. The RM is a frozen snapshot used in Phase 3
Key formula (Bradley-Terry preference model):
`
P(response_A > response_B) = sigmoid(RM(A) - RM(B))
Loss = -log P(chosen > rejected)
`
Phase 3: RL Policy Optimization (PPO)
1. Use the SFT model as the starting policy (π_SFT)
2. For each prompt:
- Generate a response with the current policy (π_θ)
- Score it with the frozen reward model → scalar reward
- Compute PPO loss to maximize reward
3. Add a KL divergence penalty to prevent the model from drifting too far from SFT:
`
Loss = -E[RM(response)] + β × KL(π_θ || π_SFT)
`
4. Iteratively update the policy to maximize reward while staying close to the SFT model
Why the KL Penalty?
Without it, the model reward hacks — finds degenerate outputs that score high on the reward model but are nonsensical or harmful to humans. The KL term anchors the model to reasonable language.
RLHF Architecture Diagram
`
Prompt → [Policy LLM π_θ] → Response
↓
[Reward Model RM] → Score
↓
[PPO Optimizer] → Update π_θ
↑
[KL Penalty vs π_SFT]
`
Data Requirements
| Phase | Data Type | Quantity |
|-------|-----------|---------|
| SFT | (prompt, ideal response) | 10K–100K |
| Reward Model | (prompt, response_A, response_B, preference) | 100K–1M comparisons |
| PPO | Prompts only (responses generated on-the-fly) | 10K–100K prompts |
Limitations of RLHF
| Limitation | Description |
|------------|-------------|
| Expensive | Requires many human hours for preference labeling |
| Reward hacking | Policy finds loopholes in the reward model |
| Reward model overfitting | RM may not generalize to all prompts |
| PPO instability | RL training is sensitive to hyperparameters |
| Sycophancy | Model learns to flatter/agree rather than be truthful |
| Annotator disagreement | Human raters often disagree on preferences |
Alternatives to RLHF
| Method | Key Difference | Advantage |
|--------|---------------|-----------|
| DPO (Direct Preference Optimization) | No reward model; optimizes directly on preference pairs | Simpler, more stable |
| ORPO | Combines SFT and preference optimization in one step | Single training stage |
| KTO | Uses binary (good/bad) labels instead of pairwise | Easier data collection |
| RLAIF | Uses AI model as preference judge instead of humans | Scales without human bottleneck |
DPO vs. RLHF
DPO (2023) reparameterizes the RLHF objective to work directly on the LLM without a separate reward model:
`
DPO Loss = -log σ(β × log(π_θ(chosen)/π_ref(chosen)) - β × log(π_θ(rejected)/π_ref(rejected)))
`
Practically: same data, simpler training, comparable or better results.
Impact of RLHF
RLHF was the key technique behind: