RLHF (Reinforcement Learning from Human Feedback)

Definition

RLHF is a training technique that uses human preference judgments to guide LLM behavior. Instead of telling the model what the "correct" answer is (supervised learning), RLHF trains the model to produce outputs that humans rate as better — capturing nuanced human values that are hard to specify explicitly.

The Core Insight

Human preferences are easier to elicit than human-written ideal answers:

Hard: "Write the perfect response to this question"
Easy: "Which of these two responses is better: A or B?"

RLHF exploits this asymmetry.

The Three-Phase RLHF Pipeline

Phase 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base model
Fine-tune on a dataset of (prompt, human-written ideal response) pairs
Creates a solid starting point for the RL phase
Typically 10K–100K examples

Phase 2: Reward Model Training

1. Sample multiple responses from the SFT model for the same prompt

2. Human raters rank/compare these responses (pairwise comparisons)

3. Train a separate reward model (RM) to predict human preference scores

4. The RM takes (prompt, response) → scalar reward score

5. The RM is a frozen snapshot used in Phase 3

Key formula (Bradley-Terry preference model):

P(response_A > response_B) = sigmoid(RM(A) - RM(B))

Loss = -log P(chosen > rejected)

Phase 3: RL Policy Optimization (PPO)

1. Use the SFT model as the starting policy (π_SFT)

2. For each prompt:

- Generate a response with the current policy (π_θ)

- Score it with the frozen reward model → scalar reward

- Compute PPO loss to maximize reward

3. Add a KL divergence penalty to prevent the model from drifting too far from SFT:

Loss = -E[RM(response)] + β × KL(π_θ || π_SFT)

4. Iteratively update the policy to maximize reward while staying close to the SFT model

Why the KL Penalty?

Without it, the model reward hacks — finds degenerate outputs that score high on the reward model but are nonsensical or harmful to humans. The KL term anchors the model to reasonable language.

RLHF Architecture Diagram

Prompt → [Policy LLM π_θ] → Response

↓

[Reward Model RM] → Score

↓

[PPO Optimizer] → Update π_θ

↑

[KL Penalty vs π_SFT]

Data Requirements

| Phase | Data Type | Quantity |

|-------|-----------|---------|

| SFT | (prompt, ideal response) | 10K–100K |

| Reward Model | (prompt, response_A, response_B, preference) | 100K–1M comparisons |

| PPO | Prompts only (responses generated on-the-fly) | 10K–100K prompts |

Limitations of RLHF

| Limitation | Description |

|------------|-------------|

| Expensive | Requires many human hours for preference labeling |

| Reward hacking | Policy finds loopholes in the reward model |

| Reward model overfitting | RM may not generalize to all prompts |

| PPO instability | RL training is sensitive to hyperparameters |

| Sycophancy | Model learns to flatter/agree rather than be truthful |

| Annotator disagreement | Human raters often disagree on preferences |

Alternatives to RLHF

| Method | Key Difference | Advantage |

|--------|---------------|-----------|

| DPO (Direct Preference Optimization) | No reward model; optimizes directly on preference pairs | Simpler, more stable |

| ORPO | Combines SFT and preference optimization in one step | Single training stage |

| KTO | Uses binary (good/bad) labels instead of pairwise | Easier data collection |

| RLAIF | Uses AI model as preference judge instead of humans | Scales without human bottleneck |

DPO vs. RLHF

DPO (2023) reparameterizes the RLHF objective to work directly on the LLM without a separate reward model:

DPO Loss = -log σ(β × log(π_θ(chosen)/π_ref(chosen)) - β × log(π_θ(rejected)/π_ref(rejected)))

Practically: same data, simpler training, comparable or better results.

Impact of RLHF

RLHF was the key technique behind:

InstructGPT (2022) — first publicly demonstrated RLHF-aligned LLM
ChatGPT — made LLMs practically useful for consumers
Claude (Anthropic) — uses Constitutional AI + RLHF
GPT-4 — RLHF + undisclosed safety training
LLaMA 2 Chat — open-source RLHF demonstration

Related Concepts

Alignment, Fine-Tuning, Reward Model, DPO, PPO, SFT, Instruct Model, Sycophancy