Intermediate·4 min read

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training technique that uses human preference judgments to guide LLM behavior. Instead of telling the model what the "correct" answer is (su

Definition

RLHF is a training technique that uses human preference judgments to guide LLM behavior. Instead of telling the model what the "correct" answer is (supervised learning), RLHF trains the model to produce outputs that humans rate as better — capturing nuanced human values that are hard to specify explicitly.

The Core Insight

Human preferences are easier to elicit than human-written ideal answers:

  • Hard: "Write the perfect response to this question"
  • Easy: "Which of these two responses is better: A or B?"
  • RLHF exploits this asymmetry.

    The Three-Phase RLHF Pipeline

    Phase 1: Supervised Fine-Tuning (SFT)

  • Start with a pre-trained base model
  • Fine-tune on a dataset of (prompt, human-written ideal response) pairs
  • Creates a solid starting point for the RL phase
  • Typically 10K–100K examples
  • Phase 2: Reward Model Training

    1. Sample multiple responses from the SFT model for the same prompt

    2. Human raters rank/compare these responses (pairwise comparisons)

    3. Train a separate reward model (RM) to predict human preference scores

    4. The RM takes (prompt, response) → scalar reward score

    5. The RM is a frozen snapshot used in Phase 3

    Key formula (Bradley-Terry preference model):

    `

    P(response_A > response_B) = sigmoid(RM(A) - RM(B))

    Loss = -log P(chosen > rejected)

    `

    Phase 3: RL Policy Optimization (PPO)

    1. Use the SFT model as the starting policy (π_SFT)

    2. For each prompt:

    - Generate a response with the current policy (π_θ)

    - Score it with the frozen reward model → scalar reward

    - Compute PPO loss to maximize reward

    3. Add a KL divergence penalty to prevent the model from drifting too far from SFT:

    `

    Loss = -E[RM(response)] + β × KL(π_θ || π_SFT)

    `

    4. Iteratively update the policy to maximize reward while staying close to the SFT model

    Why the KL Penalty?

    Without it, the model reward hacks — finds degenerate outputs that score high on the reward model but are nonsensical or harmful to humans. The KL term anchors the model to reasonable language.

    RLHF Architecture Diagram

    `

    Prompt → [Policy LLM π_θ] → Response

    [Reward Model RM] → Score

    [PPO Optimizer] → Update π_θ

    [KL Penalty vs π_SFT]

    `

    Data Requirements

    | Phase | Data Type | Quantity |

    |-------|-----------|---------|

    | SFT | (prompt, ideal response) | 10K–100K |

    | Reward Model | (prompt, response_A, response_B, preference) | 100K–1M comparisons |

    | PPO | Prompts only (responses generated on-the-fly) | 10K–100K prompts |

    Limitations of RLHF

    | Limitation | Description |

    |------------|-------------|

    | Expensive | Requires many human hours for preference labeling |

    | Reward hacking | Policy finds loopholes in the reward model |

    | Reward model overfitting | RM may not generalize to all prompts |

    | PPO instability | RL training is sensitive to hyperparameters |

    | Sycophancy | Model learns to flatter/agree rather than be truthful |

    | Annotator disagreement | Human raters often disagree on preferences |

    Alternatives to RLHF

    | Method | Key Difference | Advantage |

    |--------|---------------|-----------|

    | DPO (Direct Preference Optimization) | No reward model; optimizes directly on preference pairs | Simpler, more stable |

    | ORPO | Combines SFT and preference optimization in one step | Single training stage |

    | KTO | Uses binary (good/bad) labels instead of pairwise | Easier data collection |

    | RLAIF | Uses AI model as preference judge instead of humans | Scales without human bottleneck |

    DPO vs. RLHF

    DPO (2023) reparameterizes the RLHF objective to work directly on the LLM without a separate reward model:

    `

    DPO Loss = -log σ(β × log(π_θ(chosen)/π_ref(chosen)) - β × log(π_θ(rejected)/π_ref(rejected)))

    `

    Practically: same data, simpler training, comparable or better results.

    Impact of RLHF

    RLHF was the key technique behind:

  • InstructGPT (2022) — first publicly demonstrated RLHF-aligned LLM
  • ChatGPT — made LLMs practically useful for consumers
  • Claude (Anthropic) — uses Constitutional AI + RLHF
  • GPT-4 — RLHF + undisclosed safety training
  • LLaMA 2 Chat — open-source RLHF demonstration
  • Related Concepts

  • Alignment, Fine-Tuning, Reward Model, DPO, PPO, SFT, Instruct Model, Sycophancy

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 3).