Intermediate·4 min read

DPO (Direct Preference Optimization)

DPO (Direct Preference Optimization) is a simpler alternative to RLHF for aligning LLMs with human preferences. It directly fine-tunes the model on (c

Definition

DPO (Direct Preference Optimization) is a simpler alternative to RLHF for aligning LLMs with human preferences. It directly fine-tunes the model on (chosen, rejected) response pairs without needing a separate reward model or reinforcement learning, yet achieves comparable or better alignment quality.

The Problem with RLHF

RLHF requires three separate training stages:

1. Train a reward model (requires comparison data + training run)

2. Run PPO (unstable RL algorithm, requires careful tuning)

3. Manage KL penalties, reward hacking, reference model

DPO collapses this into a single fine-tuning step.

DPO Core Insight

RLHF implicitly defines an optimal policy. DPO derives a closed-form expression for that policy directly — the reward function can be expressed in terms of the policy itself, eliminating the need to train it separately.

The DPO Objective

`

L_DPO(θ) = -E[(x, y_w, y_l)] [ log σ( β × log(π_θ(y_w|x)/π_ref(y_w|x))

- β × log(π_θ(y_l|x)/π_ref(y_l|x)) ) ]

`

In plain English:

  • Increase the probability of the chosen (preferred) response relative to a reference model
  • Decrease the probability of the rejected response relative to a reference model
  • β controls how much to deviate from the reference model
  • Training Data Format

    DPO uses the same preference data as RLHF reward model training:

    `

    {

    "prompt": "What is the capital of France?",

    "chosen": "The capital of France is Paris.",

    "rejected": "I think it might be Lyon? Or maybe Nice?"

    }

    `

    Each example = one prompt + one better response + one worse response.

    DPO Training Process

    1. Load a fine-tuned SFT model as the reference model (frozen)

    2. Initialize the policy model (same weights as reference, but trainable)

    3. For each (prompt, chosen, rejected) triplet:

    - Compute log probabilities of chosen/rejected from policy model

    - Compute log probabilities of chosen/rejected from reference model

    - Compute DPO loss

    - Backpropagate through policy model only

    4. Result: policy model prefers chosen over rejected responses

    DPO vs. RLHF Comparison

    | Aspect | RLHF (PPO) | DPO |

    |--------|-----------|-----|

    | Separate reward model | Required | Not needed |

    | RL algorithm (PPO) | Required | Not needed |

    | Training complexity | High | Low (just fine-tuning) |

    | Stability | Notoriously unstable | Stable, like SFT |

    | Memory | 2-4 models in memory | 2 models (policy + reference) |

    | Hyperparameter sensitivity | Very high | Low |

    | Quality | Strong | Comparable or better |

    | Speed | Slow | Fast |

    DPO Variants

    IPO (Identity Preference Optimization)

  • Modification that prevents overfitting to the dataset
  • DPO can collapse chosen/rejected probabilities to 0/1; IPO prevents this
  • KTO (Kahneman-Tversky Optimization)

  • Uses binary good/bad labels instead of pairwise comparisons
  • Based on prospect theory (humans evaluate relative to a reference point)
  • Easier data collection (no need to compare two responses)
  • ORPO (Odds Ratio Preference Optimization)

  • Combines SFT and preference optimization in a single training step
  • No reference model needed
  • Single loss function does both
  • SimPO (Simple Preference Optimization)

  • No reference model, length-normalized reward
  • Simpler implementation, competitive quality
  • RPO / RLHF Hybrid

  • Start with DPO to get a good initialization, then refine with PPO
  • Some labs use this combined approach
  • When to Use DPO vs. RLHF

    Use DPO when:

  • You want simpler, more stable training
  • You have pairwise preference data (or can generate it)
  • Single GPU/small team setup
  • Rapid iteration on alignment
  • Use RLHF (PPO) when:

  • You need very fine-grained reward shaping
  • You have a very large-scale training setup
  • You need to optimize for complex, multi-dimensional rewards
  • You're training a frontier model with significant resources
  • DPO in Practice

    Data Requirements

  • Need (prompt, chosen, rejected) triplets
  • Quality > Quantity: 10K high-quality pairs >> 100K noisy pairs
  • Sources: human preference labels, AI-generated pairs (RLAIF), distillation from stronger model
  • HuggingFace TRL DPO Trainer

    `python

    from trl import DPOTrainer, DPOConfig

    training_args = DPOConfig(

    beta=0.1, # KL penalty coefficient

    learning_rate=5e-7,

    per_device_train_batch_size=4,

    num_train_epochs=3,

    )

    dpo_trainer = DPOTrainer(

    model=model,

    ref_model=ref_model, # frozen reference

    args=training_args,

    train_dataset=dataset, # must have prompt, chosen, rejected columns

    tokenizer=tokenizer,

    )

    dpo_trainer.train()

    `

    Adoption

    DPO is now the dominant alignment technique for open-source models:

  • LLaMA 3 Instruct: DPO-based alignment
  • Zephyr (Mistral fine-tune): DPO
  • Tulu 3: DPO + online preference optimization
  • Gemma Instruct: DPO
  • Related Concepts

  • RLHF, Alignment, Fine-Tuning, SFT, Preference Data, Instruct Model, LoRA

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 11).