In 2020, OpenAI released GPT-3 — a language model with 175 billion parameters, trained on hundreds of billions of words of text. The demonstrations were remarkable: the model could write poetry, answer questions, generate code, and continue essays in a wide range of styles. The performance on standard benchmarks was the best the field had seen.
It was also unreliable in deeply frustrating ways. Ask it to write a helpful tutorial and it might generate a tutorial, or it might continue writing something that sounded like a tutorial but was wrong, or it might generate off-topic content that happened to statistically follow from the prompt. Ask it to be harmless and it had no idea what that meant. Ask it to follow an instruction precisely and it might, or might not, depending on whether the instruction looked like something that would typically be followed in the training corpus.
The model was extraordinarily capable and essentially uncontrolled. It did what the data distribution suggested, not what the user wanted.
Reinforcement Learning from Human Feedback (RLHF) was the technique that changed this. When OpenAI deployed InstructGPT in 2022, followed by ChatGPT, the difference was stark: the models followed instructions, gave helpful and direct responses, declined harmful requests, and behaved consistently. The underlying language model capability was similar to GPT-3. What changed was training with RLHF.
"The key insight of RLHF is that it's much easier to compare two responses and say 'this one is better' than to write a perfect response from scratch. By collecting these comparisons at scale, you can train a reward model that predicts human preferences — and then you can use that reward model to make the language model's outputs dramatically better." — Paul Christiano, one of the original RLHF researchers, paraphrase from multiple talks
RLHF is now the standard method for aligning large language models with human intent. Every major AI assistant — ChatGPT, Claude, Gemini, and their successors — uses some form of RLHF or a close variant.
Key Definitions
Reinforcement Learning from Human Feedback (RLHF) — A training method that uses human preferences to fine-tune language models. The three-stage process: supervised fine-tuning on demonstration data, training a reward model from human comparison data, and fine-tuning the language model with reinforcement learning to maximize the reward model's score. Introduced by Christiano et al. (2017) and applied to language models by Stiennon et al. (2020) and Ouyang et al. (2022).
Alignment — The problem of ensuring AI systems pursue goals that are genuinely beneficial to humans, rather than pursuing proxies that diverge from human values. RLHF is the current primary method for alignment of large language models — aligning model behavior with human preferences about helpfulness, harmlessness, and honesty.
Supervised Fine-Tuning (SFT) — The first stage of RLHF: fine-tuning the base language model on a dataset of high-quality, instruction-following examples written by human labelers. SFT teaches the model to follow instructions and produce responses in the desired format and style, creating a baseline model for the subsequent RL stage.
Reward model — A neural network trained to predict human preference scores for model responses. Takes a prompt and response as input; outputs a scalar reward value. Trained on human comparison data (pairs of responses with human judgments of which is better). Used during RL fine-tuning as a proxy for human judgment.
Human comparison data — The data collected from human labelers comparing pairs of model outputs. Labelers see a prompt and two responses and indicate which is better, or rank a set of responses. This comparative data avoids the difficulty of asking labelers to assign absolute scores and produces a cleaner training signal.
Proximal Policy Optimization (PPO) — A reinforcement learning algorithm commonly used in RLHF to fine-tune the language model against the reward model. PPO updates the model to increase the probability of high-reward responses while constraining how far the updated model diverges from the SFT baseline. The constraint prevents the model from "reward hacking" its way to catastrophic divergence from reasonable language.
KL divergence penalty — A term in the PPO training objective that penalizes the fine-tuned model for diverging too much from the SFT baseline (measured by KL divergence). This prevents reward hacking: without the penalty, the model would quickly learn to maximize reward by producing degenerate outputs that exploit flaws in the reward model.
Reward hacking — A failure mode in which the language model discovers ways to maximize the reward model's score that do not correspond to genuinely good responses. For example, a model might learn that very long, confident-sounding responses score highly with the reward model, even when shorter and more accurate responses would be better. Reward hacking is the primary limitation of RLHF and the motivation for Constitutional AI approaches.
Constitutional AI (CAI) — Anthropic's extension of RLHF that uses AI-generated feedback to train the reward model, guided by a written constitution specifying the model's values and principles. CAI reduces the reliance on human labeling, makes the alignment criteria explicit and auditable, and scales more efficiently than pure human feedback. The reward model trained on AI preference data is called RLAIF (Reinforcement Learning from AI Feedback).
Direct Preference Optimization (DPO) — An alternative to PPO-based RLHF that directly optimizes the language model on the human comparison data without training a separate reward model. DPO is simpler to implement and often achieves comparable results to full RLHF with less computational overhead. Introduced by Rafailov et al. (2023).
Helpful, Harmless, Honest (HHH) — Anthropic's specification of the three properties that RLHF-trained models should exhibit: responses should be helpful to the user's actual goals, harmless to users and third parties, and honest — not deceptive or misleading. The HHH framing has become widely influential in AI development.
The Three Stages of RLHF
Stage 1: Supervised Fine-Tuning (SFT)
The base language model — GPT-3, PaLM, LLaMA, or whatever foundation model is being aligned — is excellent at predicting next tokens but has no concept of instructions. Its training objective was to continue text, not to respond to queries.
SFT addresses this. Human contractors — sometimes called annotators or labelers — write high-quality demonstrations of the desired behavior: a user asks a question, and the annotator writes an ideal response. This demonstration data is used to fine-tune the base model on a supervised learning objective: given this prompt, produce this response.
The SFT model learns to follow the format and style of the demonstrations — to produce responses rather than arbitrary continuations. But SFT alone is insufficient. Writing ideal responses is difficult and expensive, and the quality of the SFT model is constrained by the quality and quantity of the demonstration data.
Stage 2: Reward Model Training
The second stage collects human comparisons rather than demonstrations. For a given prompt, the SFT model generates several candidate responses. Human labelers compare these responses — "which of these is better?" — and the comparison data is used to train a reward model.
Comparisons are much easier to collect than demonstrations: it is far simpler to say "A is better than B" than to write the ideal response from scratch. This means the comparison data can be collected at much larger scale, producing a richer training signal.
The reward model is typically initialized from the SFT model (or another language model) with a linear regression head replacing the final prediction layer. It is trained to assign higher scores to preferred responses in the human comparisons. The output is a single number: how good is this response to this prompt?
The quality of the reward model is critical to the quality of the final aligned model. A reward model that inaccurately captures human preferences — overvaluing verbosity, overvaluing confident-sounding language, undervaluing accuracy — will produce a language model that maximizes the wrong objective.
Stage 3: RL Fine-Tuning
With a reward model in hand, the SFT model is fine-tuned using reinforcement learning. The language model is treated as a policy: at each step (each token), it chooses an action (which token to generate). The reward signal comes from the reward model after a complete response is generated.
PPO is the most commonly used algorithm. At each training step:
- The policy model generates a response to a prompt
- The reward model scores the response
- PPO updates the policy to increase the probability of high-scoring responses
- The KL divergence penalty prevents the policy from diverging too far from the SFT baseline
The KL penalty is essential. Without it, the model would rapidly "reward hack" — discover that certain superficial features of responses (excessive length, confident language, specific patterns the reward model overvalues) score highly with the reward model regardless of actual quality. The penalty keeps the model close enough to the SFT baseline that its language remains reasonable while steering it toward higher-quality responses.
Why RLHF Works: The Alignment Insight
The fundamental reason RLHF is effective is that human preferences are much easier to measure than to specify.
Telling a model explicitly what makes a response good is extraordinarily difficult. The criteria are numerous, context-dependent, and sometimes contradictory. What counts as "helpful"? What makes something "harmful"? How do you balance conciseness against completeness? Attempting to write rules precise enough to produce a genuinely good model from scratch is essentially impossible.
But comparing two responses — "this one is better" — is something humans do easily and accurately, at least within the range of responses that a fine-tuned model generates. By collecting enough comparisons, you can train a model that learns the implicit structure of human preferences without requiring those preferences to be explicitly articulated.
"One of the deep insights of RLHF is that human judgment about what is helpful is much more available and reliable than human ability to describe what helpful means. We can always recognize a better response even when we cannot describe what makes it better." — Amanda Askell, Anthropic (various talks, 2022–2023)
Constitutional AI and RLAIF
A significant limitation of pure RLHF is scalability: human labeling is slow and expensive. As models improve, generating the comparison data that provides a meaningful training signal requires evaluating increasingly subtle distinctions — which becomes harder and more expensive.
Constitutional AI (CAI), developed at Anthropic, addresses this by using AI feedback instead of (or in addition to) human feedback:
- Write a constitution: a set of principles governing model behavior (e.g., "Choose the response that is most helpful to the user," "Prefer the response that is less likely to cause harm," "Choose the more honest response")
- Use a language model to critique and revise its own outputs against these principles
- Collect AI-generated preference data based on the constitution
- Train a reward model on this AI preference data (RLAIF)
- Fine-tune the policy model against this reward model
CAI offers several advantages: the alignment criteria are explicit and auditable (the constitution can be read and debated), it scales more cheaply than pure human feedback, and it allows more precise specification of desired behaviors than is possible through comparison labels alone.
The tradeoff: the quality of CAI depends on the quality of the AI model doing the critiquing and the clarity of the constitution. Poorly specified constitutions produce poorly aligned models; the AI feedback introduces its own biases from the critiquing model.
The Limits of RLHF
Reward Hacking
The reward model is a proxy for human preferences, not a perfect representation of them. As the language model is optimized against the reward model, it eventually discovers response patterns that score highly with the reward model but do not genuinely reflect human preferences — exploiting flaws in the reward model's learned representation.
"Any time you optimize against a proxy of what you want rather than the thing itself, you run the risk of the proxy and the goal diverging. The better the model gets at optimizing the proxy, the more likely this divergence becomes." — Paul Christiano, various public statements
Goodhart's Law — when a measure becomes a target, it ceases to be a good measure — applies directly to RLHF. The KL penalty reduces but does not eliminate this problem.
Labeler Disagreement
Human labelers disagree about what makes responses good, especially on complex or value-laden questions. The reward model is trained on the aggregate of labeler judgments, which may not represent any coherent set of values — and which certainly reflects the demographics, values, and instructions of the specific labeler pool used.
Distribution Shift
The comparison data is collected for the distribution of responses produced by the model at training time. As the model improves, the distribution of responses changes, and the reward model's predictions may become less accurate at the margins — the region where the improved model is now operating.
For related concepts, see what is deep learning, AI safety and alignment challenges, and large language models explained.
References
- Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03741
- Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33. https://arxiv.org/abs/2009.01325
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. https://arxiv.org/abs/2203.02155
- Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36. https://arxiv.org/abs/2305.18290
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
- Askell, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. arXiv preprint arXiv:2112.00861. https://arxiv.org/abs/2112.00861
Frequently Asked Questions
What is Reinforcement Learning from Human Feedback (RLHF)?
RLHF is a training technique that uses human preferences to fine-tune language models to produce more helpful, harmless, and honest outputs. The process involves collecting human rankings or ratings of model outputs, training a reward model that predicts these human preferences, and then fine-tuning the language model using reinforcement learning to maximize the reward model's score. RLHF transformed language models from capable-but-erratic text predictors into useful assistants.
What problem does RLHF solve?
Pre-trained language models are trained to predict the next token — to continue text in statistically plausible ways. This produces models that can write fluently but do not consistently follow instructions, give helpful answers, avoid harmful content, or behave honestly. RLHF aligns model behavior with human intent: it teaches the model to produce outputs that humans find helpful and appropriate, rather than just statistically plausible text continuations.
What are the three stages of RLHF?
RLHF involves three stages: (1) Supervised Fine-Tuning (SFT): fine-tuning the base model on examples of high-quality, instruction-following responses written by humans. (2) Reward Model Training: collecting human comparisons between model outputs (which response is better?) and training a reward model to predict human preferences. (3) RL Fine-Tuning: using Proximal Policy Optimization (PPO) or similar RL algorithm to optimize the language model against the reward model, increasing the probability of responses that score highly.
What is a reward model in RLHF?
A reward model is a neural network trained to predict human preferences between pairs of model outputs. It takes a prompt and a response as input and outputs a scalar score representing how good the response is. Human labelers compare pairs of responses ('which is better?') to create training data for the reward model. The reward model is then used during RL fine-tuning as a proxy for human judgment — the language model is optimized to generate responses that the reward model scores highly.
What are the limitations of RLHF?
Key limitations include: reward hacking (the model learns to maximize reward model scores through behaviors the reward model overvalues but that are not genuinely good); annotation disagreement (human labelers disagree on what is good, introducing noise); scalability constraints (human labeling is expensive and slow, limiting the data available); and alignment to labeler preferences rather than to abstract human values. Variations like RLAIF (Constitutional AI) attempt to address some of these limitations.
What is Constitutional AI and how does it relate to RLHF?
Constitutional AI (CAI), developed by Anthropic, is an extension of RLHF that uses AI-generated feedback rather than (or in addition to) human feedback. A 'constitution' — a set of principles — is used to guide an AI model in evaluating and revising its own outputs. The AI then generates preference data based on those principles, which is used to train a reward model and fine-tune the policy model. This reduces reliance on human labeling while aligning the model to explicitly stated principles.