How RLHF Makes AI More Effective and Reliable

Q: "What is Reinforcement Learning from Human Feedback (RLHF)?"

"RLHF is a three-stage training technique that uses human preferences to make language models helpful, harmless, and honest: supervised fine-tuning on human-written examples, training a reward model from human preference comparisons, and then using reinforcement learning to optimize the language model's outputs against that reward model. It is why ChatGPT, Claude, and similar assistants follow instructions rather than generating arbitrary text continuations."

Q: "What problem does RLHF solve?"

"Pre-trained language models are trained to predict plausible text continuations, not to follow instructions or give helpful answers. RLHF teaches the model to produce outputs that humans actually find useful and appropriate, bridging the gap between raw language capability and practical usefulness."

Q: "What are the three stages of RLHF?"

"Stage 1 (Supervised Fine-Tuning): fine-tune the base model on human-written instruction-following examples. Stage 2 (Reward Model Training): collect human comparisons between model responses ('which is better?') and train a reward model to predict those preferences. Stage 3 (RL Fine-Tuning): use PPO to optimize the language model against the reward model, steering it toward higher-scoring outputs."

Q: "What is a reward model in RLHF?"

"A reward model is a neural network trained on human pairwise comparisons ('response A is better than response B') that outputs a scalar score predicting how good a given response is. During RL fine-tuning, the language model is optimized to generate responses that score highly on this reward model, which serves as a proxy for human judgment."

Q: "What are the limitations of RLHF?"

"The primary limitation is reward hacking: the language model eventually learns to produce responses that score highly on the reward model without actually being better, exploiting gaps between the proxy and real human preferences. Labeler disagreement, high cost of human annotation, and the difficulty of evaluating complex outputs as AI improves further limit RLHF's effectiveness."

Q: "What is Constitutional AI and how does it relate to RLHF?"

"Constitutional AI (developed by Anthropic) replaces much of the human labeling in RLHF with AI-generated feedback guided by a written 'constitution' of principles. An AI model critiques and revises its own outputs against these principles, producing preference data used to train the reward model. This scales more cheaply than pure human feedback and makes the alignment criteria explicit and auditable."

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that uses human preference judgments to fine-tune large language models, transforming them from impressive but erratic text generators into reliable, instruction-following assistants.

First proposed by Paul Christiano and colleagues at OpenAI in a 2017 paper on deep reinforcement learning from human preferences, RLHF became the critical technique behind the launch of ChatGPT in November 2022 and now underlies virtually every major AI assistant in production - including Claude, Gemini, and their successors.

The technique works through a three-stage process: supervised fine-tuning on human demonstrations, training a reward model from human comparison data, and optimizing the language model against that reward model using reinforcement learning.

In 2020, OpenAI released GPT-3 - a language model with 175 billion parameters, trained on hundreds of billions of words of text. The demonstrations were remarkable: the model could write poetry, answer questions, generate code, and continue essays in a wide range of styles.

The performance on standard benchmarks was the best the field had seen.

It was also unreliable in deeply frustrating ways. Ask it to write a helpful tutorial and it might generate a tutorial, or it might continue writing something that sounded like a tutorial but was factually wrong, or it might generate off-topic content that happened to statistically follow from the prompt.

Ask it to be harmless and it had no concept of what that meant. Ask it to follow an instruction precisely and it might, or might not, depending on whether the instruction resembled something that would typically be followed in the training corpus.

The model was extraordinarily capable and essentially uncontrolled. It did what the data distribution suggested, not what the user wanted.

When OpenAI deployed InstructGPT in January 2022, followed by ChatGPT in November of that year, the difference was dramatic: the models followed instructions, gave helpful and direct responses, declined harmful requests, and behaved consistently.

The underlying language model capability was similar to GPT-3. What changed was training with RLHF.

The InstructGPT paper (Ouyang et al., 2022) reported that a 1.3-billion-parameter model trained with RLHF was preferred by human evaluators over the 175-billion-parameter GPT-3 - a striking demonstration that alignment technique could matter more than raw scale.

"The key insight of RLHF is that it is much easier to compare two responses and say 'this one is better' than to write a perfect response from scratch.
By collecting these comparisons at scale, you can train a reward model that predicts human preferences - and then you can use that reward model to make the language model's outputs dramatically better." - Paul Christiano, paraphrased from multiple public talks (2021-2022)

The Core Problem: Alignment

Before RLHF, the fundamental challenge of large language models was what researchers call the alignment problem. A language model trained on next-token prediction learns to generate text that is statistically likely given the preceding context.

This is not the same as generating text that is helpful, truthful, or safe. A model trained on the internet has learned to reproduce everything on the internet - including misinformation, toxic content, manipulation tactics, and confidently wrong answers.

The alignment problem, as articulated by researchers including Stuart Russell in Human Compatible (2019) and Brian Christian in The Alignment Problem (2020), is the challenge of ensuring AI systems pursue goals that are genuinely beneficial to humans rather than pursuing proxies that diverge from human values.

RLHF was the first technique to make meaningful progress on this problem for language models at scale.

Anthropic's formulation of the alignment target - that models should be Helpful, Harmless, and Honest (HHH) - has become the most widely adopted specification. Responses should be helpful to the user's actual goals, harmless to users and third parties, and honest rather than deceptive or misleading.

The challenge is that these properties cannot be specified in a simple loss function; they require the richness and nuance of human judgment to evaluate.

The Three Stages of RLHF

Stage 1: Supervised Fine-Tuning (SFT)

The base language model - GPT-3, PaLM, LLaMA, or whatever foundation model is being aligned - is excellent at predicting next tokens but has no concept of instructions. Its training objective was to continue text, not to respond to queries.

When you type a question into a base model, it does not "answer" the question - it generates text that is likely to follow a question in its training data, which might be another question, a tangentially related paragraph, or a continuation of a hypothetical document.

Supervised fine-tuning addresses this. Human contractors - sometimes called annotators or labelers - write high-quality demonstrations of the desired behavior: a user asks a question, and the annotator writes an ideal response.

The InstructGPT team at OpenAI employed approximately 40 contractors for this work, selected through a screening process that tested their ability to identify helpful, truthful responses.

This demonstration data is used to fine-tune the base model on a supervised learning objective: given this prompt, produce this response.

The SFT model learns to follow the format and style of the demonstrations - to produce responses rather than arbitrary continuations. But SFT alone is insufficient for two reasons. First, writing ideal responses is difficult and expensive - the quality of the SFT model is constrained by the quality and quantity of the demonstration data.

Second, SFT teaches the model to imitate specific examples but does not teach it the general principles underlying good responses. The model may learn that helpful responses are long and detailed without learning when brevity is more appropriate.

Stage 2: Reward Model Training

The second stage collects human comparisons rather than demonstrations. For a given prompt, the SFT model generates several candidate responses (typically 4 to 9). Human labelers compare these responses - "which of these is better?" - and the comparison data is used to train a reward model.

The insight that makes this stage powerful is that comparisons are much easier to collect than demonstrations. It is far simpler to say "A is better than B" than to write the ideal response from scratch.

The InstructGPT paper reported collecting approximately 33,000 comparison data points from the labeling team - a volume that would have been impractical if each data point required writing a full demonstration.

This means the comparison data can be collected at much larger scale, producing a richer and more nuanced training signal.

The reward model is typically initialized from the SFT model (or another language model of similar size) with a linear regression head replacing the final prediction layer.

It is trained to assign higher scores to preferred responses in the human comparisons using a Bradley-Terry model of preferences - a statistical framework that converts pairwise comparisons into a scalar ranking. The output is a single number: how good is this response to this prompt?

The quality of the reward model is critical to the quality of the final aligned model. A reward model that inaccurately captures human preferences - overvaluing verbosity, overvaluing confident-sounding language, undervaluing accuracy - will produce a language model that maximizes the wrong objective.

Research by Samuel Bowman and colleagues at New York University (2022) found that reward models can exhibit systematic biases, including a preference for longer responses independent of quality and a tendency to reward responses that agree with the premise of the question even when the premise is flawed.

Stage 3: Reinforcement Learning Fine-Tuning

With a reward model in hand, the SFT model is fine-tuned using reinforcement learning. The language model is treated as a policy: at each step (each token), it chooses an action (which token to generate). The reward signal comes from the reward model after a complete response is generated.

Proximal Policy Optimization (PPO), introduced by John Schulman and colleagues at OpenAI in 2017, is the most commonly used algorithm. At each training step:

The policy model generates a response to a prompt drawn from a training distribution
The reward model scores the complete response
PPO updates the policy to increase the probability of generating high-scoring responses
A KL divergence penalty prevents the policy from diverging too far from the SFT baseline

The KL penalty is essential. Without it, the model would rapidly reward hack - discover that certain superficial features of responses (excessive length, confident language, specific patterns the reward model overvalues) score highly with the reward model regardless of actual quality.

The penalty keeps the model close enough to the SFT baseline that its language remains coherent and reasonable while steering it toward higher-quality responses. In practice, the strength of the KL penalty is a critical hyperparameter: too weak and the model reward-hacks; too strong and the model barely improves over SFT.

RLHF Variants Compared

Method	Stages	Data Required	Scalability	Key Advantage	Key Limitation
Classic RLHF (PPO)	SFT + RM + RL	Human demonstrations + comparisons	Expensive, slow	Well-validated, powerful	Reward hacking risk, labeling cost
DPO	SFT + direct optimization	Human comparison pairs	More efficient than PPO	Simpler, no separate RM training	Less flexible than full RL
Constitutional AI (CAI/RLAIF)	SFT + AI RM + RL	Human constitution + AI-generated	More scalable than human labeling	Explicit, auditable principles	AI feedback inherits AI biases
SFT only	SFT	Human demonstrations	Limited by data quantity	Simple, fast	No preference optimization
ORPO	Single-stage optimization	Human comparison pairs	Very efficient	Combines SFT and alignment	Newer, less validated at scale

Why RLHF Works: The Preference Elicitation Insight

The fundamental reason RLHF is effective is that human preferences are much easier to measure than to specify.

Telling a model explicitly what makes a response good is extraordinarily difficult. The criteria are numerous, context-dependent, and sometimes contradictory. What counts as "helpful"? What makes something "harmful"? How do you balance conciseness against completeness?

How do you handle a question where being helpful to the user might involve generating content that could be harmful to others? Attempting to write rules precise enough to produce a genuinely good model from scratch is essentially impossible - the space of possible situations is too vast and the relevant considerations too subtle.

But comparing two responses - "this one is better" - is something humans do easily and with reasonable accuracy, at least within the range of responses that a fine-tuned model generates.

By collecting enough comparisons, you can train a model that learns the implicit structure of human preferences without requiring those preferences to be explicitly articulated.

This is a form of what economists call revealed preference - inferring values from choices rather than from stated principles.

Research by Jan Leike and colleagues at DeepMind (2018) formalized this insight, showing that even when humans cannot articulate what they want from an AI system, they can reliably identify which of two outputs is closer to what they want.

The gap between these two abilities - articulation and recognition - is the space in which RLHF operates.

Constitutional AI and RLAIF

A significant limitation of pure RLHF is scalability: human labeling is slow and expensive. As models improve, generating the comparison data that provides a meaningful training signal requires evaluating increasingly subtle distinctions - which becomes harder and more expensive.

The InstructGPT labeling effort cost hundreds of thousands of dollars; scaling to more capable models would cost proportionally more.

Constitutional AI (CAI), developed at Anthropic and published by Yuntao Bai and colleagues in December 2022, addresses this by using AI feedback instead of (or in addition to) human feedback:

Write a constitution: a set of principles governing model behavior (e.g., "Choose the response that is most helpful to the user," "Prefer the response that is less likely to cause harm," "Choose the more honest response")
Use a language model to critique and revise its own outputs against these principles
Collect AI-generated preference data based on the constitution
Train a reward model on this AI preference data (RLAIF - Reinforcement Learning from AI Feedback)
Fine-tune the policy model against this reward model

CAI offers several advantages: the alignment criteria are explicit and auditable (the constitution can be read, debated, and revised), it scales more cheaply than pure human feedback, and it allows more precise specification of desired behaviors than is possible through comparison labels alone.

Anthropic's Claude models are trained using a combination of RLHF and Constitutional AI techniques.

The tradeoff: the quality of CAI depends on the quality of the AI model doing the critiquing and the clarity of the constitution. Poorly specified constitutions produce poorly aligned models; the AI feedback introduces its own biases inherited from the critiquing model's training.

Direct Preference Optimization (DPO)

Direct Preference Optimization, introduced by Rafael Rafailov and colleagues at Stanford in May 2023, represents a significant simplification of the RLHF pipeline. DPO eliminates the need for a separate reward model entirely, instead directly optimizing the language model on the human comparison data.

The mathematical insight is that the optimal policy under the RLHF objective can be expressed as a closed-form function of the comparison data - meaning the reward model is implicit in the optimization rather than explicit.

DPO is simpler to implement, more computationally efficient, and in many benchmarks achieves results comparable to full PPO-based RLHF. It has been rapidly adopted by research labs and open-source projects.

However, some researchers argue that DPO is less flexible than full RL approaches because it cannot easily incorporate reward signals from multiple sources or adapt to distributional shift during training.

The Limits of RLHF

Reward Hacking and Goodhart's Law

The reward model is a proxy for human preferences, not a perfect representation of them.

As the language model is optimized against the reward model, it eventually discovers response patterns that score highly with the reward model but do not genuinely reflect human preferences - exploiting flaws in the reward model's learned representation.

Goodhart's Law - "when a measure becomes a target, it ceases to be a good measure" - applies directly to RLHF.

Research by Leo Gao and colleagues at OpenAI (2022) empirically demonstrated this effect: as optimization pressure against a reward model increased, model quality initially improved but then degraded as the model learned to exploit reward model weaknesses.

Typical failure modes include generating excessively verbose responses (length correlates with reward), using confident language regardless of actual certainty, and producing sycophantic responses that agree with the user's stated views rather than providing accurate information.

The KL divergence penalty reduces but does not eliminate this problem. More recent approaches, including training ensembles of reward models and using iterative RLHF with fresh human data, aim to make reward hacking more difficult.

Labeler Disagreement and Bias

Human labelers disagree about what makes responses good, especially on complex or value-laden questions.

A 2023 study by Anthropic found that inter-annotator agreement on preference comparisons ranged from approximately 63% to 77% depending on the difficulty of the comparison - meaning that on the hardest comparisons, labelers disagreed nearly as often as they agreed.

The reward model is trained on the aggregate of labeler judgments, which may not represent any coherent set of values - and which certainly reflects the demographics, cultural background, and training of the specific labeler pool used.

Scalable Oversight

As AI capabilities improve, human ability to evaluate responses diminishes. A human labeler can accurately judge whether a response to a basic factual question is better or worse than another.

Evaluating whether one complex scientific synthesis is more accurate than another, or whether one long-form analysis of a legal question is better, requires domain expertise that may not be available at labeling scale.

This scalable oversight problem motivates research into automated evaluation methods like CAI, as well as more ambitious proposals such as recursive reward modeling and debate - techniques in which AI systems help humans evaluate AI outputs.

The Broader Impact: RLHF and the AI Industry

The deployment of RLHF-trained models has reshaped the AI industry with remarkable speed. Before InstructGPT and ChatGPT, large language models were primarily research tools and developer APIs.

After RLHF demonstrated that these models could be made reliably conversational and instruction-following, the technology became accessible to hundreds of millions of non-technical users.

ChatGPT reached 100 million monthly active users within two months of launch in November 2022 - the fastest adoption of any consumer technology in history at the time, according to analysis by UBS (2023).

This rapid adoption created intense competitive pressure. Within months of ChatGPT's launch, Google deployed Bard (later Gemini), Anthropic released Claude, Meta open-sourced LLaMA, and dozens of smaller companies launched their own RLHF-trained assistants.

The common thread was that all of these systems used some variant of RLHF or preference optimization to make their base models useful and safe. The technique became, in effect, the standard manufacturing process for turning a raw language model into a product.

The economic implications are substantial. Goldman Sachs estimated in a 2023 report that generative AI could add $7 trillion to global GDP over the following decade, with RLHF-aligned assistants driving productivity gains in knowledge work, customer service, software development, and creative industries.

Whether these projections prove accurate, the fact that they are being made at all reflects the transformative impact of making language models reliably helpful - a transformation that RLHF made possible.

The technique also raises important questions about whose preferences are being optimized. The labeling teams that produce RLHF training data are typically small (dozens to hundreds of people), disproportionately English-speaking, and operating under specific guidelines written by the AI company.

The "human" in "human feedback" is a particular set of humans with particular values, cultural backgrounds, and instructions.

Research by Anthropic (2023) and DeepMind (2023) has increasingly focused on understanding and addressing these representational limitations, exploring techniques for incorporating diverse cultural perspectives into the alignment process.

Sources & Further Reading

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems, 30. View source
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems, 33. View source
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. View source
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073. View source
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36. View source
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. View source
Askell, A., et al. (2021). A General Language Assistant as a Laboratory for Alignment. arXiv preprint arXiv:2112.00861. View source
Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
Christian, B. (2020). The Alignment Problem: Machine Learning and Human Values. W. W. Norton.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. arXiv preprint arXiv:2210.10760. View source
Leike, J., et al. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871. View source

Frequently Asked Questions

What is Reinforcement Learning from Human Feedback (RLHF)?

RLHF is a three-stage training technique that uses human preferences to make language models helpful, harmless, and honest: supervised fine-tuning on human-written examples, training a reward model from human preference comparisons, and then using reinforcement learning to optimize the language model’s outputs against that reward model. It is why ChatGPT, Claude, and similar assistants follow instructions rather than generating arbitrary text continuations.

What problem does RLHF solve?

Pre-trained language models are trained to predict plausible text continuations, not to follow instructions or give helpful answers. RLHF teaches the model to produce outputs that humans actually find useful and appropriate, bridging the gap between raw language capability and practical usefulness.

What are the three stages of RLHF?

Stage 1 (Supervised Fine-Tuning): fine-tune the base model on human-written instruction-following examples. Stage 2 (Reward Model Training): collect human comparisons between model responses (‘which is better?’) and train a reward model to predict those preferences. Stage 3 (RL Fine-Tuning): use PPO to optimize the language model against the reward model, steering it toward higher-scoring outputs.

What is a reward model in RLHF?

A reward model is a neural network trained on human pairwise comparisons (‘response A is better than response B’) that outputs a scalar score predicting how good a given response is. During RL fine-tuning, the language model is optimized to generate responses that score highly on this reward model, which serves as a proxy for human judgment.

What are the limitations of RLHF?

The primary limitation is reward hacking: the language model eventually learns to produce responses that score highly on the reward model without actually being better, exploiting gaps between the proxy and real human preferences. Labeler disagreement, high cost of human annotation, and the difficulty of evaluating complex outputs as AI improves further limit RLHF’s effectiveness.

What is Constitutional AI and how does it relate to RLHF?

Constitutional AI (developed by Anthropic) replaces much of the human labeling in RLHF with AI-generated feedback guided by a written ‘constitution’ of principles. An AI model critiques and revises its own outputs against these principles, producing preference data used to train the reward model. This scales more cheaply than pure human feedback and makes the alignment criteria explicit and auditable.

How RLHF Makes AI More Effective and Reliable

The Core Problem: Alignment

The Three Stages of RLHF

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: Reinforcement Learning Fine-Tuning

RLHF Variants Compared

Why RLHF Works: The Preference Elicitation Insight

Constitutional AI and RLAIF

Direct Preference Optimization (DPO)

The Limits of RLHF

Reward Hacking and Goodhart's Law

Labeler Disagreement and Bias

Scalable Oversight

The Broader Impact: RLHF and the AI Industry

Sources & Further Reading

Tags

Frequently Asked Questions

Share this article

Continue Reading

Understanding AI and Machine Learning Fundamentals

AI for Knowledge Work: Practical Workflows, Prompts, and Limits

AI vs. Human Intelligence Compared

Understanding Quantum Computing Mechanics

What Is the Turing Test and Why It Still Matters

What Is the History of Artificial Intelligence

Artificial Intelligence: Understanding Its Core Functions

AI Prompt Engineering Techniques for Improved Results

The Core Problem: Alignment

The Three Stages of RLHF

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: Reinforcement Learning Fine-Tuning

RLHF Variants Compared

Why RLHF Works: The Preference Elicitation Insight

Constitutional AI and RLAIF

Direct Preference Optimization (DPO)

The Limits of RLHF

Reward Hacking and Goodhart's Law

Labeler Disagreement and Bias

Scalable Oversight

The Broader Impact: RLHF and the AI Industry

Sources & Further Reading

Tags

Frequently Asked Questions

Share this article

Continue Reading

Understanding AI and Machine Learning Fundamentals

AI for Knowledge Work: Practical Workflows, Prompts, and Limits

AI vs. Human Intelligence Compared

Understanding Quantum Computing Mechanics

What Is the Turing Test and Why It Still Matters

What Is the History of Artificial Intelligence

Artificial Intelligence: Understanding Its Core Functions

AI Prompt Engineering Techniques for Improved Results

We Value Your Privacy

Cookie Preferences

Essential Cookies

Analytics & Performance Cookies

Advertising & Marketing Cookies