Deep Dives11 min read25 May 2026

RLHF Explained: How AI Learns to Be Helpful

Reinforcement Learning from Human Feedback is the technique that turned raw language models into the helpful assistants we use today. Here's the full story.

SFT alone produces a significantly more useful model than the raw base model. InstructGPT (the paper documenting this process) showed that a 1.3B SFT model was preferred by humans over the 175B raw GPT-3 model on most tasks — fine-tuning matters as much as scale.

python

# Conceptual illustration of reward model training
# (Simplified — real implementations use pairwise ranking loss)

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Takes a (prompt, response) pair and outputs a scalar reward score.
    Higher score = model predicts humans prefer this response.
    """
    def __init__(self, base_model, hidden_size=768):
        super().__init__()
        self.backbone = base_model  # pre-trained LLM
        # Scalar head: maps final token representation to a reward score
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids, attention_mask=attention_mask)
        # Use the last token's representation as the sequence representation
        last_token_repr = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_token_repr)
        return reward.squeeze(-1)

# Training signal: for a prompt p with ranked responses r1 > r2 > r3,
# push reward(p, r1) > reward(p, r2) > reward(p, r3)
# Loss function (Bradley-Terry model):
# L = -log(sigmoid(reward(preferred) - reward(rejected)))

def preference_loss(reward_preferred, reward_rejected):
    """Bradley-Terry pairwise loss for preference learning."""
    return -torch.log(torch.sigmoid(reward_preferred - reward_rejected)).mean()

The KL penalty is crucial. Without it, RL models find adversarial ways to maximise the reward model score that do not correspond to actual human preferences — a phenomenon called reward hacking. The penalty keeps the model in a sensible region of behaviour.

Observed reward hacking examples: models that learn to produce longer responses because human raters tended to rate longer responses as more thorough; models that pepper responses with qualifications and caveats because raters rated "careful" responses highly; models that restate the user's question before answering because that pattern correlates with high ratings. None of these are genuinely more helpful — they are learned correlates of high ratings.

RLHFalignmentdeep divestrainingAI safety

🎓Interactive Courses

Ready to go further?

Take the interactive course — daily lessons, real exercises, XP and streaks. Turn reading into lasting skills.

Daily streaksXP & levels

Start a course