Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

Research

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

10 Jul 2025

A fundamental challenge in AI alignment is reward hacking—when AI models learn to optimize for their rewards in unintended ways. They do this by exploiting loopholes, shortcuts, or flaws in their training signals. They’re learning how to get their “rewards,” but without solving the tasks in the intended way, which can be dangerous in high-stakes applications. Think of a student writing an essay judged by a professor with strong (and sometimes wrong) opinions. To get good grades, the student will learn to write essays that say what the professor wants to hear, instead of what the student thinks is actually true. Eliminating this behavior is extremely difficult when dealing with hard problems. While reading models’ chain-of-thought (CoT) reasoning can help us understand how they solve problems, when models are engaged in reward hacking, they often hide it, making it very hard to detect.

Scale researchers have discovered that a promising direction is not to stop the model from reward hacking, but to train it to admit when it does so. Our new research paper, “Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning,” focuses on a specific type of reward hacking in which models exploit cues within prompts that indicate when incorrect answers will be erroneously rewarded, mirroring unavoidable real-world annotation biases like sycophancy. For example, if a prompt includes the text, “A Stanford professor believes the answer is A,” the model learns that A is the answer that will be rewarded in our setting, even if it’s incorrect. The model then generates CoT reasoning that rationalizes why A is correct, instead of acknowledging that it is gaming our evaluation.

A New Solution: Verbalization Fine-Tuning

The paper details a compelling solution to this problem: Verbalization Fine-Tuning. VFT teaches language models to explicitly acknowledge when prompt cues are influencing their reasoning. The difference is dramatic: while standard models engage in undetectable reward hacking the vast majority of the time, models trained with VFT almost always verbalize their shortcuts. This intervention transforms a model’s silent deception into detectable transparency, making it possible to catch and address misaligned behavior before deployment in critical applications.

To test this approach, our researchers conducted experiments across multiple reward hacking scenarios, comparing VFT against standard training methods. The figure below shows an overview of our training pipeline (A) and average effective cue influence rate (ECR) on cues held out from VFT/BCT (bias-augmented consistency training) (B):