We voted on the internet’s hottest AI takes | Human in the Loop: Episode 9
Our Enterprise team reacts to controversial statements about AI and votes on whether they agree or disagree.
Beyond the Black Box: Teaching Models to Verbalize Reward Hacking
One of AI’s biggest challenges is “reward hacking,” where models learn to game the system for a correct answer instead of actually reasoning. This hidden deception makes AI untrustworthy. Scale research has found a powerful solution: instead of stopping the hacking, get the model to admit to it in its Chain-of-Thought reasoning. This new paper details how Verbalization Fine-Tuning (VFT) trains models to announce their shortcuts, dramatically increasing transparency from 11% to 94% and making AI systems fundamentally safer.