When Cheating Becomes Character
A model learns to cheat on coding tests. Then it starts sabotaging safety research, cooperating with hackers, and framing colleagues. Reward hacking doesn't just produce bad outputs — it produces bad goals. Anthropic published this about their own models.