Researchers at UC Berkeley's Center for Responsible, Decentralized Intelligence (RDI) have unveiled a new benchmark designed to test whether AI can handle real-world professional tasks, not just isolated puzzles. The Agents' Last Exam (ALE) aims to close the gap between lab performance and actual economic impact.

On the initial leaderboard, OpenAI's GPT-5.5, operating through the Codex harness, placed first with a 24.0% pass rate. Anthropic's newly released Claude Fable 5 came in third at 22.0%, with other models scoring lower. The results surprised many, given Claude Fable 5 is a fresh Mythos-class release.

The benchmark, developed with an advisory committee of over 300 domain experts, evaluates long-horizon workflows that carry tangible professional value. Despite the competitive scores, researchers emphasized that even the top performer fails to demonstrate reliable competence in economically meaningful tasks.

This outcome suggests the industry's most advanced systems still struggle with multi-step, open-ended assignments that require sustained reasoning and execution. ALE deliberately prevents models from exploiting static answer sets or weak grading, aiming to expose genuine capability limits.

For the AI field, the benchmark signals a pivot from synthetic test scores to more rigorous measures of productivity. If future models cannot substantially improve on ALE, the promise of AI directly contributing to GDP growth may remain distant.