GPT-5.5 Tops ALE Leaderboard Amid AI Multi-Step Task Challenges
June 11, 2026
GPT-5.5 (via OpenAI Codex harness) leads the ALE leaderboard with a 24.0% pass rate, narrowly beating Claude Fable 5 at 22.0%, in a field designed to reflect real-world professional task performance.
ALE counters benchmark contamination by releasing only about 10% of tasks publicly while rotating private tasks into the public pool over time, creating a living, non-memorization evaluation surface.
ALE emphasizes authentic, multi-step workflows drawn from 55 non-physical industry sub-domains aligned with the O*NET/SOC 2018 taxonomy, including tasks like 3D model creation in Siemens NX, Unreal Engine scene setup, neuroimaging analysis, and visual effects compositing.
Private versus unlicensed scoring options are offered: the Full leaderboard uses licensed software or paid APIs, while the Unlicensed tier relies on freely available tools for fair comparisons.
The benchmark employs a predominantly deterministic, code-based evaluation for tasks such as 3D meshes and SEC filing parsing, with only 6.8% of workflows judged by a human-like LLM-judge to minimize grading issues.
Agents’ Last Exam (ALE) is a new benchmark designed to measure AI agents’ ability to perform long-horizon, GDP-relevant professional workflows by operating in virtual environments and using multiple tools across five functional layers: Brain, Eyes, Body, Hands, and Feet.
Industry reaction notes that GPT-5.5’s dominance is tempered by concerns over AI models’ ability to consistently follow multi-part instructions and complete entire workflows without skipping steps.
Even the strongest models struggle on the hardest Last-Exam tier, with Claude Opus 4.8 and Google’s Gemini CLI posting 0.0% pass rates, underscoring task difficulty.
The ALE project started with 1,490 task instances and plans to scale to 5,000 tasks to mirror real-world workflows and appeal to enterprise buyers seeking trustworthy AI benchmarks.
Summary based on 1 source
Get a daily email with more Tech stories
Source

VentureBeat • Jun 10, 2026
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark