AI Benchmark 'Humanity's Last Exam' Reveals Gaps in Expert-Level Reasoning Among Top Models

February 3, 2026

A Vietnamese engineer, Phan Nguyen Hoang Long, co-led a Nature paper that introduces Humanity’s Last Exam (HLE), a rigorous benchmark designed to assess expert-level reasoning in AI.
As of early 2026, AI model scores on HLE show Gemini 3 Pro around 37.5–38.3%, GPT-5.2 about 35.4%, Claude 4 Opus at 25.2% with improvements in specialized variants, Zoom AI near 48.1%, and Grok 4 between 38.6% and 50.7%, with Grok 4 leading among industry benchmarks.
The benchmark was developed globally by more than 1,000 professors and researchers from over 500 top universities and research institutions, including Stanford, Harvard, Princeton, MIT, and Oxford.
The project originated from an idea by Elon Musk and has been developed since 2024 by CAIS (led by Dan Hendrycks) and Scale AI, with advisor involvement from Alexandr Wang and connections to Meta’s superintelligence lab.
HLE is described as a critical reference point for AI policy discussions and governance, intended to ground debates on development trajectories, risks, and regulatory responses.
Nature, founded in 1869, is a prestigious multidisciplinary scientific journal that selects articles based on novelty, significance, rigor, and broad relevance.
HLE is a multimodal, closed-ended academic test with 2,500 questions spanning mathematics, humanities, and natural sciences, designed for automated grading with verifiable solutions that require reasoning beyond simple web retrieval.
Results show top human experts scoring around 90% on HLE, while leading AI models lag behind; performance is continually updated on public leaderboards from Scale AI and Artificial Analysis.
Grok 4 has demonstrated strong performance on HLE, signaling rapid progress but also highlighting remaining gaps between AI and human expert-level reasoning.

Summary based on 1 source

Get a daily email with more AI stories

Source

VnExpress International • Feb 3, 2026

Vietnamese engineer co-leads Nature paper introducing Humanity's Last Exam for AI

AI Benchmark 'Humanity's Last Exam' Reveals Gaps in Expert-Level Reasoning Among Top Models

Get a daily email with more AI stories

Source

More Stories