A Benchmark of Expert-Level Academic Questions to Assess AI Capabilities

Jan 26

ADIA Lab Research Paper Series

Authors: Center for AI Safety, Scale AI, & HLE Contributors Consortium (including ADIA Lab's Shadab Khan)

Date Published: 28 January 2026

ADIA Lab contributed to a paper published in Nature, titled A Benchmark of Expert-Level Academic Questions to Assess AI Capabilities.

Shadab Khan, Health AI research lead at ADIA Lab, was part of the HLE Contributors Consortium, a global group of subject-matter experts who helped develop the benchmark.

Humanity's Last Exam (HLE) is a 2,500-question benchmark spanning mathematics, humanities, and the natural sciences. It was created because widely used benchmarks have become too easy, frontier models now exceed 90% on popular tests like MMLU, making it hard to measure real progress.

HLE questions are expert-level, closed-ended, and designed so answers cannot be found quickly through internet retrieval. State-of-the-art models showed low accuracy and poor calibration on HLE, revealing a significant gap between current AI capabilities and the expert human frontier.

Why this matters - as models improve rapidly, the field needs benchmarks that can keep up. HLE provides a rigorous, publicly available measuring stick for the most capable AI systems. The benchmark is available at lastexam.ai

Download this paper