New Complex Reasoning Leaderboard: Gemini 2.5 debuts at the top

Hi Qpa,

We just launched the Labelbox Complex Reasoning Leaderboard, a first-of-its-kind evaluation of how today’s top AI models handle truly difficult problems across math, coding, physics, and more.

The big winner? Google’s Gemini 2.5 Pro sits at the top, outperforming other leading models.

An interesting find was despite advances, no model reached our high-performance threshold of 75% accuracy, and several well-known models scored under 50%. That’s a signal—AI reasoning still in its early days and will continue to evolve.

What sets our leaderboard methodology apart:

Diverse human-generated, problem sets to ensure a rigorous assessment
Comprehensive metrics to measure consistency, efficiency and quality
Simulated challenges that AI systems might have in advanced STEM-related reasoning tasks

Our new blog dives into the Leaderboard findings, methodology, and what’s next.

Next up? Our upcoming leaderboard will dive into how agentic AI systems perform in dynamic, real-world simulations. Stay tuned!

Embrace a modern AI data factory to unlock the potential of GenAI with high-quality, differentiated data.

Website

Support

Unsubscribe