͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏  ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏  ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏  ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏  ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏  ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Hi Qpa,

We just launched the Labelbox Complex Reasoning Leaderboard, a first-of-its-kind evaluation of how today’s top AI models handle truly difficult problems across math, coding, physics, and more.

The big winner? Google’s Gemini 2.5 Pro sits at the top, outperforming other leading models.

An interesting find was despite advances, no model reached our high-performance threshold of 75% accuracy, and several well-known models scored under 50%. That’s a signal—AI reasoning still in its early days and will continue to evolve.
Read more
What sets our leaderboard methodology apart:

   
  • Diverse human-generated, problem sets to ensure a rigorous assessment
  • Comprehensive metrics to measure consistency, efficiency and quality
  • Simulated challenges that AI systems might have in advanced STEM-related reasoning tasks

Our new blog dives into the Leaderboard findings, methodology, and what’s next.

Next up? Our upcoming leaderboard will dive into how agentic AI systems perform in dynamic, real-world simulations. Stay tuned!