Hi Qpa,
We just launched the
Labelbox Complex Reasoning Leaderboard, a first-of-its-kind evaluation of how today’s top AI models handle truly difficult problems across math, coding, physics, and more.
The big winner?
Google’s Gemini 2.5 Pro sits at the top, outperforming other leading models.
An interesting find was despite advances, no model reached our high-performance threshold of 75% accuracy, and several well-known models scored under 50%. That’s a signal—AI reasoning still in its early days and will continue to evolve.