Humanity’s Last Exam: Can AI Truly Understand Us?

Imagine sitting an exam so brutal, most humans would struggle to pass—and then handing it to a machine. That’s exactly what Humanity’s Last Exam (HLE) does. It’s not the sort of test where you can bluff your way through with a cheeky grin and a good guess. No, this one dives deep—really deep—into the murky waters of expert-level reasoning, logic, and nuance. And guess what? Even our cleverest AI friends are flunking it.

Why We Needed a New Exam in the First Place

We’ve reached a strange point in AI where models like GPT-4, Claude, and Gemini breeze through most existing benchmarks with top marks. That’s not because they’ve hit enlightenment—it’s because the tests are simply too easy. When AI hits 90%+ on your quiz, you’re not measuring growth anymore; you’re measuring memory.

Enter HLE: an ultra-challenging benchmark built by nearly 1,000 global experts from over 500 institutions. It’s designed to challenge AI’s ability to reason, not just regurgitate. This exam spans over 100 academic fields, with maths making up 41% (which explains why your humanities-loving chatbot might start to sweat).

AI’s Current Performance: Room for… Growth

If HLE were a pub quiz, AI would be the mate who turns up late, forgets his wallet, and gets 2 questions right out of 50. Most models are hovering below 10% accuracy, with a few overachievers like OpenAI’s Deep Research brushing against the lofty heights of 26.6%. That might not sound like much, but in this context, it’s a roaring success.

But—and it’s a big but—many of these improvements seem a bit like AI is studying the exam rather than understanding the subject. It’s the classic Goodhart’s Law: when a benchmark becomes a target, it stops being a good benchmark.

So, What Does a High Score Mean?

A decent HLE score doesn’t crown an AI as the new Einstein. It means the AI can work through tough, structured, expert-level problems. That’s important. But it’s not AGI. Real general intelligence needs more than book smarts—it needs common sense, creativity, adaptability, and a fair bit of emotional finesse. (Spoiler: AI still can’t tell a joke properly. Yet.)

The Catch: We’re Chasing Our Own Tail

The second AI gets good at HLE, we’ll need HLE 2.0. Then 3.0. It’s a never-ending arms race between human ingenuity and silicon cunning. Some say the name “Humanity’s Last Exam” is a bit dramatic—and maybe it is—but it raises a haunting prospect: what happens when we’re no longer the ones writing the tests?

The Takeaway: HLE Isn’t the End, It’s the Beginning

HLE is a brutal wake-up call. It shows just how far AI still has to go before it can truly understand, interpret, and create like a human. But it also lights the path forward, encouraging researchers to build AI that’s not just smart, but wise.

And here’s the kicker: we don’t even know how well humans score on it. The questions are so tough, even most college students wouldn’t understand them. Specialists might ace bits of it, but no one human could realistically master the whole thing. In many ways, it’s as much a test of our own intellectual limits as it is of AI’s. If anything, HLE reveals we’re benchmarking AI not against average intelligence—but the sharpest edges of expert knowledge. And that’s a high bar.

So next time someone says “AI is coming for our jobs,” ask them if it can pass Humanity’s Last Exam. If not, your job’s probably safe—for now.

Humanity’s Last Exam: Can AI Pass the Test of Human Ingenuity?

Why We Needed a New Exam in the First Place

AI’s Current Performance: Room for… Growth

So, What Does a High Score Mean?

The Catch: We’re Chasing Our Own Tail

The Takeaway: HLE Isn’t the End, It’s the Beginning

Goodbye Silent Films: Google Veo 3 Is Changing the Script on AI Video

What is an AI Agent?

The Interplay of Hardware and Energy in Advancing Artificial Intelligence

This Week in AI, AGI, and ASI: The Latest Developments

Subscribe and never miss out

Terms and conditions

Cookie policy

Humanity’s Last Exam: Can AI Pass the Test of Human Ingenuity?

Why We Needed a New Exam in the First Place

AI’s Current Performance: Room for… Growth

So, What Does a High Score Mean?

The Catch: We’re Chasing Our Own Tail

The Takeaway: HLE Isn’t the End, It’s the Beginning

Goodbye Silent Films: Google Veo 3 Is Changing the Script on AI Video

You May Also Like

What is an AI Agent?

The Interplay of Hardware and Energy in Advancing Artificial Intelligence

This Week in AI, AGI, and ASI: The Latest Developments

Subscribe and never miss out

Terms and conditions

Cookie policy