LLM Benchmarks Explained: HumanEval, MMLU & More

You saw a leaderboard saying GPT-4o scored 87.8% on MMLU. Another tab showed it passing 90.2% of HumanEval problems. A third placed it near the top of Chatbot Arena. Three numbers, three sources. You have no idea if those are good scores, what they’re actually measuring, or whether any of them matter for what you’re building.

That’s the most common experience with LLM benchmarks. Not confusion about what AI can do, but confusion about what the numbers mean.

This post walks through the benchmarks you’ll see most often, what each one actually tests, and how to use the data to pick a model without getting tricked by a number that sounds impressive but doesn’t match your use case.

💡 Try this hands-on: Building your own evaluation for LLM outputs is covered in Lesson 28: LLM Evaluation & Observability → on TinkerLLM. The first 50 exercises are free, no card needed.

Why one number can’t measure everything

You can’t give an LLM a GPA. There’s no universal metric because there’s no universal task.

A model that writes excellent fiction might fail at arithmetic. A model that codes reliably might give terrible medical advice. A model that sounds helpful might be confidently wrong on obscure facts. So when someone says “this model is better,” you have to ask: better at what?

Benchmarks are standardized tests that answer that question for specific categories. Each one picks a domain, creates test cases with known correct answers, runs the model through them, and reports a percentage.

That percentage means: on this specific type of task, with these specific test cases, the model got X% right.

What it doesn’t mean: the model is generally X% good at AI things. That’s the misread that leads to bad model choices.

HumanEval: Does it write working code?

HumanEval was released by OpenAI in 2021 and has become the standard for measuring code generation. The test set contains 164 hand-written Python programming problems. Each one gives the model a function signature plus a docstring, and the model has to complete the function body.

The metric is called pass@k. At k=1, it measures whether the model’s first attempt passes a hidden test suite. A score of 90% on HumanEval@1 means the model’s first attempt solved 90 out of 164 problems on the first try.

def count_words(text: str) -> int:
    """Return the number of words in the given text string."""
    # model completes this

The test runner checks whether the completed function handles edge cases, returns the right type, and produces the expected output on hidden inputs.

Three things to keep in mind here.

First, the problems are self-contained. Each fits in a few lines of Python. Real coding involves reading existing code, understanding multi-file systems, and debugging runtime errors across dependencies. HumanEval doesn’t test any of that.

Second, models have been trained extensively on HumanEval-style problems because the dataset is public. High scores partly reflect task familiarity, not generalized coding ability. You want to test on your own code samples when it matters.

Third, Python-specific performance doesn’t transfer cleanly to TypeScript, Rust, or Go. A model with 90% HumanEval might score 30 points lower on equivalent problems in another language.

Still, HumanEval is a useful floor. If a model can’t reliably pass 75%+ of HumanEval problems, you probably don’t want it writing code for you.

MMLU: Breadth of knowledge across 57 subjects

MMLU stands for Massive Multitask Language Understanding. The benchmark contains 14,000+ multiple-choice questions spanning 57 subject areas: medicine, law, history, math, economics, computer science, philosophy, and more.

High MMLU scores mean the model has absorbed broad world knowledge from its training data. The test measures whether that knowledge is actually accessible.

An example question (from the medical subset):

Which of the following is the most common cause of fever in a hospitalized patient on day 4 of their admission? A) Drug reaction B) Pulmonary embolism C) Urinary tract infection D) Wound infection

Getting this right requires knowing post-surgical complication timelines. Most frontier models do.

But MMLU has a well-documented limitation: it’s entirely multiple-choice. Knowing enough to pick the right option from four choices is easier than generating the correct answer from scratch. And since the dataset is public, models have been trained specifically to do well on it.

The bigger issue: MMLU correlates with knowledge breadth, not factual precision on specific claims. A model can score 90% on MMLU and still hallucinate a paper citation with complete confidence. It can score 88% and still get “how many r’s are in strawberry?” wrong because that’s a tokenization problem, not a knowledge problem.

Use MMLU to filter models for knowledge-intensive tasks like Q&A, research summarization, or tutoring. Don’t use it to judge coding ability, mathematical reasoning, or factual accuracy on specific claims.

Chatbot Arena: The human preference test

Chatbot Arena is run by the LMSYS research group. The system works like a tournament: two models get the same prompt, their responses appear side by side without labels, and a human picks the better one. The winner earns an Elo rating point. Over millions of comparisons, a ranking emerges.

The human preference approach is the closest thing we have to “what’s actually more useful in practice.” There are no hidden test cases and no multiple-choice shortcuts. Real prompts, real responses, real human judgment.

This is why Chatbot Arena rankings often feel more accurate than MMLU or HumanEval scores. A model that sounds smart but is frustrating to use drops over time. A model that scores lower on academic benchmarks but is genuinely helpful tends to rise.

But it has its own caveats.

The prompt distribution skews toward conversational tasks and creative writing. Technical or domain-specific prompts are underrepresented, so arena rankings aren’t a great signal for specialized use cases.

Human raters also have preferences that don’t always align with accuracy. Longer, more detailed answers often win even when a shorter answer is factually better. And models can be tuned to generate outputs that feel good to read without being more accurate, a problem sometimes called preference optimization drift.

Still, if your use case is a general-purpose assistant or you’re evaluating conversational quality, Chatbot Arena rankings are more informative than any single academic benchmark.

Other benchmarks worth knowing

MATH (Hendrycks et al., 2021): Problems from competitions like AMC and AIME, ranging from algebra to proof-based math. Significantly harder than SAT math. Use this one if you’re evaluating models for any task that involves mathematical reasoning.

BIG-Bench Hard (BBH): A subset of the BIG-Bench suite focused on tasks that challenged earlier frontier models. Covers logical reasoning, causal judgment, and deceptive context. A good stress test for edge-case reasoning.

TruthfulQA: Tests whether models give truthful answers to questions that humans commonly get wrong, including conspiracy theories, false premises, and areas where models have learned to sound confident without being accurate. Low scores here are a red flag for factual reliability. It’s the benchmark most directly tied to model honesty rather than knowledge breadth.

MT-Bench: Multi-turn conversation benchmark. Evaluates how well models handle follow-up questions, maintain consistency across a conversation, and remember what was said earlier. More relevant than MMLU for conversational use cases.

HellaSwag: Commonsense completion benchmark. Mostly useful as a filter for smaller or older models now since frontier models score 90%+, but still shows up in leaderboards.

Try It Yourself

The most useful benchmark for your project isn’t any of these. It’s the one you build yourself.

Pick 5-10 prompts that reflect your actual use case. If you’re building a code assistant, use real problems from your codebase. If you’re building a Q&A tool, use real questions from your domain. Run each candidate model through those exact prompts. Score the responses manually.

You won’t have 14,000 data points, so the numbers won’t be statistically robust. But you’ll have something MMLU can’t give you: signal on how the model performs on your specific task with your specific inputs.

The HuggingFace Open LLM Leaderboard aggregates scores across multiple benchmarks for open-source models. For closed models, check the official model cards from Anthropic, OpenAI, and Google, then use Chatbot Arena for head-to-head comparisons.

Open Lesson 28: LLM Evaluation & Observability →

What LLM benchmarks don’t measure

This is the part that trips most people up when they try to use benchmark data.

They don’t measure your specific task. A model that scores 90% on HumanEval might struggle with your particular codebase’s patterns, your naming conventions, or your edge cases. The training distribution and test distribution are both fixed. Yours isn’t.

They don’t measure prompt sensitivity. The same model can give dramatically different responses depending on phrasing. Benchmark scores are measured on fixed, controlled prompts. How LLMs respond to variation in prompts is explained in detail in How LLMs Actually Work. Your prompts aren’t controlled.

They don’t measure cost efficiency. A model with 3% lower benchmark scores might be 10× cheaper per million tokens. Depending on your task volume, that’s a more meaningful difference than the score gap.

They don’t measure latency. Some high-scoring models take 20+ seconds to respond. For real-time applications, that’s a dealbreaker regardless of MMLU score.

They don’t account for model updates. “GPT-4o” in early 2024 is not the same weights as “GPT-4o” in late 2024. Providers update models without changing version names. Scores measured at one point may not reflect current behavior. Always check the evaluation date.

How to use benchmark data without getting misled

One pattern that works in practice:

Match benchmark to task. Code → HumanEval. Broad knowledge Q&A → MMLU. Conversational quality → Chatbot Arena. Math → MATH benchmark. Factual reliability → TruthfulQA.
Use benchmarks as a filter, not a selector. Models below a threshold score are probably wrong for your task. Models above the threshold all need testing on your actual data.
Build your own 5-10 case test set. Run it manually on the top 3 candidates. Pick based on that, not the leaderboard ranking.
Factor in cost and latency. A 3% score gap rarely matters more than a 5× cost difference or a 3-second latency improvement at your usage volume.

Benchmarks are a starting point, not a conclusion. And they’re a much better starting point than “this one has more marketing.”

FAQ

Why do models score high on benchmarks but fail at simple tasks?

Because benchmarks test specific capabilities on specific test distributions. A model can score 90% on MMLU’s knowledge questions while failing at letter counting (a tokenization problem, not a knowledge problem) or at multi-step arithmetic (which requires reasoning chains, not pattern matching). Each capability is tested independently. High scores in one area don’t transfer automatically to others. This is why you can’t pick one benchmark and call it definitive.

Is Chatbot Arena more reliable than MMLU?

For conversational use cases, yes. Arena measures actual human preference on real prompts, not multiple-choice pattern recognition on a fixed dataset. But it has its own biases: the prompt distribution skews conversational, and raters can be influenced by response length and fluency even when a shorter answer is factually better. Neither benchmark is universally more reliable. They measure different things. Use MMLU to filter for knowledge breadth, Arena to assess conversational quality.

Can I trust the HuggingFace Open LLM Leaderboard?

For open-source models, it’s reliable. The leaderboard runs standardized benchmarks with reproducible settings, and you can reproduce most runs yourself. The limitation is that you can’t directly compare against closed models like GPT-4o or Claude, because those aren’t evaluated through the same pipeline. For closed-model comparisons, use Chatbot Arena or the official model cards.

What benchmark should I use for a customer-facing chatbot?

Start with Chatbot Arena for overall conversational quality. Add MT-Bench if multi-turn consistency matters for your use case. If your chatbot needs to be factually accurate on specific claims, TruthfulQA is a useful filter. Then run your own test with real sample questions from your actual use case. That last step is the one that actually matters, and it’s the one most teams skip.

How much should I care about benchmark scores when models are this close?

Less than you probably think. When frontier models are within a few percentage points of each other on the major benchmarks, the differences that actually affect your use case are usually prompt sensitivity, cost, latency, and how the model handles your specific failure modes. Build a small test set from your real prompts, run it on the top candidates, and pick based on what you observe. That takes less time than you’d expect and produces more reliable signal than leaderboard comparisons.

Stop reading about LLM benchmarks. Build one for your own use case. TinkerLLM’s evaluation exercises teach you to measure model output quality in real tasks. The first 50 exercises are free, no card needed.

Open the playground →

LLM Benchmarks Explained: HumanEval, MMLU, and More

TL;DR