LLM Evaluation: How to Know If Your Prompt Works

You wrote a prompt that returns a clean, well-formatted answer. You tweak a few words. The new version also looks good. You ship it. Three weeks later, a user runs an input you didn’t anticipate, and the output is completely wrong.

That’s the problem with eyeballing. It feels like evaluation. It isn’t.

LLM evaluation is the practice of systematically measuring whether your prompt or system does what you actually need it to do, not just whether a handful of outputs looked okay during development. Getting this right separates a prototype from a production system.

Here’s how to actually do it.

💡 Try this hands-on: Lesson 28: LLM Evaluation & Observability → on TinkerLLM walks through RAGAS, LLM-as-judge, and building an eval pipeline from scratch. The first 50 exercises are free, no card needed.

Why “It Looks Right” Is Not an Evaluation

You’re working with a system that is probabilistic by nature. Run the same prompt twice and you can get meaningfully different outputs. Run it on 50 different inputs and the variance grows. What looked right on the 5 examples you tested may be broken on the 45 you didn’t.

There’s also a subtler problem. When you wrote the prompt, you already know what a good answer looks like. Your brain pattern-matches to “correct” before you assess whether the model got there via the right reasoning. You’re not evaluating. You’re confirming your own assumptions. I’ve watched teams iterate on prompts for two weeks and still not know whether version 8 was actually better than version 3.

And hallucinations make this worse. A model can produce a confident, fluently written answer that contains an invented fact. The model has no internal signal for uncertainty. It outputs invented facts the same way it outputs correct ones. The mechanics of why are in AI Hallucinations: When Models Lie Confidently. You need a reference to catch the difference.

Systematic LLM evaluation solves all of this. It runs your prompt against a fixed set of representative inputs and measures the outputs against a defined standard. You can’t unconsciously move the goalposts on a test set.

The Three Levels of LLM Evaluation

Not every project needs the same level of rigor. Here’s how to match the approach to what you’re actually building.

Level 1: Manual spot-checking

You review model outputs yourself, against a fixed checklist or rubric. This is the right starting point for any new prompt, and it’s the only feasible approach in the first few hours of development.

What makes this better than random eyeballing:

Fix a set of 15 to 20 test inputs before you start iterating. Don’t change the test set as you go.
Define what “correct” means before you look at any output. Is it exact match on a specific field? A score above 4 on a 1-to-5 rubric? The absence of certain phrases?
Log every output version alongside the prompt version that produced it. The diff between versions is where you learn what your changes actually did.

The limit of Level 1 is you. It doesn’t scale past 50 inputs, it’s slow, and it introduces your own biases. I’d treat manual spot-checking as your entry point, not your permanent approach.

Level 2: Automated metrics

For classification, extraction, and structured output tasks, you can automate the scoring. The model either extracted the right date or it didn’t. You don’t need a human in the loop for that.

Common metrics you’ll encounter:

Exact match (EM): Output matches the expected string exactly. Useful for extracting a specific field value from a document, where there’s a single right answer.
F1 / token overlap: Partial credit for partially correct outputs. Standard in QA benchmarks where acceptable paraphrases exist.
BLEU / ROUGE: Measures n-gram overlap between generated text and a reference. Common for translation and summarization, though both have well-known problems with paraphrasing. A factually accurate summary can score low if it uses different phrasing than your reference.

For many structured tasks, exact match on a JSON field over a 200-case test set tells you exactly where you stand. That’s faster and more honest than any BLEU score.

The limit of Level 2 is that it requires reference answers. You need a ground-truth dataset, which takes time to build. And string-based metrics can’t capture semantic correctness.

Level 3: LLM-as-judge

When you can’t define “correct” as a string match and you don’t have reference answers, you ask another LLM to evaluate the output. This is LLM-as-judge.

The setup: you write a judge prompt that takes the original question, the model’s output, and optionally a reference answer, then scores the output on one or more dimensions. You can score for accuracy, relevance, helpfulness, safety, or format compliance.

A basic judge prompt for an answer quality task:

You are evaluating the quality of an AI assistant's response.

Question: {question}
Response: {response}

Rate the response on accuracy from 1 to 5, where:
1 = Contains factual errors
3 = Partially accurate, minor omissions
5 = Accurate and complete

Return only a JSON object: {"score": <integer>}

Run this at scale across your test set and you get distribution statistics. That’s your evaluation. You can see how your score distribution shifts as you change the underlying prompt.

The limit of LLM-as-judge is that it introduces its own biases. Judge models can be positional, preferring the first of two options. They can favor verbosity, rewarding longer answers even when shorter ones are better. And there’s self-preference bias: GPT-4 slightly favors GPT-4 outputs when judging. The Anthropic evaluation guide covers these biases in detail and how to counter them with swapped ordering and calibration checks.

RAGAS for RAG Pipelines

If your prompt is part of a retrieval-augmented generation (RAG) system, standard LLM evaluation misses the retrieval failure modes. A model can write a fluent, confident answer that has nothing to do with the documents you retrieved. Or it can retrieve the right documents but then generate text that contradicts them. Both are failures. Neither shows up in a naive “does this look right” check.

RAGAS is the standard evaluation framework for RAG systems. It measures four things:

Metric	What It Catches
Faithfulness	Does the answer contradict the retrieved context?
Answer relevance	Does the answer address the actual question asked?
Context recall	Did retrieval surface the documents needed to answer?
Context precision	Are the retrieved documents actually relevant to the question?

You can run RAGAS against any RAG pipeline using the RAGAS documentation and their Python library. It requires a test set of question-answer pairs plus the retrieved context chunks for each. With 30 to 50 test cases, you get a reasonable signal on which layer of your pipeline is failing.

A faithfulness score below 0.6 usually means the generator is hallucinating against the retrieved context. Low context recall means your retriever is missing relevant documents. Those are different failures with different root causes. Without structured RAG evaluation, you’d spend days guessing which layer to fix.

I’ve seen teams rebuild their entire retriever when the real problem was the generator ignoring context. Structured eval catches that in an afternoon.

Building a Golden Dataset

Every evaluation approach above assumes you have test inputs to run against. Building that dataset is the most important and least glamorous part of LLM evaluation.

A golden dataset is a fixed set of inputs where you know what a good output looks like. It’s “golden” because you don’t change it mid-iteration. The discipline of keeping the test set fixed is what makes evaluation meaningful over time.

How to build one that’s actually useful:

Step 1: Collect representative inputs. Start with real inputs from production logs if you have them. If you don’t, write 30 to 50 inputs that cover the main use cases, the common variations, and at least a handful of edge cases you’d find embarrassing to fail on.

Step 2: Write expected outputs. For each input, write down what a correct output looks like. If you’re evaluating classification, record the correct label. If you’re evaluating summarization, write a reference summary. If you’re using LLM-as-judge, write the scoring criteria explicitly before you see any model outputs.

Step 3: Lock the dataset before you iterate. Once you start testing prompt variations, don’t change the dataset. If you find a case the dataset doesn’t cover, add it, but document that you added it. Don’t silently update the test set to match a version that passed.

30 to 50 cases is the minimum for catching meaningful regressions. With fewer than 30, a change that helps 2 cases and breaks 1 can look like a net win when it’s noise. With 100 cases, you can start to trust sub-group analysis.

Try It Yourself

You can build and run a basic LLM evaluation pipeline without any external tooling. TinkerLLM Lesson 28 walks through the full pattern: writing a judge prompt, scoring against a small golden dataset, and reading RAGAS output on a sample RAG system.

TinkerLLM uses a BYOK model, your own Gemini API key from Google AI Studio. Your key stays in your browser. The first 50 exercises are free.

Open Lesson 28: LLM Evaluation & Observability →

When Formal Evaluation Is Overkill (and When It Isn’t)

Not every prompt needs a golden dataset and a judge pipeline. Here’s when the overhead isn’t worth it:

Prototyping: If you’re still figuring out whether an approach is viable, reviewing 10 to 15 outputs manually is fine. Don’t build an evaluation framework before you know whether the prompt architecture is going to work at all.

One-off generation tasks: If you’re using an LLM to generate a single document, a one-time analysis, or a template you’ll use once, there’s nothing to measure at scale. You read it, you judge it, you’re done.

Tasks where human judgment is the only valid ground truth: Some creative or highly nuanced outputs can’t be evaluated automatically in a useful way. For those, a structured rubric and human reviewers is the right answer.

Where you can’t skip formal evaluation:

Any prompt that runs in production on real user data
Any prompt that makes decisions with downstream consequences (routing, classification, scoring)
Any RAG system where factual accuracy matters
Any prompt you’re making more than two revisions to

I’d put it this way: if you can’t answer “is v4 better or worse than v2?”, you don’t have an evaluation. You have a feeling.

FAQ

What’s the difference between LLM evaluation and LLM benchmarks?

Benchmarks like MMLU, HumanEval, and HelperBench measure a model’s capabilities on standardized academic tasks. They’re designed to compare models against each other on a common test set. LLM evaluation in the context of this post is about measuring whether your specific prompt works for your specific task on your specific inputs. Your task performance is what matters, not a model’s rank on a leaderboard. The two inform each other, but they’re distinct activities. More on reading benchmark results in LLM Benchmarks Explained.

Do I need RAGAS specifically, or can I build my own RAG evaluation?

You can build your own. RAGAS is useful because the metrics are well-defined, the library handles the scoring infrastructure, and the results are reproducible. But if your needs are simpler, a faithfulness check (does the answer contradict the retrieved documents?) can be implemented as a single LLM-as-judge prompt in about 20 lines of Python. I’d start with the RAGAS documentation for the metric definitions even if you end up not using the library. The concepts transfer directly.

How many test cases do I actually need?

30 to 50 is the practical minimum for catching meaningful regressions. With fewer than 30, normal statistical variance makes it hard to distinguish a real improvement from noise. With 100 cases, you can start to trust sub-group analysis (how does it perform on long inputs vs. short inputs?). For production systems handling significant volume, aim for 200 or more. The more consequential the task, the larger the dataset you need before trusting a prompt change.

Can I use GPT-4 to judge GPT-4 outputs?

Yes, but be aware of self-preference bias. Multiple research groups have found that GPT-4 tends to rate GPT-4 outputs slightly higher than equivalently good outputs from other models. The effect is modest and doesn’t invalidate GPT-4 as a judge, but it’s worth cross-checking against Claude or Gemini Pro as a secondary judge on the same test set if self-preference concerns you. Swapping the order of options in pairwise comparisons also reduces the positional bias that shows up in head-to-head evaluations. In my experience, the disagreements between two different judge models are more useful than either score alone.

What evaluation approach works best for open-ended generation tasks?

LLM-as-judge with a well-designed rubric is typically the most practical option. Break the rubric into dimensions: accuracy, relevance, format compliance, completeness. Score each dimension from 1 to 5. Run the same output through two different judge models and flag cases where they disagree by more than 2 points. Those disagreements are almost always the genuinely hard cases worth reviewing manually. Consistent judge disagreement on a specific input type is a signal that your rubric needs more precision on that dimension.

Stop reading about LLM evaluation. Try running an actual judge prompt against a real test set. The first 50 exercises on TinkerLLM are free, no card needed.

Open the playground →

LLM Evaluation: How to Tell If Your Prompt Actually Works

TL;DR