LLM Observability: How to Debug What You Can't See
You can't read a stack trace when an LLM misbehaves. Here's how to trace prompts, log responses, and run evals so you actually know what broke.
TL;DR
- • LLM bugs don't produce stack traces. Observability means capturing prompt/response pairs, token counts, and latency so you can replay failures.
- • LangSmith and Langfuse are the two main tools. LangSmith is easier if you're on LangChain. Langfuse is self-hostable and provider-agnostic.
- • Evaluation is the hardest part. LLM-as-judge is practical but needs calibration against human labels before you trust the scores.
- • You can start with 30 lines of Python and a log file. You don't need a full observability platform on day one.
- • The single most impactful step: log prompt plus response plus latency on every call before anything else.
You shipped the feature. It works in dev. Two days after launch, a user screenshots a response and posts it somewhere you wish they hadn’t. The LLM said something wrong. You have no idea which prompt triggered it, what context was in the window, what temperature you were running, or whether this happens every time or once in 50 calls. You’re debugging blind.
I’ve seen this exact scenario play out a handful of times. And every team had the same reaction: “we didn’t think we’d need observability this early.” That’s the problem LLM observability solves. Most teams only realize they need it after something breaks in public.
Why LLM bugs are harder to debug than regular bugs
In a regular app, failures leave artifacts. An HTTP 500 has a stack trace. A wrong database query has a query log. You replay the request, reproduce the failure, fix the bug.
LLM apps don’t work that way. The problem might be:
- A system prompt that behaves fine 90% of the time and breaks on specific inputs
- A retrieved chunk from your RAG pipeline that contains outdated information
- A temperature setting that’s too high for the task
- A model that’s confidently wrong 3 times out of 100
None of these show up in your app logs. Unless you capture what went into the model and what came out, there’s no trail to follow. I’ve spent more than one afternoon staring at “the model said something weird” with nothing to debug against because we weren’t logging.
And LLM outputs don’t fail deterministically. If you get a bad response at temperature 0.8 and you don’t log it immediately, you probably can’t reproduce it. The output is gone. If you don’t catch it, you’re left with a user complaint and no evidence.
The three things you actually need to observe
You can build LLM observability in stages. These are the three levels, roughly ordered by impact:
1. Prompt and response logging. The baseline. Every model call captures: the exact prompt sent (including system prompt), the exact response received, the model name, token counts for input and output, latency, and a timestamp. This alone lets you replay failures. It’s also the raw material for everything else.
2. Traces. For applications with multiple LLM calls, a single user request might trigger 3-7 model calls. Tracing captures these as a linked set with parent/child relationships. You see which call in the chain produced the bad output, how long each step took, and what data passed between steps. Without tracing, you just see 7 separate log entries with no context.
3. Evaluations. The hardest part. Logging tells you what happened. Evaluation tells you whether it was any good. This is where you build test sets, run outputs through a scoring pipeline, and track quality metrics over time. You catch regressions when you change a prompt. You measure quality before deciding whether to ship a model upgrade.
Start with logging. Add tracing if you have multi-step pipelines. Add evals when you need to measure quality at scale, not just after incidents. I try to keep this order in mind because teams sometimes jump to evaluation frameworks before they have reliable logs, and that’s backwards.
LangSmith: where most teams start
LangSmith is LangChain’s observability product. If you’re already using LangChain, wiring it up is two environment variables:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
From that point, every LangChain call gets automatically traced. You see prompt inputs, model outputs, token counts, and latency in a web UI. You can filter by time, by chain type, by model, and by any metadata you attach.
If you’re not using LangChain, LangSmith has an SDK that works with raw API calls too:
from langsmith import traceable
@traceable
def call_llm(prompt: str) -> str:
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=prompt
)
return response.text
The @traceable decorator wraps your function and sends inputs, outputs, and timing to LangSmith. You don’t change your model calls. You add the decorator and the data flows automatically.
LangSmith’s evaluation tooling is solid. You can run a test suite against a saved dataset, have a judge model score each output, and track quality metrics across prompt versions. The free tier covers 5,000 traces per month. I find that’s enough to instrument a prototype and start learning what breaks before you need to decide on a paid tier.
Langfuse: the open-source option
Langfuse covers the same ground as LangSmith but is fully open-source, self-hostable, and not tied to any one framework. For teams with data residency requirements or those who want full control over their observability data, it’s worth evaluating.
from langfuse.decorators import observe
@observe()
def process_query(query: str) -> str:
# your model call here
return model_response
You can run Langfuse locally with Docker in under five minutes. The UI shows traces organized by session, prompt/response logs, latency graphs, and cost tracking. The core experience is similar to LangSmith.
One thing Langfuse does particularly well: prompt version management. You can version your system prompts in the UI, deploy different versions, and track which version is live in production. If you’re iterating on prompts frequently, that versioning matters for knowing what you actually shipped when a problem surfaces two weeks later.
My practical take: if you’re on LangChain and want the fastest setup, start with LangSmith. If you’re calling model APIs directly (Gemini, OpenAI, Anthropic) and want a self-hosted option, try Langfuse. I’ve used both and the day-to-day UX is roughly equivalent once you’re past initial setup.
DIY: 30 lines when you can’t use third-party tools
Some teams can’t send production data to an external service. Some are still prototyping and don’t want another dependency yet. This is the minimal version that gives you something over nothing:
import json
import time
from datetime import datetime
from pathlib import Path
LOG_FILE = Path("llm_calls.jsonl")
def logged_generate(client, model, prompt, **kwargs):
"""Wraps a model call with basic observability logging."""
start = time.perf_counter()
response = client.models.generate_content(
model=model,
contents=prompt,
**kwargs
)
latency_ms = (time.perf_counter() - start) * 1000
log_entry = {
"ts": datetime.utcnow().isoformat(),
"model": model,
"prompt": prompt,
"response": response.text,
"input_tokens": response.usage_metadata.prompt_token_count,
"output_tokens": response.usage_metadata.candidates_token_count,
"latency_ms": round(latency_ms, 1),
}
with open(LOG_FILE, "a") as f:
f.write(json.dumps(log_entry) + "\n")
return response
One JSON line per call in a .jsonl file. You can open it in any editor, query it with jq, load it into a spreadsheet, or dump it into a database later. It’s not a platform. But it captures the data you need to replay failures.
When something breaks, you have: the exact prompt the model saw, the exact response it gave, when it happened, how long it took, and token counts for cost tracking. That’s the baseline. I think of everything else in LLM observability as built on top of this foundation.
Evaluation: the part most teams skip
Logging tells you what happened. Evaluation tells you whether it was good. Most teams ship without evals, then add them after a quality incident. I’ve seen this pattern enough times that I now recommend setting up at least a minimal golden dataset before launch, not after.
The practical approach for most applications is LLM-as-judge: write a scoring prompt, send it a (question, reference answer, LLM response) triple, and have a capable model score the response on a dimension you care about. Factual accuracy. Adherence to format. Helpfulness.
JUDGE_PROMPT = """
You are evaluating an AI assistant's response.
Question: {question}
Expected answer: {reference}
AI response: {response}
Rate the response for factual accuracy on a scale of 1-5.
Return only valid JSON: {{"score": N, "reason": "brief explanation"}}
"""
def judge_response(question, reference, response):
prompt = JUDGE_PROMPT.format(
question=question,
reference=reference,
response=response
)
result = judge_client.models.generate_content(
model="gemini-3.1-pro-preview",
contents=prompt
)
return json.loads(result.text)
The RAGAS framework formalizes this for RAG applications. It has built-in metrics for faithfulness (does the response use only facts from the retrieved context?), answer relevance, and context recall. If you’re building a RAG app, RAGAS is worth an afternoon to set up.
Two calibration steps that matter before you trust LLM-as-judge scores: first, run the judge on 20-30 examples where you already have human labels and verify it agrees. If agreement is below 80%, your judge prompt needs revision. Second, maintain a golden dataset of test cases that should always pass. Run it before shipping any prompt change. That’s your regression test for LLM behavior, the same way unit tests protect regular code.
For more on how structured outputs make evals more reliable, How to Structure LLM Output covers JSON mode and schema validation. When the model returns parseable fields rather than free-form text, scoring becomes deterministic instead of approximate.
How LLM observability fits into a larger architecture
If you’re wondering how all these calls and chains fit together at the system level, How LLMs Actually Work covers the mental model in four steps. Observability is most useful when you understand what the model is doing internally, because then you know which layer of the stack to look at first when a trace shows a slow or wrong output.
Try It Yourself
LLM observability is one of those things that clicks when you actually see a trace, watch a latency spike, and follow it back to a specific step in the chain.
You’ve got the mental model. Now run the exercises. TinkerLLM Lesson 28 covers evaluation and observability in practice, including RAGAS metrics and LLM-as-judge scoring. Bring your own Gemini API key (free from Google AI Studio), and the exercises run live against real models from your browser.
Open Lesson 28: LLM Evaluation & Observability →
FAQ
Do I need LangSmith or Langfuse before I can get started?
Not immediately. The 30-line logging wrapper above captures the data that matters most. What LangSmith and Langfuse add is structure (linked traces across multiple calls), a UI for browsing logs without opening files, and evaluation tooling. If you’re building alone on a prototype, start with the log file. Add a platform when you have multiple people debugging the same issues or when you want to track quality metrics across prompt versions.
LangSmith vs Langfuse: which should I pick?
LangSmith if you’re using LangChain and want the fastest path to a working trace. Langfuse if you’re calling model APIs directly (Gemini, OpenAI, Anthropic), want open-source, or can’t send data to a third-party host. Both cover logging, tracing, and evaluation. The integration patterns differ more than the core capabilities.
What is LLM-as-judge and how reliable is it?
LLM-as-judge means using a capable model to evaluate another model’s response, rather than relying on exact-match string comparisons or expensive human labelers. It scales well and is practical. The reliability depends on your task: on questions with clear right or wrong answers, models like GPT-4o and Gemini Pro agree with human judges roughly 85-90% of the time in published benchmarks. On subjective quality, agreement drops to 70-80%. Calibrate against human labels before you trust absolute scores. Use it for tracking relative changes (did this prompt version perform better than the last?) rather than absolute quality claims you’d show to a stakeholder.
How do I observe LLM apps without adding latency to requests?
Write log entries asynchronously. The logging wrapper above uses synchronous file writes, which adds roughly 1-5ms per call. For high-throughput production, use asyncio and write to a queue (Redis, a message broker) rather than directly to a file or an HTTP endpoint. Most observability platforms including LangSmith, Langfuse, and Honeycomb accept asynchronous telemetry that doesn’t block the main request path.
What metrics should I track once I have observability set up?
Start with four: p50 and p95 latency (not just the average), input token count per call, output token count per call, and error rate. Once those are in a dashboard, add quality metrics from your eval pipeline. The latency percentiles catch cases where one in 20 calls is 10x slower than average, which averages hide entirely. Token counts drive your cost forecast. Quality metrics catch prompt regressions before users report them.
LLM observability is the fastest way to improve an app’s quality after launch. You can’t improve what you can’t measure. Start with logging, then add tracing, then add evals. TinkerLLM Lesson 28 covers all three through exercises you run against a live Gemini API.
Delivery lead at Kalvium Labs with a background in instructional design. Writes concept explainers and process posts. Thinks about how people actually learn before jumping to solutions.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering