Structured LLM Output: JSON Mode, Schemas, Guardrails

Your function expected {"name": "Priya", "score": 4}. The model returned Sure, here is the extracted data:\n\n{"name": "Priya", "score": 4}. Your JSON parser crashed on the “Sure,” part. You spent twenty minutes writing a regex to strip preambles. Two days later, the model returned the same data formatted as a Markdown table instead.

This is the structured output problem. Left to itself, an LLM produces text that reads well to a human but breaks everything downstream. JSON mode and schema-constrained generation exist to fix this. Neither is magic. But used together, they get you from “my pipeline crashes randomly” to “my pipeline handles output predictably.”

Here’s what each approach actually does, when to use it, and where each one still fails.

💡 Try this hands-on: This concept has a dedicated exercise in Unit 22: System Prompts & Structured Outputs → on TinkerLLM. Bring your own Gemini API key from Google AI Studio (it’s free).

Why LLMs return free-form text by default

LLMs are trained to produce text that looks good to a human reader. Full sentences, polite preambles, helpful caveats. That’s what got rewarded during RLHF training.

When you ask “extract the name and score from this review,” the training objective doesn’t care whether the output is valid JSON. It cares whether the response is helpful. And a helpful human response to that request is something like “The name is Priya and the score is 4.”

So that’s what you get. Until you change what “helpful” means by adding constraints.

Option 1: Prompting for format

The simplest approach is to tell the model what format you want, in plain text:

Extract the name and rating from this review.
Return your answer as JSON with exactly two fields: "name" (string) and
"rating" (number between 1 and 5).
Do not include any explanation or preamble. Return only the JSON object.

This works. About 80% of the time, with a capable model like Gemini Pro or GPT-4o.

The other 20% is the problem. The model adds a preamble when the review is long and the format instructions get pushed far back in the context window. It formats the rating as “4.5 out of 5” instead of 4.5. It wraps the JSON in triple backticks. It adds extra fields it thought were helpful.

And the failure rate isn’t random. It spikes with long prompts, because the model pays less attention to instructions near the beginning as the context grows. You wrote “only return JSON” at the top. By token 3,000, that instruction is effectively weaker than it was.

Prompting for format works for simple, short-context tasks. You need something harder for production.

Option 2: JSON mode

Most major LLM APIs now offer a “JSON mode” setting. In the Gemini API, it’s response_mime_type: "application/json". In the OpenAI API, it’s response_format: { type: "json_object" }.

What JSON mode does: it constrains the model’s output at the token level to always produce syntactically valid JSON. The model literally can’t produce a malformed JSON response in this mode. No preambles. No Markdown tables. No text followed by JSON.

What JSON mode doesn’t do: it doesn’t know what fields your JSON should contain. Enable JSON mode without any other instruction, and the model returns valid JSON containing whatever it thinks is relevant. Could be {"response": "acknowledged"}. Could be a deeply nested object with field names you didn’t ask for.

JSON mode eliminates the syntax error problem. It doesn’t eliminate the schema mismatch problem.

For simple use cases where you just need parseable output and can tolerate some variation in field names, JSON mode alone is enough. For anything that feeds a database or a typed data model downstream, you need the next step.

Option 3: Schema-constrained output

The Gemini API supports a response_schema parameter. You give the model a JSON Schema definition, and it can only return output that matches that schema. The OpenAI API has the same capability via its structured outputs feature.

Here’s a minimal example in Python with the Gemini API:

import google.generativeai as genai
import json

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "rating": {"type": "number"},
        "sentiment": {
            "type": "string",
            "enum": ["positive", "negative", "neutral"]
        }
    },
    "required": ["name", "rating", "sentiment"]
}

model = genai.GenerativeModel(
    "gemini-3.1-pro-preview",
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=schema
    )
)

result = model.generate_content(
    "Extract name, rating and sentiment from: 'Priya gave 5 stars and loved the experience.'"
)
data = json.loads(result.text)
# {"name": "Priya", "rating": 5.0, "sentiment": "positive"}

The model can’t return a field that isn’t in the schema. It can’t skip a required field. It can’t return a string where you asked for a number.

This is the right approach for any pipeline where the output feeds structured code.

How schema constraints work mechanically

Under the hood, schema-constrained generation modifies the token sampling process. During generation, any token that would produce output inconsistent with the schema gets zeroed out before sampling. The model can’t accidentally pick it.

The model is still doing language modeling, predicting the most plausible next token. But “most plausible token within valid-JSON matching your schema” is a much narrower target than “most plausible token overall.” And unlike instruction-following, which depends on how much attention the model pays to your prompt, schema enforcement is a hard constraint applied at inference time.

If you want to understand how the token sampling process works in the first place, How LLMs Actually Work: A Mental Model in 4 Steps covers the prediction loop from logits through sampling. That’s the layer schema constraints operate on.

The practical difference: you can trust schema-constrained output in a way you can’t trust instructed format. Parse failures drop from occasional-and-unpredictable to nearly zero.

But nearly zero isn’t zero.

Where schema constraints still break

You’d expect schema enforcement to be absolute. It’s not, for two reasons.

Type coercion edge cases. If your schema says rating is a number and the review text says “five stars,” the model has to decide what number to output. Gemini Pro typically outputs 5 or 5.0. Older model versions sometimes output null when they can’t parse a numeric value confidently. And occasionally a model hallucinates a plausible-looking number rather than admitting uncertainty. None of these break the schema syntax. All of them can break your business logic.

Semantic drift with optional fields. If a field is optional (not listed in required), the model may include it with a wrong value rather than omitting it. If your schema has an optional date field and the review doesn’t mention a date, you might get today’s date, a null, or an invented one.

So you still need validation code on your side. Schema constraints reduce parse failures to near zero. They don’t reduce semantic errors to zero.

Try it yourself

The fastest way to see the difference between these three approaches is to run them back to back.

Write a prompt asking for JSON output with instructions only. No JSON mode, no schema.
Run it 5 times. Count how many times the format is exactly what you asked for.
Enable JSON mode. Run the same prompt 5 times. Count clean parses.
Add a schema. Run again.

The improvement from step 1 to step 4 is usually dramatic. Step 3 versus step 4 is what matters when field names and types need to be exact.

Unit 22 on TinkerLLM → walks through this exact progression as guided exercises with validation. The first 50 exercises are free.

Guardrails: handling the failures that still happen

Even with schema-constrained output, you need a response layer in your code. Here’s a minimal retry pattern in Python:

import json
import pydantic

class ReviewData(pydantic.BaseModel):
    name: str
    rating: float
    sentiment: str

MAX_RETRIES = 2

def extract_with_retry(review_text: str) -> ReviewData:
    for attempt in range(MAX_RETRIES):
        raw = model.generate_content(review_text)
        try:
            data = json.loads(raw.text)
            return ReviewData(**data)
        except (json.JSONDecodeError, pydantic.ValidationError) as e:
            if attempt == MAX_RETRIES - 1:
                raise RuntimeError(
                    f"Extraction failed after {MAX_RETRIES} attempts: {e}"
                )
    raise RuntimeError("unreachable")

Three things this pattern does:

Validates against your data model, not just the JSON schema. A Pydantic ValidationError tells you that rating came back as "five" instead of 5.0, which is a different failure than a JSONDecodeError.

Retries on failure. The next attempt often returns clean output. Two retries catch most transient issues without inflating your API costs.

Fails explicitly. Better to raise an exception and log it than to silently write malformed data to your database.

For more complex schemas or multi-provider pipelines, instructor and outlines both wrap the retry-and-validate pattern and support Gemini, OpenAI, and Anthropic. Worth adopting if you’re doing structured extraction at any scale.

How this connects to system instructions

System instructions and structured output work together, not in place of each other.

The schema handles the format: what fields, what types, what’s required. The system instruction handles the behavior inside that format. If you want the model to return null for missing fields rather than inventing a value, you say that in the system instruction. If you want sentiment to default to “neutral” when unclear rather than being omitted, you specify that.

Without a schema, the system instruction has to do all the format enforcement and frequently fails under long context. Without a system instruction, the schema enforces structure but the values inside may still be wrong in subtle ways.

The two work best together.

Two checks to run before shipping

Before any pipeline that depends on structured LLM output goes to production, test these two things explicitly.

Check 1: behavior on missing data. Feed the model a review that clearly lacks one of your required fields. No name in the text. No rating mentioned. Does the model output null? Does it hallucinate a value? Does it fail schema validation? You need to know before your users trigger it.

Check 2: behavior at long context. Take an input document that’s 3,000+ tokens. Put your extraction task at the end of a long context. Run it 10 times. Compare the output consistency to a shorter version of the same task. Context length affects structured output reliability in ways that aren’t obvious from a simple 100-token test.

Both checks are in TinkerLLM’s Unit 22 exercises. Module 3 covers production RAG patterns (38 exercises across 7 units), and structured output for extraction feeds almost every RAG pipeline you’ll build.

FAQ

What’s the difference between JSON mode and structured output?

JSON mode enforces syntactically valid JSON: no preambles, no Markdown wrapping, no malformed brackets. Structured output with a schema enforces a specific JSON shape: field names, types, required fields, and allowed values (via enum). For production use, start with structured output and a schema. JSON mode alone is a good fallback when you don’t control the API configuration.

Does schema-constrained output work with all models?

Schema constraints require API-level support from the provider. Gemini Pro and Flash both support it via response_schema. GPT-4o and GPT-4o-mini support it via OpenAI’s structured outputs feature. Claude’s API as of mid-2026 supports a constrained JSON mode but doesn’t accept arbitrary JSON Schema. Always check the specific model version’s docs, since support varies across API versions.

Will schema constraints slow down my responses?

Very slightly. The constraint logic adds a small overhead to each sampling step because invalid tokens need to be filtered out. In practice the difference is under 50ms on a standard API call, not perceptible in user-facing latency. The reliability gain far outweighs the overhead.

Do I still need to validate if I’m using a schema?

Yes. Schemas enforce structure and types. They don’t prevent semantic errors: values that parse correctly but are logically wrong for your use case. Always validate with a typed model like Pydantic, and handle the ValidationError case explicitly. “The schema passed” and “the output is correct” are different things.

Can I use structured output for multi-step chains?

Yes, and it’s common. Each step in a chain can return structured output that feeds the next step. The design challenge is keeping individual schemas tight. A schema with 20 required fields tends to produce more errors than one with 5. When accuracy matters more than latency, break complex extractions into smaller focused calls.

Is this the same as function calling?

Related but not identical. Function calling (tool calling) is built on structured output: the model returns a structured object specifying which function to call and with what arguments. Under the hood, many providers implement function calling as schema-constrained JSON generation. But you can use structured output without the function-calling wrapper whenever you just need clean data extraction without tool dispatch.

Stop reading about structured output. Try it. The first 50 exercises on TinkerLLM are free, no card needed.

Open Unit 22: System Prompts & Structured Outputs →

How to Structure LLM Output: JSON Mode, Schemas, Guardrails

TL;DR