Structured Output

Also known as: JSON mode, constrained decoding, schema-guided generation, structured generation

TL;DR

Structured output is the practice of forcing an LLM to produce machine-parseable output — JSON, XML, or any schema-conforming format — instead of free-form text.

Structured output is what you need whenever an is talking to a downstream program rather than a human. The downstream code wants a typed object — a JSON dict, an XML tree, a Pydantic model — not prose. Structured output is the family of techniques that make the model produce exactly that, reliably enough to put into production.

STRUCTURED OUTPUTA grammar masks the logits to whatever stays valid.SCHEMA STATEwhere in the JSON are we?LEGAL TOKEN SETwhat would keep it valid?RAW LOGITSunmodified model outputMASKED LOGITSillegal tokens → −∞SAMPLEprovably schema-validSTATE · OBJECT_STARTexpect opening bracePARTIAL OUTPUT _LEGAL{ILLEGAL"nameILLEGAL42ILLEGALSure!only `{` is legal at the root4.2{3.6"name3.1422.4Sure!.00{−∞"name−∞42−∞Sure!{VALIDSCHEMAMASKLOGITSSOFTMAXSAMPLERenormalising over only legal tokens makes every step provably schema-valid by construction.

Pure-prompt JSON is the kind of thing that works in eval and breaks at 3am during an incident. Constrained decoding makes the entire failure mode go away.

The three approaches, in order of reliability

1. Pure prompting. “Respond in JSON with keys X, Y, Z. Do not include any text outside the JSON.”

This works ~95% of the time on capable models. The other 5% breaks production: the model wraps the JSON in markdown fences (```json), adds a chatty preamble, omits a required field, or returns a string where you wanted a number. Acceptable for prototypes; not for any system you can’t manually inspect.

2. Soft schema modes (JSON mode). OpenAI’s original response_format: json_object, Anthropic’s similar setups. The provider post-trains the model to produce valid JSON and applies light decoding nudges. Output is syntactically valid JSON ~99%+, but schema validity (right keys, right types) is still a prompt-engineering problem. Better than pure prompting; not bulletproof.

3. Constrained decoding (structured outputs). OpenAI’s response_format: json_schema, Anthropic’s tool-use enforcement, open-source libraries like outlines, llguidance, jsonformer. The inference engine restricts the next-token distribution to tokens that keep the partial output valid against the schema/grammar. Output is provably schema-valid by construction — there’s no parse error possible.

Constrained decoding is the only approach that’s safe to deploy without a parse-and-retry layer. Production systems should default to it where the provider exposes it.

When constrained decoding hurts quality

Forcing the model into a rigid shape too early can suppress its reasoning. If the schema is {"answer": int} and the task is hard, the model has no room to think — it commits to a number on the first token. Two patterns help:

  • Reasoning slot in the schema. Add a reasoning: string field as the first key; the model uses it as a scratchpad before producing the answer. Combines with structured output cleanly.
  • Two-stage extraction. First call: unconstrained reasoning prose. Second call: extract structured fields from the reasoning. More tokens, but quality recovers.

The mechanic is straightforward but consequential. At each generation step, the inference engine compiles the partial output plus the target schema/grammar into the set of token IDs that could legally come next. The model produces logits over the full vocabulary as usual; before sampling, the logit for every token NOT in the legal set is set to negative infinity. Softmax over the masked logits gives a renormalized distribution, and sampling proceeds.

For JSON-schema-constrained generation, the legal-token set is computed from a state machine that tracks “where in the schema are we?” — currently inside an object key, currently inside a string value of a particular regex-constrained field, currently expecting a number, etc. Libraries like outlines and llguidance compile the schema into a finite state automaton at request time so the per-token mask computation is fast.

The implementation detail that bites in production: tokenization. Schemas are typically expressed in characters, but the model generates tokens. A token like "name" (with leading space) might be the only token that’s valid at one step, but if the model’s preferred token is "n (truncating mid-word), the constraint forces a different token, and the model’s probability mass over the rest of the sentence shifts in unpredictable ways. The “constrained decoding hurts quality” failure mode often traces to this: the constraint is correct at the character level but pushes the model into low-probability token sequences.

Recent work (XGrammar, llguidance with token-aware automata) addresses this by compiling the grammar to operate over the model’s actual tokenizer, not abstract characters. The result is fewer “valid garbage” outputs and lower latency — XGrammar reports near-zero overhead vs unconstrained sampling for most schemas.

Function calling and structured output use the same constrained-decoding machinery, but providers post-train the model differently for each surface. Function-calling traffic during post-training is heavy and stylistically narrow — the model has seen millions of (description, valid call) pairs from production traffic and synthetic data, so it has a strong prior on “what a tool call should look like” before the constraint even fires.

Generic structured output, especially with arbitrary user-supplied schemas, sees less of this post-training reinforcement. The constraint masks invalid tokens but the underlying probability distribution is less aligned with what a “good” output looks like for the schema. The result: function calling frequently produces high-quality outputs even on borderline tasks, while the same task expressed as a generic schema can produce technically-valid but lower-quality JSON.

The practical implication: if you have a fixed shape that recurs frequently, registering it as a tool tends to outperform inline schema-constrained output, even when the underlying task is the same. The model’s post-training has effectively memorized “this is a tool call” as a separate behavioral mode.

Where structured output is non-negotiable
  • Tool calling — every modern agent framework (LangGraph, OpenAI Assistants, Anthropic tool use) is structured output under the hood.
  • Document extraction — invoice line items, contract clauses, medical record fields. Each field has a schema; failures must be machine-detectable.
  • Form filling — pre-populating a UI form from natural-language input. UI breaks if a field is the wrong shape.
  • Multi-agent handoff — agent A produces a payload that agent B consumes. Schema is the contract.
  • Database writes — model output is going into a typed schema (Postgres, BigQuery). One unparseable row is a pager event.

Tools and prompts compose

Modern is essentially structured output with the additional semantic that the schema names a tool to invoke. The same constrained-decoding machinery powers both. From the model’s perspective there’s no fundamental difference between “produce a JSON object matching this schema” and “call this function with these arguments.”

The production frame

Anytime an LLM output is consumed by code, structured output should be on. The question is just which of the three approaches — and the answer is the most constrained one your provider supports.

Go further

What's the difference between JSON mode and constrained decoding?

JSON mode (the original OpenAI offering) is a soft constraint — it nudges the model toward valid JSON via post-training but doesn't guarantee schema validity. Constrained decoding (structured outputs, guided generation, outlines, llguidance) modifies the sampling step itself: at each token, only tokens consistent with the grammar/schema are allowed. Constrained decoding produces 100% valid output by construction; JSON mode produces ~99%.

Does constrained decoding hurt quality?

Sometimes, in subtle ways. Forcing the model into a rigid schema can prevent it from 'thinking out loud' first, which matters for complex extraction. The pragmatic pattern is two-stage: first an unconstrained reasoning pass, then a constrained extraction pass that distills the reasoning into the schema. Many production systems also reserve a 'reasoning' field at the top of the schema for the model to use as scratchpad.

Why not just use function calling for everything?

You can — function calling is structured output with a fixed shape. The reason 'structured output' exists as a separate API is that you often want the schema to be data-driven (different schemas per request) rather than tied to a registered tool. Modern providers expose both; function calling for genuine tool selection, structured output for data-shaped responses.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord