Constrained Decoding

Also known as: guided generation, schema-guided sampling, GBNF, JSON mode, structured outputs

TL;DR

Constrained decoding restricts an LLM's next-token distribution to only tokens that keep the partial output valid against a grammar or schema.

Constrained decoding is the inference-time technique that forces an LLM’s output to conform to a grammar, schema, or regex by masking the next-token distribution at every step. Tokens that would make the partial output invalid are set to logit negative infinity before sampling, leaving only legal continuations. Because the constraint is enforced during generation rather than after, the output is guaranteed valid — you never get a parse error, never have to retry, never need a fallback parser. It is the production-grade alternative to “ask nicely in the prompt and hope.”

How it works mechanically

At every decoding step, the model produces logits over its vocabulary (typically 32K-128K tokens). Without constraints, you sample from this distribution directly. With constraints:

Maintain a state machine derived from your constraint (a DFA for regex, a pushdown automaton for context-free grammar, a JSON-schema walker for JSON).
Query the state machine: “given the partial output so far, which tokens can legally come next?”
Mask all other tokens to logit -inf. The masked logits go through softmax with their masked tokens at probability zero.
Sample from the renormalized distribution. Update the state machine with the chosen token.
Repeat until the state machine reaches a terminal state (or the model emits the EOS token).

The model’s relative preferences over allowed tokens are preserved. If the model wanted to output "hello" (allowed) versus "world" (also allowed), the ratio of their probabilities is unchanged — only disallowed tokens get zeroed.

This shouldn’t happen with a well-formed grammar, but it can if your tokenizer’s tokens don’t cleanly align with grammar rules — for example, if the model wants to output a single token "foo" but the grammar only allows "f" followed by "oo", and the model’s vocabulary doesn’t have "f" as a single token. Production decoders (outlines, llguidance) preprocess the grammar against the tokenizer to find which tokens are valid prefixes of valid completions. When there’s a mismatch, the system either falls back to character-level generation (slow) or rejects the grammar at compile time.

What gets constrained

Regex. Any regular language. Phone numbers, ISO dates, semver strings, classification labels chosen from a fixed list. Regex constraints are the fastest — DFAs are linear-time to evaluate.

JSON schema. The dominant production case. Constrain output to a JSON object with specific keys, types, enums, nesting. Schemas can express most structured data needs and are widely supported (OpenAI’s response_format: json_schema, Anthropic’s tool use, Outlines, llguidance).

Grammars (GBNF, EBNF). Full context-free grammars. Useful when you need to constrain to a real language — SQL queries, code in some DSL, structured prose with specific shape. Slightly more expensive to evaluate than regex but supports nesting.

Function calling. A specialization of JSON-schema constrained decoding where the schema names a tool to invoke. Same machinery, different framing.

Why it generally beats prompt-only “structured output”

Pure-prompt structured output (“respond in JSON with keys X, Y, Z”) works most of the time on capable models, and its failure modes are production-breaking. The model wraps the JSON in markdown fences, emits a chatty preamble, omits a required field, or returns a string where you wanted a number. Each failure is rare individually; at scale, parse-error rates of 1-5% are common.

Constrained decoding makes those failures architecturally impossible. It’s not “more reliable” — it’s provably correct by construction.

Cost and latency

Constrained decoding adds a small per-step overhead — typically a few percent for regex/JSON-schema, more for full grammars with deep nesting. The state-machine queries can be cached and parallelized. In practice, the latency difference between unconstrained and constrained decoding is rarely the bottleneck.

The bigger cost is implementation complexity. Building a robust constrained-decoding system that handles arbitrary tokenizers, complex grammars, and edge cases is non-trivial — which is why most production teams use established libraries (Outlines, llguidance, vLLM’s structured-output support) rather than rolling their own.

What’s settled

Constrained decoding is the default for LLM-to-program interfaces. OpenAI, Anthropic, Google, and every open-source serving framework expose JSON-schema constraints natively. The earlier debate — “should this be prompt-engineering or inference-time constraint?” — has settled on the inference-time answer for any output that downstream code parses. The remaining art is choosing where not to constrain, leaving the model room to reason before the structured answer appears.

Go further

What's the actual mechanism — how does the model 'know' what's valid?

It doesn't. The decoder maintains a state machine (a regex compiled to DFA, a grammar compiled to a pushdown automaton, or a JSON-schema walker). At each step, the state machine returns the set of token IDs that can legally follow the current state. The decoder masks all other tokens to logit -infinity before sampling. The model's distribution over allowed tokens is preserved (renormalized after masking).

Logits Autoregressive generation

Why GBNF instead of just JSON schema?

GBNF (the GGML BNF format used by llama.cpp) lets you express any context-free grammar — not just JSON-shaped output. Regex handles regular languages but not nested structure; JSON schema handles JSON specifically. GBNF handles SQL queries, code, custom DSLs, anything you can write a CFG for. Outlines and llguidance support both regex/JSON-schema (fast) and full grammar (more general, slightly slower).

Structured output

Does constrained decoding work with reasoning models?

Yes, but you typically constrain only the answer portion, not the chain-of-thought. The model produces unconstrained reasoning tokens (often inside a designated tag like <think>...</think>), then transitions into the constrained-output state machine for the final structured answer. OpenAI's o1-style models and Anthropic's extended-thinking work this way.

Chain-of-thought Structured output

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs