Also known as: zero-shot, instruction-only prompting
TL;DR
Zero-shot prompting is asking a model to do a task with no examples — only a description of what you want. It works surprisingly well on common tasks because the model's training distribution already contains analogous patterns.
Zero-shot prompting is the simplest form of prompting: describe the task in natural language, give the model the input, and ask for the output — no examples, no demonstrations. The model is expected to do the task purely from its training-distribution knowledge.
When it works
Zero-shot is reliable on tasks that are well-represented in instruction-tuning data: summarization, translation, sentiment classification, simple Q&A, code generation in popular languages, common formatting tasks (Markdown, basic JSON). For these, instruction-tuned LLMs like GPT-5 and Claude give near-fine-tuned-model accuracy from zero-shot alone. The model has seen thousands of analogous tasks during post-training and treats yours as one more.
The signature of a good zero-shot task: a competent human reading only the prompt would know exactly what to produce, and the format is “obvious” from the description.
When it breaks
Where zero-shot fails
Idiosyncratic schemas. “Return JSON with these specific keys, nulls for missing fields, integers not strings” — every constraint is a roll of the dice without examples.
Domain jargon. Terms you use internally that the model has thinly seen (“R-tier accounts”, “phase-2 review”) get reinterpreted into nearby concepts.
Novel formats. Anything the model hasn’t seen during training (custom DSLs, your internal markup) needs demonstrations.
Long-tail edge cases. Zero-shot accuracy on the 90% case can be 95% while the 10% tail fails catastrophically — and you only learn this from evals.
Tasks with implicit assumptions. The instructions sound clear to you because you have context the model doesn’t.
Zero-shot accuracy is largely a function of how densely the post-training distribution covers your task. Instruction-tuning datasets are heavy on the obvious tasks: summarization, translation, classification, Q&A, simple code, and standard formats (JSON, Markdown, CSV). When your task overlaps with these, the model has effectively seen thousands of analogous (instruction, response) pairs and treats yours as one more.
For unusual tasks — proprietary schemas, niche domain reasoning, your internal jargon — the model has to extrapolate from less-similar training examples. The output is plausible-looking but unreliable, because the model is interpolating between patterns rather than matching one.
The diagnostic test: rephrase your task as something a recent textbook or popular library would do. If a clear analog exists (“this is essentially a sentiment classifier”, “this is a code refactor task”), zero-shot will likely work. If you need three sentences to explain what the task even is, few-shot or fine-tuning will outperform.
The escalation ladder
The pragmatic progression when zero-shot underperforms:
Sharpen the instructions. Specify format, constraints, and edge-case handling explicitly in the system prompt.
Add few-shot examples . Two to five demonstrations close most accuracy gaps that prose cannot.
Add chain-of-thought . For reasoning-heavy tasks, “think step by step” or worked-out CoT examples often dwarf format improvements.
Fine-tune . When you’ve collected enough labeled data and the task is stable, a fine-tuned small model beats zero-shot frontier prompting on every metric.
Why zero-shot is still the default starting point
Even when you suspect you’ll need few-shot or fine-tuning, start zero-shot. It’s the cheapest probe of the model’s natural prior: if zero-shot is already 80% accurate, the remaining 20% tells you exactly where the model’s prior diverges from your task. If it’s 30%, you know upfront that prompting alone won’t get you there.
The mistake is shipping zero-shot without an eval set. The failure mode — confidently wrong, with the same fluent register as when it’s correct — is the hallucination failure that’s expensive in real systems.
Go further
Why does zero-shot work at all?
Modern instruction-tuned LLMs were post-trained on millions of (instruction, response) pairs covering an enormous variety of tasks. When you write a new instruction, the model is doing approximate retrieval over that distribution — finding the closest task it was trained on and applying its template. Zero-shot reliability is essentially a function of how well your task overlaps with the post-training distribution.
Build an eval set of 50-100 representative inputs with gold outputs, run zero-shot, and measure. If accuracy is >95% on the metrics you care about, ship it. If it's 70-90%, few-shot will probably close the gap. Below 70%, you likely need fine-tuning or task decomposition. Don't skip the eval — zero-shot accuracy on real distributions surprises everyone, in both directions.
Does adding 'think step by step' count as zero-shot?
By convention, yes — that phrase alone is called zero-shot chain-of-thought, since it triggers reasoning without providing reasoning examples. The original CoT paper showed even this minimal nudge dramatically improves math and logic accuracy on capable models. Few-shot CoT (with worked examples) is stronger, but zero-shot CoT is the cheapest reasoning unlock.