Instruction Tuning

Q: What's the difference between instruction tuning and RLHF?

Instruction tuning teaches the model the shape of helpful responses (follow the instruction, produce a coherent answer). RLHF/DPO further refines which responses humans prefer among many helpful-shaped ones. They're sequential: instruction tuning first to make the model usable, then preference optimization to make it preferred.

Q: Does instruction tuning hurt the base model's capabilities?

It can — narrowly tuned instruction data causes catastrophic forgetting of pre-training knowledge. The fix is mixing in pre-training-style data during fine-tuning and using parameter-efficient methods like LoRA that limit how much the base weights move.

Also known as: instruction fine-tuning, instruct tuning, SFT for instructions

TL;DR

Instruction tuning is fine-tuning a pre-trained language model on (instruction, response) pairs so it learns to follow directions. The step that turns 'GPT-base' into 'GPT-instruct'.

A pre-trained base LLM, fresh from next-token prediction on web-scale text, doesn’t follow instructions. Ask it “What is the capital of France?” and it might continue with “is a question often asked by tourists. The capital of France is…” — completing the prompt as text rather than answering it. Instruction tuning is the step that fixes this.

The recipe

Collect (or generate) a corpus of (instruction, response) pairs:

Instruction: Summarize this article in three bullet points.
[Article text]
Response:
- Bullet 1
- Bullet 2
- Bullet 3

Fine-tune the base model on this data with standard cross-entropy loss on the response tokens (the instruction part is typically masked out of the loss). Output: a model that, given an instruction, produces a response.

That’s the entire mechanic. The interesting work is in the data.

What good instruction data looks like

Diversity of task types. Summarization, classification, code, math, reasoning, creative writing, format conversion, factual Q&A. Narrow data narrows the resulting model’s capability.
Diversity of instruction phrasing. Same task, many phrasings. “Summarize”, “TLDR”, “give me the key points”, “what’s the main idea”. Robustness to phrasing comes from variety in training.
Output format variety. JSON, bullet points, prose, tables, code blocks. Models learn the format vocabulary they see.
Coverage of refusals. Examples of “I can’t do that because…” for harmful or out-of-scope requests. Without these, the model agrees to anything.

Canonical instruction-tuning datasets

FLAN-v2 (Google, 2022) — collated supervised tasks from existing NLP benchmarks reformatted as instructions; the original recipe.
Super-NaturalInstructions — 1,600+ task templates with human-written instructions across diverse NLP shapes.
Alpaca and Self-Instruct — 52K LLM-generated instruction pairs that proved synthetic data could match hand-curated quality.
Tülu and OpenHermes — modern open-source mixes that combine human-curated subsets with heavy synthetic augmentation.
Magpie — extracts instructions latent in the base model itself by sampling from instruction templates without seed prompts.

Why synthetic data dominates now

Hand-writing 100K diverse instruction-response pairs is brutal. The 2023 trick — Self-Instruct (Wang et al.), then Alpaca, then Evol-Instruct — was: ask a frontier LLM to generate instruction data, with seed examples and template prompts to drive diversity. Modern instruction sets (Tülu, OpenHermes, Magpie) are overwhelmingly LLM-generated; the same synthetic-supervision shape that drives most modern fine-tuning beyond the frontier labs.

The standard recipe computes cross-entropy loss only on the response tokens — the instruction tokens are presented as input but their per-token loss is masked to zero. The reasoning is that you don’t want the model learning to generate instructions; you want it learning to complete them. If you don’t mask, the gradient pushes the model to memorize instruction shapes, which both wastes capacity and degrades the model’s ability to handle out-of-distribution instructions at inference time. Some recipes additionally mask special tokens like the chat template separators so the model treats them as fixed scaffolding, not learned content.

Instruction tuning vs alignment

Instruction tuning is the first half of the post-pre-training stack. The second half is preference optimization — RLHF or DPO — which takes an instruction-tuned model and pushes it toward responses humans prefer among the many possible instruction-following ones. Instruction tuning is necessary but not sufficient; an instruction-tuned model without alignment is helpful but often not honest, harmless, or stylistically calibrated.

Where it intersects with retrieval

Instruction-following rerankers (zerank-2 included) use the same supervision shape — instructions like “rerank these documents for relevance to the query, prioritizing recent docs” — turning the reranker itself into a small instruction-tuned LLM with a relevance-scoring head.

Go further

What's the difference between instruction tuning and RLHF?

Instruction tuning teaches the model the shape of helpful responses (follow the instruction, produce a coherent answer). RLHF/DPO further refines which responses humans prefer among many helpful-shaped ones. They're sequential: instruction tuning first to make the model usable, then preference optimization to make it preferred.

RLHF DPO

Where does the instruction-tuning data come from?

Originally human-written (FLAN, Super-NaturalInstructions). Now overwhelmingly synthetic — frontier LLMs generate (instruction, response) pairs at scale. Self-Instruct, Alpaca, and Evol-Instruct were the first widely-used recipes; today every open-weight instruct model leans on synthetic data heavily.

Synthetic data generation Supervised fine-tuning

Does instruction tuning hurt the base model's capabilities?

It can — narrowly tuned instruction data causes [catastrophic forgetting](/concepts/catastrophic-forgetting/) of pre-training knowledge. The fix is mixing in pre-training-style data during fine-tuning and using parameter-efficient methods like LoRA that limit how much the base weights move.

Catastrophic forgetting LoRA / PEFT

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs