Training Methodology
How modern retrieval models get their relevance signal.
The supervision signal a retrieval model is trained on determines what it can learn. The concepts below cover how modern rerankers and embeddings get their relevance targets — pairwise preferences from frontier-LLM ensembles, Thurstone fits that recover continuous Elo-style scores, and the distillation pipelines that compress a giant teacher into a fast specialized student. This is the methodology family behind zerank-1, zerank-2, and zembed-1, and the same shape generalizes to any narrow task where pairwise judgments are cheap and absolute scores are noisy.
- Catastrophic Forgetting
When fine-tuning a pre-trained model on a new task erases capabilities the base model originally had. The classical neural-network failure mode that dominates fine-tuning practice — and the reason LoRA, mixed-data training, and rehearsal exist.
- Constitutional AI
Constitutional AI replaces human pairwise preference labels with a written constitution — a list of natural-language rules — and uses an LLM to critique and revise its own outputs against those rules.
- DPO (Direct Preference Optimization)
DPO is the closed-form alternative to RLHF: optimize the LLM directly on pairwise preferences, with no separate reward model and no reinforcement learning loop. Simpler, more stable, and the default alignment recipe in 2026.
- Elo Score
Elo is a continuous skill rating recovered from pairwise win/loss outcomes — originally for chess, now repurposed in retrieval to convert pairwise document preferences into pointwise relevance scores.
- Ensemble Learning
Combining the predictions of multiple models — bagging, boosting, stacking — to get a single output more accurate than any individual member.
- Entropy Regularization
Adding an entropy bonus to a training objective to keep the model's output distribution from collapsing too sharply. Used in policy-gradient RL (PPO, SAC, A3C) to encourage exploration.
- Fine-Tuning
Fine-tuning is the process of further training a pre-trained model on task-specific or domain-specific data. It's how a generalist becomes a specialist.
- Information Bottleneck
The information bottleneck principle frames learning as a compression problem: find a representation T of input X that throws away every bit of X that is not informative about the target Y. Formally, maximize I(T; Y) while minimizing I(X; T).
- Instruction Tuning
Instruction tuning is fine-tuning a pre-trained language model on (instruction, response) pairs so it learns to follow directions. The step that turns 'GPT-base' into 'GPT-instruct'.
- Knowledge Distillation
Training a small (student) model to mimic the outputs of a larger (teacher) model — getting most of the teacher's quality at a fraction of the cost. The basis of essentially every production deployment of small specialized models.
- Learning-Rate Scheduler
A learning-rate scheduler is the function that changes the learning rate over training. Linear warmup followed by cosine decay is the modern default; WSD (warmup-stable-decay) is the 2024 successor. Picking the schedule is as load-bearing as picking the peak LR.
- LoRA and Parameter-Efficient Fine-Tuning (PEFT)
LoRA injects tiny low-rank adapter matrices into a frozen base model and trains only those — typically ~1% of the parameters. Results match or beat full fine-tuning on most narrow tasks at a fraction of the memory and storage cost.
- Pairwise Preference
Pairwise preference is the supervision signal where, for a query and two candidate documents, an annotator (or LLM) picks which one is more relevant.
- PPO (Proximal Policy Optimization)
A clipped policy-gradient algorithm that keeps each update close to the previous policy via a clip on the importance-sampling ratio. The standard RL optimizer for RLHF — Schulman et al. 2017, OpenAI — and the algorithm GPT-3.5/4 and Llama-2 were aligned with.
- Process Reward Model
A process reward model (PRM) scores each intermediate step of a reasoning chain, not just the final answer. It's the supervision signal that powers post-o1 reasoning models — credit assignment along the trajectory, not only at the end.
- Reward Modeling
Training a model that predicts a scalar quality or preference score for an LLM's output. The backbone of RLHF — the reward model is what the LLM optimizes against.
- RLHF (Reinforcement Learning from Human Feedback)
RLHF is the classical alignment recipe: train a reward model from human pairwise preferences, then fine-tune the language model with PPO to maximize that reward.
- Supervised Fine-Tuning (SFT)
SFT is plain supervised learning applied to a pre-trained language model: given (input, target) pairs, train the model to produce the target. The umbrella term for any fine-tuning that's not preference-based — distinct from RLHF and DPO.
- Synthetic Data Generation
Using a frontier LLM to generate training data for a smaller specialized model. The dominant data-creation method in 2026 — every modern open-weight instruct model and most production-tuned rerankers train on synthetic data, including zerank-2.
- Thurstone Model
A statistical model from 1927 that converts pairwise comparisons into continuous quality scores. Foundational to chess Elo ratings, food preference studies, and modern reranker training via the zELO methodology.
- zELO
ZeroEntropy's training methodology for rerankers and embeddings. Frontier LLMs vote pairwise on document relevance; a Thurstone fit recovers continuous Elo-style scores; the scores become regression targets for a small specialized model.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
