Search & Retrieval
How systems find relevant documents in the first place.
First-pass retrieval is the wide-net stage of any production search system — the algorithms, indexes, and tradeoffs that surface a few hundred candidates per query out of millions. Classical lexical retrieval (BM25), modern dense embeddings, and the hybrids that combine them all live here. Get this stage wrong and the rest of the pipeline can't recover; get it right and a reranker downstream can finish the job. The concepts below cover what each technique actually does, where each one breaks, and why production systems almost always layer multiple methods rather than picking one.
- Approximate Nearest Neighbor (ANN)
ANN algorithms — HNSW, IVF, ScaNN — find the closest vectors to a query without scanning all of them. They give up a small slice of recall in exchange for orders-of-magnitude speedup.
- BM25
BM25 is a classical lexical retrieval algorithm that scores documents by how well their term frequencies match a query, with corrections for document length and rare-term importance.
- Chunking
Chunking is the process of splitting long documents into smaller passages that fit cleanly inside an embedding model's context window — and that align with semantic boundaries so each chunk is independently retrievable.
- Dense Retrieval
Dense retrieval finds documents by comparing their embeddings to a query embedding via cosine or dot product, served from an approximate-nearest-neighbor index.
- FAISS
FAISS (Facebook AI Similarity Search) is the C++ library for efficient similarity search and clustering of dense vectors. It implements the canonical ANN algorithms — flat, IVF, HNSW, PQ, and combinations — with CPU and GPU backends.
- First-Pass Retrieval
First-pass retrieval is the initial wide-net stage of a production search pipeline that surfaces a few hundred candidate documents per query out of millions. It optimizes for recall and speed; precision-at-the-top is left to a reranker downstream.
- Grounded Generation
Grounded generation is the pattern of forcing an LLM's output to be derivable from a supplied set of retrieved sources, with citations attached. The standard defense against hallucination in RAG pipelines.
- HNSW
HNSW (Hierarchical Navigable Small World) is the dominant graph-based ANN algorithm. A multi-layer proximity graph supports log-time approximate search by greedy walks at each layer.
- Hybrid Search
Hybrid search combines lexical retrieval (BM25) with dense retrieval (embeddings) into one ranked candidate set. Each method catches what the other misses, so the union is more recall-complete than either alone.
- Inverted Index
An inverted index maps each term to the list of documents (and positions) where it appears. The classical data structure behind keyword search — sub-millisecond lookups over billions of documents and the substrate every BM25 implementation builds on.
- IVF Clustering
IVF (Inverted File Index) is the cluster-based ANN algorithm: K-means partitions the corpus into a few thousand cells, each query is matched to its nearest centroids, then exhaustively searched within only those cells.
- Parent-Document Retrieval
Parent-document retrieval splits the index granularity from the context granularity: embed and retrieve over small chunks for precision, but return the larger parent document to the LLM. Fixes the chunk-boundary problem in RAG.
- Product Quantization
Product quantization (PQ) compresses a vector by splitting it into M sub-vectors and quantizing each independently against a small codebook learned via K-means.
- Query Expansion
Augmenting the original query with synonyms, paraphrases, or hypothetical answers before retrieval. The classical IR technique that LLMs reinvented as HyDE. Sometimes a clean win, sometimes drift that hurts more than it helps.
- Query Rewriting
Query rewriting transforms a user's raw query into one or more reformulated versions tuned for retrieval — expanding abbreviations, decomposing multi-part questions, or fixing the syntax expected by an underlying search API.
- RAG (Retrieval-Augmented Generation)
RAG is the pattern of retrieving relevant documents and feeding them into an LLM as context, so the LLM can answer with grounded, citeable information instead of guessing from its training data.
- Reciprocal Rank Fusion
Reciprocal rank fusion (RRF) is the boring, parameter-free way to merge multiple ranked lists into one. Sum $1/(k + \text{rank})$ across lists with $k=60$ — and you have the default fusion method in production hybrid-search stacks.
- Semantic Search
Semantic search is the umbrella term for retrieval that goes beyond surface keyword matching to capture meaning — most often via dense embeddings, but also via learned-sparse models, query rewriting, and reranking.
- Sparse Retrieval
Sparse retrieval is the family of methods that represent queries and documents as high-dimensional sparse vectors over a vocabulary — including BM25 and modern learned-sparse models like SPLADE and uniCOIL.
- SPLADE
SPLADE (SParse Lexical AnD Expansion) is a learned sparse retrieval model: a transformer produces a sparse term-weight vector over the BERT vocabulary for each query and document, scored by dot product on an inverted index.
- TF-IDF
TF-IDF weighs a term by how often it appears in a document (term frequency) times how rare it is across the corpus (inverse document frequency).
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
