TF-IDF (term frequency × inverse document frequency) is the classical weighting that turns a bag-of-words document into a vector of term importances. Each term’s weight is the product of two factors:
— how often term appears in document . More repetitions → more important to this document.
— how many documents in the corpus of size contain . Common terms have high , so is small. Rare terms have low , so the IDF factor is large.
The intuition: a term that’s frequent in this document but rare across the corpus is highly discriminative — it’s probably what the document is about.
Why retrieval moved past it
For ranked retrieval, BM25 replaced TF-IDF in the 1990s and never gave the position back. Two specific fixes:
TF saturation. Plain TF grows linearly: a document mentioning “cat” 100 times scores 100x more than a document mentioning it once. That’s wrong — the second mention is informative, the hundredth is noise. BM25 saturates TF with so additional repetitions yield diminishing returns.
Document-length normalization. TF-IDF leaves length handling to the implementation (sometimes none, sometimes ad-hoc). BM25 normalizes by document length relative to corpus average, with a tunable .
BM25 also has a cleaner probabilistic derivation, but the practical wins are TF saturation and length normalization.
The log isn’t cosmetic. Without it, IDF would be — for a corpus of a million documents and a term appearing in one, that’s a million-fold weight. One rare token would drown out every other signal in the score.
Taking the log compresses that to — large but tractable. It also makes IDF additive in a useful way: doubling the corpus while keeping document frequency proportional leaves IDF unchanged. The base of the log is conventional (, , or ) and doesn’t affect ranking, only the absolute score scale.
Most modern variants use — the “Robertson-Spärck Jones” form — which handles edge cases (term in every document, term in no document) more gracefully and is what BM25 inherits.
Where TF-IDF still earns its keep
TF-IDF survives outside ranked retrieval because it’s not just a scoring function — it’s a vectorization. A document under TF-IDF becomes a sparse vector indexed by vocabulary, which is exactly what scikit-learn’s TfidfVectorizer produces. With that vector you can:
Feed it into a logistic regression or SVM for classification.
Cluster with k-means or DBSCAN.
Spot near-duplicates by thresholding cosine.
Build a quick topic-model baseline before reaching for LDA or BERTopic.
For all of these tasks, BM25 isn’t a drop-in replacement — BM25 scores a (query, document) pair, not a document on its own. So the modern position is: BM25 for ranked retrieval, TF-IDF for document representation.
TF-IDF vs dense embeddings
A TF-IDF vector for a typical corpus is hundreds of thousands of dimensions, mostly zeros — the dimensionality is the vocabulary size. A dense embedding is a few hundred to a few thousand dense dimensions. The two complement each other: TF-IDF (or sparse retrieval more broadly) catches rare exact tokens; embeddings catch paraphrases. Hence hybrid search .
For a small project where you don’t want infrastructure, TF-IDF + cosine in one numpy call is still the fastest path from “I have documents” to “I can search them.”
Go further
If BM25 is strictly better, why is TF-IDF still around?
It isn't strictly better — it's better for retrieval. TF-IDF as a feature representation (turning a document into a sparse vector for clustering, classification, or similarity) is still standard, because BM25 is a query-time scoring function, not a vectorization. Scikit-learn's TfidfVectorizer is the workhorse there.
Two things. TF in TF-IDF grows linearly with term repetition — a document mentioning 'cat' 100 times scores 100x more than one mentioning it once, which is wrong. BM25 saturates TF so additional mentions yield diminishing returns. BM25 also adds proper document-length normalization; TF-IDF leaves length as a per-implementation footnote.
Small corpora (under ~100k docs), no infrastructure, and a content-similarity task — finding near-duplicates, deduplication, simple topic clustering. The whole pipeline is one numpy call. Past that scale you want an inverted index plus BM25.