TF-IDF

Also known as: term frequency-inverse document frequency, tfidf

TL;DR

TF-IDF weighs a term by how often it appears in a document (term frequency) times how rare it is across the corpus (inverse document frequency).

TF-IDF (term frequency × inverse document frequency) is the classical weighting that turns a bag-of-words document into a vector of term importances. Each term’s weight is the product of two factors:

— how often term appears in document . More repetitions → more important to this document.
— how many documents in the corpus of size contain . Common terms have high , so is small. Rare terms have low , so the IDF factor is large.

The intuition: a term that’s frequent in this document but rare across the corpus is highly discriminative — it’s probably what the document is about.

Why retrieval moved past it

For ranked retrieval, BM25 replaced TF-IDF in the 1990s and never gave the position back. Two specific fixes:

TF saturation. Plain TF grows linearly: a document mentioning “cat” 100 times scores 100x more than a document mentioning it once. That’s wrong — the second mention is informative, the hundredth is noise. BM25 saturates TF with so additional repetitions yield diminishing returns.
Document-length normalization. TF-IDF leaves length handling to the implementation (sometimes none, sometimes ad-hoc). BM25 normalizes by document length relative to corpus average, with a tunable .

BM25 also has a cleaner probabilistic derivation, but the practical wins are TF saturation and length normalization.

The log isn’t cosmetic. Without it, IDF would be — for a corpus of a million documents and a term appearing in one, that’s a million-fold weight. One rare token would drown out every other signal in the score.

Taking the log compresses that to — large but tractable. It also makes IDF additive in a useful way: doubling the corpus while keeping document frequency proportional leaves IDF unchanged. The base of the log is conventional (, , or ) and doesn’t affect ranking, only the absolute score scale.

Most modern variants use — the “Robertson-Spärck Jones” form — which handles edge cases (term in every document, term in no document) more gracefully and is what BM25 inherits.

Where TF-IDF still earns its keep

TF-IDF survives outside ranked retrieval because it’s not just a scoring function — it’s a vectorization. A document under TF-IDF becomes a sparse vector indexed by vocabulary, which is exactly what scikit-learn’s TfidfVectorizer produces. With that vector you can:

What TF-IDF vectors are good for

Compute cosine similarity between documents.
Feed it into a logistic regression or SVM for classification.
Cluster with k-means or DBSCAN.
Spot near-duplicates by thresholding cosine.
Build a quick topic-model baseline before reaching for LDA or BERTopic.

For all of these tasks, BM25 isn’t a drop-in replacement — BM25 scores a (query, document) pair, not a document on its own. So the modern position is: BM25 for ranked retrieval, TF-IDF for document representation.

TF-IDF vs dense embeddings

A TF-IDF vector for a typical corpus is hundreds of thousands of dimensions, mostly zeros — the dimensionality is the vocabulary size. A dense embedding is a few hundred to a few thousand dense dimensions. The two complement each other: TF-IDF (or sparse retrieval more broadly) catches rare exact tokens; embeddings catch paraphrases. Hence hybrid search .

For a small project where you don’t want infrastructure, TF-IDF + cosine in one numpy call is still the fastest path from “I have documents” to “I can search them.”

Go further

If BM25 is strictly better, why is TF-IDF still around?

It isn't strictly better — it's better for retrieval. TF-IDF as a feature representation (turning a document into a sparse vector for clustering, classification, or similarity) is still standard, because BM25 is a query-time scoring function, not a vectorization. Scikit-learn's TfidfVectorizer is the workhorse there.

BM25 Sparse retrieval

What does BM25 actually fix vs raw TF-IDF?

Two things. TF in TF-IDF grows linearly with term repetition — a document mentioning 'cat' 100 times scores 100x more than one mentioning it once, which is wrong. BM25 saturates TF so additional mentions yield diminishing returns. BM25 also adds proper document-length normalization; TF-IDF leaves length as a per-implementation footnote.

BM25 Inverted index

When is a TF-IDF cosine search the right tool?

Small corpora (under ~100k docs), no infrastructure, and a content-similarity task — finding near-duplicates, deduplication, simple topic clustering. The whole pipeline is one numpy call. Past that scale you want an inverted index plus BM25.

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs