TF-IDF

Also known as: term frequency-inverse document frequency, tfidf

TL;DR

TF-IDF weighs a term by how often it appears in a document (term frequency) times how rare it is across the corpus (inverse document frequency).

TF-IDF (term frequency × inverse document frequency) is the classical weighting that turns a bag-of-words document into a vector of term importances. Each term’s weight is the product of two factors:

  • — how often term appears in document . More repetitions → more important to this document.
  • — how many documents in the corpus of size contain . Common terms have high , so is small. Rare terms have low , so the IDF factor is large.

The intuition: a term that’s frequent in this document but rare across the corpus is highly discriminative — it’s probably what the document is about.

Why retrieval moved past it

For ranked retrieval, replaced TF-IDF in the 1990s and never gave the position back. Two specific fixes:

  1. TF saturation. Plain TF grows linearly: a document mentioning “cat” 100 times scores 100x more than a document mentioning it once. That’s wrong — the second mention is informative, the hundredth is noise. BM25 saturates TF with so additional repetitions yield diminishing returns.
  2. Document-length normalization. TF-IDF leaves length handling to the implementation (sometimes none, sometimes ad-hoc). BM25 normalizes by document length relative to corpus average, with a tunable .

BM25 also has a cleaner probabilistic derivation, but the practical wins are TF saturation and length normalization.

The log isn’t cosmetic. Without it, IDF would be — for a corpus of a million documents and a term appearing in one, that’s a million-fold weight. One rare token would drown out every other signal in the score.

Taking the log compresses that to — large but tractable. It also makes IDF additive in a useful way: doubling the corpus while keeping document frequency proportional leaves IDF unchanged. The base of the log is conventional (, , or ) and doesn’t affect ranking, only the absolute score scale.

Most modern variants use — the “Robertson-Spärck Jones” form — which handles edge cases (term in every document, term in no document) more gracefully and is what BM25 inherits.

Where TF-IDF still earns its keep

TF-IDF survives outside ranked retrieval because it’s not just a scoring function — it’s a vectorization. A document under TF-IDF becomes a sparse vector indexed by vocabulary, which is exactly what scikit-learn’s TfidfVectorizer produces. With that vector you can:

What TF-IDF vectors are good for
  • Compute between documents.
  • Feed it into a logistic regression or SVM for classification.
  • Cluster with k-means or DBSCAN.
  • Spot near-duplicates by thresholding cosine.
  • Build a quick topic-model baseline before reaching for LDA or BERTopic.

For all of these tasks, BM25 isn’t a drop-in replacement — BM25 scores a (query, document) pair, not a document on its own. So the modern position is: BM25 for ranked retrieval, TF-IDF for document representation.

TF-IDF vs dense embeddings

A TF-IDF vector for a typical corpus is hundreds of thousands of dimensions, mostly zeros — the dimensionality is the vocabulary size. A dense is a few hundred to a few thousand dense dimensions. The two complement each other: TF-IDF (or more broadly) catches rare exact tokens; embeddings catch paraphrases. Hence .

For a small project where you don’t want infrastructure, TF-IDF + cosine in one numpy call is still the fastest path from “I have documents” to “I can search them.”

Go further

If BM25 is strictly better, why is TF-IDF still around?

It isn't strictly better — it's better for retrieval. TF-IDF as a feature representation (turning a document into a sparse vector for clustering, classification, or similarity) is still standard, because BM25 is a query-time scoring function, not a vectorization. Scikit-learn's TfidfVectorizer is the workhorse there.

What does BM25 actually fix vs raw TF-IDF?

Two things. TF in TF-IDF grows linearly with term repetition — a document mentioning 'cat' 100 times scores 100x more than one mentioning it once, which is wrong. BM25 saturates TF so additional mentions yield diminishing returns. BM25 also adds proper document-length normalization; TF-IDF leaves length as a per-implementation footnote.

When is a TF-IDF cosine search the right tool?

Small corpora (under ~100k docs), no infrastructure, and a content-similarity task — finding near-duplicates, deduplication, simple topic clustering. The whole pipeline is one numpy call. Past that scale you want an inverted index plus BM25.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord