Also known as: SParse Lexical AnD Expansion, learned sparse retrieval, neural sparse retrieval
TL;DR
SPLADE (SParse Lexical AnD Expansion) is a learned sparse retrieval model: a transformer produces a sparse term-weight vector over the BERT vocabulary for each query and document, scored by dot product on an inverted index.
SPLADE (SParse Lexical AnD Expansion) is a learned sparse retrieval model that produces a sparse term-weight vector for each query and document, scored by inner product on an inverted index. The output dimensionality is the BERT WordPiece vocabulary, around 30,000 dimensions — but each query or document activates only a few hundred of them with non-zero weights. The architecture sits between BM25 and dense retrieval : the weights are learned end-to-end like a neural model, but the representation is sparse and indexable like classical lexical search.
How the representation is built
SPLADE uses a BERT encoder with the masked-language-modeling head intact. For an input sequence, at each token position the MLM head produces a logit over the vocabulary — what word would the model predict here. SPLADE turns that into a non-negative term-weight signal:
then aggregates across token positions by max-pooling:
The result is one weight per vocabulary entry. The log(1+ReLU(.)) shape is critical: ReLU enforces non-negative weights (so they compose like classical TF-IDF terms), and the log saturates large values. Sparsity is enforced explicitly through an L1 regularizer on the weights during training; SPLADE-v2 produces vectors with roughly 50-200 non-zero entries per query or document.
This is the ‘expansion’ in SParse Lexical AnD Expansion, and it’s the key over BM25. Because the MLM head was trained to predict any word that fits a context, when the input mentions ‘feline’ the head produces high logits for related vocabulary entries like ‘cat’, ‘kitten’, ‘pet’. Those logits become non-zero weights in the SPLADE vector. A query for ‘cat’ can therefore match a document that only said ‘feline’, which BM25 would miss entirely.
The expansion happens implicitly through the MLM head’s prior. There’s no explicit synonym list. The model learns that ‘feline’ and ‘cat’ co-occur in similar contexts during pretraining, and the SPLADE training objective preserves that signal in the sparse output.
The trade-off: more non-zeros buys recall but costs index size and retrieval latency. SPLADE’s sparsity hyperparameter (the L1 coefficient) is the lever for that curve.
Why production teams care
The practical pitch is operational compatibility. A search team running Elasticsearch already has an inverted index serving BM25 in production with sub-millisecond p99. SPLADE outputs are weighted term lists — exactly the shape an inverted index handles natively. You can deploy SPLADE by writing learned term-weight payloads into the existing index, querying with the same infrastructure, and getting neural-quality retrieval without spinning up a new vector database.
What infra it works with
Elasticsearch / OpenSearch — rank_features field type stores SPLADE term weights as a map. Query is a rank_features query summing weighted contributions.
Vespa — tensor field with sparse semantics. First-class support for learned sparse models.
Lucene / Tantivy — direct support via custom term frequencies on a custom analyzer.
Pyserini / Anserini — research-friendly Lucene wrappers used in most SPLADE papers.
How it scores
The retrieval score for a (query, document) pair is the dot product of their SPLADE vectors, executed efficiently on an inverted index: for each non-zero query term, look up the posting list, accumulate query_weight[term] * doc_weight[term] per document. Because both vectors are sparse, the total work is proportional to the number of non-zero query terms times the average posting list length — very similar to classic lexical retrieval.
Scoring complexity at retrieval time is therefore the same shape as BM25 . The cost moves to the encoding step: every query and every document must be encoded by the BERT model — millisecond-scale per query, but a meaningful corpus-wide expense at index time.
Where SPLADE wins and where it doesn’t
Where it loses: in-domain accuracy against a top-tier dense bi-encoder. Modern dense models with hard-negative mining and instruction tuning typically edge out SPLADE on the dataset they were trained on. SPLADE also struggles with very short queries (limited expansion signal) and with concepts that don’t have clean vocabulary representations (cross-lingual, code).
SPLADE is the upgrade path from BM25 when you want neural-quality retrieval but are operationally committed to inverted-index infrastructure, and especially when out-of-domain robustness matters. In pipelines already running a vector database at scale, hybrid search of BM25 plus dense plus a reranker usually beats SPLADE alone.
Go further
How does SPLADE actually produce a sparse vector?
Run a BERT-style encoder. For each token position, take the masked-language-model head's logits over the vocabulary. Apply ReLU and log(1+x) to enforce non-negativity and saturation. Max-pool across token positions. The result is a |V|-dim vector (~30K dims) where most entries are zero and the non-zero ones are learned term weights — including weights on tokens not present in the input (the 'expansion' part).
Why is SPLADE compatible with existing search infrastructure?
Because the output is sparse term weights over the same vocabulary an inverted index already handles. Elasticsearch, Vespa, Lucene-based systems can serve SPLADE scores by treating each non-zero dimension as a term with a weight. No new index type required; the operational story is BM25's. This is the deciding advantage for teams already running lexical search.
Out-of-domain generalization is its strongest pitch. On BEIR, SPLADE-v2 outperforms most dense models on domains it wasn't trained on (legal, biomedical, news) because lexical signals are robust where dense embeddings haven't seen the vocabulary. On in-domain benchmarks, well-trained dense bi-encoders typically still win narrowly.