- zembed-1 scored 0.946 NDCG@10 on MSMARCO — the highest of all 16 models evaluated
- Leads every single domain benchmark: finance, healthcare, legal, conversational, manufacturing, code, and STEM
- Averages 0.5561 across domains — +10% over voyage-4-nano and +17.6% over OpenAI
- 32,768-token context window eliminates chunking artifacts for long enterprise documents
- Flexible quantization from float32 to binary (<128 bytes per vector) for enterprise-scale corpora
zembed-1: The Foundation for Enterprise RAG
Retrieval-Augmented Generation has moved from research curiosity to enterprise standard in record time. Every major enterprise AI initiative now involves some form of RAG — the pattern of retrieving relevant documents from a knowledge base before generating an answer. And the dirty truth of RAG is that no amount of prompt engineering or model fine-tuning compensates for poor retrieval. If the embedding model doesn’t surface the right documents, the language model doesn’t have what it needs to answer well.
zembed-1 by ZeroEntropy is the embedding model that enterprise RAG practitioners have been waiting for. It has achieved the highest NDCG@10 score across all 16 models on the MSMARCO benchmark — the closest available proxy to real RAG workloads — and it leads every domain-specific benchmark tested.
The Retrieval Quality Problem in Enterprise RAG
Enterprise knowledge corpora are messy. They include documents written over many years, by many people, in many styles, covering many domains. A typical enterprise knowledge base might contain:
- HR policy documents written in formal legal-adjacent prose
- Engineering documentation with technical jargon and structured specifications
- Sales collateral in persuasive marketing language
- Customer communications in casual, conversational tone
- Financial reports with dense numerical content and regulatory language
- IT documentation mixing technical commands with explanatory prose
A single RAG system often needs to serve queries against all of these simultaneously. The embedding model must understand what makes a document relevant across all these different writing styles and content types — not just one specialty.
The Numbers: zembed-1 on MSMARCO and Domain Benchmarks
MSMARCO Benchmark (Standard IR and RAG Evaluation)
zembed-1 achieved 0.946 NDCG@10 on MSMARCO, the highest score across all 16 models evaluated. MSMARCO is specifically designed to replicate the diversity of real-world search and retrieval workloads — making it the gold-standard proxy for RAG retrieval quality.
Domain-Specific Performance
| Domain | zembed-1 | voyage-4-nano | Cohere Embed v4 | OpenAI Large |
|---|---|---|---|---|
| Finance | 0.4476 | 0.4227 | 0.3670 | 0.3291 |
| Healthcare | 0.6260 | 0.5356 | 0.4750 | 0.5315 |
| Legal | 0.6723 | 0.5957 | 0.5894 | 0.5099 |
| Conversational | 0.5385 | 0.4045 | 0.4244 | 0.3988 |
| Manufacturing | 0.5556 | 0.4857 | 0.4919 | 0.4736 |
| Code | 0.6452 | 0.6415 | 0.6277 | 0.6155 |
| STEM & Math | 0.5283 | 0.5012 | 0.4698 | 0.3905 |
| Average | 0.5561 | 0.5050 | 0.4957 | 0.4727 |
zembed-1 leads every single domain. No cherry-picking, no trade-offs. It’s the first embedding model to achieve consistent best-in-class performance across all domains simultaneously — exactly what enterprise RAG deployments require.
What Makes zembed-1 the Right Foundation for Enterprise RAG
No Domain Compromises
Enterprise applications can’t afford a model that’s excellent at some content types and mediocre at others. An employee of a financial services firm might ask about HR policy one moment and a regulatory requirement the next. A healthcare company’s knowledge base spans clinical guidelines, compliance documentation, and IT procedures.
zembed-1’s universal domain leadership means you can build one RAG system with one embedding model that serves the full breadth of enterprise content.
The zELO Methodology: Training on True Relevance
zembed-1 is distilled from zerank-2 — ZeroEntropy’s state-of-the-art reranker — using the zELO methodology, which models relevance scores as Elo ratings from pairwise document competitions. This trains zembed-1 to understand genuine relevance rather than surface-level textual overlap. The result is retrieval that finds what the user needs, even when the vocabulary doesn’t exactly match.
32k Token Context: Real Documents, Not Artificial Chunks
One of the most underappreciated problems in enterprise RAG is chunking. Long documents need to be broken into pieces for embedding, and most models’ context limits force very small chunks that lose document context and degrade retrieval quality.
zembed-1’s 32,768-token context window allows entire sections of policy documents, full chapters of technical manuals, or complete financial reports to be embedded as coherent units. This preserves the logical structure and cross-reference relationships within documents — and produces dramatically better retrieval for queries that require understanding document-level context.
Flexible Quantization for Enterprise Scale
Enterprise knowledge bases are large — often millions of documents spanning years of organizational history. zembed-1’s quantization flexibility makes this tractable:
| Quantization | Storage per vector | Compression | Accuracy impact |
|---|---|---|---|
| float32 | 8 KB | 1x | Baseline |
| int8 | 2 KB | 4x | Minimal |
| binary | <128 bytes | >32x | Controlled, predictable |
A corpus of 5 million documents costs ~40 TB at full precision. With binary quantization, that drops to under 640 GB — running comfortably on standard enterprise infrastructure.
Open Weight: Full Data Control
For enterprise deployments, data sovereignty matters. zembed-1 is available as an open-weight model on HuggingFace, allowing full on-premises deployment with no external API dependencies. Your documents never leave your infrastructure.
Architecting Enterprise RAG with zembed-1
Ingestion Pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer(
"zeroentropy/zembed-1",
trust_remote_code=True,
model_kwargs={"torch_dtype": "bfloat16"},
)
# Embed your document corpus
documents = load_enterprise_documents() # Your document loader
embeddings = model.encode_document(documents, batch_size=32, show_progress_bar=True)
# Store in your vector database (Pinecone, Weaviate, Qdrant, pgvector, etc.)
vector_store.upsert(documents, embeddings)
Query Pipeline
def rag_retrieve(user_query: str, top_k: int = 5):
query_embedding = model.encode_query(user_query)
results = vector_store.search(query_embedding, top_k=top_k)
return results
# Example cross-domain enterprise query
results = rag_retrieve(
"What is the process for reporting a workplace safety incident and what are the regulatory notification requirements?"
)
# zembed-1 correctly retrieves both the HR safety procedure AND the relevant OSHA notification requirements
Shrink your vector index for cost-effective enterprise scale:
from sentence_transformers.quantization import quantize_embeddings
# Full precision: ~8 KB per vector
full_embeddings = model.encode_document(documents, batch_size=32)
# Option 1: int8 — 4x compression, negligible accuracy loss (~2 KB per vector)
int8_embeddings = quantize_embeddings(full_embeddings, precision="int8")
# Option 2: binary — 32x compression, controlled accuracy trade-off (<128 bytes per vector)
binary_embeddings = quantize_embeddings(full_embeddings, precision="ubinary")
# Option 3: Reduced dimensions (640 instead of 2560) + int8 = ~8x total compression
small_embeddings = model.encode_document(
documents,
batch_size=32,
# Pass dimension truncation via model kwargs if supported
)
small_int8 = quantize_embeddings(small_embeddings, precision="int8")
storage_summary = {
"Full float32 (2560d)": f"{full_embeddings.nbytes / 1e9:.2f} GB for 1M docs",
"int8 (2560d)": f"{int8_embeddings.nbytes / 1e9:.2f} GB for 1M docs",
"binary (2560d)": f"{binary_embeddings.nbytes / 1e6:.0f} MB for 1M docs",
}
for label, size in storage_summary.items():
print(f" {label}: {size}") What Enterprise AI Teams Are Saying
“We evaluated eight embedding models for our knowledge platform. zembed-1 was the clear winner on our internal benchmark — and the open-weight availability sealed it. Our data doesn’t leave our infrastructure.” — CTO, enterprise software company
zembed-1 in the Enterprise AI Stack
zembed-1 is available through multiple channels suited for enterprise deployment:
- HuggingFace (open-weight, CC-BY-NC-4.0): For non-commercial and research use, self-hosted deployments
- ZeroEntropy API: Managed API service, currently 50% off document embeddings until June 1st — ideal for evaluating at scale before committing to infrastructure
- AWS Marketplace: For AWS-native enterprise deployments
For commercial use of the open-weight model, contact ZeroEntropy at contact@zeroentropy.dev.
The Bottom Line for Enterprise AI Teams
If you’re choosing an embedding model for your RAG infrastructure, the decision framework is straightforward: zembed-1 leads every domain benchmark, leads the MSMARCO standard retrieval benchmark, supports the longest context window of any competitive model, offers the most flexible quantization options, and is available for self-hosted deployment.
There is no longer a trade-off between RAG retrieval quality and operational flexibility. zembed-1 delivers both.
Get Started
zembed-1 is available today through multiple deployment options:
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query", # "query" or "document"
input="What is retrieval augmented generation?", # string or list[str]
dimensions=2560, # optional: must be one of [2560, 1280, 640, 320, 160, 80, 40]
encoding_format="float", # "float" or "base64"
latency="fast", # "fast" or "slow"; omit for auto
)Documentation: docs.zeroentropy.dev
HuggingFace: huggingface.co/zeroentropy
Get in touch: Discord community or contact@zeroentropy.dev
Talk to us if you need a custom deployment, volume pricing, or want to see how zembed-1 + zerank-2 performs on your data.
