harrier-27b: Can 27B Parameters Beat zembed-1?

Apr 8, 2026 · GitHub Twitter Slack LinkedIn Discord
harrier-27b: Can 27B Parameters Beat zembed-1?
TL;DR
  • zembed-1 retains the #1 overall embedding model position, outperforming harrier-27b on average NDCG@10 (0.701 vs 0.699) and Recall@100 (0.750 vs 0.728)
  • On a per-dataset basis, zembed-1 wins 14 out of 24 datasets against harrier-27b on NDCG@10
  • voyage-4 and harrier-27b are neck-and-neck for the #2 spot — voyage-4 edges it out 12–11 on dataset wins
  • The Harrier family scales well internally (270M → 0.6B → 27B), but even the largest variant doesn’t close the gap to zembed-1
  • Explore the full interactive dashboard →

zembed-1 vs Harrier

A New Challenger, Evaluated Properly

Harrier is a recently released family of open-weight embedding models from Microsoft (finetuned Gemma and Qwen models), spanning three sizes: 270M, 0.6B, and 27B parameters. The largest variant — harrier-27b — has generated well-deserved attention. On binary MTEB, it ranked first among embedding models at time of release.

But as we explored in Beyond Binary, MTEB has a discrimination problem: given its (overwhelmingly) binary annotations, it can’t tell the difference between a document which perfectly answers a query and one which may only tangentially address it. So we ran all three Harrier models through the same graded evaluation pipeline we use for our evals dashboard — 24 diverse datasets, three independent LLM judges, continuous relevance scores from 0 to 10.

It is not a question of harrier-27b is a good model. As a matter of fact, it is. (And at 27 billion parameters and a whopping 5,376 output vector dimensionality, we would certainly hope it would be). But is it the best?

The Three Model Problem: MTEB Evals

Embed-only Recall@K, averaged across 24 datasets and 3 judges. Toggle between models to compare.

On the global average across all 24 evaluation datasets, there are three embedding models which markedly outperform the rest:

ModelNDCG@10Recall@10Recall@100
zembed-10.7010.4540.750
voyage-40.6990.4570.731
harrier-27b0.6990.4560.728

Below those, qwen3-4b, cohere-embed-v4, jina-v5-text-small, and openai-v3-large (in that order) form a cluster of second tier performance. But if you need top-tier accuracy, your choice lies in that trio.

So what separates them? On NDCG@10, very little — less than 0.25% across the trio (though it’s worth noting that harrier still comes out worst). But NDCG@10 is not the whole story.

On Recall@100 — the metric that determines whether a relevant document even makes it to your reranker — zembed-1 leads by +1.9 points over voyage-4, and +2.2 over harrier-27b.

That is where the separation becomes real. A reranker or other downstream system can reorder or rework whatever the embedding model surfaces, but it cannot conjure up a document the embedder failed to retrieve. zembed-1’s recall advantage compounds downstream: fewer relevant documents lost at the first stage means a strictly better candidate set for everything that follows.

Head-to-Head: zembed-1 vs harrier-27b

Averages, of course, can obscure as much as they reveal. So let us go dataset by dataset. Our evals dashboard covers 24 datasets drawn from three MTEB task categories — retrieval, reranking, and instruction retrieval — spanning legal (AILAStatutes, LegalBench), medical (CovidRetrieval, TRECCOVID), multilingual (MIRACL, MLQA, Belebele, WikipediaRetrieval), and technical domains (StackOverflowQA, SCIDOCS), among others.

Across 24 evaluation datasets, zembed-1 outperforms harrier-27b on NDCG@10 on 14 of them.

The pattern of where each model wins is telling. zembed-1 dominates on instruction retrieval (Core17, News21, Robust04 — tasks which require parsing nuanced query intent, not merely matching keywords), medical and legal domains (CovidRetrieval, LegalBench, TRECCOVID), and technology (StackOverflowQA). harrier-27b, for its part, shows strength on multilingual reranking and a handful of niche benchmarks (RuBQReranking, Russian paragraph reranking; and TwitterHjerne, Danish Twitter retrieval).

Datasetzembed-1harrier-27bDelta
Core17InstructionRetrieval0.8990.837+6.2
Robust04InstructionRetrieval0.8570.788+6.9
TRECCOVID0.9220.871+5.1
News21InstructionRetrieval0.9190.910+0.8
LEMBPasskeyRetrieval0.8910.825+6.6
CovidRetrieval0.8200.796+2.3
AlloprofReranking0.8510.832+1.9
LegalBenchCorporateLobbying0.8750.860+1.5
StackOverflowQA0.6950.651+4.4
T2Reranking0.8040.794+1.0
MIRACLRetrievalHardNegatives0.5310.526+0.5
MLQARetrieval0.0340.029+0.5
WikipediaRetrievalMultilingual0.7780.774+0.5
VoyageMMarcoReranking0.7320.739-0.7
StatcanDialogueDatasetRetrieval0.7230.742-1.9
TwitterHjerneRetrieval0.6940.775-8.1
SCIDOCS0.5400.623-8.3
RuBQReranking0.7360.801-6.5
AILAStatutes0.7000.740-4.0
WikipediaRerankingMultilingual0.5960.626-3.0
ArguAna0.5640.566-0.3
BelebeleRetrieval0.0730.073-0.0
HagridRetrieval0.8970.899-0.2

The Race for Second Place

As we established in our previous head-to-head, voyage-4 has been the reigning #2 embedding model. With harrier-27b now in the picture, that position is genuinely contested:

voyage-4harrier-27b
Average NDCG@100.6990.699
Average Recall@1000.7310.728
Dataset wins (head-to-head)1211

It is… remarkably close. voyage-4 holds its edge by a single dataset win and a slight recall advantage. The two models trade blows across verticals, and depending on your vertical, either could be the better runner-up. (Neither, however, threatens first place.)

Harrier’s Scaling Story

One genuinely interesting aspect of the Harrier family is its range of sizes. The scaling is clean — and instructive:

ModelParamsAvg NDCG@10Avg Recall@100
harrier-270m270M0.6190.658
harrier-0.6b600M0.6500.691
harrier-27b27B0.6990.728

A +3.1 point NDCG jump from 270M to 0.6B, then +4.9 points from 0.6B to 27B. Returns to scale are not completely diminishing — the largest absolute improvement comes from the largest model. Credit where credit is due: this is a decently-executed scaling curve, particularly for the 0.6b model which demonstrates equal or better performance to Cohere’s flagship embed-v4.

What Does Each Size Buy You?
  • harrier-270m (0.619) outperforms bge-m3 (0.580) and openai-v3-small (0.588) — entirely respectable for a 270M-parameter model
  • harrier-0.6b (0.650) is competitive with cohere-embed-v4 (0.652)
  • harrier-27b (0.699) enters the top three — but requires 27 billion parameters and 5,376-dimensional output vectors to get there, compared to zembed-1’s 4 billion parameters and 2,560 dimensions

The contrast in size between harrier-27b and all other models bears emphasis: 27 billion parameters is absolutely massive for an embedding model, and that’s not a compliment.

zembed-1 achieves its #1 ranking with 4 billion parameters and a 2,560-dimensional output. harrier-27b needs nearly 7x the parameter count and 2x the vector dimensionality to land 0.2% behind on NDCG@10. In a production setting — where embedding compute, storage costs, and index size are real constraints — the efficiency gap is hardly academic. Would you pay for a model with 7x higher inference cost, probably 7x as much latency, which outputs an embedding that’s twice as costly to store, just to get worse results?

We wouldn’t.

What This Means

harrier-27b is a legitimate top-three embedding model — quite possibly the strongest new entrant we have seen since voyage-4. It is genuinely competitive, especially on multilingual reranking tasks, and we expect Microsoft will continue to iterate on the family.

But the leaderboard has not changed:

zembed-1 leads on average NDCG@10, wins 14 of 24 datasets head-to-head against harrier-27b, and holds the highest Recall@100 of any embedding model — at 1/7th the parameter count and half the vector dimensionality.

For the full interactive breakdown across all models, datasets, metrics, and reranker combinations, explore the evaluation dashboard.

Get Started

zembed-1 is available today through multiple deployment options:

from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query", # "query" or "document"
input="What is retrieval augmented generation?", # string or list[str]
dimensions=2560, # optional: must be one of [2560, 1280, 640, 320, 160, 80, 40]
encoding_format="float", # "float" or "base64"
latency="fast", # "fast" or "slow"; omit for auto
)

Documentation: docs.zeroentropy.dev

HuggingFace: huggingface.co/zeroentropy

Get in touch: Discord community or contact@zeroentropy.dev

Talk to us if you need a custom deployment, volume pricing, or want to see how zembed-1 + zerank-2 performs on your data.

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

ZeroEntropy
The best AI teams retrieve with ZeroEntropy
Follow us on
GitHubTwitterSlackLinkedInDiscord