F1 Score

Also known as: F-measure, F1

TL;DR

The F1 score is the harmonic mean of precision and recall — a single number that punishes lopsided performance. Standard for classification, rare in retrieval, where ranked metrics like NDCG@K are usually the better choice.

The F1 score is the harmonic mean of precision and recall:

The harmonic mean punishes lopsidedness. A system with precision 1.0 and recall 0.0 has F1 = 0; a balanced system with precision 0.7 and recall 0.7 has F1 = 0.7. The arithmetic mean would have rewarded the lopsided system unfairly; F1 doesn’t.

Both penalize the lopsided case, but harmonic mean is differentiable and smooth, which matters when you want to use F1-shaped losses during training or when you want monotonic ranking of systems by score. Taking is non-smooth and discontinuous: a model improving its precision from 0.7 to 0.9 while recall stays at 0.5 produces zero gain in , which masks real improvement. The harmonic mean rewards the better precision proportionally while still punishing the recall gap. There’s also the algebraic interpretation: , so F1 is the inverse of the average of the inverses — which is exactly the right shape when “false positives” and “false negatives” both consume your error budget proportionally.

The general F-beta family

F1 is a special case of the F-beta score:

F1 () — precision and recall weighted equally.
F2 () — recall weighted ~4× more than precision. Use when missing positives is the expensive failure (cancer screening, fraud detection’s recall tier, security alerting).
F0.5 () — precision weighted ~4× more than recall. Use when false positives are expensive (spam filters with aggressive blocklists, content moderation).

Pick the beta to match the cost asymmetry of your problem. Reporting F1 by default is fine; reporting only F1 when your problem is asymmetric is misleading.

Why F1 is rare in retrieval

Retrieval is a ranking problem, not a classification problem. The metrics you actually care about — NDCG@K , MRR , Recall@K — care about where relevant documents appear, not just whether they were returned in some unordered set.

F1 throws position information away. A reranker that puts the right document at #1 and a reranker that puts it at #10 produce identical F1 if both keep the same set of top-K. NDCG separates them; F1 doesn’t.

So in retrieval evaluation you’ll see precision and recall reported separately (often as Precision@K and Recall@K curves over K), but F1 itself is uncommon as a primary metric.

F1 is the right answer for binary classification, the wrong answer for ranking. The clue is in the question being asked — “did we get this set right?” vs “is the right thing first?”.

Where F1 does show up in retrieval-adjacent work

Three places worth tracking:

Faithfulness checking. Given a claim and a context, does the claim entail from the context? Binary task → F1. See faithfulness .
Citation extraction. Given a claim and a candidate span, is this the right citation? Binary at the span level → F1. See citation extraction .
Intent classification / routing. A query is routed to one of N tools. Per-class precision/recall/F1.

The underlying retrieval is judged by ranked metrics, but the binary “is this claim supported” or “did we cite the right span” check on top is a classification task where F1 is the right summary.

Production checklist

Default to F1 for classification, but always report precision and recall alongside.
Switch to F-beta when costs are asymmetric — and document which beta and why.
For retrieval, use NDCG/MRR/Recall@K as primary metrics; reach for F1 only on the binary subtasks.

When to reach for which F-beta

F0.5 (precision-weighted): spam blocklists, content moderation enforcement, search-autocomplete filters where false positives annoy users
F1 (balanced): default classification reporting, intent routing, balanced binary tasks
F2 (recall-weighted): cancer screening, fraud-detection recall stage, security alerting where missing a positive is the expensive failure
F1 micro vs F1 macro: micro pools across classes (rewards majority-class accuracy); macro averages per-class F1 (rewards minority-class accuracy). Pick macro if class imbalance hides real failure modes.

Go further

Why is F1 rare in retrieval evaluation?

Retrieval cares about ranking, not just the set of returned items. F1 collapses precision and recall into one number but loses position information — the same as Precision@K does. NDCG@K (or MRR for top-1) captures order; F1 doesn't. F1 stays the standard for binary classification where there's no ranking to measure.

Precision@K Recall@K NDCG@K

What's the difference between F1, F0.5, and F2?

F-beta lets you weight recall vs precision. F2 weights recall twice as much as precision (use when missing relevant items is worse than including irrelevant ones — medical screening); F0.5 weights precision twice as much (use when false positives are more costly — spam filters). F1 is the symmetric default.

Precision@K Recall@K

Where does F1 actually show up in a retrieval stack?

Most often inside the evaluation harness for classification subtasks — claim verification, intent detection, citation correctness. The retrieval ranking itself is judged by NDCG/Recall, but the binary 'did this claim get supported?' check on top is precision/recall and F1.

Faithfulness Citation extraction LLM-as-judge

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs