Dataset Cards

Also known as: dataset documentation, datasheets for datasets, dataset metadata

TL;DR

Structured metadata describing a dataset's provenance, license, size, intended use, limitations, and ethical considerations. The HuggingFace dataset-card schema is the de facto standard, and every shipped dataset should have one.

A dataset card is a structured document that describes everything a downstream consumer needs to know about a dataset before training on it. Provenance, license, size, splits, intended use, known limitations, ethical considerations, contact information. The HuggingFace Hub renders cards automatically from a Markdown file at the dataset root, and the YAML frontmatter encodes machine-readable fields that let search and filtering work over the Hub. The format descends from the academic Datasheets for Datasets proposal (Gebru et al., 2018) but is shorter, machine-parseable, and now near-universal across open data releases — the contract between every curated corpus and the teams who consume it.

Why every dataset needs one

Most reproducibility failures in ML trace back to undocumented data choices. A paper reports a score; months later someone tries to reproduce it and finds different splits, different preprocessing, or license terms that prohibit the intended use. A complete card collapses each uncertainty source into a question with an answer.

Cards also matter legally. A team that fine-tunes a commercial model on a “non-commercial use only” dataset has a problem; a medical model trained on “consent for research use only” data has a bigger one. The card is the contract.

What goes on the card

Standard dataset-card sections

Summary. One paragraph: what it is, where it came from, what it’s for.
Languages and modalities. Which languages, scripts, modalities, at what fraction.
Provenance. Upstream sources with links, collection methodology, dates, preprocessing. Link parent dataset cards for derivations.
Size and splits. Rows per split, total tokens or bytes, distribution by source.
License. SPDX identifier plus full text. Note per-source heterogeneity (e.g., 80% CC-BY, 20% non-commercial).
Intended use. Authors’ target use cases. Pretraining? Evaluation? Fine-tuning?
Limitations and biases. Known gaps, demographic skews, contamination with public benchmarks, harm potential.
Ethical considerations. Consent, PII handling, sensitive content, attribution.
Citation and contact. BibTeX, maintainer email or issue tracker.

The HuggingFace YAML frontmatter encodes a subset as machine-readable fields — language:, license:, task_categories:, size_categories:, tags: — letting the Hub’s filter UI and downstream tooling query at scale. The Markdown body holds the long-form content.

The honest version specifies what was checked, how, and what was found: “Overlap with MMLU, GSM8K, HumanEval, MBPP was computed via 13-gram exact match between benchmark text and any document in the dataset. MMLU: 0.3% of benchmark examples have a 13-gram match. GSM8K: 0.0%. HumanEval: 1.2% of function names appear in the dataset’s code. We have not stripped these overlaps; consumers should decontaminate before reporting numbers.”

The bad version says “we made an effort to decontaminate” without specifying the procedure or the result. That phrasing is unfalsifiable, which is exactly why papers reach for it. A reviewer should be able to reproduce the check in an afternoon. The corollary: run contamination checks against the consumer’s benchmarks, not just the ones the authors expect.

A derived dataset needs to make the lineage explicit. The card should link the parent dataset card directly, describe the transformation precisely enough to reproduce the pipeline, and either inherit the parent’s license or document any escalation.

Constraints flow downstream. If the parent is CC-BY-NC, the derivation is also CC-BY-NC unless the transformation rises to a new copyrightable work — a judgment call for lawyers, not data engineers. Cards that elide the lineage create plausible deniability that ends in a takedown notice. FineWeb’s card does this right: it points back to Common Crawl, lists every filter applied, and inherits Common Crawl’s terms.

What dataset cards aren’t

Cards are also not a substitute for the data being inspectable. A 10TB corpus with a thorough card but no sample-browse interface leaves consumers guessing. The best releases pair the card with a small browseable sample, per-source histograms, and a search interface — claims spot-checkable in minutes rather than weeks.

Every shipped dataset, including internal ones consumed by sibling teams, should have one. The discipline of writing the card surfaces ambiguities the authors hadn’t articulated and makes downstream labeling and training decisions auditable later.

Go further

What's the difference between a dataset card and a datasheet?

Datasheets for Datasets (Gebru et al., 2018) is the academic origin — a long-form questionnaire covering motivation, composition, collection, processing, uses, and distribution. Dataset cards are the practical sibling: a structured Markdown file in a known schema, typically rendered automatically by HuggingFace Hub, GitHub, or a model registry. Cards inherited the questions; they're shorter and machine-parseable.

Data curation

What's the minimum a dataset card needs to be useful?

Provenance (where did the data come from, including upstream sources and any scraping or licensing details), size (rows, tokens, splits), license (full text and effective constraints), intended use (what tasks the authors built it for), known limitations (gaps, biases, failure modes the authors are aware of), and contact (who answers questions). Skipping any of these breaks downstream consumers.

Data labeling Data mixing

Why is contamination disclosure on the dataset card?

Because the consumers of a dataset need to know if it overlaps with public benchmarks before they report numbers. A dataset card that lists 'overlap with MMLU: 0.3% of MMLU questions appear verbatim' lets the consumer decontaminate or disclaim. A card that omits this lets people publish inflated benchmarks in good faith and discover the contamination later.

MS MARCO Data curation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs