Parquet

Also known as: Apache Parquet, columnar storage

TL;DR

Parquet is a columnar on-disk format that has become the only reasonable way to store multi-TB datasets. Rows are grouped into chunks; each chunk stores columns separately, compressed, with statistics.

Parquet is the columnar file format that backs essentially every serious data lake, every analytical warehouse, and every modern pretraining-corpus pipeline. Once you cross a few hundred GB of data, JSON and CSV stop being viable — they’re slow to scan, they don’t compress well, and they have no concept of schema. Parquet exists because somebody had to fix that.

The format is conceptually simple: split the table into row groups, store each row group’s columns separately, compress each column independently, and write statistics (min, max, null count) at the column-chunk level for fast filtering. Everything else is bookkeeping.

The columnar idea

A row-oriented file (CSV, JSON Lines) lays values out as (row1.col1, row1.col2, ..., row2.col1, row2.col2, ...). To read one column, you have to scan the whole file.

A columnar file lays values out as (col1.row1, col1.row2, ..., col2.row1, col2.row2, ...). To read one column, you read exactly the bytes of that column and skip the rest.

For analytical workloads — SELECT mean(score) FROM table — this is a 10-100x I/O reduction. For ML workloads where you load specific features (text, label) out of a wide table (text, label, source_url, fetch_date, …), it’s the same story.

Read what you query, not what you stored. That’s the columnar bet.

The secondary win is compression. Adjacent values within a column are similar — same dtype, often correlated, often duplicated. A column of integers compresses far better than a row of mixed types. zstd on a column of timestamps routinely hits 10x; the same data row-oriented might hit 2x.

Row groups and pages

A Parquet file is structured hierarchically:

file
 ├── row group 0
 │    ├── column chunk: doc_id    → page, page, page, ...
 │    ├── column chunk: text      → page, page, page, ...
 │    └── column chunk: score     → page, page, page, ...
 ├── row group 1
 │    └── ...
 └── footer (schema + row-group metadata + statistics)

The row group is the unit of horizontal partitioning. Default size is ~128MB of uncompressed data, usually 100K-1M rows. Each row group is independently readable, which is what makes Parquet streamable and parallelizable — N readers can each take one row group.

Inside a row group, each column is a column chunk, broken into pages (~1MB each). Pages are the unit of compression and encoding. Each page header includes statistics that let readers skip pages that can’t contain rows matching a filter.

The number isn’t magic — it’s the rough size at which (a) a single reader can hold a row group in memory comfortably, (b) the per-row-group metadata overhead becomes negligible relative to the data, (c) parallel readers get enough work to amortize task startup, and (d) cloud-storage range reads (S3, GCS) hit their efficient regime — large enough that round-trip latency doesn’t dominate, small enough that a failed read isn’t catastrophic. 128MB matches the historical HDFS block size, which is where the convention came from. Modern systems tune up to 256MB-1GB for archive-heavy workloads and down to 32MB for streaming-heavy ones.

Predicate pushdown and statistics

Each column chunk’s footer entry stores min, max, null_count, and (optionally) a Bloom filter. Query engines use these to skip entire row groups without reading them.

Concretely, if you write WHERE fetch_date > '2024-06-01' and a row group’s fetch_date.max is '2024-05-15', the engine doesn’t read that row group at all. On well-sorted data, predicate pushdown can turn a TB-scale scan into a 10GB scan.

Why every serious data lake uses it

At the multi-TB scale where pretraining corpora live, the tradeoffs make every other format unviable:

Common Crawl ships WARC archives, but every downstream consumer ( FineWeb , RefinedWeb, Dolma) converts to Parquet immediately. WARC is fine for archival; it’s hopeless for filtering.
Deduplication pipelines (MinHash, exact-hash) need to scan one or two columns out of many — perfect for columnar.
Data curation workflows iterate: load text + url + score, apply filters, rewrite. Every iteration touches the same few columns; reading them columnarly is 10-50x faster than re-parsing JSONL.

Typical pretraining-pipeline storage savings (vs gzipped JSONL)

Snappy Parquet: ~1.5-2x smaller, ~5x faster scans on a single column
zstd-3 Parquet: ~2-3x smaller, similar scan speed
zstd-9 Parquet: ~3-4x smaller, ~30% slower writes, same reads

Pitfalls

Parquet is good. It’s not a silver bullet. The recurring issues:

The small-file problem. Parquet has nontrivial per-file overhead (footer, row-group metadata). 10,000 files of 10MB each are dramatically slower to read than 100 files of 1GB each. Compaction jobs that merge small files into larger ones are standard infrastructure on any mature data lake.
Schema drift. Parquet supports schema evolution, but the rules are subtle (adding nullable columns: fine; renaming columns: not portable; changing a column’s type: undefined behavior across engines). Pin schemas, version them, and validate at write time.
Type coercion across writers. Pandas, PyArrow, Spark, and DuckDB don’t all agree on edge cases — timestamp precision, decimal scale, list-of-struct nesting. The pyarrow writer is the closest thing to a reference implementation; use it when in doubt.
Nested data is a footgun. Parquet supports lists and structs (Dremel-style repetition/definition levels), and they work — but they’re slow to filter and most query engines have weaker pushdown on nested fields. Flatten when you can.
Writes are not cheap. Parquet is a write-once-read-many format. Streaming append, mutation, and small-update workloads are all anti-patterns. Use Iceberg or Delta on top of Parquet if you need transactional semantics.

Go further

Why columnar instead of row-oriented?

Most analytical and ML workloads read a few columns out of many. A columnar layout lets you read just those columns from disk — skipping the rest entirely. Compression also gets much better because adjacent values in a column tend to be similar (same dtype, often correlated), while adjacent values in a row are arbitrary.

Common Crawl FineWeb

What's a sensible row-group size?

The default in most writers is ~128MB of uncompressed data per row group, which usually shakes out to 100K-1M rows depending on schema width. Too small and per-group overhead dominates; too large and you can't parallelize reads or stream efficiently. 128MB is a defensible default; tune up for archive workloads and down for streaming-heavy ones.

Data curation Deduplication

Snappy or zstd for compression?

Snappy is the historical default — fast decompression, modest ratio. zstd at level 3-9 gets 20-40% better compression at comparable decompression speed and is now the right default for archive workloads. Snappy still wins for write-heavy pipelines where compression CPU is the bottleneck. For pretraining corpora that get written once and read thousands of times, zstd is the obvious choice.

FineWeb

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs