Data Formats

Also known as: serialization formats, file formats, data serialization, storage formats

TL;DR

Data formats are the contracts between memory and disk: how a structured record turns into bytes that can be read back later. Choice of format determines whether you can scan a TB in a second or an hour, whether schema evolution breaks readers.

Every dataset eventually lives as bytes on disk. The format is the contract that says: here is how a row, a column, a schema, a null are written out and read back. The decision is almost never aesthetic. The format determines whether your training loop is I/O-bound or compute-bound, whether a schema change breaks every downstream consumer, and whether a 10 TB scan costs ten dollars or ten thousand.

The axes that actually matter

Four properties separate every format in practice:

The four format axes

Row vs column. Row-oriented stores (row1.fields..., row2.fields..., …). Column-oriented stores (col1.values..., col2.values..., …). Row wins for streaming whole records (training data shards, logs). Column wins for scanning a few fields out of many (analytics, feature stores).
Text vs binary. Text is debuggable, line-oriented, edit-by-hand-able (CSV, JSON, JSONL, YAML). Binary is 5-50× smaller, faster to parse, schema-aware (Parquet, Avro, Protobuf, Arrow). Text at small scale; binary the moment parsing or storage starts mattering.
Schemaful vs schemaless. Schemaful (Parquet, Avro, Protobuf) declares column types in a header; readers know what to expect. Schemaless (JSON, JSONL, CSV) infers per-row; the same dataset can have ten different shapes and you discover them at query time.
Append-only vs random-access. JSONL appends in O(1) — you can pipe writes from N concurrent producers. Parquet writes row groups in a compressed batch and is bad at single-row appends. The right format depends on whether you’re producing the data or consuming it.

The shortlist worth knowing

The catalog has dedicated entries for the formats you’ll touch most often:

Parquet — columnar, binary, schemaful. The default for any structured dataset above a few hundred GB. Predicate pushdown, page-level statistics, per-column compression. The format every analytical engine reads natively.
JSONL — line-delimited JSON. The ML-dataset and log-shipping default. Append-only, streamable, schemaless, brutally simple.
MDX — Markdown plus JSX. Not really a data format — a content format. Included here because it’s the format the catalog itself is written in, and because the line between content and data is blurry the moment your dataset is prose-plus-metadata.

Where format choice silently hurts you

Almost always: you’re reading JSONL or CSV at a scale where you should be reading Parquet. A row-oriented text format with no statistics has to parse every byte of every record even if you only use two columns. A 10 TB JSONL scan can drop to 800 GB in Parquet (column projection + compression) and parse 20× faster because the bytes are typed integers instead of strings of digits. The fix is one pd.to_parquet() away.

The reverse — converting Parquet to JSONL because “Parquet is hard” — is almost always wrong at scale. The mental overhead of Parquet is one-time; the I/O overhead of JSONL is per-epoch.

The first time someone in another team renames a column in a shared dataset and silently breaks your downstream model two weeks later. Schemaful formats (Parquet, Avro, Protobuf) record column names and types in a header that travels with the file; readers detect breakage at file-open time rather than at query time. Schemaless formats (JSON, CSV) push the breakage detection into your code, which is to say, into production.

If multiple teams write to the same corpus, your format choice is a coordination mechanism. Parquet is the cheapest coordination tool you have.

The honest take

Most teams overthink format choice and underthink access pattern. Start with the smallest format that meets your access pattern; only optimize when measured I/O or storage is hurting. JSONL for everything until you scan a TB; then Parquet for tables and JSONL stays for logs. That covers 90% of ML data engineering, and the remaining 10% is specific enough (vector stores, streaming, model-card metadata) that the format choice is already decided by the tool.

Go further

Row-oriented vs column-oriented — when does which one win?

Row-oriented (CSV, JSON, JSONL, Avro, Protobuf records) reads one whole record at a time and is the right shape when you process records end-to-end (training data, logs, message queues). Column-oriented (Parquet, Arrow, ORC) reads one column across many rows at a time and is the right shape when you scan or filter by a few fields out of many (analytics, feature stores, ML training where you only need the text + label out of a wide row).

Parquet JSONL

Text vs binary — is binary always better?

Not always. Text formats (CSV, JSON, JSONL, YAML) are debuggable with head, edit-by-hand-able, and have universal tooling. Binary formats (Parquet, Avro, Protobuf, Arrow) are 5-50× smaller, faster to parse, and schema-aware. Use text when humans need to read it or when the dataset is small enough that I/O isn't the bottleneck; use binary at scale or anywhere the parse step is hot.

What does schema evolution actually mean in practice?

Adding a new column to last year's data without rewriting the file. Parquet, Avro, Protobuf, and Arrow all support this as a first-class property — reader code with a newer schema can read old files (filling defaults for missing columns) and reader code with an older schema can ignore new columns. JSON kind of supports it because you can always add a key, but the lack of declared schema means breakage often shows up at query time instead of at write time.

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs