JSONL

Q: How do OpenAI and Anthropic format chat datasets?

Both use JSONL. Each line is a JSON object with a messages array ({role: 'user'|'assistant'|'system', content: '...'}). OpenAI's fine-tuning format expects {messages: [...]} per line; Anthropic's batch API expects one request per line with custom_id for joining results. The ecosystem converged on JSONL because every loader, sharder, and shuffler already handled it.

Also known as: JSON Lines, NDJSON, newline-delimited JSON, ldjson

TL;DR

JSONL — JSON Lines, also called NDJSON — is one JSON object per line. Brutally simple, ubiquitous in ML datasets and log shipping, friendly to streaming and append-only writes.

JSONL is JSON, one object per line, with \n as the record separator. The whole spec fits in one sentence — and that’s the point. There is no schema header, no compression layer, no offset index, no metadata. Just {...}\n{...}\n{...}\n. Every Python script, every awk one-liner, every streaming consumer, every cloud-storage tool already handles it. It’s the format that won by being too dumb to lose.

{"id": 1, "text": "first record", "label": "a"}
{"id": 2, "text": "second record", "label": "b"}
{"id": 3, "text": "third record", "label": "c"}

What JSONL is good for

The shape of JSONL exactly matches “structured records produced one at a time and consumed in order.” That’s most of the ML data lifecycle:

A scraper produces one JSON object per page → appended to JSONL
A labeling job produces one JSON object per labeled example → appended to JSONL
A training loop reads one JSON object per step → consumed from JSONL
An inference batch produces one JSON object per request → appended to JSONL

The line boundary is load-bearing. It lets you wc -l to count records, head -n 1000 to sample, split -l to shard, cat *.jsonl to merge. The price is no built-in schema, no compression, no column projection. For most ML datasets that price is fine.

Where JSONL stops being the right tool

Three signals that you’ve outgrown it:

Training step is I/O-bound. You’re spending more time parsing JSON than running gradients. Modern GPUs eat data faster than a single-threaded JSON parser can produce it. Switch to Parquet (or Arrow IPC for in-memory streaming) and the same training step often runs 5-20× faster.
You only need 2-3 fields out of 50. JSONL has to parse every byte of every record even for columns you don’t use. Parquet’s column projection skips the bytes entirely. On a 1 TB dataset with 50 columns, projecting 3 of them in Parquet reads ~60 GB; the equivalent JSONL reads the full 1 TB.
Schema drift is silently breaking downstream. No declared schema means readers discover broken records at parse time, not at load time. Once multiple teams write to the same JSONL corpus, you’ll discover the wrong field names through training crashes.

Below ~50 GB or single-team ownership, stay on JSONL. The simplicity wins.

Production gotchas

Pretty-printed JSON breaks it. JSONL requires one object per line; if a producer emits multi-line indented JSON, parsers see syntax errors. Always emit json.dumps(obj) with no indent= argument.
\n inside string values is the silent killer. A naive producer that writes obj["text"] = "line1\nline2" will produce a JSONL line that splits into two when a downstream reader does .split("\n"). Always escape newlines (json.dumps does this by default; CSV-to-JSONL converters often don’t).
Compression is the user’s job. .jsonl.gz and .jsonl.zst are the conventions. pyarrow.json can stream-decompress them. Don’t store raw JSONL above ~10 GB if you have any control over the producer; the 5-10× compression ratio is free.
NDJSON, ldjson, jsonl are aliases. Same format. Different tribes named it. application/x-ndjson is the most-supported MIME type.

The honest comparison

Property	JSONL	Parquet	CSV
Schema-on-write	❌	✅	❌
Column projection	❌	✅	❌
Streaming-friendly	✅	partial	✅
Edit by hand	✅	❌	✅
Append at O(1)	✅	❌	✅
Storage efficiency	poor	excellent	poor
Universal tooling	✅	mostly	✅

JSONL wins on every column where simplicity helps. Parquet wins on every column where scale helps. Pick by where your dataset is in its lifecycle.

Go further

Why JSONL instead of a single big JSON array?

Because a JSON array forces every reader to parse the whole file before yielding the first element — which means you can't stream it, you can't append to it, and you can't process it line-by-line in a shell pipeline. JSONL flips all three. Each line is a self-contained JSON document; readers consume one line at a time; producers append at O(1); head -n 1000 works.

Parquet

When does JSONL stop being the right choice?

Once your dataset crosses ~50 GB AND you scan only a few fields per query, switch to Parquet. JSONL parses every byte of every record even when you only need two columns out of fifty; Parquet skips the rest of the row entirely. The transition point is when your training step becomes I/O-bound — typically a few hundred GB for ML datasets, sooner for analytics.

Parquet

How do OpenAI and Anthropic format chat datasets?

Both use JSONL. Each line is a JSON object with a messages array ({role: 'user'|'assistant'|'system', content: '...'}). OpenAI's fine-tuning format expects {messages: [...]} per line; Anthropic's batch API expects one request per line with custom_id for joining results. The ecosystem converged on JSONL because every loader, sharder, and shuffler already handled it.

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs