Streaming Datasets

Also known as: WebDataset, tar-sharded datasets, streaming data loading, iterable datasets

TL;DR

When your dataset doesn't fit on local disk, you stream it from object storage as tar or Parquet shards. Sequential reads of large objects are 10-100x faster than random access on S3, so streaming formats.

A streaming dataset is one you read sequentially from object storage during training, never materializing the full corpus on a single machine’s disk. At this is the only viable pattern — you can’t fit FineWeb’s 60TB of on one node, and even if you could, hydrating a fresh copy onto every training worker is a 10x cost multiplier.

The interesting design constraint is that object stores hate random access. Streaming formats are an answer to that constraint, not a stylistic choice.

Sequential reads of 100MB chunks hit ~80% of an object store’s bandwidth. Random 1KB reads hit ~1%. Stream accordingly.

The asymmetry that forces the design

S3, GCS, and Azure Blob are all HTTP-style key-value stores with ~50-200ms of round-trip latency per request. The bandwidth is high (10-100 Gbps per worker, in aggregate) but the per-request overhead is brutal: every GET pays a full RTT before the first byte arrives.

The implication is that a single request for 100MB is roughly the same wall-clock cost as a single request for 1KB. To use the bandwidth you have, you need request sizes large enough that the transfer time dominates the RTT — typically tens of MB and up.

Random-access formats (Parquet with row-group seeks, indexed databases) make a lot of small reads. On a single SSD this is fine — NVMe latency is ~50µs. On S3 it’s 1000x slower, and your training worker becomes I/O-bound. Streaming formats sidestep this by reading large contiguous blocks sequentially.

What “streaming” actually means

A streaming dataset has three properties:

The streaming dataset contract
  • Sharded. The corpus lives as N shards of roughly fixed size (typically 100MB-1GB each) on object storage. Each shard is independently readable.
  • Sequentially consumed. Within a shard, samples are read in storage order. There is no random access by index.
  • Lazily materialized. A worker holds at most one shard’s worth of data in memory at a time, plus a small shuffle buffer. The full dataset is never on a single machine.

You give up the ability to do dataset[42]. You gain the ability to train on a 60TB corpus with a worker that has 64GB of RAM and no local NVMe.

The two formats that matter

WebDataset (tar shards)

WebDataset is a convention more than a library: store samples as groups of sibling files inside a .tar archive. A single sample of (image, caption, metadata) lives as 0001234.jpg, 0001234.json, 0001234.txt inside the same tar. The reader iterates the tar entry-by-entry, groups siblings by basename, and yields a dict per sample.

Why this works at scale:

  • Tar is a 50-year-old POSIX format with byte-perfect tooling everywhere. No new format to debug.
  • Tar streams beautifully — it’s literally designed for sequential reads from tape. Modern object stores are tape, conceptually.
  • Multimodal samples with arbitrary file types (image, audio, video, JSON, text) group naturally; you don’t need a schema.
  • Compression happens at the file level (each .jpg is already compressed) or at the tar level (.tar.gz/.tar.zst).

LAION-5B, OpenCLIP training, and most large CV/ASR pretraining pipelines use WebDataset. The library webdataset (from Alex Hanslovsky and Tom Breuel originally) is the canonical reader; the format itself works with any tar tooling.

Hugging Face IterableDataset over Parquet shards

IterableDataset is HF’s streaming counterpart to the in-memory Dataset. Backed by Parquet shards on the HF Hub or any URL-addressable storage, it reads row-group by row-group and yields rows lazily. The API mirrors PyTorch’s IterableDataset: for sample in dataset: ....

This is the path of least resistance for text data already in Parquet — , RedPajama, The Stack, OpenWebText. You don’t repack into tar shards; you just stream the existing Parquet directly.

When IterableDataset over Parquet wins
  • The data is already in Parquet (most modern are).
  • Samples are tabular text + metadata, not multimodal blobs.
  • You want columnar projection — read only text and url, skip score, language, etc.
  • You want HF Hub integration, , and version pinning for free.

The Parquet-over-IterableDataset pattern handles the row-group reads as ~64-128MB sequential GETs, which is well-sized for object storage. It’s slower per sample than WebDataset (Parquet decompression has more overhead than reading raw JPEG bytes from tar), but for text it’s plenty fast.

Mosaic StreamingDataset

mosaicml/streaming is the production answer for training-grade streaming, designed by the Mosaic / Databricks team for LLM pretraining at scale. It addresses three weaknesses of WebDataset and IterableDataset:

What StreamingDataset adds
  • Deterministic shuffling. A pinned seed produces the same sample order across runs and across restarts mid-epoch. Required for reproducible training and for resuming a job after a node failure.
  • Elastic resumption. A run that dies at step 47,283 of 100,000 resumes at exactly step 47,284 with the right sample, the right shuffle state, and the right shard cache. WebDataset can do this with effort; StreamingDataset bakes it in.
  • Fixed-size sample blobs. Samples are serialized into a custom binary format (MDS) with explicit offsets, enabling fixed-cost random access within a downloaded shard without sacrificing sequential reads from object storage.

The trade-off is that you have to convert your data into the MDS format — a one-time cost, but a real one. For teams running long-running, expensive training jobs, the elastic-resumption guarantee is what makes it worth it; a 70B-parameter run that loses an hour of work to a node failure has paid for the conversion many times over.

The trade-offs you accept

Pre-computing the shuffle

Sequential reads mean you can’t shuffle on the fly the way you would with a list. The standard trick — two-level shuffling — is what makes streaming feasible for training:

Two-level streaming shuffle
  • Shard-level. Permute the order in which a worker reads its assigned shards. Cheap (a list of N shard URLs in random order). Provides coarse-grained shuffling.
  • Buffer-level. Each worker maintains an in-memory ring buffer of B samples (typically 1K-10K). On __next__, emit a random sample from the buffer and refill from the current shard. Provides fine-grained shuffling.

The combined effect approximates a global shuffle as long as B is much larger than the within-shard correlation — i.e., samples that started together in the same shard get separated by ~B steps in the output. Mosaic StreamingDataset adds shard-aware shuffling (rotate which subset of shards is “active” over time) to mix samples across shard boundaries more aggressively.

The crucial property: the shuffle is deterministic given a seed. Two runs with the same seed see the same sample order, even though no machine ever held the full permutation in memory.

The hard part is restarting a long-running training run from a pinned step (say, step 47,283 of 100,000) without re-reading the first 47,283 samples and without losing determinism. StreamingDataset’s answer is a state object that captures (epoch, sample_index_in_epoch, shuffle_seed, per-shard download status) and can serialize to disk every N steps.

On resume, the state is restored: the deterministic shuffle is replayed up to the resume point (cheap — it’s a permutation, not data), the previously-downloaded shards are reused from local cache, and reading continues from the correct offset within the correct shard.

The reason this is non-trivial: a streaming dataset isn’t a list of samples, it’s a stream produced by a multi-stage pipeline (download → decompress → buffer → shuffle → emit). Restoring the state of all of those stages — including in-flight downloads and partially-filled buffers — is the engineering work. Generic IterableDataset and WebDataset don’t guarantee this; you can build it on top with care, but it’s not the default.

For a 70B-parameter pretraining run that takes weeks and has node failures every few days, mid-epoch resumption is the difference between losing 4 hours per failure (find nearest checkpoint, restart, re-stream) and losing 30 seconds (restore state, continue).

WebDataset shards too small (say, 1MB tars of 10 samples each) hit the same small-file problem as raw small files: every shard is a separate GET, the per-request overhead dominates, and aggregate throughput collapses. The fix is the same: target shard sizes of 100MB-1GB so each GET amortizes its RTT cost across enough data to saturate bandwidth.

LAION-5B uses ~1GB tars of ~10K samples. FineWeb’s Parquet shards are ~500MB-2GB of compressed text. Mosaic MDS shards default to 64MB and are tunable. In every case the floor is “big enough that the read takes much longer than the request setup,” which on modern object stores means tens of MB minimum.

The corollary: when you write a streaming dataset, batch many samples into each shard. Don’t create one shard per sample; don’t create one shard per source URL. Group, compact, write large.

The bigger picture

Streaming datasets are how you decouple the size of the corpus from the size of any single training worker. The corpus lives on object storage; a worker pulls a small steady-state subset; the training job scales by adding workers, not by getting bigger machines. This is the same architectural pattern that makes work — compute follows data, data is sharded for parallel reads, and the storage layer is the source of truth.

In 2026, every serious LLM and CV training pipeline streams. The single-machine “load the whole dataset and shuffle” pattern is a Kaggle-tutorial artifact that doesn’t survive contact with real data scale.

Go further

Why is random access on S3 so much slower than sequential reads?

S3 (and GCS, Azure Blob) are HTTP services with ~50-200ms of round-trip latency per GET. A random 1MB read costs you a full RTT plus the transfer time; a sequential 100MB read costs you one RTT plus the transfer time. The fixed-cost RTT dominates for small reads, and on a 10Gbps link you can't pipeline reads fast enough to hide it without large requests. In practice, sequential reads of 10-100MB chunks hit ~80% of theoretical bandwidth; random 1KB reads hit ~1%. Streaming formats are designed around this asymmetry.

How do streaming datasets shuffle without random access?

Two-level shuffling. First, shuffle the order of shards a worker reads (cheap — just permute a list of file URLs). Second, fill an in-memory buffer of size B (typically 1K-10K examples) and emit a random element from the buffer per __next__, refilling from the current shard. The combination approximates a global shuffle as long as B is much larger than per-shard correlations. Mosaic StreamingDataset, WebDataset, and IterableDataset all use variants of this with per-epoch deterministic seeds.

WebDataset vs Mosaic StreamingDataset vs Parquet IterableDataset — which one?

WebDataset (tar shards of .jpg/.json/.txt siblings) is dominant in CV and audio — the tar format is dead simple, POSIX-compatible, and groups multimodal samples naturally. Mosaic's StreamingDataset is the production answer for LLM pretraining: explicit deterministic shuffling, elastic resumption mid-epoch, fixed-size sample blobs for predictable I/O. Hugging Face IterableDataset over Parquet is the easiest to author and integrates with the HF hub; it's what you reach for when the data is already in Parquet and you want to stream without rewriting it.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord