What's the pipeline bubble and how big is it?
With
Also known as: PP, pipeline model parallelism, GPipe, 1F1B
Pipeline parallelism splits a model's layers across GPU stages and feeds micro-batches through them in an assembly-line schedule. GPipe (2018) introduced the basic idea; 1F1B (PipeDream, 2019) reduced its memory footprint.
Pipeline parallelism splits a model along its layer dimension, assigns groups of consecutive layers to separate GPU stages, and feeds micro-batches through the stages in an assembly-line schedule.
Where tensor parallelism shards a single matmul across GPUs, pipeline parallelism shards entire layers — a much coarser cut, with much lighter communication (point-to-point activation sends between adjacent stages, not all-reduces). It scales beyond the NVLink domain comfortably and is the standard tool for spanning a model across nodes when tensor parallelism alone isn’t enough.
A 96-layer model on 8 pipeline stages assigns 12 layers to each stage. A training step splits the global batch into
The communication between stages is one activation tensor send forward (and one gradient send back) per micro-batch boundary — typically a few hundred MB on cross-node InfiniBand at 50-100 GB/s. The volume is roughly
The cost is pipeline filling and draining. With
At
GPipe runs the whole forward sweep over all
1F1B (PipeDream, 2019) interleaves: as soon as a stage finishes the forward for micro-batch
Modern frontier training composes three parallelism axes simultaneously — “3D parallelism”:
For a 405B Llama-class model on 1024 H100s: TP=8 within node, PP=8 across nodes, DP=16 replicas. Each DP replica is one pipeline of 8 stages, each stage is 8 NVLink-coupled GPUs, each layer is split across those 8 via tensor parallelism. Megatron-LM, DeepSpeed, and NeMo all express this composition; the sharding plan is a 3D mapping from logical model parameters to physical (DP-rank, PP-stage, TP-rank) coordinates.
Pipeline parallelism’s bandwidth requirements are mild — point-to-point activation sends, not all-reduces — so it scales well across slow interconnect. Its memory savings are linear in
With
GPipe schedules all forward passes for all micro-batches first, then all backward passes — simple, but each stage holds activations for every in-flight micro-batch in memory. 1F1B (one-forward-one-backward) interleaves them: a stage starts a backward as soon as it finishes the corresponding forward, freeing activation memory immediately. Same bubble, much lower peak activation memory. 1F1B is the production default; interleaved 1F1B (Megatron's variant) further reduces bubble at the cost of more communication.
Because the alternatives don't scale far enough. Tensor parallelism works inside an NVLink domain (~8-16 GPUs) before its all-reduce traffic dominates. Data parallelism replicates the entire model — useless if the model doesn't fit on one node. Pipeline parallelism fills the gap: shard layers across nodes with cheap point-to-point sends, eat a small bubble, and you can train models that span hundreds of GPUs. 3D parallelism stacks all three to reach frontier scale.