Pipeline Parallelism

Q: What's the pipeline bubble and how big is it?

With FORMULA stages and FORMULA micro-batches, naive sequential execution leaves FORMULA stages idle at the start (filling the pipe) and FORMULA idle at the end (draining it). Bubble fraction is FORMULA. To get bubble below 10 percent at 8 stages you need at least FORMULA micro-batches per training step. This is why PP needs large global batches and why it doesn't compose well with very small batches.

Also known as: PP, pipeline model parallelism, GPipe, 1F1B

TL;DR

Pipeline parallelism splits a model's layers across GPU stages and feeds micro-batches through them in an assembly-line schedule. GPipe (2018) introduced the basic idea; 1F1B (PipeDream, 2019) reduced its memory footprint.

Pipeline parallelism splits a model along its layer dimension, assigns groups of consecutive layers to separate GPU stages, and feeds micro-batches through the stages in an assembly-line schedule.

Where tensor parallelism shards a single matmul across GPUs, pipeline parallelism shards entire layers — a much coarser cut, with much lighter communication (point-to-point activation sends between adjacent stages, not all-reduces). It scales beyond the NVLink domain comfortably and is the standard tool for spanning a model across nodes when tensor parallelism alone isn’t enough.

How an assembly line maps onto a transformer

A 96-layer model on 8 pipeline stages assigns 12 layers to each stage. A training step splits the global batch into micro-batches that flow through the stages.

Stage 0 runs micro-batch 1’s forward, sends activations to stage 1, then starts micro-batch 2’s forward.
Stage 1 runs micro-batch 1 once stage 0’s output arrives.
After the pipe fills (after steps), every stage works on a different micro-batch concurrently.
Backward passes run in reverse order; gradients flow stage-by-stage from the loss back to the inputs.

The communication between stages is one activation tensor send forward (and one gradient send back) per micro-batch boundary — typically a few hundred MB on cross-node InfiniBand at 50-100 GB/s. The volume is roughly of what tensor parallelism’s all-reduces would move at the same scale, which is why pipeline parallelism is cross-node-friendly where TP isn’t.

The bubble

The cost is pipeline filling and draining. With stages and micro-batches per step, the bubble fraction is

At and that’s about 10 percent. At and it’s about 50 percent — half your GPUs idle. The fix is more micro-batches (or fewer stages), which means a larger global batch size. This is why pipeline parallelism pairs with large batches and why it doesn’t compose with the small-batch regime — small models trained on a single node should not be using pipeline parallelism.

GPipe runs the whole forward sweep over all micro-batches first, then the whole backward sweep. The schedule is simple, the bubble is what the formula says, but every stage has to keep activations for all in-flight forwards in memory until the corresponding backward arrives — so activation memory scales with . For a stage holding 12 transformer layers and 70 in-flight micro-batches at long context, this hits OOM fast.

1F1B (PipeDream, 2019) interleaves: as soon as a stage finishes the forward for micro-batch , it immediately starts the backward for some earlier micro-batch whose downstream forward is done. The same total work is done in the same wall time — the bubble is unchanged, since every stage is still idle for slots at start and end. But each stage only holds activations for ~ micro-batches in memory, not . Activation memory drops from to , which is the difference between fitting and not fitting. Megatron’s “interleaved 1F1B” schedule splits each stage into virtual stages with smaller pipeline-bubble fraction, paying more point-to-point sends for less idle time.

Stacking with tensor and data parallelism

Modern frontier training composes three parallelism axes simultaneously — “3D parallelism”:

Tensor parallelism within a node (TP = 8 typical), splitting matmuls across NVLink-connected GPUs.
Pipeline parallelism across nodes (PP = 4-16), splitting layer groups across InfiniBand-connected nodes.
Data parallelism across the resulting “model replicas,” replicating the (TP x PP)-sharded model to scale batch size.

For a 405B Llama-class model on 1024 H100s: TP=8 within node, PP=8 across nodes, DP=16 replicas. Each DP replica is one pipeline of 8 stages, each stage is 8 NVLink-coupled GPUs, each layer is split across those 8 via tensor parallelism. Megatron-LM, DeepSpeed, and NeMo all express this composition; the sharding plan is a 3D mapping from logical model parameters to physical (DP-rank, PP-stage, TP-rank) coordinates.

Where the limits are

Pipeline parallelism’s bandwidth requirements are mild — point-to-point activation sends, not all-reduces — so it scales well across slow interconnect. Its memory savings are linear in (each stage holds of layers and of optimizer state), which is what makes 1T+ parameter training tractable on commodity InfiniBand topologies. Its bubble is the binding overhead, and it shrinks the right way as global batch grows. MFU for well-tuned 3D parallelism — TP + PP + DP combined — sits in the 40-50 percent range on H100 clusters in 2026, with bubble accounting for 5-10 percent of the lost ceiling.

Go further

What's the pipeline bubble and how big is it?

With stages and micro-batches, naive sequential execution leaves stages idle at the start (filling the pipe) and idle at the end (draining it). Bubble fraction is . To get bubble below 10 percent at 8 stages you need at least micro-batches per training step. This is why PP needs large global batches and why it doesn't compose well with very small batches.

Throughput MFU

GPipe vs 1F1B — what's the difference?

GPipe schedules all forward passes for all micro-batches first, then all backward passes — simple, but each stage holds activations for every in-flight micro-batch in memory. 1F1B (one-forward-one-backward) interleaves them: a stage starts a backward as soon as it finishes the corresponding forward, freeing activation memory immediately. Same bubble, much lower peak activation memory. 1F1B is the production default; interleaved 1F1B (Megatron's variant) further reduces bubble at the cost of more communication.

Gradient checkpointing GPU memory hierarchy

Why use pipeline parallelism at all if it has a bubble?

Because the alternatives don't scale far enough. Tensor parallelism works inside an NVLink domain (~8-16 GPUs) before its all-reduce traffic dominates. Data parallelism replicates the entire model — useless if the model doesn't fit on one node. Pipeline parallelism fills the gap: shard layers across nodes with cheap point-to-point sends, eat a small bubble, and you can train models that span hundreds of GPUs. 3D parallelism stacks all three to reach frontier scale.

Tensor parallelism Throughput

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs