Tensor Parallelism

Also known as: TP, Megatron-style parallelism, tensor model parallelism

TL;DR

Tensor parallelism splits each individual matrix multiplication across multiple GPUs — column-split or row-split with an all-reduce — so weights too large for one GPU's memory still produce one combined result.

Tensor parallelism splits an individual matrix multiplication across multiple GPUs and stitches the result back with a collective communication primitive. It’s the parallelism flavor that breaks up a single linear layer — not the model, not the batch, but each weight matrix — and the canonical way to fit and serve LLMs that don’t fit on one device. Introduced in Megatron-LM (Shoeybi et al., 2019), it dominates intra-node parallelism for any model above roughly 30B parameters in 2026.

The two splits

A linear layer is , with . There are two ways to slice across GPUs:

Column-parallel. Split along the output dimension into shards . Each GPU holds the full input and computes , producing a slice of the output. The output is left sharded; no communication required at this step.
Row-parallel. Split along the input dimension into . Each GPU receives a slice of the input and computes a partial . The full output requires an all-reduce: .

Used alone, each split adds an all-reduce per layer. The Megatron trick is pairing them.

The Megatron pattern

A transformer block has two communication-amenable sub-blocks. In each, the input is full-dim, then projects up, then projects back down. Pair the two projections — column-parallel up, row-parallel down — and the intermediate (the wide one) stays sharded across GPUs without communication. You only pay an all-reduce on the final output.

The FFN is , with and . Split column-parallel: each GPU produces a slice of the wide hidden activation, fully shard. The nonlinearity is elementwise — no communication. Split row-parallel: each GPU multiplies its slice of the hidden activation by its slice of , producing a partial output. Then one all-reduce sums the partials into the final output.

Total communication per FFN: one all-reduce of size floats. Without the pairing you’d pay two — once after and once after . The savings double in attention, where there are four projections (Q, K, V, O), grouped as column-parallel for Q/K/V and row-parallel for O.

Backward pass is symmetric — every all-reduce in the forward becomes an all-reduce in the backward. Total per transformer block: 4 all-reduces (2 in attention, 2 in FFN, counting forward and backward).

What it costs

Communication. For a 70B model with , batch 4, seqlen 2048, and TP=8: each all-reduce moves about 0.5 GB of activation data across the 8-GPU NVLink mesh. Multiply by 80 layers, multiply by 4 all-reduces per layer, multiply by 2 for forward+backward: ~256 GB per training step per worker, all of it through NVLink. At 900 GB/s NVLink bandwidth that’s a few hundred milliseconds of pure communication time per step.

This is why TP lives inside the NVLink domain. Cross-node InfiniBand at 50-100 GB/s would turn that overhead from “noticeable” to “dominant.” The conventional split: TP=8 within a node, pipeline parallelism or data parallelism across nodes for additional scale.

What TP does for memory and throughput

For a -way TP shard:

Weight memory drops by . A 70B fp16 model is 140 GB total; on TP=8 each GPU holds ~17.5 GB.
Activation memory mostly drops by (with sequence-parallelism extensions; without them, the layernorm/residual paths stay full).
Optimizer state drops by if the optimizer is also sharded (Megatron’s distributed optimizer or ZeRO-style sharding).
Throughput per GPU drops modestly — the all-reduce time is ~10-20 percent of step time on a well-tuned TP=8 stack. MFU typically drops 3-5 percentage points relative to a smaller model that fits on one GPU.

The point is that without TP the model doesn’t fit at all. Five points of MFU is a small price for being able to train a model that wouldn’t otherwise exist.

Where it sits in the parallelism stack

TP is one axis of “3D parallelism”: tensor parallelism within a node, pipeline parallelism across nodes, data parallelism replicating the whole sharded model. Modern training stacks (Megatron-LM, DeepSpeed, NeMo) compose all three. For inference, TP is the dominant strategy — vLLM, TensorRT-LLM, and SGLang all default to TP across the GPUs in a server, with pipeline parallelism reserved for cases where TP can’t shard small enough. The two parallelism dimensions answer different questions: TP solves “this model doesn’t fit on one GPU”; pipeline parallelism solves “this model doesn’t fit on one node.”

Go further

Column-parallel vs row-parallel — why do you need both?

Megatron's trick is to split consecutive linear layers in opposite directions so the activations between them stay sharded — no all-reduce in between. Column-parallel splits a weight matrix along its output dimension; each GPU produces a slice of the output. Row-parallel splits along the input dimension; each GPU sees a slice of the input and contributes a partial output that gets all-reduced. Pairing them (column then row in an FFN, for example) reduces communication to one all-reduce per FFN block instead of two.

Transformer GPU memory hierarchy

Why is TP intra-node and not cross-node?

TP requires an all-reduce on every transformer block — twice per block in attention, once per FFN. The volume is roughly hidden_dim x batch x seqlen bytes per all-reduce. NVLink within an 8-GPU node delivers ~900 GB/s; InfiniBand between nodes is closer to 50-100 GB/s. Cross-node TP makes the communication step the bottleneck of every forward pass. Standard practice: TP within an 8-GPU node, pipeline parallelism or data parallelism across nodes.

Pipeline parallelism MFU

How does sequence parallelism extend it?

TP shards weights and matmul-stage activations but leaves the layernorm, dropout, and residual paths replicated across all ranks — wasted memory. Sequence parallelism (Korthikanti et al. 2022) shards those normalized activations along the sequence dimension, replacing the all-reduce with an all-gather and a reduce-scatter. Same total communication, lower activation memory. Standard in modern Megatron stacks; required for very long context training.

Pipeline parallelism GPU memory hierarchy

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs