Noise2Signal

The five things you can shard

Saksham Consul — Mon, 25 May 2026 01:04:04 GMT

In the last post I argued that training at scale is a systems problem, and the system you’re working against is the bandwidth hierarchy of your physical cluster. Each parallelism paradigm is a bet about where in that hierarchy you can afford to communicate.

This post answers the next question: what bets can you actually make?

When you read the literature, the answer looks complicated. DDP, ZeRO-1, ZeRO-2, ZeRO-3, FSDP, HSDP, TP, SP, PP (GPipe, 1F1B, interleaved, zero-bubble), CP, EP, ring attention, expert choice, DeepEP - every framework has its own vocabulary, every paper introduces a new acronym, and a new practitioner could be forgiven for thinking there are dozens of distinct techniques to learn.

The reality is simpler. A training step has a small number of dimensions you can split work along, and every named paradigm is a strategy for splitting one of those dimensions: sometimes one alone, sometimes a few combined. Once you have the dimensions in your head, the zoo of paradigms organizes itself into a small map.

The table below is the map at a glance - what each axis splits, what communication it forces, which tier of fabric it lives on. If you’ve worked with a few of these and just need the layout, the table is the post. If you want to know why each row reads the way it does, the sections after it walk through each axis in turn.

Modifiers - knobs within an axis, not separate axes:

There are five dimensions worth knowing. Here they are, in the order I find easiest to remember.

1. Batch (B): data parallelism

What gets split: the batch of training examples. Each rank processes a different shard of the batch through the same model weights.

Communication primitive: AllReduce on gradients at the end of each step. Each rank computes a gradient on its data shard; the cluster sums them so every rank ends up with the same averaged gradient.

Paradigm names: DDP (the textbook version), and all the ZeRO/FSDP/HSDP variants which we’ll discuss separately because they’re really refinements of DP rather than a new axis.

Why this exists: it’s the trivial parallelism - you can always train on more data in parallel. The forcing function is that one GPU’s batch is too small to get statistically meaningful gradients in reasonable time.

Cost: one allreduce per step on the full gradient (~2P bytes per rank). This is the cheapest communication pattern in the menu because it’s infrequent (once per step, not once per layer) and can be overlapped with backward compute. It’s the bottom of the communication intensity hierarchy.

Where it sits: tolerable on the slowest fabric in your hierarchy (inter-rack IB). This is why DP is the outermost layer of most composition recipes - it can pay the slowest fabric without bleeding throughput.

2. Hidden dimension (H): tensor parallelism

What gets split: the matmul itself. Each linear layer’s weight matrix is sharded, column-wise for some layers, row-wise for others, and the matmul is distributed across ranks accordingly.

Communication primitive: AllReduce inside each layer. The classic Megatron-LM recipe gets you down to one allreduce per attention block and one per MLP block, so roughly two allreduces per layer per microbatch per pass, or four per layer per step counting backward.

Paradigm names: TP (Megatron-style).

A note on naming before we go further. There’s a related technique called Sequence Parallelism (SP) which is the standard companion to TP. SP shards the activations of layernorm and dropout, the elementwise operations that sit between TP’s matmul regions, along the sequence dimension. It’s a free trick: AllReduce = ReduceScatter + AllGather , so SP turns TP’s region-boundary all-reduces into a reduce-scatter and all-gather pair that moves the same total bytes but leaves activations sharded between regions instead of replicated. The result is the same activation memory savings TP gives you for matmul tensors, now extended to the non-matmul stuff.

The naming is unfortunate. “Sequence parallelism” sounds like the general technique for sharding the sequence dimension, but in current usage (matching Megatron-Core, Transformer Engine, and most frontier framework documentation), SP specifically means this TP companion - sharding only the elementwise activations between TP regions, not the attention computation itself. The general technique for sharding sequence across attention is a different thing called Context Parallelism (CP), which gets its own section below. Some papers from 2022-2023 use “sequence parallelism” loosely to mean either. This series uses SP narrowly (the TP companion) and CP for the broader attention-sharding story.

So: SP rides with TP everywhere TP appears. It’s not a separate axis; it’s how you should always run TP at scale.

Why this exists: when a single layer’s parameters or activations don’t fit on one GPU. Also useful when activation memory dominates at long context, even if the model would otherwise fit.

Cost: many allreduces per layer, on the critical path of compute (the next operation can’t start until the allreduce completes). With ~80 layers and 4 allreduces per layer, that’s 320 hard sync points per step. Even small, this adds up - TP communication is often 10-30% of step time at typical TP=8.

Where it sits: the top of the bandwidth hierarchy. TP must stay within the fast intra-node domain (NVLink/NVSwitch: 8 GPUs on H100, 72 on GB200 NVL72). Push it across InfiniBand and your MFU collapses, because the chattiest workload now lives on the slowest fabric.

3. Layer depth (L): pipeline parallelism

What gets split: the model’s layers. Each rank (or rank group) owns a contiguous chunk of layers - the “pipeline stage.” Forward activations flow from stage 0 → 1 → 2 → ...; gradients flow back the other way.

Communication primitive: point-to-point send/recv between adjacent stages. One activation tensor goes forward; one gradient tensor comes back. Not collectives.

Paradigm names: PP, GPipe, PipeDream, 1F1B, interleaved 1F1B, zero-bubble PP. The progression in the literature is mostly about scheduling: how to fill the pipeline to minimize the idle “bubble” while keeping activation memory bounded.

Why this exists: when even TP + FSDP can’t fit the model on the fast fabric. PP lets you spread the model across many nodes by giving each node a stage instead of a sharded copy.

Cost: two related costs. First, the pipeline bubble - idle time on each stage while microbatches propagate. Bubble fraction is roughly (N-1)/(M+N-1) for N stages and M microbatches, so M has to be much larger than N to keep the bubble small. Second, stashed activations: each stage holds onto activations for in-flight microbatches until their backward pass arrives. The send/recv communication itself is cheap; the scheduling complexity is what makes PP hard.

Where it sits: tolerable on InfiniBand. The messages are large but infrequent (per stage boundary, not per layer), and PP doesn’t depend on a fully-connected fast domain. This is why PP is typically the “across nodes” layer in composition recipes.

4. Sequence - attention (S): context parallelism

What gets split: the sequence dimension across the attention computation itself. Each rank persistently holds 1/N of the sequence’s Q, K, V. To make attention work (every Q must see every K), K and V chunks rotate around the ranks while attention is computed in pieces.

Communication primitive: ring rotation of K/V shards between adjacent ranks (point-to-point send/recv, but in a ring pattern). Combined with the online-softmax trick (the same algorithmic identity FlashAttention uses) to combine partial attention results without ever materializing the full attention matrix.

Paradigm names: CP, Ring Attention, Striped Attention (a load-balanced variant for causal masking).

Why this exists: at long context (typically 16K+ tokens), the attention activation memory becomes the dominant cost, and you can’t reduce it further with TP alone (head sharding caps out at the number of heads, and activations per head are quadratic in sequence).

Cost: N-1 ring rotations per attention layer, where N is the CP degree. The cost is high enough that CP only pays for itself past ~16K-32K sequence on H100 — below that, the rotation communication exceeds the memory savings. On GB200 NVL72 the crossover shifts down to maybe 4K-8K because the ring runs on much faster fabric.

Where it sits: middle of the hierarchy. CP wants fast fabric (the ring rotation is on the critical path of attention), but tolerates inter-node IB at long context where the per-chunk attention compute is large enough to hide the rotation. Typically composed alongside TP.

5. Experts (E): expert parallelism

What gets split: experts in a Mixture-of-Experts model. Each rank holds and computes for only its assigned subset of experts.

Communication primitive: all-to-all. Twice per MoE layer per forward pass - once to dispatch tokens to the ranks holding their assigned experts, once to combine the expert outputs back to the originating ranks.

Paradigm names: EP, expert sharding. The choice of routing algorithm (top-k, expert choice, auxiliary-loss-free) shapes the load distribution but doesn’t change the basic communication pattern.

Why this exists: MoE models decouple parameter count from compute per token. A 1T-parameter MoE might activate only 37B parameters per token. Those 1T parameters still have to live somewhere; distributing them across DP ranks wastes memory (every rank holds all experts but uses few). EP makes each rank specialize.

Cost: all-to-all is the densest communication pattern in this menu. Every rank sends data to every other rank, in both directions, simultaneously. The traffic is also uneven across steps - if the router sends 30% of tokens to one expert, that expert’s rank receives 30% of all traffic. Load imbalance directly compounds the communication cost. At frontier scale, EP all-to-all can be the single largest component of step time and motivates custom kernels (DeepSeek’s DeepEP).

Where it sits: wants fast moderate-distance fabric. Within a single fast domain is ideal (EP=64 within a GB200 NVL72 rack is a meaningful sweet spot). Across nodes via IB works but requires careful tuning of both the routing (to balance load) and the communication kernels (to handle the dense traffic).

The modifiers: SP and FSDP

Two more names you’ll encounter in the literature deserve a callout, because the field talks about them as if they’re peer techniques to TP/PP/CP/EP and they aren’t.

SP is the TP companion described in section 2. It’s not a separate axis: it’s how you should always run TP at scale. The reason it gets its own name is historical (Megatron published it as a distinct paper) and the reason it stuck is that you can run TP without SP, badly. But conceptually, SP is part of TP. Anywhere this series talks about TP, assume SP rides along.

FSDP is a refinement of DP. DDP replicates the entire model state on every DP rank. ZeRO-1 shards optimizer states across DP ranks. ZeRO-2 also shards gradients. ZeRO-3 (= FSDP) also shards parameters. Each level reclaims memory DDP was wasting on redundant copies, at additional communication cost.

The right way to think about it: DP is an axis (the batch dimension). FSDP is a choice about how aggressively to shard state along that same axis. HSDP is “FSDP within one node, DDP across nodes”: same axis, hierarchical sharding policy. The DDP-vs-FSDP-vs-HSDP choice is a within-axis knob, not a new axis. The dedicated DP post will go deep on which knob to pick when.

The map of “things you can adjust”: five axes, plus a sharding-aggressiveness knob within DP, plus the TP-with-or-without-SP knob (always: with).

The composition preview

When you hear “3D parallelism” or “5D parallelism,” that’s just picking multiple axes at once. A typical frontier dense recipe (Llama 3 style) uses three axes: TP × PP × DP. A long-context recipe might add CP for four. A frontier MoE recipe might use TP × EP × PP × DP for four, with EP replacing some of what TP normally does.

The geometry of the cluster sets the constraints: TP wants the fast fabric, so it goes intra-node. PP tolerates IB, so it goes across nodes. DP tolerates the slowest fabric, so it’s the outermost layer. CP and EP fit in the middle, with placement details that depend on the specific cluster topology.

The next several posts will take each axis seriously, one at a time. We’ll see how each one’s communication primitive maps to bandwidth and latency budgets, where it breaks under load, what knobs frameworks expose, and what surprises you when you scale it up. Then we’ll come back to composition - by that point, “4D parallelism” should feel less like jargon and more like an obvious consequence of picking the right axis for each tier of your machine.

Before the takeaway, it's worth pausing on what 5D composition actually looks like end-to-end. The diagram below zooms three times: from the cluster (DP across 16 replicas), into one replica (PP across 8 stages), into one stage (TP × CP × optional EP across 8 GPUs), all the way down to where a single QKV projection happens. It's the picture I keep coming back to, because it makes the nesting obvious in a way prose can't: each axis isn't a separate technique you bolt on, it's a level of zoom on the same physical cluster.

The takeaway

If I could put one thing in your head before the next post: There are five dimensions you can split work along, and the right composition is the one where each axis lives at the right level of zoom. Every paradigm name you've heard is a strategy for one of them, or a combination. Pick the dimensions by asking what your bottleneck is; pick the degree by asking what tier of your fabric you can afford to communicate.

Once you have this map, the rest of the series is a tour of each axis in turn.

Next post: data parallelism in depth: what DDP actually does, why bucketing exists, where it breaks at scale, and how ZeRO-2 recovers most of the wasted memory essentially for free.

Thanks for reading Noise2Signal! This post is public so feel free to share it.

Training is not (just) a compute problem

Saksham Consul — Fri, 22 May 2026 02:31:30 GMT

When I first started working on distributed training, I had a mental model that turned out to be wrong in an interesting way. I thought training a big model was, fundamentally, a compute problem. You had FLOPs; you had a model that needed FLOPs to fit; the job was to get them efficiently from one to the other. Cluster procurement was a matter of counting H100s. Performance work was a matter of making kernels faster.

This is true at small scale. It stops being true around the time you start filling a rack.

A well-tuned frontier training run on 16,000 H100s achieves roughly 40% of the cluster’s peak FP16 throughput. The other 60% is spent waiting: on memory, on the network, on other GPUs, on slow disks, on stragglers, on bubbles, on barriers, on the half-hour after someone’s NIC silently corrupted its 14th packet of the day. We’ve learned to live with this number, but it should bother us more than it does.

That 60% is the substrate this series will spend a lot of time on. Some of it is a design problem: choosing the parallelism that fits the physics of your cluster. Some of it is an operations problem: keeping 10,000 GPUs healthy enough to talk to each other for 54 days straight. The literature treats the first as real engineering and the second as plumbing. I think both are real engineering. The frontier labs that ship models on schedule are the ones that take both seriously.

The thing I’ve come to believe, and what this series will spend a lot of time exploring, is that training at scale is better understood as a systems problem than as a compute problem. The specific system you’re working against is the bandwidth hierarchy of your physical cluster, and most of the interesting choices in training infrastructure are bets about where in that hierarchy you can afford to communicate.

Once you have this lens, a lot of the field organizes itself.

The hierarchy that actually runs your job

The picture I keep in my head looks something like this, sorted from fast and small to slow and large. The absolute numbers depend on what generation of hardware you’re on; the trend, each tier roughly 10× slower than the one above it, is remarkably stable.

HBM ↔ tensor cores: ~3.3 TB/s on H100, ~8 TB/s on B200, scoped to one GPU. This is what tensor cores actually feed on, and it’s the reason most kernels run at a small fraction of peak FLOPs - they exhaust memory bandwidth before they exhaust compute.
NVLink within the fast domain: 900 GB/s bidirectional per H100 GPU (NVLink 4), 1.8 TB/s bidirectional per B200 GPU (NVLink 5). The domain itself is 8 GPUs on H100 nodes or 72 GPUs on GB200 NVL72 racks - the size of the fast domain matters as much as its bandwidth, and the jump from 8 to 72 changes what parallelism schemes are viable in ways we’ll come back to.
InfiniBand within a pod: ~50 GB/s per NIC, ~400 GB/s aggregate per node, scoped to a few nodes before contention shows up. Roughly an order of magnitude slower than NVLink, and that gap shapes most architecture decisions.
InfiniBand across racks: same per-link bandwidth, but now sharing spine switches with hundreds of other ranks. Effective bandwidth drops further; latency variance gets ugly.
Storage: ~100 GB/s aggregate for a decent shared filesystem, shared with everything else doing I/O.

The thing I keep coming back to: each generational jump has improved the absolute numbers, but the ratios between tiers have stayed roughly constant. NVLink is faster than IB by about an order of magnitude on both H100 and Blackwell. HBM is faster than NVLink by about an order of magnitude on both. The cliff is the same shape; it’s just shifted up. Which means the architectural choices, what to put on which tier, stay broadly similar across generations, even as the per-tier capacity grows.

(Two notes on the numbers. First, NVLink bandwidth is quoted as bidirectional aggregate per GPU; in practice you’ll often see practitioners halve this when reasoning about single-direction effective throughput in real collectives. Second, the “10× cliff” is approximate - the real ratios depend on per-NIC counts, NVSwitch topology, and what your network can actually sustain under contention. The point isn’t the exact factor; it’s that there’s a very large gap between intra-node and inter-node fabric that determines a lot of downstream decisions.)

The way I’ve come to read the parallelism literature is: every paradigm you’ve heard of, DP, FSDP, TP, PP, CP, EP, is a strategy for placing different kinds of communication onto different tiers of this hierarchy. That’s the organizing principle. Once it clicks, the menu starts to feel less like a grab-bag and more like a set of obvious choices given the constraints:

Tensor parallelism lives at the top tier (NVLink only) because its communication sits on the critical path of every matmul. There’s no room to hide it.
Pipeline parallelism tolerates the middle tier because its communication is point-to-point and infrequent - once per stage boundary, not per layer.
Data parallelism tolerates the bottom tier because its allreduce happens once per step and can be overlapped with backward compute.
FSDP sits between TP and DP: it wants the fast tier when it can get it and falls back to “hybrid” (HSDP) when it can’t.

When I see someone running TP across nodes, my first thought is that they’ve put the chattiest workload on the wrong tier. They’re paying for tensor cores that are mostly waiting on InfiniBand. Sometimes there’s a good reason; usually there isn’t. The hierarchy is unforgiving that way.

The memory side of the same hierarchy

The bandwidth story has a partner in a memory story, and the two together set up most of what’s interesting in the field.

Take a 70B-parameter model trained with Adam in mixed precision. The state you have to hold per parameter, in bytes:

Parameters (bf16): 2
Gradients (bf16 or fp32): 2-4
Adam first moment (fp32): 4
Adam second moment (fp32): 4
Master weights kept in fp32 for the optimizer: 4

That’s 16 bytes per parameter, give or take. For 70B parameters, ~1.1 TB of model state, before you’ve stored a single activation. An H100 has 80 GB; a B200 has 180 GB. The model state alone vastly exceeds one GPU, and in fact exceeds an entire 8-GPU H100 node. This is the “16× rule,” and it’s the second forcing function I’ll come back to throughout the series: you can’t compute alone, and you can’t store alone either.

Every parallelism paradigm is, among other things, a strategy for slicing this state across GPUs so no one GPU holds more than its HBM allows. These two pressures, communication and storage, pull against each other in ways that drive most of the algorithm design in the field. Shard state more aggressively (FSDP, TP) and you create more communication, because what one rank needs is now somewhere else. Replicate state (DP) and you waste memory on N copies of the same thing. The interesting algorithmic work, in my reading, is in relaxing this tradeoff - finding ways to keep memory low without paying full price in communication.

ZeRO-2 is the most elegant example I know of. It splits one allreduce into a reduce-scatter and an allgather, keeps the same total bytes on the wire, and recovers (N-1)/N of optimizer-state memory essentially for free. That’s a real algorithmic insight; everything downstream is engineering. The field has maybe a half-dozen ideas of that caliber, and a lot of careful work around them.

The third pressure: synchronization

There’s one more pressure that doesn’t show up until you’re at scale, and I think it’s underappreciated relative to how much MFU it costs.

Synchronous training pays for the maximum, not the mean. A synchronous step finishes when the slowest rank finishes. With N ranks, the expected maximum of N step times grows with both N and the variance of individual step times. For a roughly Gaussian distribution of step times with a 5% per-rank coefficient of variation (which is good tuning) the slowest rank at N=16,000 lags the average by something like 22%. That’s MFU you pay just for the right to call your training synchronous.

This tax compounds with every level of synchronization in your stack. TP allreduces synchronize 8 ranks 320 times per step. DP allreduces synchronize all 16K ranks once per step. Each is its own straggler event, with its own tail.

One consequence: variance reduction is its own engineering discipline at scale. People pin CPUs, disable Turbo Boost, set uniform power caps below the thermal-throttling threshold, isolate NUMA, pre-warm kernels - not for the small mean improvement, but to compress the right tail of per-rank latency. The slowest rank sets your throughput, and the slowest rank is set by σ, and σ is set by a hundred small things. A lot of what makes “well-tuned” clusters well-tuned is in this category, and most of it isn’t glamorous.

What this series is for

I’m writing this as a practicing infra engineer, mostly for other practicing infra engineers particularly who are scaling up and discovering that their 1,000-GPU recipe has failure modes their 100-GPU recipe never showed. The frontier labs publish technical reports; the textbooks describe the abstractions; the framework docs tell you which knobs exist. What I haven’t seen much of is the connective tissue: why these abstractions exist, what breaks when you scale them, how to reason about a recipe before you commit a month of cluster time to running it, and what it actually takes to keep a cluster running long enough to finish that month.

On the design side, I plan to write about:

The parallelism paradigms, one at a time, with a consistent lens: what physics forces it, what its communication actually does, where it breaks, what it composes with.
Composition: 3D, 4D, 5D parallelism as the consequence of mapping each axis to the right tier of the hierarchy.
Hardware shifts that matter: NVL72, optical NVLink, what changes when the fast domain expands by an order of magnitude.
Close readings of public artifacts: the Llama 3 paper, the OPT logbook, the DeepSeek-V3 report. There’s a lot in these that becomes visible only when you read them with the operational lens.

On the operations side, I plan to write about:

Cluster health: what hardware actually fails, what to watch for, what to ignore.
Observability at 10K-GPU scale: what to measure when your metrics pipeline is itself a distributed system.
Stragglers, SDC, NCCL hangs - the failure modes that don’t fit the “crash and restart” model.
Checkpointing and recovery as a discipline, not a feature.
The boring operational work that makes the difference between 25% and 45% MFU at scale.

How I’m going to run this

One thing about how I’m publishing: I want the dialogue more than I want the byline. The frontier of this field moves fast, no one person sees all of it, and a lot of what I’ve learned has come from discussions in the office and Slack threads and pull request reviews. I’d rather have a corrected post than a perfect one.

So if you read something here that’s wrong, or missing context, or that contradicts your experience - please push back. In the comments, on X, in your own blog post that links here, whatever works. I’ll update posts when I’m corrected and credit the source. I’ll write follow-ups when readers raise things I hadn’t considered. The posts will be better for it, and I think the practice of openly-revisable technical writing is itself underappreciated in this corner of the field.

If the framing here resonates, the next post is on the five things you can actually shard in a training run, and why that’s the entire menu on the design side. Operations content starts a few posts after that, once we have shared vocabulary for the systems we’re operating.

Subscribe now