Training is not (just) a compute problem
Why your GPUs spend most of their time waiting, and what that tells you about the field.
When I first started working on distributed training, I had a mental model that turned out to be wrong in an interesting way. I thought training a big model was, fundamentally, a compute problem. You had FLOPs; you had a model that needed FLOPs to fit; the job was to get them efficiently from one to the other. Cluster procurement was a matter of counting H100s. Performance work was a matter of making kernels faster.
This is true at small scale. It stops being true around the time you start filling a rack.
A well-tuned frontier training run on 16,000 H100s achieves roughly 40% of the cluster’s peak FP16 throughput. The other 60% is spent waiting: on memory, on the network, on other GPUs, on slow disks, on stragglers, on bubbles, on barriers, on the half-hour after someone’s NIC silently corrupted its 14th packet of the day. We’ve learned to live with this number, but it should bother us more than it does.
That 60% is the substrate this series will spend a lot of time on. Some of it is a design problem: choosing the parallelism that fits the physics of your cluster. Some of it is an operations problem: keeping 10,000 GPUs healthy enough to talk to each other for 54 days straight. The literature treats the first as real engineering and the second as plumbing. I think both are real engineering. The frontier labs that ship models on schedule are the ones that take both seriously.
The thing I’ve come to believe, and what this series will spend a lot of time exploring, is that training at scale is better understood as a systems problem than as a compute problem. The specific system you’re working against is the bandwidth hierarchy of your physical cluster, and most of the interesting choices in training infrastructure are bets about where in that hierarchy you can afford to communicate.
Once you have this lens, a lot of the field organizes itself.
The hierarchy that actually runs your job
The picture I keep in my head looks something like this, sorted from fast and small to slow and large. The absolute numbers depend on what generation of hardware you’re on; the trend, each tier roughly 10× slower than the one above it, is remarkably stable.
HBM ↔ tensor cores: ~3.3 TB/s on H100, ~8 TB/s on B200, scoped to one GPU. This is what tensor cores actually feed on, and it’s the reason most kernels run at a small fraction of peak FLOPs - they exhaust memory bandwidth before they exhaust compute.
NVLink within the fast domain: 900 GB/s bidirectional per H100 GPU (NVLink 4), 1.8 TB/s bidirectional per B200 GPU (NVLink 5). The domain itself is 8 GPUs on H100 nodes or 72 GPUs on GB200 NVL72 racks - the size of the fast domain matters as much as its bandwidth, and the jump from 8 to 72 changes what parallelism schemes are viable in ways we’ll come back to.
InfiniBand within a pod: ~50 GB/s per NIC, ~400 GB/s aggregate per node, scoped to a few nodes before contention shows up. Roughly an order of magnitude slower than NVLink, and that gap shapes most architecture decisions.
InfiniBand across racks: same per-link bandwidth, but now sharing spine switches with hundreds of other ranks. Effective bandwidth drops further; latency variance gets ugly.
Storage: ~100 GB/s aggregate for a decent shared filesystem, shared with everything else doing I/O.
The thing I keep coming back to: each generational jump has improved the absolute numbers, but the ratios between tiers have stayed roughly constant. NVLink is faster than IB by about an order of magnitude on both H100 and Blackwell. HBM is faster than NVLink by about an order of magnitude on both. The cliff is the same shape; it’s just shifted up. Which means the architectural choices, what to put on which tier, stay broadly similar across generations, even as the per-tier capacity grows.
(Two notes on the numbers. First, NVLink bandwidth is quoted as bidirectional aggregate per GPU; in practice you’ll often see practitioners halve this when reasoning about single-direction effective throughput in real collectives. Second, the “10× cliff” is approximate - the real ratios depend on per-NIC counts, NVSwitch topology, and what your network can actually sustain under contention. The point isn’t the exact factor; it’s that there’s a very large gap between intra-node and inter-node fabric that determines a lot of downstream decisions.)
The way I’ve come to read the parallelism literature is: every paradigm you’ve heard of, DP, FSDP, TP, PP, CP, EP, is a strategy for placing different kinds of communication onto different tiers of this hierarchy. That’s the organizing principle. Once it clicks, the menu starts to feel less like a grab-bag and more like a set of obvious choices given the constraints:
Tensor parallelism lives at the top tier (NVLink only) because its communication sits on the critical path of every matmul. There’s no room to hide it.
Pipeline parallelism tolerates the middle tier because its communication is point-to-point and infrequent - once per stage boundary, not per layer.
Data parallelism tolerates the bottom tier because its allreduce happens once per step and can be overlapped with backward compute.
FSDP sits between TP and DP: it wants the fast tier when it can get it and falls back to “hybrid” (HSDP) when it can’t.
When I see someone running TP across nodes, my first thought is that they’ve put the chattiest workload on the wrong tier. They’re paying for tensor cores that are mostly waiting on InfiniBand. Sometimes there’s a good reason; usually there isn’t. The hierarchy is unforgiving that way.
The memory side of the same hierarchy
The bandwidth story has a partner in a memory story, and the two together set up most of what’s interesting in the field.
Take a 70B-parameter model trained with Adam in mixed precision. The state you have to hold per parameter, in bytes:
Parameters (bf16): 2
Gradients (bf16 or fp32): 2-4
Adam first moment (fp32): 4
Adam second moment (fp32): 4
Master weights kept in fp32 for the optimizer: 4
That’s 16 bytes per parameter, give or take. For 70B parameters, ~1.1 TB of model state, before you’ve stored a single activation. An H100 has 80 GB; a B200 has 180 GB. The model state alone vastly exceeds one GPU, and in fact exceeds an entire 8-GPU H100 node. This is the “16× rule,” and it’s the second forcing function I’ll come back to throughout the series: you can’t compute alone, and you can’t store alone either.
Every parallelism paradigm is, among other things, a strategy for slicing this state across GPUs so no one GPU holds more than its HBM allows. These two pressures, communication and storage, pull against each other in ways that drive most of the algorithm design in the field. Shard state more aggressively (FSDP, TP) and you create more communication, because what one rank needs is now somewhere else. Replicate state (DP) and you waste memory on N copies of the same thing. The interesting algorithmic work, in my reading, is in relaxing this tradeoff - finding ways to keep memory low without paying full price in communication.
ZeRO-2 is the most elegant example I know of. It splits one allreduce into a reduce-scatter and an allgather, keeps the same total bytes on the wire, and recovers (N-1)/N of optimizer-state memory essentially for free. That’s a real algorithmic insight; everything downstream is engineering. The field has maybe a half-dozen ideas of that caliber, and a lot of careful work around them.
The third pressure: synchronization
There’s one more pressure that doesn’t show up until you’re at scale, and I think it’s underappreciated relative to how much MFU it costs.
Synchronous training pays for the maximum, not the mean. A synchronous step finishes when the slowest rank finishes. With N ranks, the expected maximum of N step times grows with both N and the variance of individual step times. For a roughly Gaussian distribution of step times with a 5% per-rank coefficient of variation (which is good tuning) the slowest rank at N=16,000 lags the average by something like 22%. That’s MFU you pay just for the right to call your training synchronous.
This tax compounds with every level of synchronization in your stack. TP allreduces synchronize 8 ranks 320 times per step. DP allreduces synchronize all 16K ranks once per step. Each is its own straggler event, with its own tail.
One consequence: variance reduction is its own engineering discipline at scale. People pin CPUs, disable Turbo Boost, set uniform power caps below the thermal-throttling threshold, isolate NUMA, pre-warm kernels - not for the small mean improvement, but to compress the right tail of per-rank latency. The slowest rank sets your throughput, and the slowest rank is set by σ, and σ is set by a hundred small things. A lot of what makes “well-tuned” clusters well-tuned is in this category, and most of it isn’t glamorous.
What this series is for
I’m writing this as a practicing infra engineer, mostly for other practicing infra engineers particularly who are scaling up and discovering that their 1,000-GPU recipe has failure modes their 100-GPU recipe never showed. The frontier labs publish technical reports; the textbooks describe the abstractions; the framework docs tell you which knobs exist. What I haven’t seen much of is the connective tissue: why these abstractions exist, what breaks when you scale them, how to reason about a recipe before you commit a month of cluster time to running it, and what it actually takes to keep a cluster running long enough to finish that month.
On the design side, I plan to write about:
The parallelism paradigms, one at a time, with a consistent lens: what physics forces it, what its communication actually does, where it breaks, what it composes with.
Composition: 3D, 4D, 5D parallelism as the consequence of mapping each axis to the right tier of the hierarchy.
Hardware shifts that matter: NVL72, optical NVLink, what changes when the fast domain expands by an order of magnitude.
Close readings of public artifacts: the Llama 3 paper, the OPT logbook, the DeepSeek-V3 report. There’s a lot in these that becomes visible only when you read them with the operational lens.
On the operations side, I plan to write about:
Cluster health: what hardware actually fails, what to watch for, what to ignore.
Observability at 10K-GPU scale: what to measure when your metrics pipeline is itself a distributed system.
Stragglers, SDC, NCCL hangs - the failure modes that don’t fit the “crash and restart” model.
Checkpointing and recovery as a discipline, not a feature.
The boring operational work that makes the difference between 25% and 45% MFU at scale.
How I’m going to run this
One thing about how I’m publishing: I want the dialogue more than I want the byline. The frontier of this field moves fast, no one person sees all of it, and a lot of what I’ve learned has come from discussions in the office and Slack threads and pull request reviews. I’d rather have a corrected post than a perfect one.
So if you read something here that’s wrong, or missing context, or that contradicts your experience - please push back. In the comments, on X, in your own blog post that links here, whatever works. I’ll update posts when I’m corrected and credit the source. I’ll write follow-ups when readers raise things I hadn’t considered. The posts will be better for it, and I think the practice of openly-revisable technical writing is itself underappreciated in this corner of the field.
If the framing here resonates, the next post is on the five things you can actually shard in a training run, and why that’s the entire menu on the design side. Operations content starts a few posts after that, once we have shared vocabulary for the systems we’re operating.


