
Transcript
MayaPicture the dashboard of a big training run. Sixty-four GPUs, and the utilization graph looks like a comb — busy, idle, busy, idle, teeth all the way across the screen.
LeoA comb.
MayaEvery gap between the teeth is a machine doing nothing while the meter runs. That comb is what this whole topic is about.
LeoThe meter is rude, too. A cluster that size bills by the hour whether the chips are computing or just waiting on a colleague to hand something over.
MayaLast topic, training looked clean. Scaling laws handed us a budget — the Chinchilla lesson was, split your compute between parameters and data in the right ratio, and the loss curve behaves.
LeoOn the whiteboard. Today that tidy budget turns into cables, memory pressure, synchronization barriers, and a clock that punishes every idle second.
MayaSo here's the team we'll carry through the whole topic. They want to train a hundred-billion-parameter model, and they have a fixed cluster — sixty-four GPUs, not one more.
LeoTheir first wall isn't speed, it's existence. That model does not fit on any single device. Not the weights, not the gradients, definitely not the optimizer state.
MayaWhich means the question is never just "can we compute this?" It's "can we store it, move it, and synchronize it before the GPUs go idle?"
LeoThree different questions.
MayaAnd distributed training is the art of answering all three at once. Deciding what to copy, what to split, what to move, and what to recompute.
LeoThat sentence is the whole topic. Everything after it is detail — expensive, fascinating detail.
MayaSo let's walk the tour. Five tolls stand between our team and a healthy run, and every paper in this topic is an attack on one of them. It opens at the State Shelf — the memory holding the model itself: weights, gradients, optimizer state.
LeoThe stuff that exists whether or not you're processing a single example.
MayaFrom there, the Scratchpad — activation memory. Every intermediate result the forward pass produces and the backward pass will need later, piling up in the meantime.
LeoRight.
MayaThen come two tolls about waiting: the Cable, which is GPU-to-GPU communication, and the Bubble — pipeline idle time, machines standing around between stages. And the tour ends at the sneakiest toll of all. The Last Inch.
LeoWhich is?
MayaData movement inside a single GPU — between its big high-bandwidth memory and the tiny, fast memory sitting right next to the compute units. A chip can look busy on the dashboard while it's mostly hauling data across that inch.
Leo[chuckle] So our team can fail five different ways before the math itself is ever the problem.
MayaAnd the episodes map straight onto the tour. GPipe slices the model layer-wise into a pipeline. Megatron-LM splits the operations inside each Transformer layer. ZeRO — spelled Z-E-R-O, the Zero Redundancy Optimizer — and FSDP, which means Fully Sharded Data Parallel, both shard the State Shelf. FlashAttention and FlashAttention Two attack the Last Inch.
LeoPlus a survey episode on how real architectures compose all of that.
MayaAnd the two outliers.
LeoThe two episodes that stretch the word "efficiency" past the cluster entirely — SSD asks whether a model can improve by training on its own generated code, and the test-time-scaling paper asks whether your training plan changes once you count inference sampling cost.
MayaHold those two for later — they bend Topic Two's budget question in a way I love. First, the levers. The whole field really has four.
LeoName them.
MayaCopy, Slice, Split, Shard.
LeoGo on.
MayaCopy is data parallelism — every GPU carries the full model and you divide the examples among them. Slice is pipeline parallelism — divide the layer stack. Split is tensor parallelism — divide one big operation across devices. And Shard means divide the stored state, so no worker carries the whole burden.
LeoThey're not ranked beginner to advanced, either. They answer different questions. Copy asks, how do I chew through more data? Slice asks, how do I fit a deep stack? Split asks, how do I fit one enormous operation? Shard asks, why is everyone carrying the same backpack?
Maya[laugh] That backpack line is going to do a lot of work this topic.
LeoIt's an honest question! In plain data parallelism, all sixty-four of our team's GPUs hold identical copies of the weights, the gradients, the optimizer state. Sixty-four identical backpacks.
MayaWhich sets up the real fight inside these papers — two camps, two answers to the same memory wall. And I'll argue the first one properly: if the model is too big, partition the model. Slice it, split it, place each piece of compute right next to the memory it needs.
LeoThen I get the other side, because the sharding camp has the cleaner story. Most of that memory crisis is pure redundancy. Shard the optimizer state, the gradients, even the parameters across the cluster, and each device's burden collapses — while training still behaves like plain data parallelism. No model surgery. No custom partition every time the architecture changes.
MayaUntil one layer's computation is itself too big — and at hundred-billion scale, it can be. Sharding the shelf doesn't shrink the work, Leo. Partitioning does. Inside one server with a fast interconnect, tensor parallelism is very hard to beat.
LeoFine — the capacity point survives. When a single operation outgrows a device, you have to split the operation. Sharding alone won't save you there.
MayaAnd I'll concede the simplicity point. Sharding generalizes across architectures in a way hand partitioning never has, and it spends no engineer-months on surgery.
LeoSo where does this actually land?
MayaOn a distinction the whole topic runs on: capacity versus throughput. Capacity asks, does the run fit at all? Throughput asks, how fast do useful training tokens flow once it fits? Sharding is a capacity move that pays a communication tax. Partitioning is a throughput move that pays an engineering tax.
LeoComposition, then — not victory. Real stacks split tensors inside a server, pipeline across servers, shard state everywhere, checkpoint activations, fuse kernels, run mixed precision—
Maya—and the art is arranging all of that so the communication patterns don't collide. You don't pick a winner. You pick which bottleneck you can afford to pay.
LeoOkay, but that raises a second fight, and this one is live in every team I know.
MayaName it.
LeoIf the answer is "compose seven techniques," should a framework just choose the layout for you? I've watched hand-tuned configs rot. Automation doesn't make typos at three a.m., and it makes giant models accessible to teams without a systems group.
MayaAnd yet the hand-tuners keep winning, because the best layout depends on the architecture, the batch size, the sequence length, the interconnect, the optimizer, even the training objective. A framework that's right on average is wrong on your cluster.
LeoSo the answer isn't ideology, it's circumstances. Model shape, hardware topology, and how much engineering time the team actually has.
MayaAgreed there. Which brings up the mistake every beginner makes right about now—
Leo—"more GPUs means faster training." [sigh] I have sat in that meeting.
MayaMore GPUs can also mean more communication, more synchronization, more failure modes, and more time spent waiting. The comb gets wider, not fuller.
LeoSo here's one for the listener, with our team in mind. Their hundred-billion-parameter run is finally crawling along on those sixty-four GPUs. Before you reach for any technique in this topic — what do you measure first?
MayaMy list: utilization over time, communication traces, dataloader stalls, memory peaks, checkpoint timing. Averages hide too much. The shape of the stalls tells you what the system is really doing.
LeoThat's the diagnostic habit this topic teaches. Find the bottleneck first, then choose the technique. The reverse order is dangerous, because every method on the tour brings its own overhead.
MayaThe crisis even moves over a project's life. Early on, fitting the model is the emergency. Then throughput is. Near deployment, cost per useful answer takes over — which is exactly where those last two episodes live.
LeoWhich is why I'd defend this topic against the "bag of tricks" accusation. Each paper moves a hard limit: fits versus doesn't fit, stalls versus flows, theoretical speed versus wall-clock speed. And cheaper experiments change which research anyone can afford to run at all.
MayaThough moving a bottleneck is not eliminating it. Make attention faster, and communication may start to dominate. Shard the memory, and the gather timing suddenly matters. Fit the model, and the checkpoints and the dataloader become visible.
LeoWhack-a-mole.
MayaWhack-a-mole with a billing meter. Before we close — the topic's working vocabulary, quick and plain.
LeoData parallelism means every device holds a full copy of the model and the training examples are divided among them.
MayaPipeline parallelism means the model's layers are divided into stages, and work flows through them like an assembly line.
LeoTensor parallelism means one large operation inside a layer is divided across several devices that each compute a piece.
MayaSharding means splitting stored training state — weights, gradients, optimizer state — across workers instead of copying it to every one.
LeoActivation checkpointing means discarding intermediate results during the forward pass and recomputing them during the backward pass to save memory.
MayaA pipeline bubble means the idle time a pipeline stage spends waiting because its neighbor isn't finished yet.
LeoUtilization means the fraction of time the hardware spends doing useful work instead of waiting.
MayaAnd high-bandwidth memory means the large main memory on a GPU — fast, but still far slower than the small on-chip memory beside the compute units.
LeoFormal links for all nine sources are in the episode notes — and this is genuinely a topic where the diagrams and implementation notes help.
MayaSo the compressed takeaway: distributed training is not "use more GPUs." It's deciding what to copy, what to split, what to move, and what to recompute — and knowing which bottleneck you're choosing to pay.
LeoThen here's the question to carry into the deep dives. The next time someone announces a giant model — which bottleneck did they remove, and which new one did they just create?
Source material
- GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- ZeRO: Memory Optimization Towards Training Trillion Parameter Models
- Fully Sharded Data Parallel: Faster AI Training with Fewer GPUs
- Research on Distributed Training Architecture for Large Scale Models for Natural Language Processing
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Embarrassingly Simple Self-Distillation Improves Code Generation (SSD)
- Test-Time Scaling Makes Overtraining Compute-Optimal
← Back to Mastering Language Models: From Architecture to Optimization