T3E3 · Topic 3: Advanced Distributed Training — Overcoming Bottlenecks · 00:11:04

ZeRO: Removing Memory Redundancy from Data Parallel Training

A deep dive into ZeRO’s memory model: optimizer states, gradients, and parameters are not sacred copies; they can be partitioned across workers.

Transcript

Generated: 2026-05-10 03:14 UTC

---

MayaBefore we jump in, here's a quick setup for this episode on t3e3_zero_memory sharding. You'll hear Maya and Leo work through the topic together.

MayaSometimes the biggest waste in a training run is not the model. It is every G P U keeping the same giant notebook of model state.

LeoThe story starts with a practical annoyance: the bottleneck is not one thing. It is a stack of limits, and this episode picks one layer of that stack.

MayaMegatron split computations inside layers. ZeRO attacks a different problem: data parallel training quietly duplicates enormous amounts of state on every worker.

LeoSo the listener should not picture a neat whiteboard equation only. They should picture a training job as a living system: chips, memory, network links, data loaders, kernels, and a clock that punishes every idle moment.

MayaExactly. ZeRO asks: if every G P U is part of one team, why should every G P U carry a full copy of the optimizer states, gradients, and parameters all the time?

LeoGive me the everyday version before we go technical.

MayaWith Adam, the optimizer does not store only parameters. It also stores momentum-like statistics for each parameter. If every G P U keeps all parameters, all gradients, and all optimizer states, the memory bill multiplies fast. ZeRO says: shard those states across the group and gather what is needed when it is needed.

LeoNice. That makes the paper feel less abstract. But what is the specific move?

MayaFirst, zeRO stands for Zero Redundancy Optimizer, and it targets memory redundancy in data-parallel and model-parallel training.

LeoLet’s translate that out of paper language. In other words, the system is not only asking, “Can we compute this?” It is also asking, “Can we store it, move it, and synchronize it before the GPUs go idle?”

MayaSecond, the paper analyzes how sharding optimizer states, gradients, and parameters can make model size scale roughly with the number of devices while retaining data-parallel style computation.

LeoThat is the part many people miss. The win is not just academic elegance. The win is making a training run cross a hard boundary: fits versus does not fit, stalls versus keeps flowing, theoretical speed versus actual wall-clock speed.

MayaThird, it reports training models over 100 billion parameters on 400 GPUs and argues that the approach can scale beyond one trillion parameters with contemporary hardware.

LeoSo we should treat this as a design pattern, not a one-off trick. The pattern is: find the hidden resource that is being wasted, then redesign the training loop around it.

MayaZeRO is especially useful because it gives names to different memory categories. Parameters are the model weights. Gradients are the updates computed during backpropagation. Optimizer states are the extra buffers, like Adam’s moments, used to decide the update.

LeoAnd the surprise for newcomers is that optimizer states can be larger than the parameters themselves.

MayaExactly. If each parameter has multiple optimizer values associated with it, full replication across every G P U becomes painfully expensive.

LeoSo ZeRO stages are like progressively more aggressive ways to stop duplicating state.

MayaYes. Shard optimizer states first, then gradients, then parameters. Each stage changes the memory-communication trade-off.

LeoThat progression is a great teaching tool: do not ask, “Do we shard?” Ask, “How much state are we willing to shard, and what communication are we willing to pay?”

LeoHere is the expert disagreement I hear underneath this episode: Is sharding better than model parallelism?

MayaThe strongest arguments are both reasonable. Sharding is easier to adopt because it preserves much of the data-parallel training style. Model parallelism can be faster for some architectures because it splits computation directly. In practice, large systems often combine both.

LeoAnd the practical answer is usually not ideological. It depends on the model shape, sequence length, hardware topology, training objective, and how much engineering time the team has.

MayaZeRO saves memory, but it can increase communication complexity. You are trading local storage for carefully timed gathers and reduces.

LeoLet’s add a concrete listener check. If you were debugging this in a real training run, what would you measure first?

MayaI would measure peak memory by category: parameters, gradients, optimizer state, activations, temporary buffers, and communication buffers. Then I would check whether sharding just moved the pain into network time.

LeoThat makes this episode useful beyond the paper. It gives the listener a diagnostic habit. Do not memorize the technique first. Find the bottleneck first.

MayaZeRO also changes how we think about data parallelism. Classic data parallelism feels simple because every worker has the full model. ZeRO keeps the high-level data-parallel flavor but removes the assumption that every worker must store every piece of state.

LeoThat is important for usability. A researcher can often keep a familiar training loop while the system handles the more complicated memory layout underneath.

MayaBut the price is timing. If a parameter shard is not local when a layer needs it, the system must fetch it. If a gradient shard needs to be reduced, the system must coordinate it. The difference between elegant and slow can be overlap.

LeoIn plain language: ZeRO saves your backpack by making the team share supplies, but now the team has to pass supplies around at exactly the right time.

MayaHere is a simple memory budget. For each parameter, you may store the parameter value, the gradient, and two Adam statistics. In mixed precision training, there may also be master weights or other buffers. Multiply that by billions of parameters and then by every data-parallel worker.

LeoThat multiplication is brutal. A model that sounds like it should fit based on parameter count may fail because the training states are several times larger than the weights alone.

MayaZeRO’s stages can be heard as a sequence of questions. Do all workers need every optimizer state locally? Do all workers need every gradient locally? Do all workers need every parameter locally at every moment?

LeoAnd the answer becomes: maybe not. They need access at the right time, not permanent ownership.

MayaThat distinction is powerful. It turns memory from a static possession problem into a scheduling problem. The system can gather, compute, reduce, and release.

LeoBut that also means the network is part of the optimizer story. ZeRO is a memory paper, but its success depends on communication being disciplined enough not to erase the gain.

MayaLet’s make the redundancy vivid. In ordinary data parallelism with eight workers, each worker may store the same optimizer state. If that state consumes hundreds of gigabytes in total, you are paying that bill eight times. ZeRO asks why the team is carrying eight identical sets of tools.

LeoAnd the answer used to be convenience. Full copies make each worker self-contained. ZeRO trades that convenience for coordination.

MayaThe coordination can be worth it because model state grows with parameter count. As models scale, memory can become the hard wall before arithmetic throughput is the limiting factor.

LeoThat makes ZeRO feel almost like a distributed database for training state. The data exists somewhere in the group, not everywhere all at once.

MayaBefore we close, let’s turn this into an implementation review. If a team brought you a ZeRO-style training setup, the first thing to ask is not whether the method sounds modern. It is whether the measurements show the right problem being solved.

LeoThe review checklist would include optimizer-state memory, gradient sharding, parameter gathering, and communication overlap. That list sounds detailed, but it prevents one-dimensional thinking. A training run can improve one metric and quietly damage another.

MayaAnother review habit is to compare against the simplest baseline. If the fancy method beats a weak baseline but loses to a tuned simple setup, the story is not finished. Distributed training papers are strongest when they cross a hard boundary and still preserve efficiency.

LeoAnd the hardest question is: Are we saving memory at the cost of synchronized waiting that dominates step time? That question keeps the listener grounded in engineering reality instead of buzzwords.

MayaI also like asking what the method makes easier for future work. Some techniques are valuable because they train one model faster. Others are valuable because they let many teams explore model sizes, sequence lengths, or training recipes that were previously out of reach.

LeoThat is the broader theme of Topic 3. Distributed training is not just about bragging rights. It changes the experiments researchers can afford to run, and those experiments shape the models everyone else later uses.

MayaThe practical summary is: ZeRO asks: if every G P U is part of one team, why should every G P U carry a full copy of the optimizer states, gradients, and parameters all the time?

LeoAnd for anyone who wants to go deeper, we’ll include the primary material and extra reading notes in the episode metadata. This is one of those topics where diagrams, source links, and implementation notes really help.

MayaIf memory redundancy is the silent tax of scale, where else in the training loop are we paying for copies we do not actually need?

LeoHold onto that question. In the next episode, we keep following the same theme: find the bottleneck, then decide whether to split, shard, move, recompute, or rethink the training plan.

CreditsThanks for listening. The producer is William Liu. Join us for the next episode.

Source material

https://arxiv.org/pdf/1910.02054

← Back to Mastering Language Models: From Architecture to Optimization