T7E0 · Jun 19, 2026 · 00:14:37

Mixture-of-Experts Models and Handling Massive Models: Sparsity, Data, and Optimization

Maya and Leo open Topic 7 by going underneath massive models to the machinery that makes them possible. Using one small lab training a large sparse model on a fixed ninety-day compute budget, they explain sparsity and Mixture-of-Experts, the router and load balancing, why diverse training data is a design choice, how AutoML automates tuning, and what it takes for any of it to converge — staging the field's real debates as a live argument and previewing five deep dives from the sparsely-gated MoE layer to PDE pre-training, the Pile, AutoML, and optimization.

Transcript

MayaPicture a small lab with one fixed compute budget — a wall of rented machines for ninety days and not an hour more. They want a model with the capacity of something ten times bigger than they can afford. So they build it strangely: hundreds of little specialist networks side by side, and a tiny traffic cop in front that, for each word, wakes up only two of them and leaves the rest asleep.

LeoSo the model is huge on paper, but most of it is dark at any moment.

MayaMost of it is dark. The parameter count says four hundred billion. The work per word says seventeen billion. That gap — that's the whole topic.

LeoHuh.

MayaThat's the move underneath every massive model today. "Big" stopped meaning "expensive per token" a while ago.

LeoBut hold on — we just spent all of Topic Six on open models. Llama, DeepSeek, which bench you can own. How is this not a repeat?

MayaGood — that's the orientation. Topic Six was about *choosing* a model. This topic goes underneath the bench, to the machinery. Why a sparse model can even exist, what you feed it, and what makes any of it converge.

LeoSo less "which model," more "how is the thing even possible."

MayaRight. And the central idea is one word we should ground now. Sparsity.

LeoPlain version, because it sounds like jargon. Sparse means: most of the thing is switched off most of the time. In a dense model, every parameter fires for every word — every token pays for the entire model.

MayaAnd sparse breaks that contract. The parameter count grows enormous, but each word touches only a small slice. Capacity up; cost per word barely moves.

LeoWhich sounds like a free lunch, and I get suspicious around free lunches.

Maya[chuckle] You should. We'll find the bill. But first the mechanism — the piece that makes it work is a Mixture of Experts — M-o-E.

LeoDefine "expert," because it's a loaded word.

MayaAn expert is just one of those little sub-networks. Nothing mystical — a small feed-forward block. You stack a lot of them in a layer. Then you add the component that runs the show — the router.

LeoThe traffic cop from the opening.

MayaThe router. For each word, it decides — these two experts handle this one, the rest stay asleep. That decision, fresh for every token, is the heart of it. The idea goes back to a two thousand seventeen paper with a wonderful title — "Outrageously Large Neural Networks." Shazeer's team showed you could grow capacity by a factor of a thousand and barely pay for it in compute.

LeoA thousand times the capacity. That's not a tweak, that's a different cost curve.

MayaA different cost curve. And that's our first deep dive — the origin. The sparsely-gated layer that started all of this.

LeoLet me make it current, though. Take an open model people run right now — one of the GPT-OSS releases. Roughly a hundred and seventeen billion parameters total. Active per word? About five billion.

MayaSo you carry a hundred-and-seventeen-billion-parameter model in memory—

Leo—but pay like a five-billion model on every token. That's the trade in one number. It's why nearly every serious open model shipped in the last year is sparse. It stopped being exotic.

MayaThat trade is the first thing every expert here internalizes. They separate two things beginners glue together: total capacity and active compute. Parameters are what you *know*; compute is what you *spend thinking* about this word. Sparsity lets you know a lot while thinking cheaply.

LeoRight.

MayaNow — back to the lab and their ninety-day wall of machines. They picked sparse. Where's the bill come due?

LeoLoad balancing. Here's where I'd push. The router is free to play favorites. If it learns experts one and two are reliable, it sends everything to one and two. The rest never train. You paid for a huge model and got a small one wearing a big coat.

MayaThat's the real tax, and it's not optional. So you add a load-balancing pressure — a gentle penalty nudging the router to spread traffic across all the experts. Load balancing is the discipline that stops the router from overusing a few experts and starving the rest.

LeoAnd it fights you. Push it too hard and you force words to the wrong experts. Too soft and you get the lazy-router collapse.

MayaAnd that tension is the first real disagreement in the field — and it's live. Let me take the sparse-maximalist side: keep scaling experts, because conditional computation is the only way the capacity curve stays affordable. The win is structural — more model for the same inference dollar, and that compounds.

LeoAnd I'll take the dense-and-boring side, because I deploy these things. The router is a discrete decision — jumpy, hard to train smoothly. Experts under-utilize. And serving it? You hold all four hundred billion parameters in memory even though you use a sliver per token. My memory bill didn't drop at all.

MayaFine — the memory point survives. That's the honest cost. You compute like a small model and *store* like a huge one.

LeoThat's not a footnote, that's the whole serving budget. A dense model half the size can be cheaper to actually run.

MayaI'll concede the serving complexity. What I won't concede is the ceiling. Past a point, dense can't get there — the capacity to be good across many domains gets too expensive to make every token pay for. Past that line, sparse isn't a preference, it's the only road.

LeoGranted — at the frontier you're right. Down where most teams live, I'd still bet dense more often than the hype says. So we agree on the shape, disagree on where the line sits.

MayaThat's the honest place to land. The line is empirical — it moves with your hardware and your workload. Hold that, because it returns the moment we ask what you *feed* this thing.

LeoGo on.

MayaA model only knows what its training data taught it. So if our lab builds this big sparse brain full of specialists — what do they specialize *on*?

LeoWhatever's in the corpus. [chuckle] Garbage in, garbage experts.

MayaThat question is exactly where the Pile comes in — one of the sources waiting for us. A roughly eight-hundred-gigabyte corpus deliberately built to be *diverse* — stitched from around twenty-two sources. Not just scraped webpages. Books, code, medical papers, physics, math, philosophy, legal text.

LeoAnd the bet is specific, right? Not "more data." It's "more *different* data."

MayaHere's the next mental model worth carrying. Diversity is a design choice, not an accident. A model forced to handle many disparate domains — books *and* code *and* medical text — generalizes better than one fed a giant pile of similar webpages.

LeoI buy the intuition. Push me on it, though — is diversity always good? I've seen the other camp.

MayaYou've put your finger on the next disagreement. The Pile's camp says maximize diversity, cover the world, build broad cross-domain understanding. Their strongest argument: narrow data gives you a brittle model that falls over the moment a user asks something off-distribution.

LeoAnd the curation camp pushes back hard, and I'm sympathetic. Diversity for its own sake just imports noise. A smaller, ruthlessly cleaned corpus can beat a bigger messy one token for token. Their strongest point: every gigabyte of junk is compute spent learning junk. Quality is a multiplier; quantity past a point is just cost.

MayaBoth have been right in different years, honestly. It depends on what's scarce. When data's the bottleneck you stretch diversity; when compute's the bottleneck you curate.

LeoAnd it lands right back on our lab. They can't waste a week of that ninety-day wall learning duplicate forum posts.

MayaRight — and notice their data mix is already an optimization problem, before training even starts. And that word — optimization — is where the topic gets practical.

LeoBecause somebody has to tune all this. Expert count, routing penalty, data mix, learning rate. A brutal number of dials.

MayaThat's where AutoML enters — another of the sources ahead. Say it as "auto" plus "M-L" — automated machine learning. Instead of a human turning every dial by hand, an automated search turns the dials and measures.

LeoThe evolution being — we used to have one tired graduate student babysitting hyperparameters at three in the morning—

Maya[laugh] —and now a search procedure does the babysitting. AutoML is the arc from hand-tuned pipelines to systems that automatically pick architectures, settings, and data recipes.

LeoThere's a real split here too. The AutoML camp says automate the search — humans are bad at high-dimensional tuning, machines aren't. Strongest argument: a careful search finds configurations no human would think to try, and never tires.

MayaThe expert-craft camp answers back: automated search is wildly expensive — for our ninety-day lab it can burn the whole budget just *looking* for a setting instead of training. Their strongest point: an experienced practitioner's prior is worth a thousand blind trials. You start near the answer.

LeoSo it's compute spent searching versus compute spent training. Same scarce wall.

MayaThat's the through-line of the whole topic. Every decision — sparse or dense, diverse or curated, automated or hand-tuned — is the same lab arguing over the same fixed budget.

LeoRight. One wall, many bills.

MayaThe final source decides whether any of it even works: optimization itself. The methods that make a giant model *converge* instead of wobbling forever or blowing up.

LeoDefine converge, plainly.

MayaThe training settles — the model's error stops bouncing and steadily comes down to something stable and good. The optimizer is the procedure deciding how to nudge the parameters each step to get there.

LeoAnd at this scale that's not solved. A sparse model with a jumpy router and a wild data mix can refuse to converge. The optimizer is load-bearing.

MayaThat closing source is a systematic review of modern optimization methods — the unglamorous engine room. And here's the twist that closes the loop: this sparse idea isn't even just about language.

LeoRight, the physics one.

MayaThe second deep dive points the same Mixture-of-Experts machinery at physics — at the equations describing how fluids flow and heat spreads. They're called P-D-Es, partial differential equations. Same router, same sleeping experts, different domain.

LeoWhich proves the idea is structural, not a language trick. If sparse routing helps a physics model the way it helps a text model, that's something fundamental about building big systems.

MayaThat's the cleanest summary. Sparsity, data, and optimization aren't three subjects — they're three views of one question: how do you build enormous capacity a small lab can afford to train and run?

LeoQuick vocabulary lap before we close — half these words sound like marketing until you pin them.

MayaMixture of Experts means a layer of many small specialist networks where only a few wake up for each word.

LeoSparse means most of the model is switched off for any given token, so capacity and cost stop being the same thing.

MayaThe router is the small component that decides, per word, which experts to activate.

LeoAn expert is one of those small specialist sub-networks inside the layer.

MayaLoad balancing means the pressure that keeps the router from overusing a few experts and starving the rest.

LeoCapacity means how much the model can hold and represent — roughly, its total parameter count.

MayaAutoML means automated machine learning — a search procedure picks architectures, settings, and data recipes instead of a human tuning by hand.

LeoAnd the optimizer is the procedure that nudges the parameters each step so training converges — settles into a stable, low-error state.

MayaSo that's the map. The origin of sparse scaling, the same idea carried into physics, the diverse corpus that feeds it, the automation that tunes it, the optimization that makes it converge — all circling one small lab and one fixed wall of machines.

LeoAnd here's what we'd leave you sitting with: if you had that ninety-day budget, would you spend it on a bigger sparse model, cleaner data, or a smarter search for the settings — and what does that choice say about which problem you actually believe is hardest?

Source material

← Back to Mastering Language Models: From Architecture to Optimization