T2E0 · Apr 26, 2026 · 00:14:16

T2E0 · Scaling and Training Large Models Efficiently

Topic 2 opens with the question the Transformer made urgent: once you can build big, how should one fixed training budget be split between model size, training tokens, and data quality? Maya and Leo stage the scale-first versus compute-optimal argument in its strongest forms, introduce the smooth-curve predictability of scaling laws, the four interacting knobs of scale, and the two-bills view of training versus inference cost — then map the three deep-dives: Kaplan's scaling laws, Chinchilla's budget correction, and the data-constrained regime where fresh text runs short.

Transcript

MayaThere's a whiteboard I keep picturing. A planning room, one number written in the corner — the total compute this team gets to spend, once. And two engineers at the board sketching opposite recipes. One draws the biggest model that number can buy. The other draws a model half the size, fed twice the text.

LeoSame budget for both. Same chips, same weeks, same electricity bill.

MayaSame everything. The only difference is how they split it. And one of those recipes produces the better model — by a margin that embarrassed a lot of smart people.

LeoWhich one?

MayaNot in the first minute. That argument is this entire topic.

Leo[chuckle] Fine, I'll wait. So — last topic we sat inside the Transformer. Attention, the architecture, and why its appetite for parallel hardware made giant models buildable at all.

MayaArchitecture made scale possible. Topic 2 asks the uncomfortable follow-up: scale *how*? Once you can build big, how big should you build, how much data should the model see, and how long should you train?

LeoTwo papers anchor the fight. Scaling Laws for Neural Language Models — the Kaplan paper, from twenty twenty. And Training Compute-Optimal Large Language Models, from twenty twenty-two — everyone just calls it Chinchilla.

MayaPlus a third that haunts them both: Scaling Data-Constrained Language Models, on what happens when the supply of fresh text starts to feel finite. All three are in the show notes.

LeoAnd underneath all three sits one question: resource allocation. Fixed training budget — do you spend it on more parameters, more tokens, or more training steps? Although — before we trade anything off, pin those down. What are we actually allocating?

MayaParameters are the model's learned numbers — its capacity. Tokens are the pieces of text it trains on — its experience. And compute is the calculation spent during training, which is roughly tied to model size times the number of tokens you push through it.

LeoSo under a fixed budget, those last two pull against each other. Bigger model, fewer tokens. Smaller model, more tokens.

MayaThat's the tension. My loose picture is building a student. Parameters are memory and reasoning capacity. Tokens are the study material. Compute is study time. A huge brain with a thin textbook underlearns —

Leo— and a tiny brain in an endless library hits a ceiling. Sure. I'd flag where it creaks, though: a student chooses what to reread. A training run mostly doesn't.

MayaFair creak. But as a budget picture it holds, and it sets up the running example we'll reuse across this whole topic.

LeoGo on, set it up.

MayaCall them the one-run team: a small group with funding for exactly one pretraining run of their assistant model. They will not get a second run. And whatever they train, they then have to serve to users every single day.

LeoI like that last clause, because it's the part teams forget. Training is a one-time bill. Serving is forever.

MayaHold that thought — it comes back with a vengeance.

LeoNoted.

MayaFirst, the mental models that serious scaling people actually share, whatever camp they're in. The one I'd start with is the smooth curve. Loss tends to improve predictably as model size, data, and compute grow — not in lurches, in smooth power-law trends, at least over broad ranges.

LeoAnd that's the genuinely strange empirical fact under this whole topic. You can fit a curve on small cheap runs and forecast what a run a thousand times larger will do. It turned scaling from gambling into something closer to engineering.

MayaNext comes the four knobs. Bigger is not one knob — you can grow the model, grow the dataset, train longer, or raise data quality. And they interact. Turning one without the others is how budgets get wasted.

LeoMm-hm.

MayaThen there's the waste audit. Efficiency here mostly means avoiding waste — a model can be too small for its data, too large for its data, trained too briefly, or fed the same text too many times.

LeoAnd the last one is mine: data is not just volume. Quality, diversity, deduplication, domain mix, contamination. A trillion tokens can be gold or sludge depending on where they came from.

MayaAnd that's where the camps split. So let's actually have the argument instead of describing it. I'll take the scale-first side — the early reading of the Kaplan results — and I'll defend it properly, because it was never naive.

LeoAnd I'll take the side the field landed on after Chinchilla. Go.

MayaThe strongest progress this field has ever produced came from one move: more compute, bigger models, guided by observed laws. Not clever tricks — scale. And the Kaplan results said model size was the dominant term. Big models are more sample-efficient: they learn more from every token they see. So if you must err, err big, even if the model stops short of full convergence. History kept rewarding exactly that bet.

LeoUntil someone reran the budget math. The Chinchilla work went back, fixed how the trade-off was measured, and found that a whole generation of giant models was undertrained — huge capacity, starved of tokens. Same compute, smaller model, way more data: better model. That's not a vibe, that's the headline result.

MayaI'll grant the bookkeeping — for a fixed training budget, the early recipe overweighted size. But you're quietly dropping my strongest card: capability. Large models can unlock behaviors smaller ones simply don't show, and a loss curve doesn't tell you which abilities arrive at which scale.

LeoFine — the sample-efficiency claim survives, and the capability question is real. But here's the card you're dropping: the second bill. Your giant model wins the training-paper comparison and then bleeds the one-run team dry at inference, every query, forever. A smaller, better-trained model is cheaper to serve and easier to deploy. For a product, that's not a footnote —

Maya— that's the business. I know. [sigh] And honestly, that's where the steam goes out of my side. Scale-first answers "what's the best model we can train?" The one-run team is asking "what's the best model we can afford to *run*?"

LeoSo here's where the evidence actually lands. Under a fixed training budget, parameters and tokens should grow much more evenly than the early reading said. That part Chinchilla settled, and we'll spend a full episode on how.

MayaAnd the open part?

LeoWhether sheer size unlocks capabilities that no amount of extra data buys back. That one's still genuinely open.

MayaAgreed on both. And notice the phrase doing the heavy lifting there — compute-optimal. It does not mean best model possible. It means best use of a *fixed training budget*, under a set of assumptions.

LeoAssumptions like: training compute is the bill that matters. Which brings back my favorite ledger. Pretraining is a capital expense — you pay once. Inference is an operating expense — you pay per question, and when millions of people ask questions every day, the operating bill dominates.

MayaTwo bills. A model can look brilliant in a training paper and be an awkward, expensive thing to actually run. Compute-optimal for a research lab is not automatically optimal for a product team with latency costs.

LeoRight.

MayaThere's a second disagreement hiding under the first one, though, and this time you defend the boring side. Loss. Is it enough?

LeoI'll defend it happily, because it's the most stable number we have. Loss is concrete, comparable, and it forecasts. Lower loss usually means better prediction and, more often than people admit, better downstream behavior. Without a shared number, progress becomes storytelling.

MayaAnd yet — a model can improve its average prediction and still fail at reasoning, truthfulness, safety, tool use. The average hides the failures you care about. And the moment a benchmark becomes the target, everyone trains for the test and it stops measuring general ability.

LeoThat holds, except it's an argument for more measurement, not less. The people doing this seriously triangulate: language loss, curated benchmarks, adversarial evaluations, human studies, real deployment feedback.

MayaTriangulation is the honest answer, yes.

LeoIt usually is.

MayaOne more boundary before we map the episodes — almost everything in this topic is about *pretraining* scale. Pretraining teaches broad pattern prediction. Post-training shapes behavior: following instructions, refusing unsafe requests, matching a product's style.

LeoSo when two models differ, ask where the difference came from. Base scale? Data mixture? Fine-tuning? Preference optimization? People compare end products and credit the wrong layer constantly.

MayaAnd that's also why "better" starts so many fights. Better on a benchmark? Better per dollar? Better on a small device? People optimize different objectives while using the same word.

LeoThe frontier picture helps there. No single model is best, full stop — there's a trade-off surface across loss, training cost, data volume, inference cost, latency. You pick a point on it. The topic gives you the map; it doesn't pick your point.

MayaSo here's the map for the next three episodes. First, the Kaplan scaling laws — the predictability paper. Loss follows power-law trends in size, data, and compute, which meant progress could be *planned* with curves instead of discovered by accident.

LeoThen Chinchilla, which keeps the curves and corrects the recipe. If you're compute-constrained, many famous models were too big for the tokens they saw. Smaller model, more data, same budget — better results. It's the rare paper that politely tells the entire field it's been spending wrong.

Maya[chuckle] Politely. And third, the data-constrained paper — when the whiteboard number isn't compute anymore, it's fresh text. If unique high-quality data stops growing, can you repeat what you have?

Leo[sigh] And the honest answer is: a little. Repeating data a few times still helps; repeat it too often and the returns decay toward empty calories. The paper tries to quantify exactly where that bend is.

MayaWhich quietly reopens the settled question. Chinchilla says more tokens — but what if the tokens run out? That's where the data-first camp gets its strongest footing: the right mixture, not just the biggest mixture.

LeoThis topic also loads the next two. Once you commit to a huge run, you have to make thousands of devices cooperate — that's distributed training, Topic 3. And because pretraining is so costly, you'd rather not redo it per task — adapting cheaply is Topic 4.

MayaBefore we close, the working vocabulary — short and plain. Parameters means the learned numbers inside the model: its capacity.

LeoTokens means the chunks of text the model trains on: its experience.

MayaCompute means the total calculation spent during training, roughly model size times tokens processed.

LeoA scaling law means a fitted curve that predicts how loss falls as size, data, or compute grows.

MayaCompute-optimal means the best split of a fixed training budget between parameters and tokens — under stated assumptions.

LeoUndertrained means a model too large for the tokens it saw — capacity its data never filled.

MayaInference cost means what you pay every time the trained model answers a query.

LeoAnd data repetition means training on the same tokens more than once — useful in small doses, diminishing fast in large ones.

MayaSo the one-run team walks out of that planning room with a vocabulary instead of vibes. What regime are we in? Where's the bottleneck — capacity, tokens, quality, or the serving bill?

LeoAnd with some humility. These laws are fitted from experiments. They guide judgment; they don't replace it.

MayaSo here's the question to carry into the Kaplan episode. You get one fixed training budget — one run. Would you rather buy a bigger brain, a longer education, or a cleaner textbook?

Source material

← Back to Mastering Language Models: From Architecture to Optimization