SE0 · Apr 26, 2026 · 00:11:21

Series Overview — Mastering Language Models: From Architecture to Optimization

Maya and Leo open the series with the map: seven stops from the Transformer blueprint to the machinery under massive models, anchored by a three-person startup building an insurance-claims assistant on eight GPUs. They lay out the mental models every LLM expert shares — trust curves, find the bottleneck, separate capability from behavior — then stage the field's cleanest fight on air: bigger models versus more data, from OpenAI's 2020 scaling curves to Chinchilla's flip to the serving-cost era that ran past both camps. Plus trailers for the live attention debate and the alignment fight to come.

Transcript

MayaPicture a three-person startup in a rented office. They want to build an assistant that reads insurance claims — thousands of pages of them — and flags the ones that don't add up. They've got one machine with eight GPUs, a budget that has to last the year, and a very nervous founder.

LeoI've met that founder.

Maya[chuckle] Everyone has. And between that team and a working assistant sit about six big decisions. Which architecture. How big to train. How to split the work across those GPUs. How to specialize the model. How to make it behave. And whether to build on open weights or rent a closed model through an API.

LeoAnd every one of those decisions has a wrong answer that quietly costs real money. That's this series, isn't it — the forks in that road.

MayaThat's the series. Mastering Language Models — from architecture to optimization. We walk the whole stack the way the field built it, and at each stop we do three things. Give you the mental model experts actually share. Show you where they genuinely fight. And let each side land its best punch.

LeoNo strawmen.

MayaNo strawmen. If a camp sounds dumb, we've summarized it badly.

LeoSo, the map. The way I hold it, the journey runs from blueprint to behavior. It opens at the Blueprint — the Transformer, the 2017 paper "Attention Is All You Need," which threw out recurrence and bet everything on a mechanism called attention.

MayaAnd sitting right next to it, the modern challengers — architectures like Kimi Linear arguing that the bet needs revising. That pairing is our first topic.

LeoAfter the blueprint comes the Growth Question: when you get more compute, do you build a bigger brain or feed it more books? Scaling laws. That one produced the cleanest fight in the field's recent history—

Maya—which we're staging in about two minutes, so hold that thought.

Leo[chuckle] Holding it.

MayaThen the Machine Shop. Pipeline parallelism, Megatron, ZeRO, FlashAttention — the unglamorous engineering that lets a model too big for one chip train across hundreds of them without the wiring becoming the bottleneck.

LeoThen the Tailor's Bench, because in practice you almost never train from scratch — you adapt something pre-trained. Low-rank adaptation — LoRA — and its cousins fine-tune a giant model by training a thin sliver of it.

MayaSometimes under one percent of the weights.

LeoThen the Feedback Loop — where a model that merely predicts text gets shaped into something that actually helps people. Reinforcement learning from human feedback — R-L-H-F — plus newer shortcuts that claim you can skip the reinforcement machinery entirely.

MayaAnd the final stretch covers the modern era in two halves: the open-weight families — Llama, DeepSeek — and then the machinery under the biggest systems. Mixture-of-experts, the giant training datasets, the optimizers that make any of it converge.

LeoThe promise by the end is simple. You read a model release like an engineer, not like a press release.

MayaBefore any forks, though — the shared ground. Because for all the public arguing, serious people in this field think alike in a few very specific ways. Here's the first: nobody trusts a single result. They trust curves.

LeoMeaning what, concretely?

MayaYou never ask "is this model good." You ask what happens to performance when I double the data, double the parameters, double the compute. A claim that sits on a curve is knowledge. A claim that doesn't is an anecdote.

LeoThat one I'd defend with my life. The second is bottleneck thinking. At any given moment, exactly one thing constrains your system — compute, memory, data, or dollars — and only spending against that constraint moves anything. Our claims team's first bottleneck isn't intelligence at all. It's that eight GPUs can't even hold the model they want to start from.

MayaWhich is why half this series is secretly about memory, when everyone expects it to be about intelligence.

LeoMm. Painfully true.

MayaAnd the third one took the field years to say out loud: capability and behavior are different things, made in different factories. Pre-training decides what a model can do. Post-training — the tailoring, the feedback — decides what it does do.

LeoAnd mixing those up is how you get a brilliant model that's useless in front of a customer. Or a polite one that's confidently wrong.

MayaBoth of which our claims founder will meet personally. Okay — the fight.

LeoFinally. Because the scaling story sounds settled when people tell it fast, and it absolutely was not.

MayaThen I'll argue it the way 2020 argued it. OpenAI's scaling-laws paper measured how loss falls as you scale size, data, and compute — and the curves said model size was the dominant lever. Given ten times the compute, grow the model a lot and the data only a little.

LeoAnd the world acted on it.

MayaFully. That's the GPT-3 era — bigger was the strategy. And the strongest form of the position is about capacity: a small model cannot represent what it has no room to learn. No amount of reading fixes a brain that's too small for the idea.

LeoThen I get DeepMind's side, because two years later the Chinchilla paper broke that consensus with a better-run experiment. They redid the measurement with the learning-rate schedule properly matched to each training run. Sounds like a footnote — it flipped the answer. Scale model and data together, in roughly equal proportion.

MayaThat's a re-measurement, not a refutation—

LeoThen explain the demonstration. Chinchilla, seventy billion parameters, beat Gopher at two hundred eighty billion — same compute budget. Four times smaller, better model. My side's strongest form isn't "small models win." It's that the field spent two years starving its models of data because one early curve under-weighted it. Bad measurement, billions misallocated.

MayaBrutal result, I'll grant.

LeoGrant more than that.

Maya[laugh] Fine — I concede the experiment. Chinchilla's measurement was simply better, and the field was right to move. But notice what I don't concede: the frontier kept getting bigger anyway, because capacity still buys capabilities that more data alone doesn't reach. The fight didn't end with "data wins." It ended with "stop starving the model" — which is a different sentence.

LeoFair. And then practice ran straight past both camps — because Chinchilla optimized the cheapest model to train, and nobody trains a model just to admire it. You serve it, millions of times a day. So the Llama-style move is to deliberately overtrain a smaller model far past Chinchilla's rule, because the smaller model is cheaper every single day it's deployed.

MayaAnd that's the resolution worth keeping: the answer changed when the question changed — from cheapest to train, to cheapest to own. Our claims startup doesn't care about training elegance. They care about the cost per claim, forever.

LeoCost per claim. That's the frame.

MayaThere's a second fight, and it's live right now — it's where this series begins. Full attention versus linear attention.

LeoThis one we fight properly in the Transformer episodes, so consider this the trailer.

MayaThe Transformer's core move is letting every word look at every other word. That's exactly what makes it powerful — and exactly what makes it expensive. Double the document, quadruple the work. And while it's answering, the model drags along a memory of everything it has read — the key-value cache, the kay-vee cache — like a suitcase that gets heavier with every page.

LeoHeavier and slower.

MayaBut if an architecture could keep that suitcase a fixed size, the serving cost—

Leo—stops growing with the document. Which is the whole prize. So one camp says keep attention exact and make it faster — the FlashAttention school, and exact attention is still the only recipe proven at frontier scale. The other camp says quadratic cost is a dead end at million-token contexts, and points at results like Kimi Linear, which claims to match full attention while cutting that suitcase by three quarters.

MayaWhether those claims survive contact with the frontier is precisely what we'll argue when we get there. I'm not spoiling my side.

Leo[chuckle] You'll take the romantic side. You always take the romantic side.

MayaSomeone has to keep you honest about what the boring side misses.

LeoAnd a third fight waits at the Feedback Loop — whether shaping a model's behavior needs the full reinforcement-learning machinery with humans grading outputs, or whether a model can learn preferences directly. Even from another model's feedback.

MayaWe'll put that one on the table when the evidence is in front of us.

LeoAgreed. Worth saying what all three fights have in common, though.

MayaThat underneath, nobody is arguing taste. They're arguing about which measurement to trust and which cost to count. Once you hear that shape, the papers stop being intimidating — they become positions in an argument you can actually follow.

LeoSo here's the promise, builder's version. By the end of this series, when that claims team hits its forks — which architecture, how big, how distributed, how specialized, how aligned, open or closed — you'll know not just what the field chose, but why. And what evidence would change the answer.

MayaNext time, straight to the blueprint: the paper that threw away recurrence, bet everything on attention — and accidentally created the cost problem the rest of the series keeps trying to pay down.

LeoUntil then, something to chew on.

MayaIf someone handed your team ten times your compute budget tomorrow — would you spend it on a bigger model, more data, longer context, or better feedback, and what evidence do you actually have for that answer?

Source material

← Back to Mastering Language Models: From Architecture to Optimization