William Liu · Podcasts
A compute-data-parameters triangle balanced over a scaling curve.

T2E0 · Topic 2 · 00:11:12

Scaling and Training Large Models Efficiently

A topic-level overview of efficient large-model scaling, introducing parameters, tokens, compute, data quality, compute optimality, and the core disagreements around scale.

Transcript

Generated: 2026-05-01 01:44 UTC

---

MayaBefore we jump in, here's a quick setup for this episode on t2e0_scaling_and_training_large_models efficiently. You'll hear Maya and Leo work through the topic together.

MayaTopic 1 gave us the Transformer. Topic 2 asks the uncomfortable follow-up: once you have the architecture, how big should you make it?

LeoAnd just as important: how much data should it see, how long should you train it, and what happens when the supply of fresh high-quality text starts to feel limited?

MayaThis topic covers three materials. Scaling Laws for Neural Language Models, Training Compute-Optimal Large Language Models, often associated with Chinchilla, and Scaling Data-Constrained Language Models. All three are in the show notes for extra reading.

LeoThe shared question is resource allocation. If you have a fixed training budget, do you spend it on more parameters, more tokens, or more training steps?

MayaLet’s define the pieces. Parameters are the model’s learned numbers. Tokens are the pieces of text it trains on. Compute is the amount of calculation used during training, often tied to model size times number of training tokens.

LeoA simple analogy: building a student. Parameters are like memory and reasoning capacity. Tokens are like study material. Compute is study time plus effort. A huge brain with too little study material underlearns. Endless study material with a tiny brain also hits a ceiling.

MayaThat analogy is not perfect, but it captures the trade-off. Topic 2 is about balancing capacity, experience, and budget.

LeoHere are the core mental models experts share. First, scaling is often predictable. Loss tends to improve smoothly as model size, data size, and compute increase, at least over broad ranges.

MayaSecond, “bigger” is not one knob. You can make the model bigger, the dataset bigger, the training run longer, or the data higher quality. Those choices interact.

LeoThird, training efficiency is about avoiding waste. A model can be too small for the data, too large for the data, trained too briefly, trained on repeated data too often, or fed low-value tokens.

MayaFourth, data is not just volume. Quality, diversity, deduplication, filtering, domain mix, and contamination all matter. A trillion tokens can be gold or sludge depending on where they come from.

LeoNow the expert disagreements. The first disagreement is “scale first” versus “data first.” The scale-first camp says the most reliable progress has come from increasing compute and model size according to observed laws. Their strongest argument is empirical: scaling has delivered broad improvements again and again.

MayaThe data-first camp says blind scaling wastes compute if the data is repetitive, low quality, or poorly matched to the target use. Their strongest argument is that training data shapes behavior deeply, and the best model is often the one trained on the right mixture, not just the biggest mixture.

LeoSecond disagreement: should we train bigger models for fewer tokens or smaller models for more tokens? Early scaling-law interpretations leaned toward very large models trained before full convergence. Chinchilla shifted the discussion by arguing that many large models were undertrained and that tokens should scale more evenly with parameters.

MayaStrongest argument for huge models: large models can be sample-efficient and may unlock behaviors smaller models do not. Strongest argument for more tokens per parameter: inference cost matters. A smaller, better-trained model can be cheaper to run and easier to deploy.

LeoThird disagreement: how close are scaling laws to real product usefulness? One side says loss is a powerful universal signal. Lower loss usually means better predictions and often better downstream behavior.

MayaThe other side says loss is not the whole story. A model can improve average prediction and still fail on reasoning, truthfulness, safety, or tool-use tasks. They want targeted evaluation, not just smoother loss curves.

LeoFourth disagreement: what to do when high-quality data runs out. Some researchers argue repeated data can still be useful up to a point. Others emphasize synthetic data, domain-specific data, retrieval, or new training objectives.

MayaThe data-constrained paper is important here because it tries to quantify repetition. Repeating data a little can be okay; repeating it too much eventually gives diminishing returns.

LeoLet’s preview the three deep-dives. The first, Scaling Laws for Neural Language Models, is about predictability. It says loss follows power-law trends with model size, dataset size, and compute. That made scaling feel like engineering rather than guesswork.

MayaThe second, Chinchilla, changes the budget recipe. It says if you are compute-constrained, many big models were too large for the number of tokens they saw. Train a smaller model on more data and you may get better performance for the same compute.

LeoThe third asks what happens when fresh data is constrained. If you cannot keep scaling unique text forever, how should you train? When does repetition help, and when does it become empty calories?

MayaThis topic also sets up distributed training. Once you decide to train a huge model on huge data, you need to make the hardware cooperate. That becomes Topic 3.

LeoAnd it sets up fine-tuning. If pretraining is so costly, you rarely want to redo it for every task. Topic 4 asks how to adapt a large pretrained model efficiently.

MayaLet’s bring this down to an engineering meeting. Imagine a team has a budget for one training run. Someone says, “Let’s build the biggest model we can.” Another person says, “No, we should train a smaller model longer.” Another says, “Our data is not good enough.” Another says, “The model will be too expensive at inference.”

LeoTopic 2 gives that team a vocabulary. Not just vibes. It gives a way to ask: what is the compute-optimal frontier? What data regime are we in? Is the bottleneck capacity, tokens, data quality, or deployment cost?

MayaAnd it teaches humility. Scaling laws are helpful, but they are fitted from experiments. They can guide decisions, not remove judgment.

LeoA key phrase here is “compute optimal.” It does not mean “best model possible.” It means best use of a fixed compute budget under a set of assumptions.

MayaThat distinction matters. A compute-optimal training plan for a research lab may not be optimal for a product team with strict latency costs. A model that is cheap to train but expensive to serve may be a bad business decision.

LeoExactly. Training compute is only one bill. Inference compute can dominate when millions of users ask questions every day.

MayaAnother mental model: pretraining is a capital expense. Inference is operating expense. A model can look great in a training paper and still be awkward to run at scale.

LeoSo experts in this topic think in frontiers: loss versus compute, accuracy versus data, context versus memory, training cost versus inference cost.

MayaLet’s close with the big lesson. Topic 1 says architecture made scale possible. Topic 2 says scale needs discipline. More parameters, more tokens, and more compute are not automatically wise. The question is how they combine.

LeoNext episode, we start with Kaplan and colleagues’ scaling laws. It is the paper that made many people believe language model progress could be planned with curves, not just discovered by trial and error.

MayaOne thing experts also track is the difference between pretraining and post-training. Pretraining teaches broad pattern prediction. Post-training shapes behavior: following instructions, refusing unsafe requests, using tools, or matching a product style.

LeoThat matters because Topic 2 focuses mostly on pretraining scale. A model can be compute-optimal in pretraining and still need careful post-training to become pleasant, safe, or useful.

MayaSo when people compare models, you have to ask whether the difference came from base model scale, data mixture, supervised fine-tuning, preference optimization, tool access, or evaluation selection.

LeoAnother shared mental model is the Pareto frontier. A system is Pareto-improved if it gets better on one dimension without getting worse on another. In this topic, the dimensions might be loss, training compute, data volume, inference cost, and latency.

MayaNo single model is “best” outside a use case. A chatbot for millions of users may value low inference cost. A research assistant for hard science may value maximum reasoning quality. A privacy-sensitive product may care most about data governance.

LeoThat is why scaling conversations can get heated. People are often optimizing different objectives while using the same word: better.

MayaBetter for a benchmark? Better per dollar? Better for a small device? Better for a regulated domain? Topic 2 gives tools, but the objective still has to be chosen.

MayaOne more disagreement belongs in the overview: benchmark-driven scaling versus capability-driven scaling. Benchmark-driven teams optimize for measured scores because benchmarks are concrete and comparable. Capability-driven teams worry that benchmarks become stale or gameable.

LeoStrongest benchmark argument: without shared tests, progress becomes storytelling. Strongest anti-benchmark argument: once everyone trains for the test, the test stops measuring general ability.

MayaSo experts increasingly triangulate: language loss, curated benchmarks, adversarial evaluations, human studies, and real deployment feedback.

MayaFinal question: if you had one fixed training budget, would you rather buy a bigger model, a longer education, or a cleaner textbook?

CreditsThanks for listening. The producer is William Liu. Join us for the next episode.

Source material

← Back to Mastering Language Models: From Architecture to Optimization