William Liu · Podcasts
A finite library of data with repeated-token loops and a diminishing-returns curve.

T2E3 · Topic 2 · 00:11:06

Scaling Data-Constrained Language Models: When Fresh Text Runs Out

A deep dive into data-constrained scaling, explaining repeated data, effective tokens, diminishing returns, and the training-data bottleneck.

Transcript

Generated: 2026-05-01 01:45 UTC

---

LeoBefore we jump in, here's a quick setup for this episode on t2e3_scaling_data_constrained_language_models_when_fresh_text_runs out. You'll hear Leo and Maya work through the topic together.

LeoChinchilla told the field: train models on more tokens. This episode asks the next question: what if you do not have enough fresh high-quality tokens?

MayaThe paper is Scaling Data-Constrained Language Models. We’ll put it in the show notes as extra reading, along with the project materials when available, because the experiments are broad.

LeoThe setup is simple but uncomfortable. Modern scaling recipes want more data as models grow. But the internet is not infinite, and not all text is worth training on.

MayaPlus, some text is duplicated, low quality, private, copyrighted, toxic, misleading, or too close to evaluation benchmarks. So “just get more data” is not a clean answer.

LeoThe paper studies what happens when data is constrained and you repeat it. In other words: if you only have a certain amount of unique text, can the model benefit from seeing it multiple times?

MayaThe authors run many experiments across model sizes, compute budgets, and repetition levels. One headline result is that repeating data for a small number of epochs can be surprisingly okay. But beyond that, the value of repeated tokens decays.

LeoLet’s translate “epoch.” One epoch means the model has seen the dataset once. Four epochs means it has gone through the same dataset four times.

MayaIn school terms, rereading a textbook once or twice can help. Rereading the same chapter a hundred times probably stops teaching new concepts.

LeoThe paper’s important contribution is not just “repetition bad.” It is more nuanced: repeated data has value, but less value than fresh data after a point.

MayaThat nuance matters because some training recipes inevitably repeat data. Small domains, low-resource languages, specialized science corpora, medical text, legal text — all can be data-constrained.

LeoLet’s name the core mental model: tokens have marginal value. The first time a model sees a useful example, it may learn a lot. The second time, it may consolidate. The tenth time, maybe less. The thousandth time, almost nothing new.

MayaAnd if repetition is excessive, the model may overfit. It can become too tailored to the repeated examples instead of learning patterns that generalize.

LeoThe paper proposes scaling laws that account for the decreasing value of repeated tokens and excess parameters. That is a direct extension of the Topic 2 story.

MayaSo far, Topic 2 has moved from “scale is predictable” to “balance parameters and tokens” to “but unique data may be limited.”

LeoWhich forces a practical question: what do you do when data is the bottleneck?

MayaThe paper explores mitigation strategies like adding code data or relaxing common filters. More broadly, the field considers several approaches: better data selection, domain mixing, synthetic data, retrieval, curriculum learning, and using smaller models where data is scarce.

LeoLet’s unpack synthetic data carefully. It means model-generated training data. The optimistic view is that strong models can help create explanations, problem variants, or clean examples that improve training.

MayaThe skeptical view is that synthetic data can collapse diversity or amplify mistakes. If models train too heavily on model-generated text, they may inherit weird patterns or become less grounded.

LeoThat is a major expert disagreement. One camp says synthetic data is essential because high-quality human text is limited. The other says synthetic data is useful only when filtered, verified, and grounded in real tasks.

MayaAnother disagreement: should we remove filters to get more tokens? The pro side says aggressive filtering can throw away useful diversity, especially for code, informal text, multilingual text, or rare domains.

LeoThe caution side says lower-quality data can teach bad behavior, waste compute, or create safety risks. More tokens can be worse if they are noisy enough.

MayaA third disagreement is whether data scarcity is universal. For English web text, maybe the frontier feels constrained. For multimodal data, interaction data, domain-specific logs, and tool-use traces, there may be other kinds of experience to collect.

LeoGood point. “Data” does not only mean static internet text. Future models may learn from environments, simulations, tasks, user feedback, code execution, or retrieved knowledge.

MayaLet’s create a practical example. Suppose you are building a model for a low-resource language with only 40 billion good tokens. A strict Chinchilla-style recipe might want more. Do you repeat the data? Add related languages? Use translation? Generate synthetic examples? Train a smaller model?

LeoThe answer probably mixes all of those. But this paper gives a way to reason about repeated data instead of treating it as automatically useless.

MayaAnother example: company-internal support tickets. They are valuable because they match the product domain. But there may not be many of them. Repetition can help the model learn the domain, but too much repetition may make it memorize quirks.

LeoThat brings in privacy and compliance too. Data-constrained training is not only a technical issue. It is also a governance issue: what are you allowed to train on, and what should you avoid?

MayaLet’s connect to overfitting. Overfitting means the model gets very good at the training examples but worse at new examples. In language modeling, repeated data can increase that risk.

LeoBut repeated data is not always the villain. Humans reread important material. Engineers fine-tune on small datasets. The question is how much repetition and under what conditions.

MayaThe paper’s message is close to: repetition buys you some learning, but repeated tokens are discounted. Fresh, diverse, high-quality tokens remain valuable.

LeoAnd that changes scaling planning. If your data is constrained, the compute-optimal model may be smaller than the model you would choose with unlimited fresh data.

MayaBecause extra parameters without enough effective data can become excess capacity.

LeoExactly. The model has room to learn, but not enough new experience to fill that room.

MayaThis episode also foreshadows later topics. Fine-tuning is often data-constrained. Reinforcement Learning from Human Feedback is data-constrained because high-quality preference labels are expensive. Harness engineering can be seen as an inference-time response: instead of storing everything in weights, retrieve or present the right information when needed.

LeoThat is a nice connection. If training data is limited, maybe the model should not memorize every fact. Maybe the system should retrieve facts at inference time.

MayaSo the boundary between model training and system design becomes blurry.

LeoLet’s summarize. Scaling Data-Constrained Language Models studies training when unique data is limited. It finds that moderate repetition can be useful, heavy repetition has diminishing returns, and compute-optimal planning should account for the decreasing value of repeated tokens.

MayaThe larger lesson: scaling is no longer only about model size and compute. It is about data economics.

LeoAnd data economics includes availability, quality, legality, diversity, and how many times the model has already seen similar examples.

MayaLet’s make the “effective data” idea concrete. Suppose you have one million unique examples. If you repeat them four times, you have four million training examples in a counting sense, but not four million independent lessons.

LeoThe model gets practice, not novelty. Practice has value. Novelty has different value. A data-constrained scaling law tries to model that difference.

MayaThis matters for specialized domains. In medicine, law, chemistry, or low-resource languages, the most relevant corpus may be much smaller than general web text. You cannot always buy your way out with more crawl data.

LeoAnd repeating specialized data can create memorization concerns. If the data contains sensitive details, repetition may increase the chance the model reproduces them.

MayaSo data scarcity intersects with privacy, copyright, and evaluation contamination. The question is not merely “how many tokens exist?” but “which tokens can we responsibly use?”

LeoAnother practical implication is that retrieval can complement training. If a fact is rare, changing the model weights to memorize it may be less efficient than retrieving it when needed.

MayaThat ties back to the series arc. As training data becomes constrained, system design becomes more important. The model does not have to carry every piece of knowledge internally if the surrounding system can fetch reliable context.

LeoIn other words, data limits push us from “make the model know everything” toward “make the system know how to find and use what matters.”

MayaThe uncomfortable ending is that data scarcity may make evaluation more important, not less. If teams use synthetic data, repeated data, or relaxed filters, they need stronger tests to catch memorization, degradation, and brittle behavior.

LeoSo the data-constrained world is not just about squeezing more from less. It is about proving that the squeeze did not distort the model in ways users will notice later, even months afterward.

MayaFinal question for listeners: when high-quality human text becomes scarce, should we teach models by repeating, generating, retrieving, or interacting?

CreditsThanks for listening. The producer is William Liu. Join us for the next episode.

Source material

← Back to Mastering Language Models: From Architecture to Optimization