William Liu · Podcasts
Warm 2D editorial illustration of a head-on sparse-routing machine room: a central routing switchboard sends tokens to a few lit expert cabinets among many dark ones, a diverse-data intake hopper feeds in from the left, and on the right an automated mechanical search arm turns the tuning dials itself while a human hand rests on a paper template that bounds which dials it may reach.

T7E4 · Jun 23, 2026 · 00:13:44

AutoML from Basics to State-of-the-Art Techniques

Maya and Leo trace AutoML from grid and random search through Bayesian optimization, neural architecture search, and automated pipeline construction (TPOT, AutoGluon, Auto-Keras), then ask where automated search beats expert hand-tuning when both draw from the same fixed ninety-day compute budget. The episode's spine: the unit of automation keeps growing, the real enemy is the cost of one evaluation, and at large-model scale AutoML stops being a way to escape the budget and becomes a way to spend it deliberately — inside a search box a human still has to draw.

Subscribe

Transcript

MayaPicture the ninety-day lab again — the wall of rented machines, the clock already running. And two whiteboards. On the left, an engineer who's trained a dozen of these, writing down expert count, routing penalty, data mix, learning rate, from gut and scar tissue. On the right, no human at all — a search program that will *try* those settings, one after another, scoring each.

LeoAnd here's the catch that makes today tense: that search runs on the same machines. Every configuration it tries is one the actual model didn't get to train on.

MayaThat's the move. Today's source is a survey on Automated Machine Learning — AutoML — and the question underneath is brutally simple. The search for good settings competes with training for the *same* fixed budget.

LeoRight, the same wall.

MayaThe same wall. Last time we looked at the Pile — how diverse training data, eight hundred gigabytes from twenty-two domains, is a deliberate design choice, and how that breadth buys generalization in tension with aggressive curation for quality.

LeoThat was about *what* goes into the model. Today flips to *who decides* — whether a human or an algorithm should be turning those dials at all.

MayaSo let me name it plainly first. AutoML is the attempt to automate the parts of machine learning that, for years, were craft. Choosing the architecture. Setting the hyperparameters. Building the data pipeline. Things a senior practitioner did by feel.

LeoHyperparameters meaning — the knobs you set *before* training, not the weights the model learns *during* it.

MayaThat's the distinction. Learning rate, how many experts, how hard the routing penalty pushes — the model never learns those for you. Someone, or something, picks them up front.

LeoAnd historically that someone was a person with taste and a lot of failed runs behind them. Which always bugged me — it doesn't scale and it doesn't transfer. Your intuition for image models doesn't tell you the learning rate for a sparse text model.

MayaFor most of machine learning's history the loop was: guess a configuration, train, look, adjust, train again. So AutoML asks: can we make the *search itself* a system? I'll take it in three rooms.

LeoName them.

MayaThe Knob Room — tuning hyperparameters. The Blueprint Room — searching over the architecture itself. And the Assembly Room — automating the whole pipeline, soup to nuts.

LeoStart in the Knob Room. That's where everybody starts.

MayaIt is. You've got a handful of knobs and you want the best combination. The oldest, most honest method is grid search — lay out every knob, pick a few values each, try *every* combination. Exhaustive. No cleverness.

LeoAnd catastrophic the moment you have more than a couple of knobs. Four knobs, five values each — that's six hundred and twenty-five runs before you've learned anything.

MayaThat's the curse of dimensionality in the worst possible place. Every knob multiplies the grid. For the ninety-day lab, gridding expert count, routing penalty, data mix, and learning rate would eat the whole budget on settings and leave nothing to train on.

LeoSo nobody serious grids a big model.

MayaNobody serious grids a big model. The first real improvement is almost insultingly simple — random search. Instead of a neat grid, you throw darts. Sample random combinations of the knobs.

LeoWait, random *beats* the grid?

MayaFor a beautifully specific reason — one of my favorite results in the field. Usually only a few knobs matter much. Learning rate matters enormously; some other knob barely moves the needle. A grid wastes most of its tries re-testing the unimportant knob at fine resolution. Random search, by scattering, samples *more distinct values* of the knob that matters.

LeoOh — that inverts my intuition. So the disorder is the feature. You cover the important dimension better precisely because you stopped being tidy.

MayaThat's the whole insight. For a fixed number of trials — which, remember, is your budget — random search reliably beats a grid of the same size.

LeoOkay. But random's still blind. Trial fifty knows nothing trial one didn't.

MayaAnd that's the doorway to the method that runs at scale — Bayesian optimization. The shift: instead of treating each trial as independent, you build a little model of the search itself, one that predicts where the next *promising* settings are, given everything you've tried.

LeoSo it's a model whose job is to guess where the good model lives.

MayaA model of a model, yes. After each expensive run it updates its belief and balances two urges — exploit: go near the best thing found; explore: go somewhere uncertain, because the best might be hiding there.

LeoThat explore-exploit tension shows up everywhere — slot machines, drug trials. What makes it bite here is each "pull" costs a chunk of the ninety days.

MayaWhich is why you'd pay for the cleverness. When a single trial is a multi-day run, you can't afford blind darts — you want each next trial to be the most informative one. That's what *sample-efficient* means: fewer trials to a good answer. Every trial saved is days back on the wall.

LeoAll right, Knob Room makes sense. But none of that touches the *shape* of the model. We're still assuming the architecture is fixed.

MayaWhich is the door to the Blueprint Room — neural architecture search. N-A-S. The audacious idea: don't just tune the model, *design* it automatically. Let an algorithm search over the architecture itself — how many layers, how they connect, what each block does.

LeoHuh. So the search space stops being numbers and becomes structures.

MayaIt becomes the building itself. And the early results were startling — searched architectures matching or beating ones human researchers spent years refining.

LeoAnd now I get to be the wet blanket, because I remember what those results *cost*. The famous reinforcement-learning version of architecture search ran to thousands of GPU-days for a single search — one run priced like training hundreds of ordinary models.

MayaThe central tension of the field, and you're right to lead with cost. Look at what architecture search *is* — training models to figure out which model to train. Compute spent on the meta-question instead of the object question.

LeoAnd for the ninety-day lab that's disqualifying. If naive search costs more than the run it's trying to improve, you've inverted your own budget. You'd spend the wall searching and have nothing left to build.

MayaSo I'll concede the naive version outright. Full from-scratch architecture search, the early kind — no small or mid-sized lab can afford it. Full stop.

LeoThank you.

MayaBut — and this is where the field went — the survey traces how the cost came down. Weight-sharing was the big one. Instead of training every candidate from scratch, you train one big over-parameterized supernetwork that holds all the candidates as sub-paths, and let them *share* weights. Now evaluating a candidate is reading a path through something you already trained.

LeoSo the trick is to stop re-paying the training cost on every trial.

MayaThat's the whole arc, in both rooms. Bayesian optimization makes the knob search sample-efficient; weight-sharing makes the architecture search amortized. Same enemy — the cost of evaluating one candidate — attacked two ways.

LeoOkay, that I'll buy. Weight-sharing has its own gremlins — shared weights can make a bad architecture look good, the ranking gets noisy — but the order of magnitude is sane now. From national-lab budget to something a serious team could attempt.

MayaThe survey doesn't hide the ranking noise, but that's the shift. Which leaves the third room.

LeoThe Assembly Room.

MayaAutomated pipeline construction. The model is the *middle* of the work, not the whole of it. Before it: data cleaning, feature engineering, encoding. After it: calibration, thresholds. AutoML at its most ambitious automates the *whole assembly line*.

LeoThis touches our running example most. The data mix in the ninety-day lab — how much code, how much prose, how much lab notes — that's a pipeline decision, not a model decision.

MayaIt is, and the survey names concrete systems. A tool called T-POT treats the whole pipeline as something to *evolve* — genetic programming, breeding and mutating candidate pipelines like organisms, keeping the fit ones.

LeoEvolving pipelines. Okay.

MayaAnd Auto-Gluon and Auto-Keras lean more on smart defaults and ensembling — stacking several decent models rather than hunting for one perfect one. The survey's point isn't that one tool wins. It's that the *unit of automation* keeps growing. Knobs, then architecture, then the whole pipeline — each step automating more of what used to be human craft.

LeoSo let me drag this where the title points — large models, LLM scale. Everything you've described sounds great for a model you can train in an afternoon. The ninety-day lab can't run a hundred trials. It can barely run *one*.

MayaAnd that's the honest limit, and the survey is careful here. The classic loop assumes you can afford many trials. At frontier scale that breaks — when a single training run *is* the budget, you can't do a search *over* training runs.

LeoSo does AutoML just... not apply up there?

MayaIt applies, but it changes shape — two ways. One, you search small and transfer up. Tune a tiny proxy model, find settings that hold as you scale, carry them to the big run. The game becomes finding what's *predictable* across scale.

LeoStrong assumption, though — that the best learning rate on a toy model is the best one on the real one. Sometimes it transfers, sometimes it absolutely doesn't.

MayaThe live research question, and the survey flags it rather than pretending it's solved. The other way is narrower automation — automate the cheap parts, the data-recipe sweeps, the small ablations, and leave the irreversible big-run choices to humans.

LeoAnd that's where I plant the flag — the limit the whole episode's been circling. There's a human prior in all of this that automation doesn't remove — it just *relocates*. Somebody chose the search space. Somebody decided the learning rate lives between *these* two numbers. The algorithm only explores the box a human drew. AutoML automates the search *inside* the box. It does not automate drawing the box.

MayaAnd that's the deepest honest finding in the survey. The human didn't disappear — they moved up a level, from picking settings to designing the space the search runs in. The expert on the left whiteboard didn't lose to the algorithm on the right. The expert drew the room the algorithm searches.

LeoSo the real ninety-day decision isn't "automated versus hand-tuned." It's "where do I spend human judgment, and where do I spend compute."

MayaThat's the synthesis. Hand-tuning spends human judgment to *save* compute — scar tissue skips a hundred bad trials. Automated search spends compute to *save* human judgment — but only inside a box the human still has to draw well. Both draw from the same wall.

LeoSo AutoML isn't a way to *escape* the budget.

MayaIt's a way to *spend* it more deliberately — and only if the search is sample-efficient enough that the trials it buys are worth more than the training they cost.

LeoWhich, for a small lab, is a genuinely hard call. The clever search might find a better model. It might also burn three weeks proving your hand-tuned guess was already fine.

MayaAnd you only know which afterward. So here's where I'll leave it for you. If you had one fixed budget and one shot — would you spend a slice of it letting an algorithm search for the right settings, or trust an expert's hand to set them and pour every machine into the run itself?

Source material

← Back to Mastering Language Models: From Architecture to Optimization