William Liu · Podcasts
Warm flat illustration of two engineers at an open-model workbench comparing blank model cards, tuning knobs, local devices, and branching deployment paths for privacy, cost, capability, and specialization.

T6E0 · 00:15:19

The Latest in Fine-Tuned and Open Models: From LLaMA to DeepSeek

Maya and Leo introduce Topic 6 by mapping open-weight and fine-tuned model families as deployable engineering components. Using an on-device coding and data-analysis assistant, they explain Llama, fine-tuning, deployment envelopes, private evaluation, ecosystem trade-offs, and why sparse models like DeepSeek complicate the open-model frontier.

Transcript

MayaPicture a team running the exact same assistant on three workbenches. One bench calls out to a hosted frontier model — a black box behind an API. Another runs an open-weight Llama model the team downloaded, can poke at, can retrain. The last runs a sparse open model that promises a huge context window but asks for a fancier serving stack. Three benches — and the demo sounds identical on all of them.

LeoSo the demo tells you nothing.

MayaThe demo tells you nothing. The real question is which bench the team can actually own.

LeoHm. Own.

MayaThat word — own — is the whole topic. We just spent Topic Five inside preference training and alignment loops, shaping behavior. Today the model itself becomes a component. Something you choose, adapt, audit, and deploy.

LeoBefore we go anywhere: open-weight is not open-source. People blur that constantly. Open-weight means the trained numbers — the weights — are downloadable under a license. You can run them, usually adapt them. It does not mean the whole training recipe is public.

MayaThe weights are just the learned settings inside the model — but having them changes the engineering conversation. You can test locally, fine-tune on your own workflows, study failures directly, control where your data goes.

LeoIt also doesn't make the model safe, cheap, or good. Owning the weights means owning more of the deployment mess.

MayaWhich is the first mental model for this topic: open weights turn model choice into a systems-design problem. Capability is one axis. Then there's license, data control, tuning cost, latency, hardware, evaluation, safety, update cadence—

Leo—which is a lot of axes for something marketing calls "free."

Maya[chuckle] Nothing about this is free. Okay — the example we'll carry through the whole topic: a mid-sized company building an on-device coding and data-analysis assistant. It reads internal schemas, helps write small scripts, answers questions about private data, and lives inside a budget.

LeoAnd the data can't leave the building. That's the part that makes "open" concrete instead of philosophical. If customer data or source code can't cross the company boundary, the model's location is a hard requirement, not a preference.

MayaSo the team is asking: can an open-weight base model be the workbench? Inspect it, fine-tune it on internal workflows, evaluate it on private tasks, and know when a bigger hosted model or a sparse model is worth the trade.

LeoWhere do they start shopping?

MayaFirst landmark — call it the Base Shelf. It's where teams compare foundation models before any adaptation. And Llama 2 is the paper that made that shelf real for ordinary builders: pretrained models plus chat-tuned versions, from small sizes to large.

LeoThat split inside the paper matters more than people notice. The base model is raw — it just continues text. The chat variant has already been post-trained toward assistant behavior. You're choosing your starting altitude.

MayaThen the Llama 3 herd report widens it from a model to an ecosystem — multilinguality, coding, reasoning, tool use, dedicated safety models, and a largest dense model with a much longer context window than the generation before.

LeoRight, the herd framing.

MayaA shelf full of llamas. But nothing on that shelf is a finished product, which brings us to the second landmark — the Tuning Bench.

LeoFine-tuning, in plain terms: additional training that changes the model's behavior. Sometimes you retrain the full model, but often it's lightweight adaptation — training small add-on parameters, adapters — because updating the whole thing is expensive.

MayaFor our company that might mean tuning on internal code-review comments, analytics query patterns, examples of "never expose this table."

LeoHere's where I want to actually argue, because I think teams under-fine-tune. If a behavior matters and repeats — your code style, your schema conventions, your refusal rules — train it in. Stop stuffing a giant prompt into every request; you pay for that prompt on every single call.

MayaAnd I'd take the other side of that bet almost every time. Keep the base model intact. Put the changing stuff in retrieval, tools, prompts, policy layers. Your documents change faster than your weights ever should — and a bad fine-tune is so much harder to unwind than a bad retrieval rule.

LeoHarder to unwind, granted. But "just use retrieval" has its own failure: the assistant that needs its whole personality re-explained on every request. Consistency is a feature. Latency is a feature.

MayaFine — the latency argument survives. Repeated, stable, frequent behavior: train it in. What doesn't survive is putting knowledge in the weights. Fine-tuning teaches vocabulary and workflow shape. It can also overfit, erase general behavior, or make the model confidently imitate your worst internal habits.

LeoConfidently imitating bad habits is also what new hires do.

Maya[laugh] At least you can fire a fine-tune by rolling back a checkpoint. So — truce: behavior goes into the weights when it's stable and hot; knowledge and policy stay outside, where you can audit them.

LeoI'll sign that. Next landmark?

MayaThe Deployment Envelope. The box around the model — device memory, speed, power, privacy, support burden, license terms, and what happens when it fails.

LeoThis envelope is where the open-versus-hosted fight actually lives, so let me steelman the hosted side, because it is not lazy outsourcing. A frontier provider pours massive investment into training, safety work, infrastructure, monitoring. Renting that stack can buy more capability — and more reliability — than rebuilding a worse version in-house.

MayaAnd the open-weight answer is one word: control. If the assistant touches private code, regulated data, offline devices — you want a model you can inspect, pin, fine-tune, and evaluate without piping everything through a black box.

LeoPin. Say more about pin, because that's the part procurement people feel.

MayaA model you can freeze and regression-test is— okay, here's the cleaner way to say it. A pinned model is a fact you can reason about. An API that silently changes behavior underneath you is not — your evaluation from last month may describe a model that no longer exists.

LeoThat's not ideology, that's engineering hygiene. So the hosted side keeps the capability claim — for rare hard tasks that flatten a small local model, calling out is rational. But it loses the stability claim.

MayaWhich is most of the resolution right there: it's not open versus closed as religion. It's which failure you cannot afford. Hold that thought — we'll land on it at the end.

LeoThere's a third fight first. Dense versus sparse.

MayaThe Capability Frontier — where dense open models like the Llama line meet sparse models like DeepSeek-V3 and the DeepSeek-V4-Flash preview. Sparse meaning: not every part of the model wakes up for every token. A mixture-of-experts model has many specialist blocks, and a router activates only a few for a given piece of text — more total capacity without paying full cost on every step.

LeoAnd I'll defend dense, because boring usually deploys. One model path. Predictable serving. No routing surprises. When you're a small team putting a model on a device, every moving part you don't have is a part that can't break at two in the morning.

MayaBut the sparse side's argument is that scale is becoming too expensive to use naively. If only part of the model wakes up, you can push capacity, context length, specialized behavior — at a cost profile dense models struggle to match. DeepSeek-V3 pushes that hard. The V4-Flash material adds very long context with sparse attention. Think of an assistant reading a large codebase, or a huge analytics archive. That's not a luxury claim.

Leo—it's the actual workload, yeah. I'll concede the workload. I won't concede the marketing. A million-token window is not a million tokens of understanding. The window can hold your whole codebase and still miss the one dependency that matters if retrieval, attention, or prompting are weak.

MayaNo argument there. Which means neither of us gets to win this from a spec sheet—

Leo—you win it from tests. Private ones. Bug-fix tasks, schema questions, data-leak probes, latency checks, recovery-from-wrong-assumption cases. That's the next landmark, isn't it.

MayaThe Evaluation Loop. People who actually deploy open models trust dashboards less than task-specific evidence. The model has to pass your work, not a leaderboard.

LeoPublic benchmarks are weather reports. They tell you a model family has moved into a capable range. They don't tell you whether it understands your payment tables, your logging conventions, your security boundary.

MayaThey also say nothing about total cost of ownership — fine-tuning, quantization, serving, monitoring, red-teaming, rollback plans, support. Those can dwarf the headline model choice.

LeoOne more mental model, and it's the one I'd put on a poster: open models come with an ecosystem ledger. You're not choosing weights. You're choosing tool compatibility, inference libraries, adapter recipes, community patches, license stability, and whether you can hire people who know the stack.

MayaThat's half of why Llama matters — its importance is partly technical and partly ecosystem. When many builders target one family, the tooling compounds: quantized variants, fine-tuning recipes, evaluation harnesses, safety filters, deployment guides.

LeoWith a herd effect attached. Teams pick the most popular open model because the ecosystem is comfortable, not because it fits the task.

MayaWhich keeps coming back to the same bench. For the on-device assistant, the only question that survives is whether this model, after the right adaptation, answers private engineering questions safely, fast, and cheaply enough.

LeoAnd when it doesn't, the team has moves — escalate hard tasks to a larger Llama-family model, call a hosted model for the rare cases, or test a sparse model when long context and cost change the math.

MayaHere's where the topic goes. Llama 2 first: the open-foundation-plus-chat-tuned pattern. Then Llama 3: the family growing into a herd, with coding, reasoning, tool use, and safety work.

LeoThen DeepSeek-V3 shifts us to sparse expert capacity and training efficiency, and DeepSeek-V4-Flash previews efficient long-context deployment as a practical pressure on everyone else.

MayaUnderneath all four sits the disagreement we kept circling: control versus abstraction. One side trusts inspectability, local data control, repeatable evaluation—

Leo—the other trusts better performance, stronger provider operations, faster updates, less internal complexity. And the strongest experts refuse the religion entirely.

MayaMatch the model's openness, tuning path, and serving stack to the failure you cannot afford. For private code, the failure is leakage. For a math tutor, it's wrong explanations. For an agent that edits files, it's unsafe tool use. Different failure, different bench.

LeoQuick vocabulary lap before we close — this topic is full of words that sound like marketing until you ground them.

MayaOkay. Open-weight means the learned parameters are available to download and run under a license, even when the full training recipe is not open.

LeoBase model means the broad pretrained model before it's shaped into a chat assistant or specialized worker.

MayaFine-tuning means additional training that adapts the model's behavior for a task, domain, style, or policy.

LeoAnd post-training means the stage after pretraining where instruction data, preferences, and safety examples make the model genuinely usable.

MayaQuantization means storing the model at lower numerical precision so it runs in less memory, sometimes trading a little quality.

LeoContext window means how much text the model can consider at once — and a bigger window does not guarantee better reasoning over it.

MayaThen mixture of experts means a design where specialist parts handle different tokens, so only some of the capacity is active at each step.

LeoEvaluation harness means a repeatable test setup for the tasks, safety cases, and regressions that actually matter in deployment.

MayaSo the Topic Six map isn't "Llama versus everything else." It's a working bench: pick the base, choose the tuning strategy, fit the deployment envelope, test on real work, and keep re-drawing the open-versus-managed line as the models move.

LeoAnd here's what we'd leave you chewing on: if you were choosing the model for that on-device assistant, where would you draw the line between owning the weights and buying frontier capability?

Source material

← Back to Mastering Language Models: From Architecture to Optimization