William Liu · Podcasts
Two engineers at an open-model workbench compress private code and data documents into memory cards feeding a compact Flash-style model engine, with local assistant devices and tuning controls nearby.

T6E4 · Jun 18, 2026 · 00:13:57

DeepSeek-V4-Flash

Maya and Leo unpack DeepSeek-V4-Flash as an efficiency stack for million-token context: hybrid compressed attention, sparse expert activation, low-precision serving, and specialist distillation. They stage the field's real arguments — how hard to compress long-context memory, and when a local Flash-class model should escalate to a Pro-class or hosted one — and land on a managed-memory mental model for private coding and data-analysis assistants.

Subscribe

Transcript

MayaThere's a room in old newspaper buildings called the morgue — every issue the paper ever printed, going back a century. And here's the thing about a good morgue: this week's papers sit on the reading desk, full size. Everything older lives on microfiche — whole years shrunk onto a reel — with a card index telling you which reel to pull.

LeoOkay…

MayaNobody rereads the century. You keep the near past sharp, you compress the deep past, and you only pull the two reels that matter for today's question.

LeoAnd the punchline is that this is an attention mechanism.

MayaIt's today's attention mechanism. Last episode, DeepSeek-V3 showed how to stop paying for a whole giant model on every token — expert routing, a fraction of the network awake at a time. Today's report keeps that trick and attacks a different bill: what it costs to remember a million tokens of context.

LeoThe source is DeepSeek's V4 technical report. The family aims at million-token context, and Flash is the cost-oriented member — which is the one I care about, because cost is where long context usually dies.

MayaSay more, because I think listeners hear "million-token window" as pure good news.

LeoA context window is a promise that the model can look back. The looking is the expensive part. Every token the model has read leaves behind notes — keys and values — in what's called the key-value cache, the K-V cache. Naively, a million tokens of notes means every new token drags a warehouse behind it. The window is free to advertise and brutal to serve.

MayaAnd that's the squeeze on our running company — the mid-sized team building the on-device coding and data-analysis assistant. It has to hold a repository, a data schema, a year of tickets, a private incident report. Privacy says local. Cost says no giant hosted call. And the bug they're hunting might be one line in month nine of the logs.

LeoSo how does Flash shrink the warehouse?

MayaTwo compression schemes plus a desk. First, the Microfiche Room. The report's compressed sparse attention — C-S-A — takes stretches of older context and squeezes them into compact entries, then runs an indexer: for each new token, it retrieves only the handful of compressed entries that look relevant. That's the card index pulling two reels.

LeoAnd the second scheme?

MayaHeavily compressed attention — H-C-A. Same instinct, more aggressive. It compresses the context so hard that the model can afford to look at everything that's left, densely. No selection step — the reels themselves got small enough to scan end to end.

LeoDifferent layers get different schemes?

MayaThat's the design — not every layer needs the same memory behavior. And then the Reading Desk: sliding-window attention keeps a strip of the most recent context at full detail. Compressed memory for the far past, full resolution for the line beside the cursor.

LeoMm — the variable name you're about to misspell.

MayaNow the claim that makes this a headline. At a million tokens of context, the report estimates Flash uses about ten percent of the per-token compute and roughly seven percent of the K-V cache, compared with DeepSeek V3.2.

LeoEstimates being the operative word. Those are DeepSeek's numbers, under DeepSeek's setup. I'll happily believe the direction. The magnitude, your serving stack has to earn for itself.

MayaFair, the report doesn't pretend otherwise. But I'll defend the aggressive bet outright, because there's a real split in the field here. One camp — this report squarely in it — says compress hard, retrieve sparsely, shrink the cache, whatever it takes. Without compression, a million-token window is ceremonial: it exists on the spec sheet and nobody can afford to use it.

LeoThen I'll take the other camp, because they're not wrong either: compression loses things. Small things. And the workloads people actually want long context for — code, contracts, security logs — are the ones where the small thing is the whole answer. One overridden config flag in a compressed summary of month nine. The indexer doesn't pull that reel, and your assistant confidently ships the bug.

MayaThe desk catches the nearby case—

Leo—the desk catches this week. The flag was set in March. Dense attention is expensive precisely because it refuses to pre-decide what matters. The moment you compress, somebody decided March was summarizable.

MayaTake the point on what compression risks — the report's own evaluations back you up: long-context performance is strongest in the shorter bands and declines at the largest ranges on some tasks. The compressed million is not a dense million. But here's what I won't give you: the alternative. Dense attention at a million tokens isn't a quality choice the buyer gets to make — at that scale it's not on the menu at any price most teams can pay.

LeoSo the honest framing is compressed long context versus no long context. Not compressed versus dense.

MayaThat's where the evidence actually lands. And what settles the rest is boring and specific — the company runs its own needle-in-the-logs tests on its own incident reports, and finds out where the decline starts for its tasks.

Leo[chuckle] Every argument in this topic ends with "measure it on your own work."

MayaBecause every paper in this topic grades its own homework. Down the hall now — the Skeleton Crew.

LeoThis is the part Flash inherits from V3.

MayaInherits and tightens. Flash is a mixture-of-experts model: two hundred eighty-four billion total parameters, but only about thirteen billion awake for any one token. One shared expert that always works, plus a large pool of routed specialists, with a small slice picked per token.

LeoA big building, a skeleton crew on any given night. And the crew is deliberately smaller than the flagship's — the report is upfront that V4-Pro carries more total and more active parameters, and it holds more knowledge.

MayaWhich makes Flash the more interesting design question, honestly. Pro asks how good the family can get. Flash asks how much capability survives when the active budget gets tight and the memory gets huge.

LeoAnd the report's answer, for what an internal answer is worth: Flash-Base beats V3.2-Base across many of their base-model benchmarks with fewer total and fewer active parameters — gains concentrated in knowledge and long-context tasks. Same caveat as before. Their harness, their tables.

MayaNoted twice now. None of this matters, though, if the training run dies. Call this stretch the Steady Run. The models pre-train on more than thirty-two trillion tokens — long documents, code, math, multilingual text, agentic data — and at that scale, stability stops being a side quest.

LeoEasy to undersell, so let's not: expert routing, two kinds of compressed attention, a million-token target, a months-long run. Any one of those can spike a loss curve.

MayaSo the report stacks stabilizers. Muon, the optimizer — in plain terms, it shapes the large matrix updates so they behave more cleanly, converging faster with fewer tantrums.

LeoAnd the second?

MayaManifold-constrained hyper-connections — m-H-C — an upgrade to the ordinary residual stream: richer paths for carrying information forward through the network, but constrained so the signal can't run chaotic.

LeoAnd even with all of that, they hit instability anyway.

MayaThey did, and they say so. The fixes were anticipatory routing, which decouples the routing decisions from the very latest weights, and clamping extreme activation values in the feed-forward blocks — SwiGLU clamping, in the paper's terms. And then the sentence I respect most in the report: the deeper theory of why giant sparse systems spike like this is still not settled.

LeoHuh.

MayaEmpirically tamed, theoretically open.

LeoThat kind of honesty is worth more than another benchmark table.

MayaAfter pre-training comes the Apprenticeship — and it's not the post-training you'd guess. They first train specialists: separate expert models for coding, math, agent behavior, instruction following. Then on-policy distillation. The student model generates from its own behavior—

Leo—and gets graded against the specialist teachers' output distributions on those same trajectories. On-policy is the load-bearing word. The student isn't memorizing the teacher's homework — it's getting corrected on its own attempts.

MayaAnd they distill over the full vocabulary to keep it stable, which is an enormous engineering burden — teacher scheduling, hidden-state caching, custom kernels. A whole machine shop behind one training objective.

LeoThere's a product lesson buried in that for our company. Instead of shipping a base model wearing a bag of separate fine-tuned adapters, you train the specialists, then press them into one deployable student.

MayaOne model ships; many teachers stand behind it. One page left in the report tour: the Fine Print — the deployment page.

LeoMy favorite page. The model card says Flash and Flash-Base are downloadable, the context target is the full million, and the instruct model stores its expert weights in four-bit floating point — FP4.

MayaFour bits.

LeoFewer bits per weight, less memory traffic — and memory traffic is where serving cost actually hides.

MayaWith the standard caution: low precision is not free compression. It changes model behavior unless the model trained for it — which is why the pipeline includes quantization-aware training during post-training.

LeoAnd then the risk nobody prints in bold: coupling. C-S-A, H-C-A, the sliding window, compressed cache layouts, FP4 paths, expert routing, custom kernels — that's one tightly interlocked machine. You don't get the headline economics by downloading the weights. You get them by reproducing the whole machine in your serving stack.

MayaWhich sets up the last real argument. Take the build-on-Flash side — I'll press escalation.

LeoGladly, because the economics favor me. Most of what our company's assistant does all day is private long-context triage — read the repo, read the logs, find the thing, draft the fix. The binding constraints there are privacy, context cost, and serving memory.

MayaMm-hm.

LeoAnd Flash's entire design is aimed at those three constraints. You don't rent a frontier model to search your own logs.

MayaUntil the task stops being triage. The report's own family comparisons show Flash trailing Pro on knowledge-heavy and agentic work, and the gap widens as tasks get complex. The day your assistant has to plan a migration across four services instead of finding a flag, the skeleton crew comes up short — and that's DeepSeek's data, not mine.

LeoThe hard ceiling is real — the tables show it, point to you. But the default posture is mine: the expensive model earns its way into the loop task by task. It doesn't sit there as a security blanket.

MayaThen we agree on the shape and differ on the resting position: a local Flash-class model for the private, long-memory, everyday work; an escalation path to a Pro-class or hosted model for the rare task that defeats it; and your own evaluations deciding where that line sits.

LeoRoute by bottleneck, not by fear.

MayaSo here's the mental model to walk away with. A million tokens of context is not infinite memory — it's managed memory. Microfiche for the deep past, a reading desk for the near past, an index that knows which reel to pull, and a skeleton crew that keeps the lights affordable.

LeoIf your team could buy only one kind of efficiency for a local assistant — longer memory, cheaper tokens, or stronger specialist behavior — which bottleneck would actually change the product?

Source material

← Back to Mastering Language Models: From Architecture to Optimization