T7E2 · Jun 21, 2026 · 00:12:50

Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training

Maya and Leo follow the sparsely-gated Mixture-of-Experts idea out of language and into scientific machine learning. The paper, MoE-POT, is an operator transformer pre-trained across many families of partial differential equations — fluid flow, heat, waves — where a layer-wise router sends each chunk of a physical field to a few specialized experts plus a couple of always-on shared experts. They explain operator learning and frequency-domain layers in plain terms, perform the real tension between one specialist model per PDE family and one sparse generalist, and dig into the standout result: because physics has ground truth, the routing fingerprints can be checked against the equation family — and they line up, even on unseen physics. They close on the honest limitation: it still rests on a hand-tuned load-balancing dial and on the families that happen to live in public benchmarks.

Transcript

MayaPicture our ninety-day lab again — the wall of rented machines, one fixed budget. Only this time they're not training a model that reads. They're training a model that predicts physics: how a fluid swirls, how heat spreads through a plate, how a wave runs down a channel. And they hit a wall the language people never see — there's barely any training data, because every example has to be generated by an expensive simulation.

LeoSo the scarce resource flips. In text, data is everywhere and compute is the constraint. Here, simulated physics is the rare thing.

MayaRight — and the temptation is to give up on one model entirely. Train a little solver for fluids, a separate one for heat, a third for waves. Today's paper refuses that. It carries the sparse mixture-of-experts idea straight out of language and drops it into scientific computing — one model, pre-trained across many families of equations, where different experts quietly specialize in different physics.

LeoHuh.

MayaIt's called MoE-POT — Mixture-of-Experts Pre-training Operator Transformer — out of a group at the University of Science and Technology of China, presented at this past year's NeurIPS.

Figure 1: Left. An illustration of pre-training a PDE foundation model using extensive data fromSource: Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training

LeoAnd to see why that's a real move and not just a remix, we should say what we proved last time.

MayaGo ahead.

LeoLast episode was the sparsely-gated mixture-of-experts layer — Shazeer, twenty-seventeen. The trick: a per-token router that wakes only a few of many experts, so a model's parameter capacity can grow enormously without the compute-per-word growing with it. The catch was a fragile router that needs load balancing or it collapses onto a few favorites.

MayaSame machinery here. The difference is the input. Last time the experts woke up for a token of text. Today they wake up for a chunk of a physical field — a patch of a velocity grid, a slice of a temperature map — and the bet is that specialization will line up with physics instead of with words.

LeoWhich is exactly the bet I want to interrogate. So where do we start?

MayaStart with what these models even are, because "operator" is the word doing the work. A normal network learns a function — number in, number out. An operator network learns a mapping between whole functions. Give it the state of a fluid right now, the entire field, and it returns the state a moment later. It's learning the rule the equation enforces, not one trajectory.

LeoOkay.

MayaThe body is a transformer — but with a twist from a line of work called neural operators. Some layers do their mixing in the frequency domain: take the field, transform it into waves of different scales, reshape it there, transform back.

LeoWhy the detour through frequencies?

MayaBecause physics lives there. Smooth, large-scale structure is low frequency; sharp fronts and turbulence are high frequency. That space lets one layer touch the whole field at once instead of creeping pixel by pixel.

LeoSo you reshape the physics, not the pixels.

MayaRight. The lineage here is a model called DPOT — a denoising operator transformer — and MoE-POT takes that backbone and makes the feed-forward parts sparse.

LeoSo the skeleton is "operator transformer," and the new organ is the mixture-of-experts block.

MayaExactly the swap. Most frequency-and-attention layers stay dense, but the feed-forward blocks become an expert pool. For any patch of physics flowing through, a router-gating network looks and picks a handful of experts to actually run.

LeoGive me the real numbers, not a hand-wave.

MayaSixteen routed experts in the pool. For any given input the router lights up four of them. And then — this is the piece that maps onto last week — there are also two shared experts that run every single time, no matter what.

LeoTwo always-on, four chosen from sixteen. I've seen that shape before. The shared expert is the generalist desk in the lobby; the routed four are the specialists upstairs.

MayaThat's the design intent, almost word for word. The shared experts capture what all partial differential equations have in common — the universal grammar of how fields evolve. The routed experts are free to specialize, because the common stuff is already handled.

LeoNice.

MayaA division of labor. Without the shared experts, every routed expert wastes capacity re-learning the basics. With them, the routed four can afford to be narrow.

LeoHold on, though — narrow on what? "Specialize" is a word we love to throw at mixture-of-experts and then never check. Last time the experts supposedly specialized on tokens and the truth was murkier than the marketing. So when you tell me an expert specializes in "heat" — does it? Or is it splitting the work by some statistical accident that has nothing to do with physics?

MayaThat's the best part of the paper, and they actually looked. They probed the router's decisions and asked a blunt question: from the routing pattern alone — just which experts got chosen — can you tell which equation family the input came from?

LeoAnd?

MayaThey could classify the source dataset from the routing fingerprint with high accuracy — well above chance. The router isn't smearing the work randomly. It sends fluid-flow inputs to one signature set of experts and heat-diffusion inputs to another, reliably enough that the choice itself reveals the physics.

LeoOkay, that's a stronger result than I expected. The router learned to sort by regime without ever being told the labels.

MayaAnd it held up on equation families it hadn't been trained on — out-of-distribution physics still got routed coherently.

LeoWait, unseen families too?

MayaUnseen families too. So the specialization isn't memorizing six datasets. It's carving the space of physics into regions and assigning experts to them.

LeoThat's the line that would've made me skeptical and now doesn't. In language, "this expert handles punctuation-ish tokens" is a shrug. Here, "this expert handles diffusive regimes" is a claim you can check against the math — and it checks out.

MayaWhich is the deep reason this transfer is more than a gimmick. The heterogeneity that makes one-model-for-all-physics hard — fluids and heat and waves behaving by genuinely different rules — is the exact thing sparse routing is built to exploit. Different rules, different experts.

LeoSo let me put the tension on the table sharply, because this is where I'd push a researcher. The conservative move is one specialized model per equation family. Clean. Each one only has to be good at its own physics. Why is the sparse generalist better than that?

MayaBecause the per-family approach drowns in the data problem we opened with. Each little specialist starts from scratch and only ever sees its own scarce, expensive simulations. It can never borrow the structure that fluids and heat actually share.

LeoBut the dense generalist — one model jamming all the physics into shared weights — has the opposite disease. I'd expect negative transfer. You add a new, weird equation family to the training mix and the model gets worse at the ones it already knew, because everything fights over the same parameters.

MayaAnd that's precisely the failure they measured against. For a dense model, piling on more heterogeneous physics does drag performance down — the families interfere. For the sparse one—

Leo—the interference has somewhere to go.

MayaIt has somewhere to go. Add more diverse physics and the sparse model holds steady or improves, because new regimes can claim their own experts instead of trampling the old ones. You get the generalist's shared knowledge and the specialist's protected capacity, in one model.

LeoMm. That's the resolution, then.

MayaThat's the resolution. It's not "sparse is magic." It's that separating capacity from compute lets one model carry many physics without the families having to share every weight.

LeoSo spend our lab's budget on that and what do we get back? Give me the headline the paper actually stands behind.

MayaA model with about ninety million activated parameters matched or beat a strong dense baseline activating roughly a hundred twenty million — and on zero-shot error, the tasks it never trained on, the reduction reached up to forty percent.

LeoSay what "activated" is doing in that sentence, because it's the whole point.

MayaThe same separation as last week. The full model is far larger than ninety million — they scaled the total pool toward the half-billion range. But only the shared experts plus the chosen four run for any input. So the compute you pay is set by activated parameters, not total. Less active compute, lower error.

LeoSmaller bill, better generalization. On the same wall of machines, the lab gets a model that transfers to physics it never saw. That's not a small thing in a field where every new training example costs a simulation.

MayaAnd it pre-trained across six public collections of physics data — drawn from the standard operator-learning and fluid-dynamics benchmarks — so the breadth is real, not one tidy dataset.

LeoNow I get to do my job. Where's the catch? Last week's router was fragile — does this one inherit the fragility?

MayaIt inherits exactly that risk, and they treat it as a first-class problem. If you don't restrain the router, four experts hog everything and the other twelve go dark — you paid for sixteen and trained four. So they add a balancing penalty built around how evenly the experts get used, without forcing it so hard that specialization gets crushed.

LeoThe same knife-edge.

MayaThe same knife-edge as the language models. Balance too little, the router collapses. Balance too hard, you flatten the specialization that the whole physics story depends on. They land in the middle, but it's a tuned compromise, not a solved problem.

LeoAnd that's the honest limitation. The interpretability result is gorgeous — but it rests on a balancing dial someone had to set by hand, and that dial would have to be re-found for a new mix of physics. It's persuasive evidence, not a closed case.

MayaFair. And a quieter caveat: "many families of physics" still means the families in public benchmarks. Real engineering throws coupled, messy systems that look nothing like the curated datasets. Whether the routing keeps carving physics so cleanly out there is genuinely open.

LeoSo the takeaway isn't "physics is solved." It's that the sparse-routing idea survived the jump from language to scientific computing — and not only survived, it got a check language never offered.

MayaThat's the through-line for me. Last week the mixture-of-experts layer let a text model hold more capacity than it spends per word. This week the same layer let one operator model hold many physics at once — and because physics has ground truth, we could watch the experts sort themselves by regime and confirm the specialization was real.

LeoOne router, one budget, many equations — same machinery as the language models, new domain, and a rare moment where you can verify the experts learned what you hoped.

MayaSo here's what I'd leave you turning over. If a single sparse router can carve fluids from heat from waves on its own, where else does your problem secretly split into regimes — and would one routed generalist beat the pile of specialists you're maintaining today?

Source material

← Back to Mastering Language Models: From Architecture to Optimization