We propose a novel framework that leverages state-of-the-art machine learning techniques to construct stochastic closure models. Our approach combines conditional diffusion models with the Fourier Neural Operator (FNO) to capture the missing dynamics in complex systems governed by partial differential equations (PDEs).
Complex dynamical systems鈥攆rom turbulent atmospheric flows to ocean circulation and materials under stress鈥攕pan an enormous range of spatial and temporal scales. Directly resolving every scale in numerical simulation is computationally prohibitive, and will remain so for the foreseeable future. Classical closure models address this challenge by parameterizing the effects of unresolved (subgrid) scales on the resolved ones. However, traditional closure models are typically deterministic and local in character, assumptions that break down precisely in the most challenging regimes where the resolved and unresolved scales interact in a highly nonlinear, non-local, and non-Markovian manner.
This work introduces a data-driven framework for constructing stochastic and non-local closure models by marrying two powerful and complementary advances in machine learning: score-based conditional diffusion models and Fourier Neural Operators (FNOs). The resulting framework can learn the full conditional probability distribution of the unknown closure term鈥攏aturally capturing uncertainty鈥攚hile remaining resolution-invariant and capable of modeling long-range spatial interactions. To make the approach practical in numerical simulations, we also demonstrate an accelerated sampling strategy that reduces the inference cost by up to two orders of magnitude.
Consider a dynamical system whose general governing equation is:
\[\frac{\partial v}{\partial t} = M(v),\]where $v$ represents the full system state and $M$ encapsulates the dynamics derived from first principles鈥攑otentially involving highly nonlinear mappings, differential operators, and integral operators. In practice, numerically resolving every detail of $v$ is infeasible, motivating the use of a reduced-order system:
\[\frac{\partial V}{\partial t} = \widetilde{M}(V),\]where $V := \mathcal{K}(v)$ is a reduced-order representation of $v$, and $\widetilde{M}(V)$ approximates the reduced dynamics $\mathcal{K} \circ M(v)$. The inevitable mismatch between $\widetilde{M}(V)$ and the true dynamics gives rise to a closure term $U$, so that
\[\frac{\partial V}{\partial t} = \widetilde{M}(V) + U,\]where $U$ corrects for the influence of unresolved scales. When the system lacks a clear scale separation, $U$ can depend on the current state $V$ in a nonlinear, non-local fashion and may exhibit memory effects due to non-Markovian dynamics. These features render classical deterministic, local closures insufficient, and motivate a stochastic, data-driven approach.
At the heart of this framework lies the variance-exploding stochastic differential equation (VE-SDE)
where $\sigma(t)$ is a noise schedule that grows from nearly zero to a large value, and $W$ is a standard Wiener process. After running this process until time $T$, the data is fully corrupted into near-Gaussian noise with variance $\sigma^2(T) \gg 1$.
Generation then proceeds by reversing this diffusion. The exact reverse SDE is
\[dU = -\frac{d[\sigma^2(t)]}{dt}\, \nabla_U \log p_t(U)\, dt + \sqrt{\frac{d[\sigma^2(t)]}{dt}}\, d\bar{W},\]where $\nabla_U \log p_t(U)$ is the score function — the gradient of the log density at noise level $t$ — and $\bar{W}$ is a reverse-time Wiener process. Starting from Gaussian noise and integrating this reverse SDE produces samples from $p_\text{data}$.
The score function is approximated by a neural network $s_\theta(U, t)$ trained with the denoising score matching objective:
\[\mathcal{L}(\theta) = \mathbb{E}_{t, U_0, \epsilon}\!\left[\lambda(t)\,\bigl\lVert s_\theta(U_t, t) - \nabla_{U_t}\log p_{0t}(U_t \mid U_0)\bigr\rVert^2\right],\]where $\epsilon \sim \mathcal{N}(0, I)$, $U_t = U_0 + \sigma(t)\epsilon$, and $\lambda(t)$ is a weighting function. Because the transition kernel gradient simplifies to $\nabla_{U_t}\log p_{0t}(U_t\mid U_0) = -(U_t - U_0)/\sigma^2(t)$, training reduces to predicting the noise direction from a corrupted sample.
Closure terms do not live in a vacuum: the term $U$ at time $t$ typically depends on the resolved state $V$ and its recent history. We collect all such side information into a conditioning vector $\mathrm{y}$ (e.g., current vorticity field, sparse observations of $U$) and train a conditional score network $s_\theta(U_t, t, \mathrm{y})$.
The conditional training objective is
\[\mathcal{L}(\theta) = \mathbb{E}_{t, U_0, \epsilon}\!\left[\lambda(t)\,\bigl\lVert s_\theta(U_t, t, \mathrm{y}) + \frac{\epsilon}{\sigma(t)}\bigr\rVert^2\right].\]An important implementation insight is that the diffusion only needs to run up to a finite terminal time $T < \infty$ such that $\sigma(T) \gg \sigma_\text{data}$, rather than $T \to \infty$. Smaller $T$ shortens the reverse trajectory and thus speeds up sampling, which is essential for online use inside a simulation loop.
The score network $s_\theta$ must map continuous spatiotemporal fields — not fixed-size images — to scalar scores. This motivates replacing conventional CNNs with a Fourier Neural Operator (FNO)
where $W$ is a pointwise linear operator, $R_\phi$ are learnable complex weight matrices acting on the low-frequency Fourier modes, and $\sigma$ is a nonlinearity. The key advantages for closure modeling are:
Resolution invariance. Because FNO operates in frequency space, a model trained on a $64\times64$ grid can be directly evaluated at $128\times128$ or $256\times256$ without retraining — critical for multi-resolution production simulations.
Non-local receptive field. A single Fourier layer already has a global receptive field, capturing long-range correlations that local convolutional filters miss. This is essential when closure depends on the entire vorticity pattern rather than a small patch.
In practice, the architecture uses three parallel pipelines for conditioning: (i) the noisy closure sample $U_t$, (ii) the full vorticity field $V$, and (iii) sparse observations of the true closure term — all passed through separate FNO encoders before being fused with the noise level embedding $t$.
Vanilla diffusion samplers require hundreds of steps, making them too slow for per-timestep use inside a simulation. The paper introduces two complementary accelerations:
Adaptive time-step schedule. Rather than uniform steps, the reverse SDE uses the noise-level schedule
\[\sigma_i = \left(\sigma_\text{max}^{1/\rho} + \frac{i}{N-1}\left(\sigma_\text{min}^{1/\rho} - \sigma_\text{max}^{1/\rho}\right)\right)^\rho, \quad i = 0, 1, \ldots, N-1,\]with $\rho = 7$. This geometric-like spacing concentrates steps near $\sigma_\text{min}$, where the score changes most rapidly, yielding better accuracy per step.
Reduced terminal time $T$. Because the closure distribution $p(U)$ has a much smaller variance than a standard image dataset, the forward process reaches “pure noise” much sooner. Exploiting this with a smaller $T$ (and correspondingly smaller $\sigma_\text{max}$) reduces the number of reverse steps needed by roughly an order of magnitude, enabling sub-10-step sampling without quality loss.
The animation below illustrates the core idea of the reverse diffusion process. At $t = T$ (slider all the way left) the field is pure Gaussian noise. Dragging the slider right “denoises” it, and structured vortex patterns — reminiscent of a 2-D turbulent flow — gradually emerge.
Drag the slider to watch the reverse diffusion process denoise a random field into a structured 2-D vorticity pattern.
The framework is validated on the 2-D Navier–Stokes equations in vorticity form:
\[\frac{\partial \omega}{\partial t} + u \cdot \nabla\omega = \nu\Delta\omega + f,\]where $\omega = \nabla \times u$ is the vorticity, $\nu = 10^{-3}$ is the kinematic viscosity, and $f$ is a fixed external forcing. The computational domain is $[0, 2\pi]^2$ with periodic boundary conditions. A fine-grid ($256\times256$) pseudo-spectral solver provides ground-truth trajectories; coarse-grid ($64\times64$) models must compensate via stochastic closure terms.
Two physically distinct closure scenarios are studied:
The closure term $G$ represents a spatially varying, stochastic viscous diffusion correction. Because it is linear in $\omega$ and spatially local, a relatively simple conditioning strategy works well. Conditioning on only the vorticity field yields a Frobenius-norm relative error of $\sim 30\%$; adding sparse observations of $G$ (just $5^2 = 25$ grid points) reduces this to 8.2%.
The convection closure $H$ is nonlinear and non-local — it depends on global vorticity patterns. This makes it significantly harder to model. The best conditional model (vorticity + sparse $H$ observations) achieves a relative error of 12.7%, compared to $\sim 45\%$ without any closure. The FNO’s global receptive field is critical here: purely local architectures fail to capture the long-range nature of $H$.
A defining feature of FNO-based models is that they can be queried at any resolution after training at $64\times64$. The table below summarizes the Frobenius-norm relative errors $D_\text{Fro}$ across resolutions for Scenario G:
| Training res. | Test res. | $D_\text{Fro}$ |
|---|---|---|
| $64\times64$ | $64\times64$ | 8.2% |
| $64\times64$ | $128\times128$ | 9.1% |
| $64\times64$ | $256\times256$ | 9.8% |
Performance degrades gracefully — less than 2 percentage points across a 4× increase in linear resolution — confirming that no retraining is needed when the simulation grid is refined.
The energy spectra of generated closure fields likewise match the ground truth across all tested resolutions: the expected $k^{-3}$ enstrophy cascade is preserved, and no spurious energy pile-up appears at the grid cutoff frequency.