Estimate the score:
MLE needs normalization:
but
Minimize score error directly:
$$ \mathbb{E}{p(x)} \left[ | s\theta(x) - \nabla \log p(x) |^2 \right] $$
But we don’t know
$$ \mathcal{L}{\text{SM}}(\theta) = \mathbb{E}{p(x)} \left[ \text{Tr}(\nabla_x s_\theta(x)) + \frac{1}{2} | s_\theta(x) |^2 \right] $$
Key Insight:
We estimate the score without ever needing
Unstable in high-dim; data distribution may be too sharp.
Add Gaussian noise:
Define noisy (smoothed) distribution:
Train
$$ \mathcal{L}{\text{DSM}} = \mathbb{E}{x, x̃} \left[ \left| s_\theta(x̃) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$
Why It Works:
Minimizing this makes
Fixed
Train on many noise levels
$$ \mathcal{L}{\text{NCSN}} = \mathbb{E}{x, \sigma, x̃} \left[ \lambda(\sigma) \left| s_\theta(x̃, \sigma) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$
Why It Works:
Same principle as DSM — just across all
Allows learning multi-scale structure of data.
Instead of learning just the score
$$
f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃)
$$
If
So use DSM-like score supervision:
$$ \mathcal{L}{\text{MULDE}} = \mathbb{E} \left[ \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$
Why It Works:
- First term: makes gradient match DSM target
- Second term: regularizes
$f_\theta$ to behave like log-density
Method | Learns | Loss Derived From | Core Equation |
---|---|---|---|
SM | Score error + integration by parts | ||
DSM | Denoising identity (Vincent) | ||
NCSN | Multi-scale DSM | Same as DSM but over |
|
MULDE | Integrate score into scalar |
Here is a complete, clean, and technical breakdown of everything you need to know to implement and modify MULDE in PyTorch, grounded in its mathematical core. This guide focuses on the core architecture and training loop — feature extraction and dataset are modular.
We want to approximate the log-density
We achieve this by:
- Adding Gaussian noise to clean data $x )
- Training
$f_\theta$ such that $\nabla_{x̃} f_\theta(x̃, \sigma) \approx - \frac{x - x̃}{\sigma^2} ) - Adding a regularization term
$f_\theta(x, \sigma)^2$ to align predictions across scales
The full MULDE loss (Eq. 6 in the paper):
$$ \mathcal{L}(\theta) = \mathbb{E}{x, x̃, \sigma} \left[ \lambda(\sigma) \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$
Where:
- $x \sim p(x) ): clean features
- $x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I) )
- $\sigma \sim \text{log-uniform}[\sigma_\text{low}, \sigma_\text{high}] )
-
$\lambda(\sigma) = \sigma^2$ (scale-dependent weighting) - $f_\theta ): scalar output neural network
- $\nabla_{x̃} f_\theta ): gradient w.r.t. input $x̃ )
import torch
import torch.nn as nn
import torch.nn.functional as F
class MULDE(nn.Module):
def __init__(self, input_dim, hidden_dim=4096):
super().__init__()
self.fc1 = nn.Linear(input_dim + 1, hidden_dim) # +1 for sigma
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, 1) # output: scalar f_theta
def forward(self, x, sigma):
# x: [B, D], sigma: [B, 1]
sigma_log = sigma.log() # log-sigma conditioning
h = torch.cat([x, sigma_log], dim=1)
h = F.gelu(self.fc1(h))
h = F.gelu(self.fc2(h))
f = self.fc3(h) # [B, 1]
return f
def mulde_loss(model, x, sigma, beta):
"""
x: [B, D] - clean features
sigma: [B, 1] - sampled noise levels
"""
# 1. Add Gaussian noise
eps = torch.randn_like(x) * sigma
x_tilde = x + eps
x.requires_grad_(True)
x_tilde.requires_grad_(True)
# 2. Forward pass for noisy and clean
f_noisy = model(x_tilde, sigma) # [B, 1]
f_clean = model(x, sigma) # [B, 1]
# 3. Gradient w.r.t. noisy input
grad_outputs = torch.ones_like(f_noisy)
grads = torch.autograd.grad(
outputs=f_noisy, inputs=x_tilde,
grad_outputs=grad_outputs,
create_graph=True, retain_graph=True
)[0] # [B, D]
# 4. Score matching target
target = -eps / (sigma ** 2) # [B, D]
# 5. Loss terms
score_loss = ((grads - target) ** 2).sum(dim=1) # [B]
reg_loss = beta * (f_clean ** 2).squeeze(1) # [B]
# 6. Noise weighting
weight = sigma.squeeze(1) ** 2
loss = (weight * score_loss + reg_loss).mean()
return loss
model = MULDE(input_dim=512).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
beta = 0.1
sigma_low, sigma_high = 1e-3, 1.0
for epoch in range(num_epochs):
for x in dataloader: # x: [B, D]
x = x.to(device)
B = x.size(0)
# Sample log-uniform σ
u = torch.rand(B, 1).to(device)
sigma = sigma_low * (sigma_high / sigma_low) ** u # log-uniform
loss = mulde_loss(model, x, sigma, beta)
optimizer.zero_grad()
loss.backward()
optimizer.step()
After training:
-
Choose a list of noise levels:
${\sigma_i}{i=1}^L$ (e.g., 16 values between $\sigma\text{low}$ and $\sigma_\text{high} )) -
For test input
$x$ , compute:
$$ \mathbf{v}x = \left[ f\theta(x, \sigma_1), \dots, f_\theta(x, \sigma_L) \right] \in \mathbb{R}^L $$
-
Fit a GMM on normal training vectors $\mathbf{v}_x )
-
Compute anomaly score as negative log-likelihood under the GMM
- To use other noise types: replace Gaussian sampling
- To estimate gradient only (like standard DSM): directly learn $s_\theta(x̃, \sigma) )
- For image/video inputs: add convolutional backbone before final MLP
- To control the smoothness of $f_\theta ): replace GELU with other smooth activations (ReLU not allowed)
Aspect | MULDE Design |
---|---|
Output | Scalar log-density estimate |
Conditioning | On log(σ), concatenated with input |
Required property | Twice-differentiable network (smooth activations) |
Loss supervision | From DSM identity (Vincent, 2011) |
Regularization | Aligns log-densities across σ-scales |
Score field property | Gradient is conservative (∇f instead of f-free) |