Estimate the score: 
MLE needs normalization: 
but 
Minimize score error directly:
But we don’t know 
Key Insight:
We estimate the score without ever needing 
Unstable in high-dim; data distribution may be too sharp.
Add Gaussian noise: 
Define noisy (smoothed) distribution:
Train 
Why It Works:
Minimizing this makes 
Fixed 
Train on many noise levels 
Why It Works:
Same principle as DSM — just across all 
Allows learning multi-scale structure of data.
Instead of learning just the score 
$$
f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃)
$$
If 
So use DSM-like score supervision:
Why It Works:
- First term: makes gradient match DSM target
 - Second term: regularizes 
$f_\theta$ to behave like log-density 
| Method | Learns | Loss Derived From | Core Equation | 
|---|---|---|---|
| SM | Score error + integration by parts | ||
| DSM | Denoising identity (Vincent) | ||
| NCSN | Multi-scale DSM | Same as DSM but over  | 
|
| MULDE | Integrate score into scalar | 
Here is a complete, clean, and technical breakdown of everything you need to know to implement and modify MULDE in PyTorch, grounded in its mathematical core. This guide focuses on the core architecture and training loop — feature extraction and dataset are modular.
We want to approximate the log-density 
We achieve this by:
- Adding Gaussian noise to clean data $x )
 - Training 
$f_\theta$ such that $\nabla_{x̃} f_\theta(x̃, \sigma) \approx - \frac{x - x̃}{\sigma^2} ) - Adding a regularization term 
$f_\theta(x, \sigma)^2$ to align predictions across scales 
The full MULDE loss (Eq. 6 in the paper):
Where:
- $x \sim p(x) ): clean features
 - $x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I) )
 - $\sigma \sim \text{log-uniform}[\sigma_\text{low}, \sigma_\text{high}] )
 - 
$\lambda(\sigma) = \sigma^2$ (scale-dependent weighting) - $f_\theta ): scalar output neural network
 - $\nabla_{x̃} f_\theta ): gradient w.r.t. input $x̃ )
 
import torch
import torch.nn as nn
import torch.nn.functional as F
class MULDE(nn.Module):
    def __init__(self, input_dim, hidden_dim=4096):
        super().__init__()
        self.fc1 = nn.Linear(input_dim + 1, hidden_dim)  # +1 for sigma
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)  # output: scalar f_theta
    def forward(self, x, sigma):
        # x: [B, D], sigma: [B, 1]
        sigma_log = sigma.log()  # log-sigma conditioning
        h = torch.cat([x, sigma_log], dim=1)
        h = F.gelu(self.fc1(h))
        h = F.gelu(self.fc2(h))
        f = self.fc3(h)  # [B, 1]
        return fdef mulde_loss(model, x, sigma, beta):
    """
    x: [B, D] - clean features
    sigma: [B, 1] - sampled noise levels
    """
    # 1. Add Gaussian noise
    eps = torch.randn_like(x) * sigma
    x_tilde = x + eps
    x.requires_grad_(True)
    x_tilde.requires_grad_(True)
    # 2. Forward pass for noisy and clean
    f_noisy = model(x_tilde, sigma)  # [B, 1]
    f_clean = model(x, sigma)        # [B, 1]
    # 3. Gradient w.r.t. noisy input
    grad_outputs = torch.ones_like(f_noisy)
    grads = torch.autograd.grad(
        outputs=f_noisy, inputs=x_tilde,
        grad_outputs=grad_outputs,
        create_graph=True, retain_graph=True
    )[0]  # [B, D]
    # 4. Score matching target
    target = -eps / (sigma ** 2)  # [B, D]
    # 5. Loss terms
    score_loss = ((grads - target) ** 2).sum(dim=1)  # [B]
    reg_loss = beta * (f_clean ** 2).squeeze(1)      # [B]
    # 6. Noise weighting
    weight = sigma.squeeze(1) ** 2
    loss = (weight * score_loss + reg_loss).mean()
    return lossmodel = MULDE(input_dim=512).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
beta = 0.1
sigma_low, sigma_high = 1e-3, 1.0
for epoch in range(num_epochs):
    for x in dataloader:  # x: [B, D]
        x = x.to(device)
        B = x.size(0)
        # Sample log-uniform σ
        u = torch.rand(B, 1).to(device)
        sigma = sigma_low * (sigma_high / sigma_low) ** u  # log-uniform
        loss = mulde_loss(model, x, sigma, beta)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()After training:
- 
Choose a list of noise levels:
${\sigma_i}{i=1}^L$ (e.g., 16 values between $\sigma\text{low}$ and $\sigma_\text{high} )) - 
For test input
$x$ , compute: 
- 
Fit a GMM on normal training vectors $\mathbf{v}_x )
 - 
Compute anomaly score as negative log-likelihood under the GMM
 
- To use other noise types: replace Gaussian sampling
 - To estimate gradient only (like standard DSM): directly learn $s_\theta(x̃, \sigma) )
 - For image/video inputs: add convolutional backbone before final MLP
 - To control the smoothness of $f_\theta ): replace GELU with other smooth activations (ReLU not allowed)
 
| Aspect | MULDE Design | 
|---|---|
| Output | Scalar log-density estimate | 
| Conditioning | On log(σ), concatenated with input | 
| Required property | Twice-differentiable network (smooth activations) | 
| Loss supervision | From DSM identity (Vincent, 2011) | 
| Regularization | Aligns log-densities across σ-scales | 
| Score field property | Gradient is conservative (∇f instead of f-free) |