Mulde.md

1. Score Matching (SM)

Goal

Estimate the score: $s(x) = \nabla_x \log p(x)$

Why Not MLE?

MLE needs normalization: $\log p(x) = \log \tilde{p}(x) - \log Z$,
but $Z = \int \tilde{p}(x) dx$ is intractable.

Trick

Minimize score error directly:

$$ \mathbb{E}{p(x)} \left[ | s\theta(x) - \nabla \log p(x) |^2 \right] $$

But we don’t know $\nabla \log p(x)$, so Hyvärinen (2005) rewrites this (via integration by parts):

$$ \mathcal{L}{\text{SM}}(\theta) = \mathbb{E}{p(x)} \left[ \text{Tr}(\nabla_x s_\theta(x)) + \frac{1}{2} | s_\theta(x) |^2 \right] $$

Key Insight:
We estimate the score without ever needing $p(x)$.

2. Denoising Score Matching (DSM)

Problem with SM

Unstable in high-dim; data distribution may be too sharp.

Fix

Add Gaussian noise: $x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I)$

Define noisy (smoothed) distribution:

$$ q(x̃) = \int p(x), \mathcal{N}(x̃ \mid x, \sigma^2 I) dx $$

Key Identity (Vincent 2011)

$$ \nabla_{x̃} \log q(x̃) = \mathbb{E}_{x \mid x̃} \left[ \frac{x - x̃}{\sigma^2} \right] $$

Loss

Train $s_\theta(x̃)$ to match this:

$$ \mathcal{L}{\text{DSM}} = \mathbb{E}{x, x̃} \left[ \left| s_\theta(x̃) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$

Why It Works:
Minimizing this makes $s_\theta(x̃) \to \nabla_{x̃} \log q(x̃)$ — the score of the smoothed distribution.

3. NCSN (Noise Conditional Score Networks)

Problem with DSM

Fixed $\sigma$: trade-off between detail (low noise) and stability (high noise).

Solution

Train on many noise levels $\sigma$, condition the model on noise scale:

$$ s_\theta(x̃, \sigma) \approx \nabla_{x̃} \log q_\sigma(x̃) $$

Loss

$$ \mathcal{L}{\text{NCSN}} = \mathbb{E}{x, \sigma, x̃} \left[ \lambda(\sigma) \left| s_\theta(x̃, \sigma) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$

Why It Works:
Same principle as DSM — just across all $\sigma$.
Allows learning multi-scale structure of data.

4. MULDE

New Goal

Instead of learning just the score $\nabla \log q$, learn the log-density itself:
$$ f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃) $$

Key Gradient Link

If $f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃)$, then:

$$ \nabla_{x̃} f_\theta(x̃, \sigma) \approx - \nabla_{x̃} \log q_\sigma(x̃) $$

So use DSM-like score supervision:

$$ \nabla_{x̃} f_\theta(x̃, \sigma) \approx -\frac{x - x̃}{\sigma^2} $$

Loss

$$ \mathcal{L}{\text{MULDE}} = \mathbb{E} \left[ \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$

Why It Works:

First term: makes gradient match DSM target
Second term: regularizes $f_\theta$ to behave like log-density

Final Summary (Core Math for Each)

Method	Learns	Loss Derived From	Core Equation
SM	$\nabla \log p(x)$	Score error + integration by parts	$\text{Tr}(\nabla s) + \frac{1}{2} \|s\|^2$
DSM	$\nabla \log q(x̃)$	Denoising identity (Vincent)	$\frac{x - x̃}{\sigma^2} = \nabla_{x̃} \log q(x̃)$
NCSN	$\nabla \log q_\sigma(x̃)$	Multi-scale DSM	Same as DSM but over $\sigma$
MULDE	$-\log q_\sigma(x̃)$	Integrate score into scalar	$\nabla f_\theta \approx -\frac{x - x̃}{\sigma^2}$

Here is a complete, clean, and technical breakdown of everything you need to know to implement and modify MULDE in PyTorch, grounded in its mathematical core. This guide focuses on the core architecture and training loop — feature extraction and dataset are modular.

1. Core Idea Recap

We want to approximate the log-density $-\log q_\sigma(x̃)$ using a neural network $f_\theta(x̃, \sigma) ).

We achieve this by:

Adding Gaussian noise to clean data $x )
Training $f_\theta$ such that $\nabla_{x̃} f_\theta(x̃, \sigma) \approx - \frac{x - x̃}{\sigma^2} )
Adding a regularization term $f_\theta(x, \sigma)^2$ to align predictions across scales

2. Mathematical Objective

The full MULDE loss (Eq. 6 in the paper):

$$ \mathcal{L}(\theta) = \mathbb{E}{x, x̃, \sigma} \left[ \lambda(\sigma) \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$

Where:

$x \sim p(x) ): clean features
$x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I) )
$\sigma \sim \text{log-uniform}[\sigma_\text{low}, \sigma_\text{high}] )
$\lambda(\sigma) = \sigma^2$ (scale-dependent weighting)
$f_\theta ): scalar output neural network
$\nabla_{x̃} f_\theta ): gradient w.r.t. input $x̃ )

3. Model Design

import torch
import torch.nn as nn
import torch.nn.functional as F

class MULDE(nn.Module):
    def __init__(self, input_dim, hidden_dim=4096):
        super().__init__()
        self.fc1 = nn.Linear(input_dim + 1, hidden_dim)  # +1 for sigma
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)  # output: scalar f_theta

    def forward(self, x, sigma):
        # x: [B, D], sigma: [B, 1]
        sigma_log = sigma.log()  # log-sigma conditioning
        h = torch.cat([x, sigma_log], dim=1)
        h = F.gelu(self.fc1(h))
        h = F.gelu(self.fc2(h))
        f = self.fc3(h)  # [B, 1]
        return f

4. Training Step (Loss Computation)

def mulde_loss(model, x, sigma, beta):
    """
    x: [B, D] - clean features
    sigma: [B, 1] - sampled noise levels
    """
    # 1. Add Gaussian noise
    eps = torch.randn_like(x) * sigma
    x_tilde = x + eps

    x.requires_grad_(True)
    x_tilde.requires_grad_(True)

    # 2. Forward pass for noisy and clean
    f_noisy = model(x_tilde, sigma)  # [B, 1]
    f_clean = model(x, sigma)        # [B, 1]

    # 3. Gradient w.r.t. noisy input
    grad_outputs = torch.ones_like(f_noisy)
    grads = torch.autograd.grad(
        outputs=f_noisy, inputs=x_tilde,
        grad_outputs=grad_outputs,
        create_graph=True, retain_graph=True
    )[0]  # [B, D]

    # 4. Score matching target
    target = -eps / (sigma ** 2)  # [B, D]

    # 5. Loss terms
    score_loss = ((grads - target) ** 2).sum(dim=1)  # [B]
    reg_loss = beta * (f_clean ** 2).squeeze(1)      # [B]

    # 6. Noise weighting
    weight = sigma.squeeze(1) ** 2
    loss = (weight * score_loss + reg_loss).mean()

    return loss

5. Training Loop (Simplified)

model = MULDE(input_dim=512).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
beta = 0.1
sigma_low, sigma_high = 1e-3, 1.0

for epoch in range(num_epochs):
    for x in dataloader:  # x: [B, D]
        x = x.to(device)
        B = x.size(0)

        # Sample log-uniform σ
        u = torch.rand(B, 1).to(device)
        sigma = sigma_low * (sigma_high / sigma_low) ** u  # log-uniform

        loss = mulde_loss(model, x, sigma, beta)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

6. Inference (Anomaly Scoring)

After training:

Choose a list of noise levels:
${\sigma_i}{i=1}^L$ (e.g., 16 values between $\sigma\text{low}$ and $\sigma_\text{high} ))
For test input $x$, compute:

$$ \mathbf{v}x = \left[ f\theta(x, \sigma_1), \dots, f_\theta(x, \sigma_L) \right] \in \mathbb{R}^L $$

Fit a GMM on normal training vectors $\mathbf{v}_x )
Compute anomaly score as negative log-likelihood under the GMM

7. Customization / Modifications

To use other noise types: replace Gaussian sampling
To estimate gradient only (like standard DSM): directly learn $s_\theta(x̃, \sigma) )
For image/video inputs: add convolutional backbone before final MLP
To control the smoothness of $f_\theta ): replace GELU with other smooth activations (ReLU not allowed)

8. Design Considerations

Aspect	MULDE Design
Output	Scalar log-density estimate
Conditioning	On log(σ), concatenated with input
Required property	Twice-differentiable network (smooth activations)
Loss supervision	From DSM identity (Vincent, 2011)
Regularization	Aligns log-densities across σ-scales
Score field property	Gradient is conservative (∇f instead of f-free)

farzadhallaji/Mulde.md

1. Score Matching (SM)

Goal

Why Not MLE?

Trick

2. Denoising Score Matching (DSM)

Problem with SM

Fix

Key Identity (Vincent 2011)

Loss

3. NCSN (Noise Conditional Score Networks)

Problem with DSM

Solution

Loss

4. MULDE

New Goal

Key Gradient Link

Loss

Final Summary (Core Math for Each)

1. Core Idea Recap

2. Mathematical Objective

3. Model Design

4. Training Step (Loss Computation)

5. Training Loop (Simplified)

6. Inference (Anomaly Scoring)

7. Customization / Modifications

8. Design Considerations