Skip to content

Instantly share code, notes, and snippets.

@farzadhallaji
Created March 21, 2025 10:01
Show Gist options
  • Save farzadhallaji/c95051446ef71d6950054a9d9c68169d to your computer and use it in GitHub Desktop.
Save farzadhallaji/c95051446ef71d6950054a9d9c68169d to your computer and use it in GitHub Desktop.

1. Score Matching (SM)

Goal

Estimate the score: $s(x) = \nabla_x \log p(x)$

Why Not MLE?

MLE needs normalization: $\log p(x) = \log \tilde{p}(x) - \log Z$,
but $Z = \int \tilde{p}(x) dx$ is intractable.

Trick

Minimize score error directly:

$$ \mathbb{E}{p(x)} \left[ | s\theta(x) - \nabla \log p(x) |^2 \right] $$

But we don’t know $\nabla \log p(x)$, so Hyvärinen (2005) rewrites this (via integration by parts):

$$ \mathcal{L}{\text{SM}}(\theta) = \mathbb{E}{p(x)} \left[ \text{Tr}(\nabla_x s_\theta(x)) + \frac{1}{2} | s_\theta(x) |^2 \right] $$

Key Insight:
We estimate the score without ever needing $p(x)$.


2. Denoising Score Matching (DSM)

Problem with SM

Unstable in high-dim; data distribution may be too sharp.

Fix

Add Gaussian noise: $x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I)$

Define noisy (smoothed) distribution:

$$ q(x̃) = \int p(x), \mathcal{N}(x̃ \mid x, \sigma^2 I) dx $$

Key Identity (Vincent 2011)

$$ \nabla_{x̃} \log q(x̃) = \mathbb{E}_{x \mid x̃} \left[ \frac{x - x̃}{\sigma^2} \right] $$

Loss

Train $s_\theta(x̃)$ to match this:

$$ \mathcal{L}{\text{DSM}} = \mathbb{E}{x, x̃} \left[ \left| s_\theta(x̃) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$

Why It Works:
Minimizing this makes $s_\theta(x̃) \to \nabla_{x̃} \log q(x̃)$ — the score of the smoothed distribution.


3. NCSN (Noise Conditional Score Networks)

Problem with DSM

Fixed $\sigma$: trade-off between detail (low noise) and stability (high noise).

Solution

Train on many noise levels $\sigma$, condition the model on noise scale:

$$ s_\theta(x̃, \sigma) \approx \nabla_{x̃} \log q_\sigma(x̃) $$

Loss

$$ \mathcal{L}{\text{NCSN}} = \mathbb{E}{x, \sigma, x̃} \left[ \lambda(\sigma) \left| s_\theta(x̃, \sigma) - \frac{x - x̃}{\sigma^2} \right|^2 \right] $$

Why It Works:
Same principle as DSM — just across all $\sigma$.
Allows learning multi-scale structure of data.


4. MULDE

New Goal

Instead of learning just the score $\nabla \log q$, learn the log-density itself:
$$ f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃) $$

Key Gradient Link

If $f_\theta(x̃, \sigma) \approx -\log q_\sigma(x̃)$, then:

$$ \nabla_{x̃} f_\theta(x̃, \sigma) \approx - \nabla_{x̃} \log q_\sigma(x̃) $$

So use DSM-like score supervision:

$$ \nabla_{x̃} f_\theta(x̃, \sigma) \approx -\frac{x - x̃}{\sigma^2} $$

Loss

$$ \mathcal{L}{\text{MULDE}} = \mathbb{E} \left[ \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$

Why It Works:

  • First term: makes gradient match DSM target
  • Second term: regularizes $f_\theta$ to behave like log-density

Final Summary (Core Math for Each)

Method Learns Loss Derived From Core Equation
SM $\nabla \log p(x)$ Score error + integration by parts $\text{Tr}(\nabla s) + \frac{1}{2} |s|^2$
DSM $\nabla \log q(x̃)$ Denoising identity (Vincent) $\frac{x - x̃}{\sigma^2} = \nabla_{x̃} \log q(x̃)$
NCSN $\nabla \log q_\sigma(x̃)$ Multi-scale DSM Same as DSM but over $\sigma$
MULDE $-\log q_\sigma(x̃)$ Integrate score into scalar $\nabla f_\theta \approx -\frac{x - x̃}{\sigma^2}$

Here is a complete, clean, and technical breakdown of everything you need to know to implement and modify MULDE in PyTorch, grounded in its mathematical core. This guide focuses on the core architecture and training loop — feature extraction and dataset are modular.


1. Core Idea Recap

We want to approximate the log-density $-\log q_\sigma(x̃)$ using a neural network $f_\theta(x̃, \sigma) ).

We achieve this by:

  • Adding Gaussian noise to clean data $x )
  • Training $f_\theta$ such that $\nabla_{x̃} f_\theta(x̃, \sigma) \approx - \frac{x - x̃}{\sigma^2} )
  • Adding a regularization term $f_\theta(x, \sigma)^2$ to align predictions across scales

2. Mathematical Objective

The full MULDE loss (Eq. 6 in the paper):

$$ \mathcal{L}(\theta) = \mathbb{E}{x, x̃, \sigma} \left[ \lambda(\sigma) \left| \nabla{x̃} f_\theta(x̃, \sigma) + \frac{x - x̃}{\sigma^2} \right|^2 + \beta f_\theta(x, \sigma)^2 \right] $$

Where:

  • $x \sim p(x) ): clean features
  • $x̃ = x + \epsilon,\ \epsilon \sim \mathcal{N}(0, \sigma^2 I) )
  • $\sigma \sim \text{log-uniform}[\sigma_\text{low}, \sigma_\text{high}] )
  • $\lambda(\sigma) = \sigma^2$ (scale-dependent weighting)
  • $f_\theta ): scalar output neural network
  • $\nabla_{x̃} f_\theta ): gradient w.r.t. input $x̃ )

3. Model Design

import torch
import torch.nn as nn
import torch.nn.functional as F

class MULDE(nn.Module):
    def __init__(self, input_dim, hidden_dim=4096):
        super().__init__()
        self.fc1 = nn.Linear(input_dim + 1, hidden_dim)  # +1 for sigma
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)  # output: scalar f_theta

    def forward(self, x, sigma):
        # x: [B, D], sigma: [B, 1]
        sigma_log = sigma.log()  # log-sigma conditioning
        h = torch.cat([x, sigma_log], dim=1)
        h = F.gelu(self.fc1(h))
        h = F.gelu(self.fc2(h))
        f = self.fc3(h)  # [B, 1]
        return f

4. Training Step (Loss Computation)

def mulde_loss(model, x, sigma, beta):
    """
    x: [B, D] - clean features
    sigma: [B, 1] - sampled noise levels
    """
    # 1. Add Gaussian noise
    eps = torch.randn_like(x) * sigma
    x_tilde = x + eps

    x.requires_grad_(True)
    x_tilde.requires_grad_(True)

    # 2. Forward pass for noisy and clean
    f_noisy = model(x_tilde, sigma)  # [B, 1]
    f_clean = model(x, sigma)        # [B, 1]

    # 3. Gradient w.r.t. noisy input
    grad_outputs = torch.ones_like(f_noisy)
    grads = torch.autograd.grad(
        outputs=f_noisy, inputs=x_tilde,
        grad_outputs=grad_outputs,
        create_graph=True, retain_graph=True
    )[0]  # [B, D]

    # 4. Score matching target
    target = -eps / (sigma ** 2)  # [B, D]

    # 5. Loss terms
    score_loss = ((grads - target) ** 2).sum(dim=1)  # [B]
    reg_loss = beta * (f_clean ** 2).squeeze(1)      # [B]

    # 6. Noise weighting
    weight = sigma.squeeze(1) ** 2
    loss = (weight * score_loss + reg_loss).mean()

    return loss

5. Training Loop (Simplified)

model = MULDE(input_dim=512).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
beta = 0.1
sigma_low, sigma_high = 1e-3, 1.0

for epoch in range(num_epochs):
    for x in dataloader:  # x: [B, D]
        x = x.to(device)
        B = x.size(0)

        # Sample log-uniform σ
        u = torch.rand(B, 1).to(device)
        sigma = sigma_low * (sigma_high / sigma_low) ** u  # log-uniform

        loss = mulde_loss(model, x, sigma, beta)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

6. Inference (Anomaly Scoring)

After training:

  1. Choose a list of noise levels:
    ${\sigma_i}{i=1}^L$ (e.g., 16 values between $\sigma\text{low}$ and $\sigma_\text{high} ))

  2. For test input $x$, compute:

$$ \mathbf{v}x = \left[ f\theta(x, \sigma_1), \dots, f_\theta(x, \sigma_L) \right] \in \mathbb{R}^L $$

  1. Fit a GMM on normal training vectors $\mathbf{v}_x )

  2. Compute anomaly score as negative log-likelihood under the GMM


7. Customization / Modifications

  • To use other noise types: replace Gaussian sampling
  • To estimate gradient only (like standard DSM): directly learn $s_\theta(x̃, \sigma) )
  • For image/video inputs: add convolutional backbone before final MLP
  • To control the smoothness of $f_\theta ): replace GELU with other smooth activations (ReLU not allowed)

8. Design Considerations

Aspect MULDE Design
Output Scalar log-density estimate
Conditioning On log(σ), concatenated with input
Required property Twice-differentiable network (smooth activations)
Loss supervision From DSM identity (Vincent, 2011)
Regularization Aligns log-densities across σ-scales
Score field property Gradient is conservative (∇f instead of f-free)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment