ds v3.2-exp first page - markdown

1. Architecture

Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.

Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.

The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$ and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:

$$ I_{t,s}=\sum_{j=1}^{H^{I}} w^{I}_{t,j}\cdot\mathrm{ReLU}\left(\mathbf{q}^{I}_{t,j}\cdot \mathbf{k}^{I}_{s}\right), $$ $$ \tag{1} $$

where $H^{I}$ denotes the number of indexer heads; $\mathbf{q}^{I}_{t,j}\in\mathbb{R}^{d^{I}}$ and $w^{I}_{t,j}\in\mathbb{R}$ are derived from the query token $\mathbf{h}_t$; and $\mathbf{k}^{I}_{s}\in\mathbb{R}^{d^{I}}$ is derived from the preceding token $\mathbf{h}_s$. We choose ReLU as the activation function for throughput consideration. Given that the lightning indexer has a small number of heads and can be implemented in FP8, its computational efficiency is remarkable.

Given the index scores $\{I_{t,s}\}$ for each query token $\mathbf{h}_t$, our fine-grained token selection mechanism retrieves only the key–value entries $\{c_s\}$ corresponding to the top-$k$ index scores. Then, the attention output $\mathbf{u}_t$ is computed by applying the attention mechanism between the query token $\mathbf{h}_t$ and the sparsely selected key-value entries $\{\mathbf{c}_s\}$:

$$\mathbf{u}_t=\mathrm{Attn}\left(\mathbf{h}_t,\{c_s\mid I_{t,s}\in \mathrm{Top}\text{-}k(I_{t,:})\}\right).$$

$$ \tag{2} $$

createthis/deepseek_sparse_attention.md