1. Architecture
Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.
Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.
The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$
and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:
$$
I_{t,s}=\sum_{j=1}^{H^{I}} w^{I}_{t,j}\cdot\mathrm{ReLU}\left(\mathbf{q}^{I}_{t,j}\cdot \mathbf{k}^{I}_{s}\right),
$$
$$
\tag{1}
$$
where $H^{I}$ denotes the number of indexer heads; $\mathbf{q}^{I}_{t,j}\in\mathbb{R}^{d^{I}}$ and $w^{I}_{t,j}\in\mathbb{R}$
are derived from the query token $\mathbf{h}_t$; and $\mathbf{k}^{I}_{s}\in\mathbb{R}^{d^{I}}$ is derived from the
preceding token $\mathbf{h}_s$. We choose ReLU as the activation function for throughput consideration. Given that the lightning indexer has a small number of heads and can be implemented in FP8, its computational efficiency is remarkable.
Given the index scores $\{I_{t,s}\}$ for each query token $\mathbf{h}_t$, our fine-grained token selection mechanism
retrieves only the key–value entries $\{c_s\}$ corresponding to the top-$k$ index scores. Then, the attention output
$\mathbf{u}_t$ is computed by applying the attention mechanism between the query token $\mathbf{h}_t$ and the sparsely
selected key-value entries $\{\mathbf{c}_s\}$:
$$\mathbf{u}_t=\mathrm{Attn}\left(\mathbf{h}_t,\{c_s\mid I_{t,s}\in \mathrm{Top}\text{-}k(I_{t,:})\}\right).$$
$$
\tag{2}
$$