Jesse createthis

DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

DeepSeek- AI

Abstract

We introduce DeepSeek- V3.2- Exp, an experimental sparse- attention model, which equips DeepSeek- V3.1- Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine- grained sparse attention mechanism powered by a lightning indexer, DeepSeek- V3.2- Exp achieves significant efficiency improvements in both training and inference, especially in long- context scenarios. The model checkpoints are available at https://huggingface.co/deepseek- ai/DeepSeek- V3.2- Exp.

This code implements a high-performance Top-K selection algorithm using TileLang for GPU acceleration. I'll explain it line by line, focusing on the radix-based selection approach.

1. Imports and Configuration

import torch
import tilelang
import tilelang.language as T
pass_configs = {
 tilelang.PassConfigKey.TL_DISABLE_THREAD_STORAGE_SYNC: True,

This code implements the DeepSeek Sparse Attention (DSA) lightning indexer, which computes index scores for efficient attention using FP8 precision. I'll explain it line by line, breaking it into logical sections. The code uses TileLang (a DSL for GPU kernels) and PyTorch for high-performance computation.

1. Imports and Utility Functions

# ruff: noqa
import itertools
import tilelang
from tilelang import language as T
import torch

1. Architecture

Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.

Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.

The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$ and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:

$$

	#include <tl_templates/cuda/cuda_fp8.h>
	#include <tl_templates/cuda/gemm.h>
	#include <tl_templates/cuda/copy.h>
	#include <tl_templates/cuda/reduce.h>
	#include <tl_templates/cuda/ldsm.h>
	#include <tl_templates/cuda/threadblock_swizzle.h>
	#include <tl_templates/cuda/debug.h>
	#ifdef ENABLE_BF16
	#include <tl_templates/cuda/cuda_bf16_fallbacks.cuh>
	#endif

	#!/usr/bin/env python3
	import argparse
	import torch
	import os
	import sys
	from typing import Optional

	# Optional TVM runtime import to dump CUDA/PTX sources
	import tilelang
	from tilelang import tvm

	#!/usr/bin/env python3
	import argparse
	import time
	import torch

	# TileLang example kernels
	from examples.deepseek_v32.topk_selector import tl_topk, tl_topk_impl

	def bench_tl_topk(seq_len: int, topk: int = 256, batch: int = 1, iters: int = 50, warmup: int = 5):
	torch.cuda.synchronize()

	#!/usr/bin/env python3
	import argparse
	import torch

	# Prefer local examples path resolution if running from repo root
	try:
	from examples.deepseek_v32.utils import per_custom_dims_cast_to_fp8 as _to_fp8
	def to_fp8(x):
	# Cast along last dim to FP8 E4M3 to match kernel expectations
	# Handle both (x, dims, use_ue8m0) and (x, dims) signatures and return the scaled tensor only.

	{% if not add_generation_prompt is defined %}
	{% set add_generation_prompt = false %}
	{% endif %}
	{% if not thinking is defined %}
	{% set thinking = false %}
	{% endif %}
	{% set ns = namespace(is_first=false, is_tool=false, system_prompt='', is_first_sp=true, is_last_user=false, is_only_sys=false, is_prefix=false) %}
	{%- for message in messages %}
	{%- if message['role'] == 'system' %}
	{%- if ns.is_first_sp %}