Skip to content

Instantly share code, notes, and snippets.

@alexarmbr
alexarmbr / taylor_seer.py
Created April 23, 2025 17:21
A decorator that implements the algorithm explained in "From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers"
import functools
import torch
import math
def taylor_seer_approximation(WARMUP_STEPS=1, SKIP_INTERVAL_STEPS=1, compute_step_map=None, n_derivatives = 2):
"""
A decorator that approximates the forward pass of an nn.Module to reduce computation.
Args:
warmup: Number of steps to compute the actual forward pass before starting approximation
@alexarmbr
alexarmbr / profiling.md
Last active April 3, 2025 18:27
gpu profiling

Nsight Systems

add torch.cuda.cudart().cudaProfilerStart() and torch.cuda.cudart().cudaProfilerStop() where profiling should start and stop. launch profiler with

CUDA_VISIBLE_DEVICES=0,1,2,3
nsys profile \
-w true \
-t cuda,nvtx,osrt,cudnn,cublas \
@alexarmbr
alexarmbr / ring_attn.py
Last active March 15, 2025 05:59
Ring-Flash Attention
"""
test performance and correctness of ring attention vs. single gpu attention
torchrun --nproc-per-node 4 ring_attn.py
using 4 H100s I get:
Rank 0 single gpu attention: 261.78 ms
Rank 0 ring attention: 73.34 ms
"""
import os
import math
@alexarmbr
alexarmbr / benchmark_para_attn.py
Last active April 23, 2025 17:18
a minimal correctness test and benchmark of ulysses style parallel attention from ParaAttention
"""
test performance and correctness of ulysses parallel attention vs single gpu attention
torchrun --nproc-per-node 2 benchmark_attn.py
using two H100s I get:
Rank 0 single gpu attention: 1698.14 ms
Rank 0 ulysses attention: 912.84 ms
running pip install para-attn should install everything needed
"""