add torch.cuda.cudart().cudaProfilerStart()
and torch.cuda.cudart().cudaProfilerStop()
where profiling should start and stop.
launch profiler with
CUDA_VISIBLE_DEVICES=0,1,2,3
nsys profile \
-w true \
-t cuda,nvtx,osrt,cudnn,cublas \
import functools | |
import torch | |
import math | |
def taylor_seer_approximation(WARMUP_STEPS=1, SKIP_INTERVAL_STEPS=1, compute_step_map=None, n_derivatives = 2): | |
""" | |
A decorator that approximates the forward pass of an nn.Module to reduce computation. | |
Args: | |
warmup: Number of steps to compute the actual forward pass before starting approximation |
""" | |
test performance and correctness of ring attention vs. single gpu attention | |
torchrun --nproc-per-node 4 ring_attn.py | |
using 4 H100s I get: | |
Rank 0 single gpu attention: 261.78 ms | |
Rank 0 ring attention: 73.34 ms | |
""" | |
import os | |
import math |
""" | |
test performance and correctness of ulysses parallel attention vs single gpu attention | |
torchrun --nproc-per-node 2 benchmark_attn.py | |
using two H100s I get: | |
Rank 0 single gpu attention: 1698.14 ms | |
Rank 0 ulysses attention: 912.84 ms | |
running pip install para-attn should install everything needed | |
""" |