Skip to content

Instantly share code, notes, and snippets.

@vadimkantorov
vadimkantorov / logsumexp.py
Last active August 6, 2025 14:09
Example of two-level aggregation for LogSumExp in Triton-lang (only forward pass), created for investigation of https://github.com/volcengine/verl/issues/2899
# Extracted and simplified the two-level aggregation approach (first, parallel aggregation in blocks, then final sequential aggregation) from https://github.com/volcengine/verl/blob/main/verl/utils/kernel/kernels.py
# Examples of single-level sequential, online aggregation approaches:
# - https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/cross_entropy.py
# - https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/cross_entropy.py
# logsumexp_torch has some eager pseudo/code in PyTorch which emulates what Triton does, except that BLOCK_SIZE_M equials to M
# tl.program_id(axis=0).to(tl.int64) is used for https://arxiv.org/abs/2410.10989 and https://github.com/linkedin/Liger-Kernel/blob/05b43a14913ced3776aa3fc50020089b8c0d63c1/src/liger_kernel/ops/cross_entropy.py#L77-L79
# sample_verl.pt is derived from the inputs (logits = torch.matmul(hidden, weights) uploaded by @WindowsXP-Beta in https://github.com/volcengine/verl/issues/2656#issuecomment-3131136498 )
# created for
@vadimkantorov
vadimkantorov / nvidia-pids.sh
Last active July 28, 2025 14:31
Print all CUDA-using PIDs
nvidia-smi -q -x | grep "</pid>" | tr -d "</pid>\t"
@vadimkantorov
vadimkantorov / to_.py
Last active July 26, 2025 09:31
Inplace downcasting in PyTorch
# https://github.com/pytorch/pytorch/issues/158710
# https://github.com/pytorch/pytorch/issues/158698
# https://github.com/pytorch/pytorch/issues/69431
import torch
def to_(tensor1d, dtype, *, chunks = 0, split_size = 0):
# TODO: instead of clone() maybe could copy_ into a buffer, clone() does not allow using a buffer
# TODO: unclear if these codes can support autograd, and if so, will it remember too much in saved_for_backward
@vadimkantorov
vadimkantorov / base64_torch.py
Last active August 10, 2025 18:55
Base64 decoding in PyTorch
# https://en.wikipedia.org/wiki/Base64
# 00123456 00ABCDEF 00abcdef 00uvwxyz
# 123456AB CDEFabcd efuvwxyz
# this code does not support batches. adapting for e.g. concatenated varlen format is possible, but need to handle/preserve varlen information and paddings in some way
import torch
def base64_encode_padded(input_as_uint8_tensor):
base64_alphabet, base64_pad = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', '='
device = input_as_uint8_tensor.device
@vadimkantorov
vadimkantorov / .nebius.s3cfg
Created July 16, 2025 13:24
s3cmd config .s3cfg for Nebius s3
# Nebius s3
# sudo apt-get install s3cmd # export the vars below or configure access_key / secret_key
# export AWS_ACCESS_KEY_ID=...
# export AWS_SECRET_ACCESS_KEY=...
# https://s3tools.org/usage
# s3cmd -c ~/.nebius.s3cfg ls
[default]
# access_key=...
# secret_key=...
@vadimkantorov
vadimkantorov / sitecustomize.py
Created July 10, 2025 18:03
Install a OOM hook in PyTorch
# PYTHONPATH=. python ...
import os
import torch
def cuda_oom_hook(device, alloc, device_alloc, device_free, info = dict(counter = 0), snapshot_dump_file_pattern = './memory_snapshot_{pid}_{oom_counter}.pt'):
memory_summary = torch.cuda.memory_summary(device = device)
memory_snapshot = torch.cuda.memory._snapshot(device = device)
pid = os.getpid()
print('device:', device, 'oom#:', info['oom_counter'], 'pid:', pid, 'alloc:', alloc, 'device_alloc:', device_alloc, 'device_free:', device_free)
@vadimkantorov
vadimkantorov / multiprocessing_pool_batched.py
Last active May 27, 2025 09:45
Example of using multiprocessing with explicitly batched inputs
import multiprocessing
import itertools
inputs = list(range(111))
batchsize = 10
num_workers = 4
batches = itertools.batched(inputs, batchsize)
def reducer(xs):
@vadimkantorov
vadimkantorov / cache_hf_model.sh
Created May 23, 2025 17:26
Downloads and tests import of a HuggingFace model into a meta device (and thus does not use any GPU/CPU memory)
# Usage: bash cache_hf_model.sh Qwen/Qwen3-8B
# export HF_HOME=/my/cache/HF_HOME
python -c 'import sys, transformers; transformers.AutoModel.from_pretrained(sys.argv[-1], trust_remote_code=True, device_map="meta")' $@
@vadimkantorov
vadimkantorov / minidotenv.py
Created May 22, 2025 18:40
toml can abused to read some basic key-value pairs as well
def load_dotenv(dotenv_path = '.env'):
# https://snarky.ca/use-toml-for-env-files/
# https://github.com/theskumar/python-dotenv
'''
# such simple key-value files are toml subset and can be read via tomllib without external packages or hacks
a="b"
c="d"
'''
import os, tomllib
os.environ.update(tomllib.load(open(dotenv_path, 'rb')))
@vadimkantorov
vadimkantorov / catfsspec.py
Last active June 12, 2025 11:50
Basic example of using fsspec explaining some quirks on porting from regular Python I/O
import sys, fsspec
with fsspec.open(sys.argv[1], 'rt') as f: # must pass 'rt' explicitly, as in fsspec the default mode is 'rb'
print(f.read()) # msut use context manager as in fsspec the result of fsspec.open(...) does not have method read()
# echo world > hello.txt
# python catfsspec.py hello.txt
# python catfsspec.py file://hello.txt
# python catfsspec.py s3://mybucket/hello.txt