Liutong Zhou LiutongZhou

Benchmarking Jax 0.7+ vs Pytorch 2.10+ Attention Variants' Speed and Memory

torch.nn.attention.varlen.varlen_attn is the absolute winner on GPU

GPU: NVIDIA GeForce RTX 4090 Laptop GPU
PyTorch version: 2.10.0
JAX version: 0.7.2
Config: Config(batch_size=8, num_heads=32, head_dim=128, seq_min=128, seq_max=2048, dtype=<DType.BFLOAT16: 'bfloat16'>, is_causal=True, warmup=2, iters=30, seed=42, _seq_lens=(299, 1614, 1385, 971, 959, 1777, 293, 1467))
Seq lens: [299, 1614, 1385, 971, 959, 1777, 293, 1467]

Memory Efficient Training

Optimizers

Short Name	Full Name	URL
CAME	Confidence-guided Adaptive Memory Efficient Optimization (ACL23 outstanding paper award)	https://github.com/huawei-noah/Pretrained-Language-Model/blob/master/CAME/came.py
LOMO	LOw-Memory Optimization	https://github.com/OpenLMLab/LOMO/blob/main/src/lomo.py

Low Rank Decomposition

LORA (Low Rank Adaption): https://github.com/microsoft/LoRA or https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py

Docker Tips

Move default docker storage to another location

nano /etc/docker/daemon.json

## add this config
{
"data-root": "/newlocation"

	"""Mixture of Experts (MoE) Layer with token dropping

	Using Ragged All-to-All Communication and Ragged Dot in JAX.
	"""

	__author__ = "Liutong Zhou"

	from __future__ import annotations

	from dataclasses import dataclass, field

	"""FlashAttention: reference implementation of the core logic"""

	import math

	import torch
	from einops import einsum
	from jaxtyping import Float, Int
	from torch import nn, Tensor

	"""OpenAI OSS sdpa and moe implementations that are suitable for both training and inference."""

	from typing import Final

	import torch
	import torch.nn.functional as F
	from einops import einsum, rearrange, repeat
	from torch import Tensor, nn

	__all__ = ["sdpa", "MOEBlock"]

	"""Distributed Data Parallel Inference for Hugging Face Transformers."""

	from typing import Union

	import torch
	from accelerate import Accelerator
	from accelerate.utils import gather_object
	from tqdm import tqdm
	from transformers import (
	PreTrainedModel,

	"""Universal Decorator

	Universal decorators can decorate functions, classes, bound methods
	(class method / instance method) referenced outside of class definition
	and descriptors (class methods, static methods) defined inside class definition.
	"""

	from __future__ import annotations

	import inspect

	"""Data Strutures that extend OrderedDict"""
	from collections import Counter, OrderedDict
	from typing import Any, Hashable, Optional, Tuple, List

	from hypothesis import given, strategies as st

	__all__ = ["OrderedDefaultDict", "MinMaxCounter"]


	class OrderedDefaultDict(OrderedDict):