Last active
April 22, 2024 04:53
-
-
Save catid/533dd0c7d4f3ee8d34a6a905155b72ae to your computer and use it in GitHub Desktop.
How to quantize 70B model so it will fit on 2x4090 GPUs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
How to quantize 70B model so it will fit on 2x4090 GPUs: | |
I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened). | |
HQQ worked: | |
I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space. | |
I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid. | |
Note you need to fill in the form to get access to the 70B Meta weights. | |
You can copy/paste this on the console and it will just set up everything automatically: | |
```bash | |
apt update | |
apt install git-lfs vim -y | |
mkdir -p ~/miniconda3 | |
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh | |
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 | |
~/miniconda3/bin/conda init bash | |
source ~/.bashrc | |
conda create -n hqq python=3.10 -y && conda activate hqq | |
git lfs install | |
git clone https://github.com/mobiusml/hqq.git | |
cd hqq | |
pip install torch | |
pip install . | |
pip install huggingface_hub[hf_transfer] | |
export HF_HUB_ENABLE_HF_TRANSFER=1 | |
huggingface-cli login | |
``` | |
Create `quantize.py` file by copy/pasting this into console: | |
``` | |
echo " | |
import torch | |
model_id = 'meta-llama/Meta-Llama-3-70B-Instruct' | |
save_dir = 'cat-llama-3-70b-hqq' | |
compute_dtype = torch.bfloat16 | |
from hqq.core.quantize import * | |
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True) | |
zero_scale_group_size = 128 | |
quant_config['scale_quant_params']['group_size'] = zero_scale_group_size | |
quant_config['zero_quant_params']['group_size'] = zero_scale_group_size | |
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer | |
model = HQQModelForCausalLM.from_pretrained(model_id) | |
from hqq.models.hf.base import AutoHQQHFModel | |
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, | |
compute_dtype=compute_dtype) | |
AutoHQQHFModel.save_quantized(model, save_dir) | |
model = AutoHQQHFModel.from_quantized(save_dir) | |
model.eval() | |
" > quantize.py | |
``` | |
Run script: | |
``` | |
python quantize.py | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment