catid · April 22, 2024 04:53
diff --git a/gistfile1.txt b/gistfile1.txt
 How to quantize 70B model so it will fit on 2x4090 GPUs:

 I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

 HQQ worked:

 I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
 I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

 Note you need to fill in the form to get access to the 70B Meta weights.

 You can copy/paste this on the console and it will just set up everything automatically:

 ```bash
 apt update
 apt install git-lfs vim -y

 mkdir -p ~/miniconda3
 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
 bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
 ~/miniconda3/bin/conda init bash
 source ~/.bashrc

 conda create -n hqq python=3.10 -y && conda activate hqq

 git lfs install
 git clone https://github.com/mobiusml/hqq.git
 cd hqq

 pip install torch
 pip install .

 pip install huggingface_hub[hf_transfer]
 export HF_HUB_ENABLE_HF_TRANSFER=1

 huggingface-cli login
 ```

 Create `quantize.py` file by copy/pasting this into console:

 ```
 echo "
 import torch

 model_id      = 'meta-llama/Meta-Llama-3-70B-Instruct'
 save_dir   = 'cat-llama-3-70b-hqq'
 compute_dtype = torch.bfloat16

 from hqq.core.quantize import *
 quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
 zero_scale_group_size = 128
 quant_config['scale_quant_params']['group_size']     = zero_scale_group_size
 quant_config['zero_quant_params']['group_size']      = zero_scale_group_size

 from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
 model = HQQModelForCausalLM.from_pretrained(model_id)

 from hqq.models.hf.base import AutoHQQHFModel
 AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
                                    compute_dtype=compute_dtype)

 AutoHQQHFModel.save_quantized(model, save_dir)
 model = AutoHQQHFModel.from_quantized(save_dir)

 model.eval()

 " > quantize.py
 ```

 Run script:

 ```
 python quantize.py
 ```
	How to quantize 70B model so it will fit on 2x4090 GPUs:

	I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

	HQQ worked:

	I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
	I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

	Note you need to fill in the form to get access to the 70B Meta weights.

	You can copy/paste this on the console and it will just set up everything automatically:

	```bash
	apt update
	apt install git-lfs vim -y

	mkdir -p ~/miniconda3
	wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
	bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
	~/miniconda3/bin/conda init bash
	source ~/.bashrc

	conda create -n hqq python=3.10 -y && conda activate hqq

	git lfs install
	git clone https://github.com/mobiusml/hqq.git
	cd hqq

	pip install torch
	pip install .

	pip install huggingface_hub[hf_transfer]
	export HF_HUB_ENABLE_HF_TRANSFER=1

	huggingface-cli login
	```

	Create `quantize.py` file by copy/pasting this into console:

	```
	echo "
	import torch

	model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
	save_dir = 'cat-llama-3-70b-hqq'
	compute_dtype = torch.bfloat16

	from hqq.core.quantize import *
	quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
	zero_scale_group_size = 128
	quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
	quant_config['zero_quant_params']['group_size'] = zero_scale_group_size

	from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
	model = HQQModelForCausalLM.from_pretrained(model_id)

	from hqq.models.hf.base import AutoHQQHFModel
	AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
	compute_dtype=compute_dtype)

	AutoHQQHFModel.save_quantized(model, save_dir)
	model = AutoHQQHFModel.from_quantized(save_dir)

	model.eval()

	" > quantize.py
	```

	Run script:

	```
	python quantize.py
	```