Skip to content

Instantly share code, notes, and snippets.

@whjms
Last active April 7, 2023 16:35
Show Gist options
  • Save whjms/2505ef082a656e7a80a3f663c16f4277 to your computer and use it in GitHub Desktop.
Save whjms/2505ef082a656e7a80a3f663c16f4277 to your computer and use it in GitHub Desktop.
Instructions for running KoboldAI in 8-bit mode

Running KoboldAI in 8-bit mode

tl;dr use Linux, install bitsandbytes (either globally or in KAI's conda env, add load_in_8bit=True, device_map="auto" in model pipeline creation calls)

Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.

This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.

Requirements

  • KoboldAI (KAI) must be running on Linux
  • Must use NVIDIA GPU that supports 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100)
  • CPU RAM must be large enough to load the entire model in memory (KAI has some optimizations to incrementally load the model, but 8-bit mode seems to break this)
  • GPU must contain ~1/2 of the recommended VRAM requirement. The model cannot be split between GPU and CPU.

Getting started

Installing bitsandbytes

bitsandbytes is a Python library that manages low-level 8-bit operations for model inference. Add bitsandbytes to the environments/huggingface.yml file, under the pip section. Your file should look something like this:

name: koboldai
channels:
  # ...
dependencies:
  # ...
  - pip:
    - bitsandbytes  # <---- add this
    # ...

Next, install bitsandbytes in KoboldAI's environment with bin/micromamba install -f environments/huggingface.yml -r runtime -n koboldai. The output should look something like this:

...
Requirement already satisfied: MarkupSafe>=2.0 in /home/...
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.1

Code changes

Make the following changes to aiserver.py:

  1. Under class vars:, set lazy_load to False:

    class vars:
      # ...
      debug       = False # If set to true, will send debug information to the client for display
      lazy_load   = False # <--- change this
      # ...
  2. Under reset_model_settings(), set vars.lazy_load to False also:

    def reset_model_settings():
      # ...
      vars.lazy_load = False # <--- change this
  3. Edit this line to add load_in_8bit=True and device_map="auto":

                                                                                                     # vvvvvvvvvvv add these vvvvvvvvvvvvvvv #
    model     = AutoModelForCausalLM.from_pretrained("models/{}".format(vars.model.replace('/', '_')), load_in_8bit=True, device_map ="auto", revision=vars.revision, cache_dir="cache", **lowmem)

Go!

Start KoboldAI normally. Set all model layers to GPU, as we cannot split the model between CPU and GPU.

The changes we made do not apply to GPT-2 models, nor models loaded from custom directories (but you can enable it for custom directories by adding the load_in_8bit/device_map parameters to the appropriate AutoModelForCausalLM.from_pretrained() calls.


1: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer)

@Ph0rk0z
Copy link

Ph0rk0z commented Mar 9, 2023

Threshold of 1+ allows me to use this and keep generating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment