Running KoboldAI in 8-bit mode

tl;dr use Linux, install bitsandbytes (either globally or in KAI's conda env, add load_in_8bit=True, device_map="auto" in model pipeline creation calls)

Many people are unable to load models due to their GPU's limited VRAM. These models contain billions of parameters (model weights and biases), each of which is a 32 (or 16) bit float. Thanks to the hard work of some researchers [1], it's possible to run these models using 8-bit numbers, which halves the required amount of VRAM compared to running in half-precision. E.g. if a model requires 16GB of VRAM, running with 8-bit inference only requires 8GB.

This guide was written for KoboldAI 1.19.1, and tested with Ubuntu 20.04. These instructions are based on work by Gmin in KoboldAI's Discord server, and Huggingface's efficient LM inference guide.

Requirements

KoboldAI (KAI) must be running on Linux
Must use NVIDIA GPU that supports 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100)
CPU RAM must be large enough to load the entire model in memory (KAI has some optimizations to incrementally load the model, but 8-bit mode seems to break this)
GPU must contain ~1/2 of the recommended VRAM requirement. The model cannot be split between GPU and CPU.

Getting started

Installing `bitsandbytes`

bitsandbytes is a Python library that manages low-level 8-bit operations for model inference. Add bitsandbytes to the environments/huggingface.yml file, under the pip section. Your file should look something like this:

name: koboldai
channels:
  # ...
dependencies:
  # ...
  - pip:
    - bitsandbytes  # <---- add this
    # ...

Next, install bitsandbytes in KoboldAI's environment with bin/micromamba install -f environments/huggingface.yml -r runtime -n koboldai. The output should look something like this:

...
Requirement already satisfied: MarkupSafe>=2.0 in /home/...
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.35.1

Code changes

Make the following changes to aiserver.py:

Under class vars:, set lazy_load to False:

class vars:
  # ...
  debug       = False # If set to true, will send debug information to the client for display
  lazy_load   = False # <--- change this
  # ...

Under reset_model_settings(), set vars.lazy_load to False also:

def reset_model_settings():
  # ...
  vars.lazy_load = False # <--- change this

Edit this line to add load_in_8bit=True and device_map="auto":

                                                                                                 # vvvvvvvvvvv add these vvvvvvvvvvvvvvv #
model     = AutoModelForCausalLM.from_pretrained("models/{}".format(vars.model.replace('/', '_')), load_in_8bit=True, device_map ="auto", revision=vars.revision, cache_dir="cache", **lowmem)

Go!

Start KoboldAI normally. Set all model layers to GPU, as we cannot split the model between CPU and GPU.

The changes we made do not apply to GPT-2 models, nor models loaded from custom directories (but you can enable it for custom directories by adding the load_in_8bit/device_map parameters to the appropriate AutoModelForCausalLM.from_pretrained() calls.

1: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer)

whjms/kobold-8bit.md

Running KoboldAI in 8-bit mode

Requirements

Getting started

Installing `bitsandbytes`

Code changes

Go!

Ph0rk0z commented Mar 9, 2023

Uh oh!

whjms/kobold-8bit.md

Running KoboldAI in 8-bit mode

Requirements

Getting started

Installing bitsandbytes

Code changes

Go!

Ph0rk0z commented Mar 9, 2023

Uh oh!

Installing `bitsandbytes`