Quantization reduces the precision of each of the weights from 32-bit numbers to 8-bit numbers (or similar). The accuracy is reduced but only slightly. Ensure the `bitsandbytes` library is installed.
```bash
pip install bitsandbytes
```
Use the `BitsAndBytesConfig` class of the [[HuggingFace]] `transformers` library.
```python
from transformers import BitsAndBytesConfig
import torch
# 8-bit
quant_config = BitsAndBytesConfig(load_in_8bit=True)
# 4-bit
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # quantize twice
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=quant_config,
device_map="auto" # use GPU if available
)
```