Quantization reduces the precision of each of the weights from 32-bit numbers to 8-bit numbers (or similar). The accuracy is reduced but only slightly. Ensure the `bitsandbytes` library is installed. ```bash pip install bitsandbytes ``` Use the `BitsAndBytesConfig` class of the [[HuggingFace]] `transformers` library. ```python from transformers import BitsAndBytesConfig import torch # 8-bit quant_config = BitsAndBytesConfig(load_in_8bit=True) # 4-bit quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, # quantize twice bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4" ) base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=quant_config, device_map="auto" # use GPU if available ) ```