Mistral inference on local GPU hits OOM with 13B model
Last week I tried the 13B Mistral model on a single RTX 3060 (12 GB). python main.py
crashed instantly:
RuntimeError: CUDA out of memory. Tried to allocate 10.2 GiB
torch.cuda.mem_get_info()
showed only 11.7 GB free.
Fix:
- loaded model with 4‑bit quant:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-13B-Instruct-v0.2", quantization_config=bnb)
- set
torch.cuda.set_per_process_memory_fraction(0.92, 0)
to leave room for the tokenizer - exported
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64
The script stayed under 9.4 GB and completed inference at 18 tokens/s.
If memory still blows up try safetensors=True
on load; the .bin weights spike RAM before hitting GPU.