Mistral inference on local GPU hits OOM with 13B model

Jan 15, 2024
mistral gpu oom
1 min read

Mistral inference on local GPU hits OOM with 13B model

Last week I tried the 13B Mistral model on a single RTX 3060 (12 GB). python main.py crashed instantly:

RuntimeError: CUDA out of memory. Tried to allocate 10.2 GiB

torch.cuda.mem_get_info() showed only 11.7 GB free.

Fix:

loaded model with 4‑bit quant:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-13B-Instruct-v0.2", quantization_config=bnb)

set torch.cuda.set_per_process_memory_fraction(0.92, 0) to leave room for the tokenizer
exported PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64

The script stayed under 9.4 GB and completed inference at 18 tokens/s.

If memory still blows up try safetensors=True on load; the .bin weights spike RAM before hitting GPU.