Gradio app stalls on LLaMA 2 inference with 8‑bit quant
Spun up a quick Gradio demo around LLaMA‑2‑7B‑chat‑hf using 8‑bit quantisation. Prompt box froze after I hit “Submit”; no tokens returned.
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
load_in_8bit=True,
device_map="auto"
)
GPU RTX 3090, CUDA 11.8, torch 2.0.1+cu118, bitsandbytes 0.41.0, transformers 4.31.0. Gradio 3.34.0. Linux Mint 21.2.
nvidia‑smi
showed memory allocated but zero utilisation. No error in the console.
Solved by:
- upgrading bitsandbytes to 0.41.1
- setting
bnb_4bit_compute_dtype=torch.float16
inquantization_config
- calling
model.eval()
before launching the interface
Throughput jumped to 32 tokens/s and Gradio responded instantly.
If it hangs after the first request set queue=False
when calling gr.Interface
to bypass the event queue.