Gradio app stalls on LLaMA 2 inference with 8‑bit quant

Gradio app stalls on LLaMA 2 inference with 8‑bit quant

Spun up a quick Gradio demo around LLaMA‑2‑7B‑chat‑hf using 8‑bit quantisation. Prompt box froze after I hit “Submit”; no tokens returned.

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    load_in_8bit=True,
    device_map="auto"
)

GPU RTX 3090, CUDA 11.8, torch 2.0.1+cu118, bitsandbytes 0.41.0, transformers 4.31.0. Gradio 3.34.0. Linux Mint 21.2.

nvidia‑smi showed memory allocated but zero utilisation. No error in the console.

Solved by:

  • upgrading bitsandbytes to 0.41.1
  • setting bnb_4bit_compute_dtype=torch.float16 in quantization_config
  • calling model.eval() before launching the interface

Throughput jumped to 32 tokens/s and Gradio responded instantly.

If it hangs after the first request set queue=False when calling gr.Interface to bypass the event queue.

comments powered by Disqus