Gradio app stalls on LLaMA 2 inference with 8‑bit quant

Jul 15, 2023
gradio llama2 inference
1 min read

Spun up a quick Gradio demo around LLaMA‑2‑7B‑chat‑hf using 8‑bit quantisation. Prompt box froze after I hit “Submit”; no tokens returned.

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    load_in_8bit=True,
    device_map="auto"
)

GPU RTX 3090, CUDA 11.8, torch 2.0.1+cu118, bitsandbytes 0.41.0, transformers 4.31.0. Gradio 3.34.0. Linux Mint 21.2.

nvidia‑smi showed memory allocated but zero utilisation. No error in the console.

Solved by:

Throughput jumped to 32 tokens/s and Gradio responded instantly.

If it hangs after the first request set queue=False when calling gr.Interface to bypass the event queue.