Llama2
- Gradio app stalls on LLaMA 2 inference with 8‑bit quant
Gradio app stalls on LLaMA 2 inference with 8‑bit quant Spun up a quick Gradio demo around LLaMA‑2‑7B‑chat‑hf using 8‑bit quantisation. Prompt box froze after I hit “Submit”; no tokens returned. model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-chat-hf", load_in_8bit=True, device_map="auto" ) GPU RTX 3090, CUDA 11.8, torch 2.0.1+cu118, bitsandbytes 0.41.0, transformers 4.31.0. Gradio 3.34.0. Linux Mint 21.2.