Hugging Face Transformers crawling on WSL2 with CUDA
Ran a simple AutoModelForCausalLM
generate on a T4 but got one token every five seconds.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, time
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
t0 = time.time()
model.generate(**tokenizer("hello", return_tensors="pt").to("cuda"))
print(time.time() - t0) # ~5 seconds for a single token
GPU showed zero utilisation.
Ubuntu 22.04 on WSL2. NVIDIA driver 531.79 installed on Windows, nvidia-smi
inside WSL worked. torch
was 1.13.1+cpu by accident, built without CUDA.
Fixed it by:
- installing the WSL‑specific CUDA driver from NVIDIA
pip3 install --upgrade --index-url https://download.pytorch.org/whl/cu117 torch==2.0.1+cu117 torchvision torchaudio
- verifying with
python -c "import torch; print(torch.cuda.is_available())"
which now printsTrue
- re‑running, generation time dropped to 0.03 seconds per token
If it still crawls, check that the process is pinned to GPU with model.to("cuda")
. CPU wheels ship first in pip, easy to miss.