vLLM tokenizer mismatch on finetuned Mistral model
Spent the afternoon benchmarking a Mistral‑7B finetune with vLLM. First prompt returned gibberish tokens.
from vllm import LLM
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2", tokenizer="mistralai/Mistral-7B-Instruct-v0.2")
print(llm.generate("Hello"))
Output contained repeating � characters.
Ubuntu 22.04, CUDA 12.1, vllm 0.4.3, transformers 4.40.0, Python 3.11. Checkpoint trained with --trust-remote-code
.
Issue: the finetune used a merged tokenizer JSON; vLLM cached the default base tokenizer vocabulary. The mismatch breaks byte‑pair merges.
Fix:
- delete cached tokenizer folder
~/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2
- run with explicit local tokenizer path:
llm = LLM(model="./mistral-ft", tokenizer="./mistral-ft")
- add
--tokenizer-mode slow
when launchingpython -m vllm.entrypoints.openai.api_server
Tokens now match reference output and perplexity dropped to expected range.
If mismatch persists verify tokenizer_config.json
has "model_max_length": 32768
; older configs ship 2048 and cause truncation.