vLLM tokenizer mismatch on finetuned Mistral model

Jun 15, 2024
vllm mistral tokenizer
1 min read

vLLM tokenizer mismatch on finetuned Mistral model

Spent the afternoon benchmarking a Mistral‑7B finetune with vLLM. First prompt returned gibberish tokens.

from vllm import LLM
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2", tokenizer="mistralai/Mistral-7B-Instruct-v0.2")
print(llm.generate("Hello"))

Output contained repeating � characters.

Ubuntu 22.04, CUDA 12.1, vllm 0.4.3, transformers 4.40.0, Python 3.11. Checkpoint trained with --trust-remote-code.

Issue: the finetune used a merged tokenizer JSON; vLLM cached the default base tokenizer vocabulary. The mismatch breaks byte‑pair merges.

Fix:

delete cached tokenizer folder ~/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2
run with explicit local tokenizer path:

llm = LLM(model="./mistral-ft", tokenizer="./mistral-ft")

add --tokenizer-mode slow when launching python -m vllm.entrypoints.openai.api_server

Tokens now match reference output and perplexity dropped to expected range.

If mismatch persists verify tokenizer_config.json has "model_max_length": 32768; older configs ship 2048 and cause truncation.