Mistral
- vLLM tokenizer mismatch on finetuned Mistral model
vLLM tokenizer mismatch on finetuned Mistral model Spent the afternoon benchmarking a Mistral‑7B finetune with vLLM. First prompt returned gibberish tokens. from vllm import LLM llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2", tokenizer="mistralai/Mistral-7B-Instruct-v0.2") print(llm.generate("Hello")) Output contained repeating � characters. Ubuntu 22.04, CUDA 12.1, vllm 0.4.3, transformers 4.40.0, Python 3.11. Checkpoint trained with --trust-remote-code.
- Mistral inference on local GPU hits OOM with 13B model
Mistral inference on local GPU hits OOM with 13B model Last week I tried the 13B Mistral model on a single RTX 3060 (12 GB). python main.py crashed instantly: RuntimeError: CUDA out of memory. Tried to allocate 10.2 GiB torch.cuda.mem_get_info() showed only 11.7 GB free.