Ollama – v0.20.0: tokenizer: add byte fallback for SentencePiece BPE encoding (#15232)
🚨 Ollama v0.20.0 is live! 🚨
Big tokenizer upgrade incoming—this one’s a must-have for accuracy and reliability, especially when dealing with non-ASCII or rare characters. Here’s what’s new:
🔹 Byte fallback for SentencePiece BPE
→ When a character can’t be tokenized via standard BPE merges, Ollama now falls back to encoding each UTF-8 byte as a `<0xHH>` token (e.g., `€` → `<0xE2><0x82><0xAC>`).
→ No more silent character loss! 🛡️
🔹 Decoding updated too
→ The decoder now correctly reconstructs the original bytes from `<0xHH>` tokens—ensuring perfect round-trip fidelity. ✅
🔧 Fixes:
- #15229 (dropped chars on encode)
- #15231 (decoder crashes with unknown tokens)
This upgrade boosts robustness across multilingual, technical, or edge-case text—making local LLM inference even more reliable. 🧠⚡
Upgrade now and keep your tokens tight! 📦✨
