Ollama – v0.20.0: tokenizer: add byte fallback for SentencePiece BPE encoding (#15232)

Ollama – v0.20.0: tokenizer: add byte fallback for SentencePiece BPE encoding (#15232)

🚨 Ollama v0.20.0 is live! 🚨

Big tokenizer upgrade incoming—this one’s a must-have for accuracy and reliability, especially when dealing with non-ASCII or rare characters. Here’s what’s new:

🔹 Byte fallback for SentencePiece BPE

→ When a character can’t be tokenized via standard BPE merges, Ollama now falls back to encoding each UTF-8 byte as a `<0xHH>` token (e.g., `€` → `<0xE2><0x82><0xAC>`).

→ No more silent character loss! 🛡️

🔹 Decoding updated too

→ The decoder now correctly reconstructs the original bytes from `<0xHH>` tokens—ensuring perfect round-trip fidelity. ✅

🔧 Fixes:

  • #15229 (dropped chars on encode)
  • #15231 (decoder crashes with unknown tokens)

This upgrade boosts robustness across multilingual, technical, or edge-case text—making local LLM inference even more reliable. 🧠⚡

Upgrade now and keep your tokens tight! 📦✨

🔗 View Release