Ollama – v0.23.1: mlx: Gemma4 MTP speculative decoding (#15980)

Ollama – v0.23.1: mlx: Gemma4 MTP speculative decoding (#15980)

Ollama v0.23.1 is officially live, and it’s bringing some serious speed boosts for Apple Silicon fans! 🚀 If you’ve been looking to squeeze more tokens per second out of your local LLMs, this update is a massive win for performance.

The star of the show is support for MTP (Multi-Token Prediction) speculative decoding specifically for the Gemma 4 model family using MLX. This means much faster inference speeds on Mac hardware!

Here’s the breakdown of what’s new:

  • Gemma 4 Optimization: Full support for MTP speculative decoding is now active, significantly boosting generation speed.
  • New `DRAFT` Command: You can now use a new `DRAFT` instruction in your `Modelfile` to specify exactly which draft model to use for speculation.
  • Streamlined Model Creation: It’s now easier than ever to import `safetensors`-based Gemma 4 draft models directly via the `ollama create` command.
  • New Quantization Flag: The `ollama create` command now includes a `–quantize-draft` flag, making it simple to manage lightweight draft models.
  • Under-the-Hood Upgrades: Includes updated rotating cache support to handle MTP correctly and enhanced sampling support for better draft model token prediction.

If you’re running on a Mac, definitely grab this update and start experimenting with those lightning-fast generations! 🛠️✨

🔗 View Release