Ollama – v0.23.1: mlx: Gemma4 MTP speculative decoding (#15980)
Ollama v0.23.1 is officially live, and it’s bringing some serious speed boosts for Apple Silicon fans! 🚀 If you’ve been looking to squeeze more tokens per second out of your local LLMs, this update is a massive win for performance.
The star of the show is support for MTP (Multi-Token Prediction) speculative decoding specifically for the Gemma 4 model family using MLX. This means much faster inference speeds on Mac hardware!
Here’s the breakdown of what’s new:
- Gemma 4 Optimization: Full support for MTP speculative decoding is now active, significantly boosting generation speed.
- New `DRAFT` Command: You can now use a new `DRAFT` instruction in your `Modelfile` to specify exactly which draft model to use for speculation.
- Streamlined Model Creation: It’s now easier than ever to import `safetensors`-based Gemma 4 draft models directly via the `ollama create` command.
- New Quantization Flag: The `ollama create` command now includes a `–quantize-draft` flag, making it simple to manage lightweight draft models.
- Under-the-Hood Upgrades: Includes updated rotating cache support to handle MTP correctly and enhanced sampling support for better draft model token prediction.
If you’re running on a Mac, definitely grab this update and start experimenting with those lightning-fast generations! 🛠️✨
