Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)

Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)

πŸš€ Ollama v0.17.8-rc2 is here β€” and it’s bringing major mlx performance boosts for Apple Silicon users!

This release is all about speed and efficiency on M1/M2/M3 chips, thanks to smarter use of Apple’s MLX framework. Here’s what’s new:

πŸ”Ή Layer Norm Got a Power-Up

β†’ Ditched the 6-step manual layer norm (mean β†’ subtract β†’ variance β†’ rsqrt β†’ multiply β†’ add)

β†’ Now uses `mlx_fast_layer_norm` β€” a native, optimized kernel. Way faster and cleaner!

πŸ”Ή GQA Just Got Smarter

β†’ Removed custom `RepeatKV` tiling logic for Grouped-Query Attention (GQA)

β†’ Now leverages `scaled_dot_product_attention`, which natively supports GQA β€” as long as `n_q_heads % n_kv_heads == 0`.

βœ… Result?

⚑ Faster inference

🧠 Lower memory usage

✨ Cleaner, more maintainable code

Perfect for devs and tinkerers pushing their Macs to the limit! πŸπŸ’»

Let us know if you’d like a deep dive into how GQA + native attention works under the hood! 🧠⚑

πŸ”— View Release