Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)

Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)

🚀 Ollama v0.17.8-rc2 is here — and it’s bringing major mlx performance boosts for Apple Silicon users!

This release is all about speed and efficiency on M1/M2/M3 chips, thanks to smarter use of Apple’s MLX framework. Here’s what’s new:

🔹 Layer Norm Got a Power-Up

→ Ditched the 6-step manual layer norm (mean → subtract → variance → rsqrt → multiply → add)

→ Now uses `mlx_fast_layer_norm` — a native, optimized kernel. Way faster and cleaner!

🔹 GQA Just Got Smarter

→ Removed custom `RepeatKV` tiling logic for Grouped-Query Attention (GQA)

→ Now leverages `scaled_dot_product_attention`, which natively supports GQA — as long as `n_q_heads % n_kv_heads == 0`.

✅ Result?

⚡ Faster inference

🧠 Lower memory usage

✨ Cleaner, more maintainable code

Perfect for devs and tinkerers pushing their Macs to the limit! 🍏💻

Let us know if you’d like a deep dive into how GQA + native attention works under the hood! 🧠⚡

More posts