Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)
๐ Ollama v0.17.8-rc2 is here โ and itโs bringing major mlx performance boosts for Apple Silicon users!
This release is all about speed and efficiency on M1/M2/M3 chips, thanks to smarter use of Appleโs MLX framework. Hereโs whatโs new:
๐น Layer Norm Got a Power-Up
โ Ditched the 6-step manual layer norm (mean โ subtract โ variance โ rsqrt โ multiply โ add)
โ Now uses `mlx_fast_layer_norm` โ a native, optimized kernel. Way faster and cleaner!
๐น GQA Just Got Smarter
โ Removed custom `RepeatKV` tiling logic for Grouped-Query Attention (GQA)
โ Now leverages `scaled_dot_product_attention`, which natively supports GQA โ as long as `n_q_heads % n_kv_heads == 0`.
โ Result?
โก Faster inference
๐ง Lower memory usage
โจ Cleaner, more maintainable code
Perfect for devs and tinkerers pushing their Macs to the limit! ๐๐ป
Let us know if youโd like a deep dive into how GQA + native attention works under the hood! ๐ง โก
