Ollama – v0.17.8-rc2: mlx: perf improvements (#14768)
π Ollama v0.17.8-rc2 is here β and itβs bringing major mlx performance boosts for Apple Silicon users!
This release is all about speed and efficiency on M1/M2/M3 chips, thanks to smarter use of Appleβs MLX framework. Hereβs whatβs new:
πΉ Layer Norm Got a Power-Up
β Ditched the 6-step manual layer norm (mean β subtract β variance β rsqrt β multiply β add)
β Now uses `mlx_fast_layer_norm` β a native, optimized kernel. Way faster and cleaner!
πΉ GQA Just Got Smarter
β Removed custom `RepeatKV` tiling logic for Grouped-Query Attention (GQA)
β Now leverages `scaled_dot_product_attention`, which natively supports GQA β as long as `n_q_heads % n_kv_heads == 0`.
β Result?
β‘ Faster inference
π§ Lower memory usage
β¨ Cleaner, more maintainable code
Perfect for devs and tinkerers pushing their Macs to the limit! ππ»
Let us know if youβd like a deep dive into how GQA + native attention works under the hood! π§ β‘
