MLX-LM – v0.30.3

Written by

Tater Totterson

in

MLX-LM – v0.30.3

MLX LM v0.30.3 just dropped and it’s a beast 🚀

AWQ & GPTQ quantization now fully supported — load quantized models like it’s nothing.
New models: IQuest Coder V1 Loop (code gen on steroids) + GLM4 MoE Lite (lightweight but mighty).
Nemotron Super 49B v1.5 and Falcon H1 with tied embeddings & muP scaling — optimized for peak performance.
Batching got a massive overhaul: sliding window + cache handling fixed, `CacheList`/`ArraysCache` now batchable, empty caches? Handled.
First-ever server benchmark for continuous batching — real-world numbers, not just benchmarks.
LongCat Flash now sharded + extended context — generate longer texts without choking.
Minitensor sharding (Minimax) + GPT-OSS sharding — scale your models smarter, not harder.
SwiGLU fixed, tokenizer errors now use `warnings`, MLX updated to latest — all the polish you didn’t know you needed.

Massive thanks to @ericcurtin, @nikhilmitrax, @tibbes, @solarpunkin, @AndrewTan517, and @Evanev7 for the wins!

Update. Run. Build something wild. 🤖💻

🔗 View Release

More posts