MLX-LM – v0.30.3
MLX LM v0.30.3 just dropped and itβs a beast π
- AWQ & GPTQ quantization now fully supported β load quantized models like itβs nothing.
- New models: IQuest Coder V1 Loop (code gen on steroids) + GLM4 MoE Lite (lightweight but mighty).
- Nemotron Super 49B v1.5 and Falcon H1 with tied embeddings & muP scaling β optimized for peak performance.
- Batching got a massive overhaul: sliding window + cache handling fixed, `CacheList`/`ArraysCache` now batchable, empty caches? Handled.
- First-ever server benchmark for continuous batching β real-world numbers, not just benchmarks.
- LongCat Flash now sharded + extended context β generate longer texts without choking.
- Minitensor sharding (Minimax) + GPT-OSS sharding β scale your models smarter, not harder.
- SwiGLU fixed, tokenizer errors now use `warnings`, MLX updated to latest β all the polish you didnβt know you needed.
Massive thanks to @ericcurtin, @nikhilmitrax, @tibbes, @solarpunkin, @AndrewTan517, and @Evanev7 for the wins!
Update. Run. Build something wild. π€π»
