MLX-LM – v0.30.3

MLX-LM – v0.30.3

MLX LM v0.30.3 just dropped and it’s a beast πŸš€

  • AWQ & GPTQ quantization now fully supported β€” load quantized models like it’s nothing.
  • New models: IQuest Coder V1 Loop (code gen on steroids) + GLM4 MoE Lite (lightweight but mighty).
  • Nemotron Super 49B v1.5 and Falcon H1 with tied embeddings & muP scaling β€” optimized for peak performance.
  • Batching got a massive overhaul: sliding window + cache handling fixed, `CacheList`/`ArraysCache` now batchable, empty caches? Handled.
  • First-ever server benchmark for continuous batching β€” real-world numbers, not just benchmarks.
  • LongCat Flash now sharded + extended context β€” generate longer texts without choking.
  • Minitensor sharding (Minimax) + GPT-OSS sharding β€” scale your models smarter, not harder.
  • SwiGLU fixed, tokenizer errors now use `warnings`, MLX updated to latest β€” all the polish you didn’t know you needed.

Massive thanks to @ericcurtin, @nikhilmitrax, @tibbes, @solarpunkin, @AndrewTan517, and @Evanev7 for the wins!

Update. Run. Build something wild. πŸ€–πŸ’»

πŸ”— View Release