MLX-LM – v0.30.6
MLX‑LM v0.30.6 just dropped – fresh on Apple silicon! 🍏✨
What it does:
Generate text and fine‑tune massive LLMs right on your M‑series Mac using the MLX framework. Plug into Hugging Face, run quantized models, handle long prompts, and scale with distributed inference.
What’s new in this release:
- LongCat Flash parser & Lite – lightning‑fast token streaming (shoutout @kernelpool).
- Kimi‑K2.5 support – tool‑call handling fixed; Kimi models work out‑of‑the‑box.
- MLX bump – upgraded backend for smoother, faster Apple silicon performance.
- Nemotron H config fix – aligns with HuggingFace format → hassle‑free loading.
- MultiLinear quant bug – restored missing `mode` argument; no more crashes during quantization.
- CLI finally live – real command‑line interface (thanks @awniin) plus quick bug fixes.
- Distributed inference – server can now spread work across multiple nodes (big thanks @angeloskath).
- Custom model loading – drop any 🤖 model into the folder; the server auto‑detects it.
- BatchRotatingKVCache default – smarter cache handling in batch mode for faster generation.
- Step 3.5 Flash & conversion fix – new flash‑optimized step and corrected model conversion pipeline.
- Chat template kwargs + top_logprobs – richer chat templates supported; can return token‑level probabilities.
- Stability upgrades: GLM 4.7 fallback handling, Deepseek V3.2 tweaks, batch mamba & sliding‑window mask fixes.
🚀 New contributor alert: @jalehman landed the first PR—welcome aboard!
More speed, more flexibility, fewer crashes. Happy tinkering! 🎉
