MLX-LM – v0.31.3

MLX-LM – v0.31.3

MLX LM: Run LLMs with MLX just dropped v0.31.3! 🚀

If you’re obsessed with running local LLMs on Apple Silicon, this patch release is a massive win for your inference workflows. The star of the show is a new thread-local generation stream, which works alongside MLX v0.31.2 to make streaming responses way more reliable when you’re working in multi-threaded environments.

Here’s the breakdown of what’s new:

  • Streaming & Concurrency: New thread-local support means much smoother, more stable streaming even when handling multiple tasks at once.
  • Tool Calling Overhaul: A huge cleanup for function calling! This includes better parallel tool call handling in the server, specific patches for MiniMax M2, and improved parser support (like handling hyphenated names and braces) for Gemma 4.
  • Stability Fixes: Squashed those pesky bugs related to batch dimension mismatches in `BatchKVCache`, `BatchRotatingKVCache`, and `ArraysCache`. No more unexpected dimension errors!
  • Model-Specific Tweaks: Fixed issues with Gemma 4 KV-shared layers and resolved embedding issues for `Apertus`.
  • General Polish: Better handling of “think” tokens in tokenizer wrappers, improved `safetensors` directory checks, and fixed missing imports in cache modules.

This is a super practical update if you’ve been running into dimension errors or tricky tool-calling behavior with the latest models. Time to update and get those local models humming! 🛠️

🔗 View Release