Ollama – v0.18.3-rc0: mlx: add mxfp4/mxfp8/nvfp4 importing (#15015)

Ollama – v0.18.3-rc0: mlx: add mxfp4/mxfp8/nvfp4 importing (#15015)

🚨 Ollama v0.18.3-rc0 is here — and it’s a quantization powerhouse! 🚨

The latest release adds supercharged import support for MLX (Apple’s Metal framework) and NVIDIA’s new FP8 formats — meaning you can now run even more efficient, low-bit models locally. Here’s the breakdown:

🔹 New Quantization Imports

✅ Import BF16 models → convert on-the-fly to:

  • `mxfp4` (Meta’s 4-bit mixed-precision)
  • `mxfp8` (Meta’s 8-bit mixed-precision)
  • `nvfp4` (NVIDIA’s 4-bit floating-point format)

✅ Import FP8 models → convert directly to `mxfp8`

🎯 Why this rocks:

  • 🍏 Apple Silicon users (M1/M2/M3): Run ultra-efficient MLX-native models with minimal memory footprint.
  • 🎮 NVIDIA fans: Get early access to NVFP4 — a promising new format for faster, smaller inference.
  • ⚡ Smaller models + less VRAM = more models on your laptop, fewer cloud trips.

This is a big leap toward truly portable, hardware-agnostic LLM inference — all from your desktop. 🧠💻

Curious how `mxfp4` stacks up against `nvfp4`? Let us know — happy to deep dive! 🧵

🔗 View Release