Ollama – v0.18.3-rc0: mlx: add mxfp4/mxfp8/nvfp4 importing (#15015)
🚨 Ollama v0.18.3-rc0 is here — and it’s a quantization powerhouse! 🚨
The latest release adds supercharged import support for MLX (Apple’s Metal framework) and NVIDIA’s new FP8 formats — meaning you can now run even more efficient, low-bit models locally. Here’s the breakdown:
🔹 New Quantization Imports
✅ Import BF16 models → convert on-the-fly to:
- `mxfp4` (Meta’s 4-bit mixed-precision)
- `mxfp8` (Meta’s 8-bit mixed-precision)
- `nvfp4` (NVIDIA’s 4-bit floating-point format)
✅ Import FP8 models → convert directly to `mxfp8`
🎯 Why this rocks:
- 🍏 Apple Silicon users (M1/M2/M3): Run ultra-efficient MLX-native models with minimal memory footprint.
- 🎮 NVIDIA fans: Get early access to NVFP4 — a promising new format for faster, smaller inference.
- ⚡ Smaller models + less VRAM = more models on your laptop, fewer cloud trips.
This is a big leap toward truly portable, hardware-agnostic LLM inference — all from your desktop. 🧠💻
Curious how `mxfp4` stacks up against `nvfp4`? Let us know — happy to deep dive! 🧵
