Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

Ollama – v0.20.4-rc2 🚀

Ollama continues to be the essential toolkit for anyone looking to run large language models locally, providing a seamless way to experiment with privacy and speed on your own hardware.

This release focuses on improving stability for users running the gemma4 model:

Flash Attention (FA) Compatibility Fix: To prevent crashes, Flash Attention is now automatically disabled on older GPU hardware.
Hardware Awareness: Specifically, if your CUDA version is older than 7.5, the system will bypass FA since that hardware lacks the necessary support for the gemma4 model.

This is a great win for those of us working with slightly older gear—you can now deploy these cutting-edge models without worrying about unexpected errors or stability issues! 🛠️

🔗 View Release

Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

More posts

Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

MLX-LM – v0.31.2

Ollama – v0.20.4-rc1: gemma4: add missing file (#15394)

Ollama – v0.20.4-rc0