Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)

Ollama – v0.20.4-rc2 ๐Ÿš€

Ollama continues to be the essential toolkit for anyone looking to run large language models locally, providing a seamless way to experiment with privacy and speed on your own hardware.

This release focuses on improving stability for users running the gemma4 model:

  • Flash Attention (FA) Compatibility Fix: To prevent crashes, Flash Attention is now automatically disabled on older GPU hardware.
  • Hardware Awareness: Specifically, if your CUDA version is older than 7.5, the system will bypass FA since that hardware lacks the necessary support for the gemma4 model.

This is a great win for those of us working with slightly older gearโ€”you can now deploy these cutting-edge models without worrying about unexpected errors or stability issues! ๐Ÿ› ๏ธ

๐Ÿ”— View Release