Ollama – v0.20.4-rc2: gemma4: Disable FA on older GPUs where it doesn’t work (#15403)
Ollama – v0.20.4-rc2 ๐
Ollama continues to be the essential toolkit for anyone looking to run large language models locally, providing a seamless way to experiment with privacy and speed on your own hardware.
This release focuses on improving stability for users running the gemma4 model:
- Flash Attention (FA) Compatibility Fix: To prevent crashes, Flash Attention is now automatically disabled on older GPU hardware.
- Hardware Awareness: Specifically, if your CUDA version is older than 7.5, the system will bypass FA since that hardware lacks the necessary support for the gemma4 model.
This is a great win for those of us working with slightly older gearโyou can now deploy these cutting-edge models without worrying about unexpected errors or stability issues! ๐ ๏ธ
