Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

Ollama v0.20.1 is officially live! ๐Ÿš€

If you aren’t using Ollama yet, you are missing out on one of the best ways to run powerful Large Language Models (LLMs) like Llama 3, DeepSeek-R1, and Mistral locally on your own hardware. Itโ€™s a total game-changer for privacy-conscious tinkerers and devs who want to experiment with AI without relying on cloud APIs.

This latest release is a targeted maintenance update focused on stability:

  • Flash Attention Reversion: The team has reverted the “enable flash attention for gemma4” feature. ๐Ÿ”„

Why does this matter?

While Flash Attention is an awesome optimization for speed, it looks like the developers decided to pull it back for nowโ€”likely to iron out some unexpected behavior or stability issues specifically with Gemma 4 models.

If you’ve been experiencing weirdness or crashes while running Gemma 4 with flash attention enabled, updating to v0.20.1 should get your local environment back into a much more predictable and stable state! ๐Ÿ› ๏ธ

๐Ÿ”— View Release