• Ollama – v0.20.2

    Ollama – v0.20.2

    Ollama v0.20.2 is officially live! πŸš€

    If you’re looking to run powerful large language models like Llama 3, DeepSeek-R1, or Mistral locally on your own hardware, Ollama remains the gold standard for making that process seamless and easy. It handles all the heavy lifting of model management so you can focus on tinkering and building.

    This latest release focuses on smoothing out your user experience:

    • Improved App Flow: The default home view has been updated to direct you straight into a new chat session rather than just launching the application interface. This small change helps you jump right into the conversation without extra clicks! πŸ’¬

    Keep those local environments running!

    πŸ”— View Release

  • Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

    Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

    Ollama v0.20.1 is officially live! πŸš€

    If you aren’t using Ollama yet, you are missing out on one of the best ways to run powerful Large Language Models (LLMs) like Llama 3, DeepSeek-R1, and Mistral locally on your own hardware. It’s a total game-changer for privacy-conscious tinkerers and devs who want to experiment with AI without relying on cloud APIs.

    This latest release is a targeted maintenance update focused on stability:

    • Flash Attention Reversion: The team has reverted the “enable flash attention for gemma4” feature. πŸ”„

    Why does this matter?

    While Flash Attention is an awesome optimization for speed, it looks like the developers decided to pull it back for nowβ€”likely to iron out some unexpected behavior or stability issues specifically with Gemma 4 models.

    If you’ve been experiencing weirdness or crashes while running Gemma 4 with flash attention enabled, updating to v0.20.1 should get your local environment back into a much more predictable and stable state! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.3.3 – Gemma 4 support!

    Text Generation Webui – v4.3.3 – Gemma 4 support!

    text-generation-webui just dropped a massive update! If you’re looking for the “AUTOMATIC1111” experience for local LLMs, this Gradio-based powerhouse is now even more capable and snappy. πŸš€

    Here is the breakdown of what’s new in this release:

    🧠 New Model & Backend Support

    • Gemma 4 Integration: Full support is officially live! You can now run Gemma 4 with full tool-calling capabilities via both the UI and the API.
    • ik_llama.cpp Backend: A brand new backend option has arrived, offering much more accurate KV cache quantization (via Hadamard rotation) and specialized optimizations for MoE models and CPU inference.

    πŸ› οΈ API & Transformer Enhancements

    • Enhanced Completions: The `/v1/completions` endpoint now supports `echo` and `logprobs`, giving you deep visibility into token-level probabilities.
    • Smarter Model Loading: The system now auto-detects `torch_dtype` from model configs, providing way more flexibility than the previous forced half-precision method.
    • Metadata-Driven Templates: Instruction templates are now intelligently detected via model metadata instead of relying on filename patterns.

    ⚑ Performance & UI Polish

    • Snappier Interface: A custom Gradio fork has been tuned to save up to 50ms per UI event, making the whole experience feel much more responsive.
    • Critical Bug Fixes: Resolved several issues including dropdown crashes, API parsing errors for non-dict JSON tool calls, and `llama.cpp` template parsing bugs.

    πŸ›‘οΈ Security & Stability

    • Hardened Protections: Implemented ACL/SSRF fixes for extensions, patched path-matching bypasses on Windows/macOS, and added filename sanitization to prevent manipulation during prompt file operations.

    πŸ“¦ Portable Build Upgrades

    New self-contained packages are available for NVIDIA, AMD, Intel, Apple Silicon, and CPU users! Pro tip: You can now move your `user_data` folder one level up to easily share settings across multiple version installs. πŸ› οΈ

    πŸ”— View Release

  • Ollama – v0.20.1-rc2: model/parsers: rework gemma4 tool call handling (#15306)

    Ollama – v0.20.1-rc2: model/parsers: rework gemma4 tool call handling (#15306)

    Ollama v0.20.1-rc2 is officially here, and it’s bringing some serious precision to how your local engine handles model interactions! πŸ› οΈ

    If you’ve been using Ollama to run LLMs like Llama 3, Mistral, or Gemma locally, you know it’s the backbone for building private AI applications. This latest release focuses heavily on refining the way specific models communicate with your system.

    What’s new in this release:

    • Gemma4 Tool Call Overhaul: The developers have completely reworked how Gemma4 handles tool calls. By replacing the old custom argument normalizer with a much stricter reference-style conversion, model interactions are now significantly more reliable.
    • Improved Data Integrity: This update is a win for stability! It ensures that quoted strings remain strings, bare keys get properly quoted, and unquoted values maintain their correct types during the JSON unmarshalling process.
    • Enhanced Error Handling: New test coverage has been added to catch malformed raw-quoted inputs. This ensures Ollama behaves exactly like the official reference implementation, reducing those pesky unexpected errors.

    If you are currently experimenting with Gemma4 for agentic workflows or complex tool use, this update is a must-have to make your model interactions more predictable and robust! πŸš€

    πŸ”— View Release

  • Ollama – v0.20.1-rc1: ggml: fix ROCm build for cublasGemmBatchedEx reserve wrapper

    Ollama – v0.20.1-rc1: ggml: fix ROCm build for cublasGemmBatchedEx reserve wrapper

    Ollama v0.20.1-rc1 is officially live, bringing some much-needed stability for the AMD crowd! πŸš€

    If you’ve been trying to leverage your AMD GPU to run local LLMs like Llama 3 or DeepSeek-R1, this release is a critical one. It focuses heavily on refining the ROCm build, ensuring that hardware acceleration is smoother and more reliable for those of us not using NVIDIA.

    What’s new in this release:

    • Fixed ROCm Build: Resolved specific issues within the `ggml` library to prevent crashes and improve stability when running on AMD GPUs.
    • Improved Type Mapping: Added missing mappings between `cublasGemmAlgo_t` and `hipblasGemmAlgo_t`, which helps with smoother communication between software layers.
    • Wrapper Optimization: Fixed a bug in the `cublasGemmBatchedEx` reserve wrapper by correcting how const qualifiers are handled, ensuring compatibility with `hipblasGemmBatchedEx`.

    This is a great update for anyone building a local AI workstation around AMD hardware. Grab the update and get those models running! πŸ› οΈ

    πŸ”— View Release

  • Ollama – v0.20.1-rc0

    Ollama – v0.20.1-rc0

    Ollama v0.20.1-rc0 is officially hitting the scene! πŸš€

    If you’re looking to run powerful LLMs like Llama 3, DeepSeek-R1, or Mistral locally without relying on expensive cloud subscriptions, Ollama remains the gold standard for your local dev environment. It handles all the heavy lifting of downloading and managing models across macOS, Windows, and Linux.

    This latest release candidate is all about squeezing more performance out of your hardware:

    • Flash Attention Support for Gemma: This update brings Flash Attention specifically to the Gemma model family. 🧠
    • The Impact: By utilizing this clever algorithm, you’ll see significantly faster inference times and much lower memory consumption when running Gemma models on your machine.

    For those of us tinkering with local workflows, these optimizations mean smoother interactions and more efficient processing power! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.3.2

    Text Generation Webui – v4.3.2

    text-generation-webui v4.3.2 is officially live! πŸš€ This Gradio-based powerhouse is the go-to interface for running LLMs locally, and this update brings some serious heavy-hitting performance boosts and expanded model support for all you tinkerers out there.

    Here is the breakdown of what’s new in this release:

    Core Model & Backend Upgrades

    • Gemma 4 Support: You can now run Gemma 4 with full tool-calling capabilities enabled in both the API and the UI. πŸ†•
    • New `ik_llama.cpp` Backend: A massive addition for performance enthusiasts! This backend offers superior KV cache quantization using Hadamard rotation, better optimizations for MoE models, and improved CPU inference.
    • Transformers Enhancements: The engine now auto-detects `torch_dtype` from model configs rather than forcing half-precision, making the model loading process much smarter.

    API & UI Improvements

    • Enhanced Completions API: The `/v1/completions` endpoint now supports `echo` and `logprobs`, allowing you to see token-level probabilities and IDs. πŸ“Š
    • Snappier Interface: A custom Gradio fork has been optimized to save up to 50ms per UI event, making button clicks and transitions feel much smoother.
    • Smarter Templates: Instruction templates are now detected via model metadata instead of relying on old filename patterns.

    Security & Stability Fixes

    • Hardened Security: Fixed an ACL bypass in the Gradio fork for Windows/macOS and added server-side validation for various input groups like Dropdowns and Radio buttons. πŸ›‘οΈ
    • SSRF Protection: Added URL validation to `superbooga` extensions to block requests to private or internal networks.
    • Bug Squashing: Resolved several critical issues, including crashes related to Gemma 4 templates in llama.cpp and loading failures for Qwen3.5 MoE models.

    Portable Builds & Updates

    New self-contained packages are available for Windows, Linux, Mac, and various GPU architectures (NVIDIA CUDA, AMD Vulkan/ROCm, and Intel). If you’re using the portable version, updating is easier than everβ€”you can now use a shared `user_data` folder across multiple installs! πŸ“‚

    πŸ”— View Release

  • ComfyUI – v0.18.5

    ComfyUI – v0.18.5

    ComfyUI v0.18.5 is officially live! πŸš€

    For those of you building complex, node-based generative AI pipelines, ComfyUI continues to be the powerhouse engine for granular control over Stable Diffusion and beyond. Whether you’re orchestrating intricate image upscaling or multi-step video generation, this tool remains the gold standard for modularity and efficiency.

    This latest minor version update focuses on keeping your creative workflows smooth and reliable. Here is what’s new in v0.18.5:

    • Enhanced Stability: This patch includes refinements to existing code, specifically aimed at ensuring smoother operation when executing heavy or complex node sequences.
    • Core Maintenance: As part of the ongoing development by Comfy-Org, this release ensures the core engine stays perfectly aligned with the rapidly evolving broader AI ecosystem.

    If you’ve been pushing your hardware through massive workflows and want to ensure peak performance and stability, now is a great time to pull this update! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.3.1

    Text Generation Webui – v4.3.1

    text-generation-webui v4.3.1 is officially live, and it’s a massive one for anyone looking to push the boundaries of local LLM inference! πŸš€

    This Gradio-based web UI is essentially the “AUTOMATIC1111” equivalent for text generation, providing a comprehensive interface to run Large Language Models locally with support for multiple backends like llama.cpp, Transformers, and ExLlama.

    Here’s what’s new in this release:

    • Model & Inference Upgrades:
    • πŸ†• Gemma 4 Support: Full integration including tool-calling capabilities in both the API and UI.
    • ik_llama.cpp Backend: New support via portable builds (or the `–ik` flag for full installs) offering specialized optimizations for MoE models, improved CPU inference, and highly accurate KV cache quantization.
    • Transformers Optimization: The UI now auto-detects `torch_dtype` from model configs instead of forcing bf16/f16.
    • ExLlamaV3 Fixes: Resolved issues with Qwen3.5 MoE loading and fixed `ban_eos_token` functionality.
    • API Enhancements:
    • The `/v1/completions` endpoint now supports `echo` and `logprobs` parameters, returning token-level probabilities and new `top_logprobs_ids`.
    • Performance & UI Tweaks:
    • Snappier Interface: A custom Gradio fork has been optimized to save up to 50ms per UI event (like button clicks).
    • Smarter Templates: Instruction templates now detect from model metadata rather than relying on filename patterns.
    • Security & Stability:
    • Fixed a critical ACL bypass in the Gradio fork for Windows/macOS.
    • Added server-side validation for input components (Dropdown, Radio, etc.).
    • Patched an SSRF vulnerability in superbooga extensions by validating fetched URLs against private networks.

    πŸ› οΈ Pro-tip for updating: If you’re using a portable install, just download the latest version and replace your `user_data` folder. Since version 4.0, you can actually keep `user_data` one level up (next to your install folder) to make future updates even smoother!

    πŸ”— View Release

  • Text Generation Webui – v4.3

    Text Generation Webui – v4.3

    🚨 Text-Generation-WebUI v4.3 is live! 🚨

    Hey AI tinkerers & devs β€” fresh update dropped, and it’s packed with performance wins, new backends, and security upgrades. Here’s the lowdown:

    πŸ”Ή πŸ”₯ Brand-new backend: `ik_llama.cpp`

    A high-octane fork by the imatrix creator, now baked into TGWU:

    • βœ… New quant formats (Q4_K_M, Q6_K, etc.)
    • 🧠 Hadamard-based KV cache quantization β€” way more accurate, on by default
    • ⚑ Built for MoE models & CPU inference (yes, really fast)

    β†’ Grab it via `textgen-portable-ik` or `–ik` flag!

    πŸ”Ή 🧠 API upgrades (OpenAI-compatible!)

    The `/v1/completions` endpoint now supports:

    • `echo`: Returns prompt + completion in one go
    • `logprobs`: Token-level log probabilities (prompt & generated)
    • `top_logprobs_ids`: Top token IDs per position β€” perfect for probing model confidence 🎯

    πŸ”Ή 🎨 Gradio UX + Security Boost

    • 🐌 Custom Gradio fork = ~50ms faster UI interactions
    • πŸ”’ Fixed ACL bypass (Windows/macOS path quirks)
    • βœ… Server-side validation for Dropdown/Radio/CheckboxGroup
    • πŸ›‘οΈ SSRF fix in superbooga: blocks internal/private IPs

    πŸ”Ή πŸ”§ Bug fixes & polish

    • `–idle-timeout` now works for encode/decode + parallel generations βœ…
    • Stopping strings fixed (e.g., `<|return|>` vs `<|result|>`)
    • Qwen3.5 MoE loads cleanly via ExLlamaV3_HF
    • `ban_eos_token` finally works (EOS suppression at logit level)

    πŸ”Ή πŸ“¦ Dependency upgrades

    • πŸ¦™ `llama.cpp` β†’ latest (`a1cfb64`) + Gemma-4 support
    • πŸ”„ `ExLlamaV3` β†’ v0.0.28
    • πŸ“¦ `transformers` β†’ 5.5
    • ✨ Auto-detects `torch_dtype` from model config (override with `–bf16`)
    • πŸ—‘οΈ Removed obsolete `models/config.yaml` β€” templates pulled from model metadata now

    πŸ”Ή πŸ“Œ Terminology update

    “Truncation length” β†’ now “context length” in logs (more accurate, less confusing!)

    πŸ”Ή πŸ“¦ Portable builds β€” GGUF-ready & zero-install

    | Platform | Build to Use |

    |———-|————–|

    | NVIDIA (old driver) | `cuda12.4` |

    | NVIDIA (new driver, CUDA >13) | `cuda13.1` |

    | AMD/Intel GPU | `vulkan` |

    | AMD (ROCm) | `rocm` |

    | CPU-only | `cpu` |

    | Apple Silicon | `macos-arm64` |

    | Intel Mac | `macos-x86_64` |

    πŸ” Updating? Just swap the folder β€” keep `user_data/`, and now you can even move it one level up for shared use across versions πŸŽ‰

    Let me know if you want a quick-start walkthrough on `ik_llama.cpp` or portable builds! πŸ› οΈπŸš€

    πŸ”— View Release