Author: Tater Totterson

  • Ollama – v0.20.3-rc0: model/parsers: add gemma4 tool call repair (#15374)

    Ollama – v0.20.3-rc0: model/parsers: add gemma4 tool call repair (#15374)

    Ollama v0.20.3-rc0 is officially live! πŸš€

    If you are running local LLMs, you know that “agentic” workflows depend entirely on how well a model can call tools and functions. Even a tiny syntax error from the model can crash your entire pipeline. This release is a massive quality-of-life update specifically designed to bridge that gap.

    What’s new in this release:

    • Gemma 4 Tool Call Repair: Instead of letting a malformed tool call break your code, Ollama now features a “repair” layer. It uses a candidate pipeline to catch and fix syntax mistakes on the fly.
    • Smart Error Correction: The repair logic is fine-tuned to handle common model hiccups, such as:
    • Missing Gemma string delimiters.
    • Single-quoted string values or dangling delimiters.
    • Raw terminal strings that need proper formatting per the tool schema.
    • Missing object closing braces.
    • Enhanced Stability: This update includes new regression coverage and unit tests to ensure these repair helpers work reliably across various scenarios, preventing old bugs from resurfacing.

    This is a huge win for anyone building autonomous agents or using Gemma 4 for function callingβ€”it makes your local development much more robust and less prone to frustrating crashes! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.4 – MCP server support!

    Text Generation Webui – v4.4 – MCP server support!

    text-generation-webui (v4.4) πŸš€

    This powerhouse Gradio web UI is essentially the “AUTOMATIC1111” for Large Language Models, providing a comprehensive local interface to run LLMs via backends like llama.cpp and Transformers. It’s the go-to tool for anyone wanting a private, offline, and highly customizable way to interact with models.

    The latest update is a massive one, focusing heavily on extensibility and UI polish! Here is what’s new:

    • Remote MCP Server Support: This is a game-changer! You can now connect to remote Model Context Protocol (MCP) servers directly from the Chat tab. The webui will automatically discover and use those tools alongside your local ones, massively expanding what your models can actually do.
    • Modernized UI: The interface has been sleeked up with better contrast, improved scrollbars, and tighter spacing to make your chat experience feel more professional and less cluttered.
    • Gemma 4 Support: Thanks to an updated `ik_llama.cpp` dependency, you can now jump straight into running Gemma 4!
    • Enhanced Image Metadata: For those using the API for image generation, PNG files now include embedded metadata (seed, model, steps, etc.) so your settings are always baked right into the file.
    • Expanded Platform Support: New portable builds are available for Windows users running AMD hardware via ROCm.

    Technical & Developer Notes:

    • API Refinements: Added `instruction_template` parameters to the model load endpoint and cleaned up deprecated settings.
    • Bug Fixes: Resolved critical issues including LaTeX rendering protection, crashes during prompt truncation, and server restart errors.

    πŸ› οΈ Pro-Tip for Tinkerers: If you use a portable installation, you can now move your `user_data` folder one level up (next to the install folder). This allows multiple versions of the webui to share the same models and settings, making updates a total breeze!

    πŸ”— View Release

  • Lemonade – v10.1.0

    Lemonade – v10.1.0

    The lemonade-sdk/lemonade library has just bumped up to version v10.1.0! πŸ‹

    If you’re looking to run Large Language Models (LLMs) locally with high performance, Lemonade is your go-to toolkit. It optimizes inference engines to leverage both GPUs and NPUs (like the AMD Ryzen AI series), making local LLM experiences faster and more responsive. Plus, it offers OpenAI API compatibility, so you can swap cloud services for your own hardware without breaking your workflow.

    What’s new in this release:

    • Version Bump: The project has officially transitioned to version 10.1.0.
    • Maintenance Update: This release focuses on updating the core project versioning to ensure compatibility and streamlined dependency management for all you tinkerers out there.

    Whether you are using the Python SDK or the CLI, this update helps keep your local environment stable and ready for heavy lifting. Keep those builds running fast! πŸš€

    πŸ”— View Release

  • Ollama – v0.20.2

    Ollama – v0.20.2

    Ollama v0.20.2 is officially live! πŸš€

    If you’re looking to run powerful large language models like Llama 3, DeepSeek-R1, or Mistral locally on your own hardware, Ollama remains the gold standard for making that process seamless and easy. It handles all the heavy lifting of model management so you can focus on tinkering and building.

    This latest release focuses on smoothing out your user experience:

    • Improved App Flow: The default home view has been updated to direct you straight into a new chat session rather than just launching the application interface. This small change helps you jump right into the conversation without extra clicks! πŸ’¬

    Keep those local environments running!

    πŸ”— View Release

  • Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

    Ollama – v0.20.1: Revert “enable flash attention for gemma4 (#15296)” (#15311)

    Ollama v0.20.1 is officially live! πŸš€

    If you aren’t using Ollama yet, you are missing out on one of the best ways to run powerful Large Language Models (LLMs) like Llama 3, DeepSeek-R1, and Mistral locally on your own hardware. It’s a total game-changer for privacy-conscious tinkerers and devs who want to experiment with AI without relying on cloud APIs.

    This latest release is a targeted maintenance update focused on stability:

    • Flash Attention Reversion: The team has reverted the “enable flash attention for gemma4” feature. πŸ”„

    Why does this matter?

    While Flash Attention is an awesome optimization for speed, it looks like the developers decided to pull it back for nowβ€”likely to iron out some unexpected behavior or stability issues specifically with Gemma 4 models.

    If you’ve been experiencing weirdness or crashes while running Gemma 4 with flash attention enabled, updating to v0.20.1 should get your local environment back into a much more predictable and stable state! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.3.3 – Gemma 4 support!

    Text Generation Webui – v4.3.3 – Gemma 4 support!

    text-generation-webui just dropped a massive update! If you’re looking for the “AUTOMATIC1111” experience for local LLMs, this Gradio-based powerhouse is now even more capable and snappy. πŸš€

    Here is the breakdown of what’s new in this release:

    🧠 New Model & Backend Support

    • Gemma 4 Integration: Full support is officially live! You can now run Gemma 4 with full tool-calling capabilities via both the UI and the API.
    • ik_llama.cpp Backend: A brand new backend option has arrived, offering much more accurate KV cache quantization (via Hadamard rotation) and specialized optimizations for MoE models and CPU inference.

    πŸ› οΈ API & Transformer Enhancements

    • Enhanced Completions: The `/v1/completions` endpoint now supports `echo` and `logprobs`, giving you deep visibility into token-level probabilities.
    • Smarter Model Loading: The system now auto-detects `torch_dtype` from model configs, providing way more flexibility than the previous forced half-precision method.
    • Metadata-Driven Templates: Instruction templates are now intelligently detected via model metadata instead of relying on filename patterns.

    ⚑ Performance & UI Polish

    • Snappier Interface: A custom Gradio fork has been tuned to save up to 50ms per UI event, making the whole experience feel much more responsive.
    • Critical Bug Fixes: Resolved several issues including dropdown crashes, API parsing errors for non-dict JSON tool calls, and `llama.cpp` template parsing bugs.

    πŸ›‘οΈ Security & Stability

    • Hardened Protections: Implemented ACL/SSRF fixes for extensions, patched path-matching bypasses on Windows/macOS, and added filename sanitization to prevent manipulation during prompt file operations.

    πŸ“¦ Portable Build Upgrades

    New self-contained packages are available for NVIDIA, AMD, Intel, Apple Silicon, and CPU users! Pro tip: You can now move your `user_data` folder one level up to easily share settings across multiple version installs. πŸ› οΈ

    πŸ”— View Release

  • Ollama – v0.20.1-rc2: model/parsers: rework gemma4 tool call handling (#15306)

    Ollama – v0.20.1-rc2: model/parsers: rework gemma4 tool call handling (#15306)

    Ollama v0.20.1-rc2 is officially here, and it’s bringing some serious precision to how your local engine handles model interactions! πŸ› οΈ

    If you’ve been using Ollama to run LLMs like Llama 3, Mistral, or Gemma locally, you know it’s the backbone for building private AI applications. This latest release focuses heavily on refining the way specific models communicate with your system.

    What’s new in this release:

    • Gemma4 Tool Call Overhaul: The developers have completely reworked how Gemma4 handles tool calls. By replacing the old custom argument normalizer with a much stricter reference-style conversion, model interactions are now significantly more reliable.
    • Improved Data Integrity: This update is a win for stability! It ensures that quoted strings remain strings, bare keys get properly quoted, and unquoted values maintain their correct types during the JSON unmarshalling process.
    • Enhanced Error Handling: New test coverage has been added to catch malformed raw-quoted inputs. This ensures Ollama behaves exactly like the official reference implementation, reducing those pesky unexpected errors.

    If you are currently experimenting with Gemma4 for agentic workflows or complex tool use, this update is a must-have to make your model interactions more predictable and robust! πŸš€

    πŸ”— View Release

  • Ollama – v0.20.1-rc1: ggml: fix ROCm build for cublasGemmBatchedEx reserve wrapper

    Ollama – v0.20.1-rc1: ggml: fix ROCm build for cublasGemmBatchedEx reserve wrapper

    Ollama v0.20.1-rc1 is officially live, bringing some much-needed stability for the AMD crowd! πŸš€

    If you’ve been trying to leverage your AMD GPU to run local LLMs like Llama 3 or DeepSeek-R1, this release is a critical one. It focuses heavily on refining the ROCm build, ensuring that hardware acceleration is smoother and more reliable for those of us not using NVIDIA.

    What’s new in this release:

    • Fixed ROCm Build: Resolved specific issues within the `ggml` library to prevent crashes and improve stability when running on AMD GPUs.
    • Improved Type Mapping: Added missing mappings between `cublasGemmAlgo_t` and `hipblasGemmAlgo_t`, which helps with smoother communication between software layers.
    • Wrapper Optimization: Fixed a bug in the `cublasGemmBatchedEx` reserve wrapper by correcting how const qualifiers are handled, ensuring compatibility with `hipblasGemmBatchedEx`.

    This is a great update for anyone building a local AI workstation around AMD hardware. Grab the update and get those models running! πŸ› οΈ

    πŸ”— View Release

  • Ollama – v0.20.1-rc0

    Ollama – v0.20.1-rc0

    Ollama v0.20.1-rc0 is officially hitting the scene! πŸš€

    If you’re looking to run powerful LLMs like Llama 3, DeepSeek-R1, or Mistral locally without relying on expensive cloud subscriptions, Ollama remains the gold standard for your local dev environment. It handles all the heavy lifting of downloading and managing models across macOS, Windows, and Linux.

    This latest release candidate is all about squeezing more performance out of your hardware:

    • Flash Attention Support for Gemma: This update brings Flash Attention specifically to the Gemma model family. 🧠
    • The Impact: By utilizing this clever algorithm, you’ll see significantly faster inference times and much lower memory consumption when running Gemma models on your machine.

    For those of us tinkering with local workflows, these optimizations mean smoother interactions and more efficient processing power! πŸ› οΈ

    πŸ”— View Release

  • Text Generation Webui – v4.3.2

    Text Generation Webui – v4.3.2

    text-generation-webui v4.3.2 is officially live! πŸš€ This Gradio-based powerhouse is the go-to interface for running LLMs locally, and this update brings some serious heavy-hitting performance boosts and expanded model support for all you tinkerers out there.

    Here is the breakdown of what’s new in this release:

    Core Model & Backend Upgrades

    • Gemma 4 Support: You can now run Gemma 4 with full tool-calling capabilities enabled in both the API and the UI. πŸ†•
    • New `ik_llama.cpp` Backend: A massive addition for performance enthusiasts! This backend offers superior KV cache quantization using Hadamard rotation, better optimizations for MoE models, and improved CPU inference.
    • Transformers Enhancements: The engine now auto-detects `torch_dtype` from model configs rather than forcing half-precision, making the model loading process much smarter.

    API & UI Improvements

    • Enhanced Completions API: The `/v1/completions` endpoint now supports `echo` and `logprobs`, allowing you to see token-level probabilities and IDs. πŸ“Š
    • Snappier Interface: A custom Gradio fork has been optimized to save up to 50ms per UI event, making button clicks and transitions feel much smoother.
    • Smarter Templates: Instruction templates are now detected via model metadata instead of relying on old filename patterns.

    Security & Stability Fixes

    • Hardened Security: Fixed an ACL bypass in the Gradio fork for Windows/macOS and added server-side validation for various input groups like Dropdowns and Radio buttons. πŸ›‘οΈ
    • SSRF Protection: Added URL validation to `superbooga` extensions to block requests to private or internal networks.
    • Bug Squashing: Resolved several critical issues, including crashes related to Gemma 4 templates in llama.cpp and loading failures for Qwen3.5 MoE models.

    Portable Builds & Updates

    New self-contained packages are available for Windows, Linux, Mac, and various GPU architectures (NVIDIA CUDA, AMD Vulkan/ROCm, and Intel). If you’re using the portable version, updating is easier than everβ€”you can now use a shared `user_data` folder across multiple installs! πŸ“‚

    πŸ”— View Release