Gemmaclaw - Community

Community & Hardware Reports

Real-world experiences running Gemma models, curated from the community. Browse hardware reports, read the weekly field notes, or search for your setup.

Field Notes

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use.

Field Notes - 2026-07-28

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (9 new hardware-mention entries from the July 26-27 ingest, 573 entries total) and their threads. Confidence is low this cycle, but it is a rare numbers-bearing one: after several qualitative sweeps, three of the nine new posts carry real measured throughput tables, and the center of gravity moved to integrated-graphics and edge inference (an AMD APU, an Intel Arc iGPU) plus one two-box multi-GPU RPC result. Every new post is still a single-author, placeholder-score (about 20), zero-comment item from the Atom fallback, so none is community-corroborated and all figures are one person's measurements on one machine. Read the tables as leads, not settled benchmarks.

July 28 sweep, 2026-07-28 00:00 UTC: an integrated-graphics and edge cycle with a multi-GPU footnote. The July 26-27 ingest surfaced nine Gemma-mentioning entries. The three that matter carry numbers: a full llama.cpp Vulkan quant sweep of Gemma 4 26B A4B on an AMD Ryzen 7 6800H APU (Radeon 680M, shared system memory), a LiteRT-LM versus llama.cpp prefill comparison for Gemma 4 E2B on an Intel Arc iGPU, and a tensor-parallel RPC run of Gemma 4 31B split across an RTX 5090 and an RTX 4090 on two PCs over 10GbE. A fourth adds a backend note (Vulkan faster than ROCm for Gemma 4 E4B on an RX 6650 XT). The rest are configuration and buying questions (a 24 GB RX 7900XTX remote box, a Sapphire R9700 noise question, a 4 GB-VRAM user priced out of Gemma 4 entirely) plus two quality items (a Gemma 4 31B that grew an unprompted "attitude", and a 23-way comparison of Gemma 4 E4B fine-tunes finding the most-downloaded one the most broken). These add genuinely new integrated-graphics and multi-GPU datapoints but do not overturn any prior tier pick.

Integrated graphics, measured: on an AMD Ryzen 7 6800H APU with the Radeon 680M iGPU and no dedicated VRAM, Gemma 4 26B A4B at Q4_0 reached 18.35 tok/s decode (312.67 tok/s prefill), the sweet spot of a full quant sweep, while the dense 31B at Q8_0 was effectively unusable at 2.30 tok/s. A user benchmarked the new Gemma 4 and Qwen 3.6 MoE models on an AMD Ryzen 7 6800H mini-PC using llama.cpp on Kubuntu with the Vulkan backend, running entirely on shared system memory (UMA) through the Radeon 680M iGPU (no dedicated VRAM; the author notes that varying the memory allocation from 1 GB to 16 GB did not change inference). The measured llama.cpp table, as prefill (pp512) over decode (tg128) in tok/s, was: Gemma 4 26B A4B Q4_0 (13.26 GiB): 312.67 / 18.35; 26B A4B MXFP4 MoE (15.40 GiB): 261.32 / 11.93; 26B A4B Q4_K Medium (15.77 GiB): 258.16 / 11.92; 26B A4B NVFP4 (16.45 GiB): 152.35 / 7.53; and the dense 31B Q8_0 (16.74 GiB): 30.26 / 2.30. For context the author also ran GPT-OSS 20B Q6_K (353.87 / 16.85) and Qwen 3.6 35B A3B NVFP4 (153.75 / 15.05). The practical reading for the APU and integrated-graphics tier is clear: the sparse 26B A4B is genuinely usable on a 6800H-class APU if you pick a plain Q4_0, but NVFP4 is a trap on this hardware (no FP4 acceleration on RDNA2, so it lands at under half the Q4_0 decode), and the dense 31B is a non-starter at Q8_0. Confidence: low, a single author, placeholder score, no comments, but the numbers are internally consistent and it is the most useful integrated-graphics Gemma 4 benchmark this tracker has captured. (source, July 27, 2026)

Edge inference, measured: on an Intel Arc iGPU (Core Ultra 7 155U, Meteor Lake, no matrix cores), Google's LiteRT-LM ran Gemma 4 E2B prefill 2.7 to 3.2 times faster than llama.cpp Vulkan, cutting time-to-first-token at 32k context from about 3.5 minutes to 80 seconds, though llama.cpp with MTP still won on decode. A user ran a head-to-head of LiteRT-LM (WebGPU / ML-Drift backend) against llama.cpp (Vulkan) for Gemma 4 E2B on an Intel Core Ultra 7 155U with the Arc iGPU (4 Xe-cores, UMA), 16 GB LPDDR5x, Windows 11 (llama.cpp ran a Q4_K_M GGUF; LiteRT-LM an auto-int4 .litertlm). The measured prefill and time-to-first-token comparison, from llama.cpp to LiteRT-LM, was: at 4,096 tokens 267 to 853 tok/s (3.2x, TTFT 15.3s to 4.8s); at 8,192 tokens 241 to 771 tok/s (3.2x, 34.0s to 10.6s); at 22,000 tokens 185 to 500 tok/s (2.7x, 119.0s to 44.0s); and at 32,000 tokens 152 to 404 tok/s (2.7x, 210s to 80s). On decode the order flips: LiteRT-LM managed 23.2 tok/s with speculative decoding off and 20.4 tok/s with it on, while llama.cpp with MTP reached about 30.0 tok/s (the author flags a bug in the LiteRT speculative-decode path, and the captured excerpt truncates there). The post's headline claims "up to 3.5x", but the captured table tops out at 3.2x, so treat 2.7 to 3.2x as the measured prefill range. The useful signal for the edge and low-power tier is that for the smallest Gemma 4 (E2B), a WebGPU runtime can dramatically cut long-context prompt-processing latency on a matrix-core-less Intel iGPU, at the cost of decode throughput versus llama.cpp plus MTP. Confidence: low, single author, one device, placeholder score, no comments. (source, July 27, 2026)

Multi-GPU, measured: Gemma 4 31B split across an RTX 5090 and an RTX 4090 on two separate PCs, tensor-parallel over a 10GbE link via llama.cpp RPC, ran at about 28 tok/s on a fresh session and dropped to about 17 tok/s by 100k context, only marginally above a single 5090 at about 24 to 26 tok/s. A user running a self-compiled llama.cpp on Windows 11 put a 5090 and a 4090 on two different machines and served Gemma 4 31B (q4) across a 10GbE link using RPC with tensor parallelism (not layer split). They report about 28 tok/s on a fresh, low-context session, falling to about 17 tok/s by 100k context, versus about 24 to 26 tok/s on the single 5090 alone at low context on the same q4, and are asking whether physically co-locating the 4090 in one box would be worth it. The honest read for the multi-GPU tier is that a 10GbE RPC tensor-parallel split of the dense 31B buys only a few tok/s over one 5090 at short context and loses ground as context grows, so it does not obviously justify the second machine and network hop, though the poster wants to know if same-host PCIe would change that. Confidence: low, single author, placeholder score, no comments, and it is framed as an open question rather than a settled result. (source, July 27, 2026)

AMD backend note: on an RX 6650 XT (RDNA2) under Linux with LM Studio, Gemma 4 E4B ran about 8 to 10 tok/s faster on Vulkan than on ROCm, consistently. A user compared ROCm and Vulkan (both runtime 2.27.1) in LM Studio on Linux with an RX 6650 XT, and found Gemma 4 E4B on Vulkan faster by about 8 to 10 tok/s every time, at a 26,200-token context. No absolute throughput was given, only the delta. This lines up with the standing pattern across prior sweeps that Vulkan is often the better AMD path for Gemma 4 on consumer RDNA cards, and adds a concrete E4B datapoint on RDNA2. Confidence: low, single author, relative numbers only, placeholder score, no comments. (source, July 27, 2026)

Configuration and buying datapoints (no throughput): a 24 GB RX 7900XTX fits Gemma 4 26B and 31B for remote use, a Sapphire R9700 is being weighed for Gemma 4 31B, and a 4 GB-VRAM user is priced out of Gemma 4 entirely. Three posts add config and buying context without measurements. One user set up a Ryzen 5 7600X, RX 7900XTX 24 GB, 32 GB DDR5-5600 gaming PC for remote access over Tailscale (ssh and rustdesk) and notes that 24 GB comfortably fits Qwen 3.6 27B, Gemma 4 26B and 31B, with CPU offload for larger models, asking which models and frameworks are actually useful for remote software-engineering work (they acknowledge these are not on par with ChatGPT or Claude). A second is weighing a Sapphire R9700 purely on fan noise (they run a quiet 7800XT today and are willing to undervolt and underclock), planning to use it for Qwen 27B and Gemma 4 31B. A third, with only 4 GB VRAM and 40 GB RAM, says Gemma 4 and Qwen 3.6 are out of reach and is asking for the smallest usable agentic-coding model instead, a useful reminder that the current Gemma 4 lineup does not comfortably serve the sub-8 GB VRAM tier. None of these carries a benchmark. Confidence: low, anecdotal configs and open questions, placeholder scores, no comments. (source, July 27, 2026; source, July 26, 2026; source, July 27, 2026)

Quality and reliability notes: a Gemma 4 31B grew an unprompted "attitude" its owner could not reproduce, and a 23-way comparison of Gemma 4 E4B fine-tunes found the most-downloaded one to be the most broken. Two posts speak to behavior rather than hardware. One user reports their local Gemma 4 31B spontaneously developed a sarcastic, self-critical persona (it "roasted me for my mistakes", blamed itself for errors, and produced human-like asides) with no system prompt asking for it, and says they cannot reproduce it in new chats. It is an unverified single anecdote with no known reproduction path, so it belongs as a curiosity rather than guidance, but it is a reminder that Gemma 4 31B's persona can drift from session to session. The second, more actionable, is a comparison of 23 Gemma 4 E4B models from HuggingFace (abliterations and fine-tunes) run through the author's "abliterlitics" benchmark gauntlet, whose headline finding is that the most-downloaded model in the set is also the most broken, with the full report and data published on HuggingFace and the author's site. The takeaway for anyone reaching for a Gemma 4 E4B derivative is that download count is not a proxy for quality, and it is worth checking the actual comparison before trusting a popular abliteration. Confidence: low for both, single-author posts, placeholder scores, no comments; the E4B comparison links full data but has not been independently reviewed here. (source, July 27, 2026; source, July 26, 2026)

Best current setup (this cycle's additions)

Integrated graphics / APU (new measured pick): on an AMD Ryzen 7 6800H APU (Radeon 680M, shared system memory), Gemma 4 26B A4B at Q4_0 is the sweet spot at 18.35 tok/s decode and 312.67 tok/s prefill; avoid NVFP4 on this hardware (7.53 tok/s, no FP4 acceleration on RDNA2) and skip the dense 31B at Q8_0 (2.30 tok/s, unusable) (1v8adin).
Edge / low-power (E2B): for the smallest Gemma 4 on an Intel Arc iGPU, LiteRT-LM cuts long-context prefill and time-to-first-token by 2.7 to 3.2x over llama.cpp Vulkan (32k TTFT about 80s versus 3.5 minutes), while llama.cpp plus MTP still gives the better decode (about 30 versus 23 tok/s) (1v850zn).
Single 24 GB GPU: a 24 GB card (here an RX 7900XTX) still comfortably holds Gemma 4 26B and 31B; on AMD, prefer Vulkan over ROCm where you can (about 8 to 10 tok/s faster for E4B on RDNA2) (1v83ace, 1v88p7k).
No change to prior tiers otherwise: the single-consumer-NVIDIA, Apple Silicon, and enterprise picks from the July 16 through July 26 sweeps stand; nothing this cycle contradicts them.

What works

Gemma 4 26B A4B is usable on an integrated-graphics APU at Q4_0: 18.35 tok/s decode on an AMD 6800H (Radeon 680M) with no dedicated VRAM (1v8adin).
A WebGPU runtime slashes edge prefill latency: LiteRT-LM reaches first token in about 80 seconds at 32k context for Gemma 4 E2B on an Intel Arc iGPU, versus about 3.5 minutes for llama.cpp Vulkan (1v850zn).
A 10GbE RPC tensor-parallel split runs the dense 31B across a 5090 and a 4090 at about 28 tok/s fresh, proving the multi-box path works even if the gain over one 5090 is small (1v7t7dn).
A 24 GB AMD card holds both Gemma 4 26B and 31B for remote self-hosted use (1v83ace).

Known limits

NVFP4 is slow on RDNA2 integrated graphics: Gemma 4 26B A4B NVFP4 managed only 7.53 tok/s on the 6800H APU, well under half the Q4_0 result, because there is no FP4 acceleration on that iGPU (1v8adin).
The dense Gemma 4 31B is impractical on an APU: 2.30 tok/s at Q8_0 on the 6800H (1v8adin).
Multi-GPU RPC over 10GbE barely beats one 5090: about 28 tok/s (dropping to 17 at 100k context) versus 24 to 26 tok/s single-GPU for Gemma 4 31B, so the second machine and network hop may not be worth it (1v7t7dn).
The sub-8 GB VRAM tier is priced out of Gemma 4: a 4 GB-VRAM user reports Gemma 4 (and Qwen 3.6) out of reach and is seeking smaller alternatives (1v7sqa7).
Popular Gemma 4 E4B fine-tunes can be broken: a 23-way comparison found the most-downloaded E4B model the most broken, so download count is not a quality signal (1v73ux4).
Every new post is a single-author, placeholder-score, zero-comment Atom-fallback item, so even the measured tables are one person's run on one machine, unreplicated.

Open questions

Does same-host PCIe beat a 10GbE RPC split for Gemma 4 31B? The poster wants to know whether moving the 4090 into the 5090's box would meaningfully raise the roughly 28 tok/s RPC result (1v7t7dn).
How does LiteRT-LM's Gemma 4 E2B prefill edge hold up on other iGPUs and larger Gemma 4 sizes? The result is one Intel Arc device, E2B only, with a known bug in the speculative-decode path (1v850zn).
What is the best sub-8 GB VRAM path for agentic coding now that Gemma 4 does not comfortably fit 4 GB cards (1v7sqa7)?
Is the Gemma 4 31B "attitude" persona reproducible, and what sampling or context conditions trigger it (1v7kf8o)?

Sources

The Gemma-related posts driving this update (July 28 sweep, newest first). All are placeholder-score (about 20), zero-comment items from the Atom-fallback ingest, so weight each as a single-author datapoint; three carry measured throughput tables:

Testing Gemma 4 and Qwen 3.6 MoE on AMD 6800H (iGPU/UMA) (Jul 27, 2026, AMD Ryzen 7 6800H, Radeon 680M iGPU, shared system memory, llama.cpp Kubuntu Vulkan; measured prefill/decode for Gemma 4 26B A4B at Q4_0 312.67/18.35, MXFP4 261.32/11.93, Q4_K Medium 258.16/11.92, NVFP4 152.35/7.53, and dense 31B Q8_0 30.26/2.30 tok/s)
LiteRT-LM is up to 3.5x faster than llama.cpp on Intel Arc iGPU (Gemma-4 E2B Benchmark) (Jul 27, 2026, Intel Core Ultra 7 155U Arc iGPU, 16 GB LPDDR5x; LiteRT-LM WebGPU versus llama.cpp Vulkan prefill 2.7 to 3.2x faster for Gemma 4 E2B, 32k TTFT 80s versus 210s, but llama.cpp plus MTP wins decode about 30 versus 23 tok/s)
Need Advice: Llama.cpp Tensor Parallelism RPC vs Single Node Performance (Jul 27, 2026, RTX 5090 plus RTX 4090 on two PCs over 10GbE, tensor-parallel RPC; Gemma 4 31B q4 about 28 tok/s fresh, 17 at 100k, versus 24 to 26 on a single 5090)
Is ROCM slower than vulkan for you? (Jul 27, 2026, RX 6650 XT, LM Studio 2.27.1 Linux; Gemma 4 E4B about 8 to 10 tok/s faster on Vulkan than ROCm at 26,200 context)
What To Run on A Remotely Accessible Gaming PC? (Jul 27, 2026, Ryzen 5 7600X, RX 7900XTX 24 GB, 32 GB DDR5, Tailscale remote; notes 24 GB fits Qwen 3.6 27B, Gemma 4 26B and 31B, a configuration question with no numbers)
Current smallest usable coding model (Jul 27, 2026, 4 GB VRAM, 40 GB RAM; Gemma 4 and Qwen 3.6 out of reach, seeking the smallest usable agentic-coding model)
My local AI developed a personality with an "attitude" (Jul 27, 2026, Gemma 4 31B grew an unprompted sarcastic persona, not reproducible, a behavioral anecdote with no hardware)
23 Gemma4-E4B models compared with abliterlitics (Jul 26, 2026, 23 Gemma 4 E4B fine-tunes and abliterations benchmarked; the most-downloaded is the most broken, full report on HuggingFace and the author's site)
Sapphire r9700 fan noise (Jul 26, 2026, a buying and noise question for a Sapphire R9700 versus a quiet 7800XT, planned for Qwen 27B and Gemma 4 31B, no performance data)

_Last updated: 2026-07-28 (July 28 sweep). Confidence: low but numbers-bearing (nine placeholder-score, zero-comment Atom-fallback posts, three with measured throughput tables, none independently replicated). Key points: this was an integrated-graphics and edge cycle. On an AMD 6800H APU (Radeon 680M, shared memory), Gemma 4 26B A4B at Q4_0 is the usable sweet spot at 18.35 tok/s decode, NVFP4 is slow at 7.53 tok/s (no FP4 on RDNA2), and the dense 31B at Q8_0 is unusable at 2.30 tok/s. On an Intel Arc iGPU, LiteRT-LM cuts Gemma 4 E2B prefill 2.7 to 3.2x over llama.cpp Vulkan (32k TTFT about 80s versus 3.5 minutes) while llama.cpp plus MTP keeps the decode lead. A 10GbE RPC tensor-parallel split of Gemma 4 31B across a 5090 and 4090 hit about 28 tok/s fresh, only just above a single 5090. On AMD RDNA2, Vulkan beat ROCm by about 8 to 10 tok/s for E4B. Config and quality notes: a 24 GB RX 7900XTX fits 26B and 31B, a 4 GB-VRAM user is priced out of Gemma 4, and the most-downloaded Gemma 4 E4B fine-tune was found the most broken. No prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes - 2026-07-26

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (10 new hardware-mention entries from the July 25 ingest, 564 entries total) and their threads. Confidence is low this cycle. It is a broad, practical, consumer-hardware sweep rather than a benchmark cycle: most items are setup questions, model-selection threads, and backend-stability reports, not measured throughput. Only two carry numbers, and both are weak evidence: an open-source runtime (TensorSharp) whose author reports it running Gemma 4 E4B roughly on par with llama.cpp on CUDA, and a triple-GPU Vulkan benchmark whose actual per-model tok/s table did not survive our archive capture. The rest are qualitative: two reports that Gemma 4 is good but gets beaten by Qwen 3.6 on the poster's tasks, a single-RTX-3090 owner weighing Ollama (stable but slow) against Unsloth (fast but crash-prone), a low-quant reasoning loop, and several entry-level AMD and Apple Silicon configs. Every new post arrived through the Atom fallback with a placeholder score around 20 and no captured comments, so treat every item as a single-author datapoint.

July 26 sweep, 2026-07-26 00:00 UTC: a wide but shallow cycle spanning single-GPU NVIDIA, older multi-GPU, mid-VRAM AMD, and Apple Silicon laptops. The July 25 ingest of 80 posts surfaced ten Gemma-mentioning hardware entries. None of them moves a tier pick. The two closest to measured are both hedged: TensorSharp is a project announcement with self-reported ratios, and the triple-GPU 1080 Ti build lists the Gemma 4 quants it ran but the throughput table was truncated in capture, so we present it as a configuration datapoint only. The strongest signals are directional: on consumer single-GPU boxes people keep reaching for Qwen 3.6 for coding and agentic work while still keeping Gemma 4 around for chat and tool use, and backend choice (Ollama vs Unsloth vs llama.cpp) is now as much about stability as speed. The tier guidance carries over unchanged from the July 16 through July 25 sweeps, with two caveats added (single-3090 backend stability, and low-quant reasoning loops).

Single RTX 3090, the recurring question is backend stability, not raw speed: one owner reports Ollama plus OpenWebUI was slow with Gemma 4 and Qwen 3.6 but stable, while Unsloth was fast with working search and tool calls but crashed repeatedly. A user with an R7 5700X, 48 GB DDR4, and an RTX 3090 24 GB wants to move coding, web and PDF lookups, and config generation off cloud providers. They tried Ollama plus OpenWebUI (ran the models slowly, and OpenWebUI was occasionally slow on web and MCP, but stayed stable) and Unsloth Studio (quick, with search and tool calls working perfectly, but unstable enough that the models crashed a few times), and are asking what the go-to stack is for a 24 GB plus 48 GB machine. No throughput numbers are given. The practical read for the single-GPU tier is that the backend tradeoff is real and worth flagging: on a 3090, the fast tool-calling path some users find (Unsloth) is also the one they report as crash-prone, while the stable path (Ollama plus OpenWebUI) is slower, and llama.cpp remains the default worth trying for a middle ground. Confidence: low, an open question from a single user, placeholder score, no comments, no measured tok/s. (source, July 25, 2026)

Single RTX 3090, the other recurring limit is context length: a sysadmin running Gemma 4 26B A4B and Qwen 3.6 27B in LMStudio says it works very well for light coding and log analysis but hits the context ceiling fast, and is weighing a second GPU (leaning AMD) purely for more VRAM. A user on an i5-12400, 64 GB RAM, RTX 3090, Fedora KDE, serving Qwen 3.6 27B and Gemma 4 26B A4B through LMStudio into VSCode with kilocode and Continue, reports the setup is solid for scripting and log work but that the context-size ceiling blocks bigger projects. They want to add VRAM and, being on Linux, are eyeing a 16 GB Radeon (9070-class) AMD card, asking how well mixing a 24 GB NVIDIA plus 16 GB AMD GPU works in 2026. This is a configuration and buying-advice datapoint rather than a benchmark, but it reinforces a standing single-GPU theme for Gemma 4 26B A4B on a 3090: the model fits and runs well, and the wall users hit first is usable context, which is what pushes them toward a second card. Confidence: low, anecdotal, placeholder score, no comments, no numbers, and the AMD-plus-NVIDIA mixing question is unresolved in-thread. (source, July 25, 2026)

Preference signal, restated on newer 16 GB hardware: an RTX 5070 Ti 16 GB owner says Gemma 4 was good with great tool usage, but it was beaten by Qwen in their collection. A user building a local model vault on an RTX 5070 Ti 16 GB, i9-14900K, 64 GB DDR5, llama.cpp server says they were disappointed by GPT-OSS tool use, and that Gemma 4 was good and had great tool usage but was beaten by Qwen, whose Qwen 3.6 35B-A3B and Qwen3-Coder variants fill most of their roster. A second, broader thread asking which medium MoE models (up to 60B) are worth using lists Gemma 4 A4B alongside Qwen 3.5 35B and Nemotron 3 Nano but concludes Qwen seems to dominate this bracket. Together these are two more low-confidence votes for the same pattern this tracker has seen for weeks: Gemma 4 is respected for chat quality and tool use, but on coding and agentic collections users keep landing on Qwen 3.6. Neither post gives hardware throughput. Confidence: low, two qualitative single-author opinions, placeholder scores, no comments. (source, July 25, 2026; source, July 25, 2026)

Older multi-GPU, a configuration datapoint (numbers not captured): a single system with a GTX 1080 Ti 11 GB plus two P102-100 10 GB cards (31 GB combined) ran a full sweep of Gemma 4 quants under llama.cpp Vulkan on Kubuntu. A user benchmarked a triple-GPU Vulkan box (GTX 1080 Ti 11 GB plus two NVIDIA P102-100 10 GB, 31 GB combined VRAM, Ryzen 5 3600, 48 GB DDR4, Kubuntu 26.04, llama.cpp Ubuntu Vulkan build 10107) across many models, including Gemma 4 26B A4B UD Q4_K_XL, Gemma 4 26B A4B NVFP4, Gemma 4 26B A4B UD Q6_K_XL, and dense Gemma 4 31B UD Q4_K_XL (plus medgemma 27B on Gemma 3, and several Qwen 3.6 and Qwen3-Coder models). Note on evidence: the per-model tok/s table the poster produced did not survive our archive capture (the excerpt cuts off at the Vulkan device enumeration), so we deliberately cite no throughput numbers here. What it does establish is a genuinely low-cost, older-NVIDIA multi-GPU path: on the used market these are inexpensive cards, and 31 GB of combined VRAM is enough to load the 26B A4B MoE at Q6_K_XL or the dense 31B at Q4_K_XL. The P102-100 cards lack fp16, bf16, and fp4 support (Vulkan reports them as compute-mining-derived), so expect this to be a budget, memory-first configuration rather than a fast one, and check the original post for the actual speeds. Confidence: low as a benchmark (numbers not captured), useful as a config example. (source, July 25, 2026)

Runtime comparison, author-reported: an open-source inference engine (TensorSharp) reports Gemma 4 E4B running roughly on par with llama.cpp on CUDA, with a modest prefill and time-to-first-token edge. The author of TensorSharp, an open-source engine for Unsloth GGUF models with multimodal support (image, vision, audio), OpenAI and Ollama compatible APIs, and CUDA, Vulkan, and Metal backends, posted a comparison against llama.cpp. For Gemma 4 E4B it (Q8_0, dense multimodal) versus llama.cpp on CUDA, they report a geomean decode ratio of 1.02x, prefill 1.28x, and TTFT 1.27x (single-stream, MTP off, where above 1.0x means TensorSharp is faster or lower-latency). In plain terms, decode throughput is essentially tied while prefill and first-token latency are modestly better in their measurements. This is a project announcement with self-reported numbers and no independent reproduction, so treat it as a lead rather than a result, but a second actively maintained runtime that handles Gemma 4 E4B multimodal is worth tracking for the edge and laptop tiers. Confidence: low, vendor and author-reported, placeholder score, no comments, Vulkan-backend row not captured. (source, July 25, 2026)

Low-quant caveat: a user reports Gemma 4 26B A4B at IQ3_S with reasoning enabled getting stuck in a thinking loop while playing tic tac toe. Running gemma-4-26B-A4B-UD-IQ3_S (unsloth, latest chat template) through llama-swap and llama-server with reasoning on, a 32K context, Q8 K and V cache, unified KV, jinja templating, flash attention, batch and ubatch 2048, and speculative n-gram settings, a user shows Gemma 4 spinning in a reasoning loop on a trivial tic-tac-toe task. This is a single unreproduced report, and the very aggressive quant (IQ3_S) plus reasoning-on plus speculative n-gram is exactly the kind of stacked configuration that can destabilize a small MoE, so it should be kept as a caveat rather than published as a conclusion. It does line up with the standing pattern that Gemma 4 is weaker on open-ended agentic and multi-step loops than on constrained tool use, and it adds a concrete warning: at IQ3_S with reasoning on, watch for loops. Confidence: low, one report, placeholder score, no comments, heavily confounded by quant and sampling settings. (source, July 25, 2026)

Entry configs worth cataloguing: Gemma 4 12B and E4B are running on 12 GB AMD and 16 GB Apple Silicon laptops for general use. Two beginner-level posts confirm the low end. One user on an R5 5600X, RX 6700 XT 12 GB, 16 GB DDR4 runs Gemma 4 E4B and Gemma 4 12B QAT (plus GPT-OSS 20B) in LM Studio for general use, a clean 12 GB-AMD datapoint. Another runs a split setup: a headless 7800XT 16 GB desktop (7800X3D, 64 GB DDR5, llama.cpp Windows HIP) serving coding models to an M1 Pro MacBook Pro 16 GB that itself runs Gemma 4 12B Q4_0 and Qwen3.5 9B Q4_K_M locally for general and light use. Neither gives throughput, but both reinforce the laptop and mid-VRAM guidance: Gemma 4 12B at Q4 is the comfortable general-use pick on a 16 GB laptop or a 12 GB GPU, and E4B covers the tighter cases. Confidence: low, anecdotal configs, placeholder scores, no comments. (source, July 25, 2026; source, July 25, 2026)

One further post mentions Gemma 4 only in passing and is kept as a searchable community card but left out of the tier guidance: a model-selection thread where the poster runs Qwen 3.6 35B-A3B at Q4 (about 20 to 25 tok/s) as the only model passing their personal file-finding and coding tests, tried MTP and DFlash without benefit (more GPU layers helped more consistently), and found Gemma 4 26B A4B too slow on their box (1v6kth6).

Best current setup (this cycle's additions)

Single RTX 3090 (Gemma 4 26B A4B): choose your backend for stability, not just speed. Community reports this cycle put Ollama plus OpenWebUI as slow but stable and Unsloth as fast (working search and tool calls) but crash-prone on a 3090; llama.cpp remains the default worth trying for a stable middle ground. Expect to hit the usable-context ceiling before the compute ceiling (1v6mna4, 1v6akx1).
Budget older-NVIDIA multi-GPU is viable for VRAM, not speed. A GTX 1080 Ti plus two P102-100 cards (31 GB combined) loads the 26B A4B MoE at Q6_K_XL or the dense 31B at Q4_K_XL under llama.cpp Vulkan; these cards lack fp16 and fp4, so treat it as a memory-first, not throughput-first, path (actual speeds not captured here) (1v6llio).
Laptop and 12 GB AMD: Gemma 4 12B at Q4 stays the comfortable general-use pick on a 16 GB Apple Silicon laptop or a 12 GB AMD GPU, with E4B for tighter cases (1v6bv4a, 1v6g27c).
No tier changes otherwise: the single-consumer-GPU, multi-GPU, Apple Silicon, CPU-only, and enterprise picks from the July 16 through July 25 sweeps stand.

What works

Gemma 4 26B A4B runs well on a single RTX 3090 for coding, scripting, and log analysis; the practical limit users report is context length, not fit or basic speed (1v6akx1).
Tool use remains a Gemma 4 strength: a 16 GB RTX 5070 Ti owner calls its tool usage great even while preferring Qwen overall (1v6mcvc).
Gemma 4 12B and E4B run on entry hardware (12 GB AMD, 16 GB Apple Silicon laptop) for general use (1v6bv4a, 1v6g27c).
A second actively maintained runtime handles Gemma 4 E4B multimodal: TensorSharp reports roughly llama.cpp-parity decode with a modest prefill and TTFT edge on CUDA (author-reported) (1v6ect8).

Known limits

Coding and agentic preference still tilts to Qwen 3.6. Two more users this cycle rate Gemma 4 good but pick Qwen for their coding and agentic collections (1v6mcvc, 1v6d2ou).
Backend stability is a real cost on a single 3090: the fast tool-calling path one user found (Unsloth) also crashed, while the stable path (Ollama plus OpenWebUI) was slow (1v6mna4).
Very aggressive quant plus reasoning-on can loop: Gemma 4 26B A4B at IQ3_S with reasoning enabled got stuck in a thinking loop on a trivial task, though the result is confounded by quant and sampling settings (1v6lg60).
Every new post this cycle is a placeholder-score, zero-comment Atom-fallback item, and the two number-bearing items are an author-reported runtime comparison and a benchmark whose throughput table was not captured (1v6ect8, 1v6llio).

Open questions

What is the go-to stable-and-fast stack for Gemma 4 on a single RTX 3090? Users are split between Ollama (stable, slow), Unsloth (fast, crash-prone), and llama.cpp, with no clear community consensus this cycle (1v6mna4).
How well does mixing a 24 GB NVIDIA and a 16 GB AMD GPU actually work for Gemma 4 in 2026? A 3090 owner wants to add an AMD card for VRAM and context headroom, but the thread has no confirmed answer (1v6akx1).
What throughput does the 1080 Ti plus dual P102-100 build actually reach on the 26B A4B and dense 31B? The configuration is documented but the per-model tok/s table did not survive capture, so the numbers need to be read from the original post (1v6llio).
Does TensorSharp's Gemma 4 E4B edge hold up under independent testing? The prefill and TTFT gains are author-reported on CUDA only, with no reproduction and no captured Vulkan row (1v6ect8).

Sources

The Gemma-related posts driving this update (July 26 sweep, newest first). All are placeholder-score (about 20), zero-comment items from the Atom-fallback ingest, so weight each as a single-author datapoint:

Help me complete my AI collection (Jul 25, 2026, RTX 5070 Ti 16 GB, i9-14900K, 64 GB DDR5, llama.cpp; Gemma 4 rated good with great tool usage but beaten by Qwen in the poster's collection)
5700, 48GB RAM and a 3090 24Gb. Best OS and framework/model? (Jul 25, 2026, R7 5700X, 48 GB DDR4, RTX 3090; Ollama plus OpenWebUI slow but stable vs Unsloth fast but crashing, for Gemma 4 and Qwen 3.6, no throughput numbers)
Qwen3.6 to Gemma4: Performance Triple GPU GTX 1080 Ti and P100s (Jul 25, 2026, GTX 1080 Ti 11 GB plus two P102-100 10 GB, 31 GB combined, Kubuntu, llama.cpp Vulkan build 10107; ran Gemma 4 26B A4B UD Q4_K_XL, NVFP4, UD Q6_K_XL and dense 31B UD Q4_K_XL, but the per-model tok/s table was not captured in the archive)
Gemma 4 stuck in thinking loop while playing tic tac toe (Jul 25, 2026, Gemma 4 26B A4B UD IQ3_S with reasoning on, 32K context, Q8 KV cache, speculative n-gram; loops on a trivial task, single unreproduced report confounded by quant and sampling)
Model suggestions for my setup? (Optimizing MoE or smaller faster model?) (Jul 25, 2026, Qwen 3.6 35B-A3B Q4 at about 20 to 25 tok/s is the only model passing the poster's tests, MTP and DFlash gave no benefit, Gemma 4 26B A4B judged too slow)
Getting a second GPU in addition to my RTX3090 (Jul 25, 2026, i5-12400, 64 GB RAM, RTX 3090, LMStudio; Gemma 4 26B A4B and Qwen 3.6 27B work well but hit the context ceiling, considering a 16 GB AMD card for VRAM)
Sanity Check: headless 7800XT desktop serving LLMs to a Mac (Jul 25, 2026, 7800XT 16 GB desktop plus M1 Pro 16 GB laptop; Gemma 4 12B Q4_0 on the Mac for general use, config datapoint)
Benchmarks: TensorSharp vs. llama.cpp (Jul 25, 2026, open-source Unsloth-GGUF engine; author reports Gemma 4 E4B Q8_0 vs llama.cpp CUDA at decode 1.02x, prefill 1.28x, TTFT 1.27x, self-reported)
how local ai skills/tools work (Jul 25, 2026, R5 5600X, RX 6700 XT 12 GB, 16 GB DDR4, LM Studio running Gemma 4 E4B and Gemma 4 12B QAT, entry-level 12 GB AMD config)
Medium sized MoE LLM models (Jul 25, 2026, medium-MoE discussion listing Gemma 4 A4B among options but concluding Qwen dominates the bracket)

_Last updated: 2026-07-26 (July 26 sweep). Confidence: low (ten placeholder-score, zero-comment Atom-fallback posts, none with reproduced throughput). Key points: this is a consumer-hardware and backend-stability cycle, not a benchmark one. On a single RTX 3090, the reported tradeoff is Ollama (stable but slow) vs Unsloth (fast but crash-prone), with context length the first wall users hit on Gemma 4 26B A4B. Coding and agentic preference tilts to Qwen 3.6 again in two reports, while Gemma 4's tool use is still praised. A budget older-NVIDIA triple-GPU build (1080 Ti plus dual P102-100, 31 GB) fits the 26B A4B and dense 31B but its speeds were not captured. TensorSharp adds a second runtime for Gemma 4 E4B multimodal (author-reported parity with llama.cpp on CUDA). A low-quant IQ3_S plus reasoning-on config looped on a trivial task and is kept only as a caveat. Tier guidance carries over from the July 16 through July 25 sweeps. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes - 2026-07-25

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new posts from the July 24, 2026 sweep, 554 hardware-mention entries total) and their threads. Confidence is low this cycle, and the center of gravity moved to on-device and mobile hardware. All three posts carry a placeholder score (~20) and captured no comment threads, so none of them is community-corroborated. Only one carries measured numbers, and it is a vendor demonstration (a company founder showing their own app), so read its figures as a proof of concept rather than an independent benchmark. The other two are a deliberately inconclusive visual comparison and an open how-to question. Nothing this cycle produced a controlled tokens-per-second-versus-VRAM result on a named desktop GPU, so all of the prior single-GPU, multi-GPU, Apple Silicon, and CPU-only tier guidance is unchanged.

July 25 sweep, 2026-07-25 00:00 UTC: a small on-device and edge cycle. The July 24 ingest surfaced three Gemma 4 mentions: a founder showing Gemma 4 26B A4B running on an iPhone 17 Pro by paging expert weights off the SSD, a casual multi-model SVG-drawing comparison that includes Gemma 4 12B at Q8_0, and a question about whether any app wires Gemma 4 12B plus a local text-to-speech model into a live voice-chat experience. The only new datapoint that carries numbers is the iPhone paging demo, and it is a vendor's own result. The other two add no measurements. The useful new signal is narrow but real: a 26B-class Gemma 4 can be made to run on a current flagship phone at all, at an accuracy-over-speed pace, and there is standing community demand for a local speech-to-speech stack built on the small Gemma 4 models. No prior hardware tier changes.

On-device mobile, vendor demo: a Q4_K_M Gemma 4 26B A4B was shown running on an iPhone 17 Pro by paging expert weights off the SSD, reporting prefill 34.4 tok/s and decode 3.5 tok/s on a 699-token prompt, with a full answer taking about 6 minutes. The founder of Noema demonstrated their Noema Overfit feature running a Q4_K_M Gemma 4 26B A4B on an iPhone 17 Pro. The method keeps non-expert weights in RAM while the expert weights are read from the SSD, which is what makes a 26B-class mixture-of-experts model fit on a phone at all. On an initial 699-token prompt the demo reports prefill speed 34.4 tok/s (prefill time 20.34s) and decode speed 3.5 tok/s, and it took about 6 minutes to finish the answer, which the author says was correct. The pitch is explicitly for cases where answer accuracy matters more than a quick reply, and the same paging approach is also described as helpful for low-RAM MacBooks. For Gemmaclaw this is a genuinely new tier datapoint: it pushes the sparse Gemma 4 26B A4B down onto phone-class hardware, which no prior sweep had shown, but only as a slow accuracy-first novelty rather than an interactive assistant. The limits are heavy. This is the app maker's own demo (the founder disclosed the affiliation), it is one device, one quant, one prompt, there is no independent replication, and at 3.5 tok/s it is far too slow for chat. Confidence: a single vendor demonstration, one iPhone, one quant, no comment thread, no third-party confirmation. (source, July 24, 2026)

Casual visual comparison, no ranking: a pelican-on-a-bicycle SVG test ran Gemma 4 12B at Q8_0 alongside a 118B model at Q2 and two Qwen 3.6 35B A3B configurations, but the author explicitly declined to rank them. A user ran the well-known generate-an-SVG-of-a-pelican-riding-a-bicycle prompt across Gemma 4 12B at Q8_0, Laguna S2.1 118B at Q2_K_XL, and Qwen 3.6 35B A3B at both IQ4_NL_XL and Q8_0, using the pi.dev harness. The author states up front that the results are shown in order of presentation, not quality, and asks what it even proves, adding that they simply had free time. There are no scores, no timings, and no hardware details, so the post confirms only that Gemma 4 12B at Q8_0 is a routine participant in this kind of casual small-model comparison, not that it wins or loses. For Gemmaclaw this adds no measurement and no guidance. Confidence: an explicitly-for-fun comparison with no ranking, no numbers, and no hardware context. (source, July 24, 2026)

Local voice, open question: a user asked whether any existing app ties Gemma 4 12B together with a local text-to-speech model into a ChatGPT-style live voice experience, and no answer was captured. A user who likes the idea of ChatGPT Advanced Voice on the desktop asked whether there is an app that ties together local models like Gemma 4 12B and a text-to-speech model such as Kokoro into a comparable live voice-chat pipeline, or whether they would have to write one themselves. No answer was captured in the thread. For Gemmaclaw this is a demand signal rather than a result: it shows continued interest in a fully local speech-to-speech stack around the small Gemma 4 models, but it names no working setup, no hardware, and no performance. Confidence: a zero-comment question, a demand signal, not a datapoint. (source, July 24, 2026)

Best current setup (this cycle's additions)

On-device and phone-class (novelty, not interactive): a Q4_K_M Gemma 4 26B A4B can be run on an iPhone 17 Pro by paging expert weights off the SSD, but at about 3.5 tok/s decode (roughly 6 minutes for one answer) it is an accuracy-over-speed novelty for when a correct answer matters more than latency, not a usable interactive assistant. It is the app maker's own demo. The same paging approach is also pitched as helpful for low-RAM MacBooks (1v5p5sf).
Small-model 12B tier (no new numbers): Gemma 4 12B at Q8_0 keeps showing up as a general small-model choice (a casual SVG-drawing comparison and a local voice-assistant question), but neither post produced a benchmark, so there is no new performance guidance for the 12B tier this cycle (1v4xvfv, 1v5rhki).
No change to prior tiers: the July 24 Apple Silicon coding result (48GB M5 Pro, 26B A4B at Q6, about 60 tok/s in OpenCode) and the ASCII-diagram vision benchmark, the July 23 dual-5060-Ti MTP and AMD R9700 ROCm findings, the July 20 dual-3090 tensor-parallel numbers, and the earlier single-GPU and CPU-only guidance all still stand, since nothing this cycle contradicts them and no new consumer-GPU benchmark arrived.

What works

A Q4_K_M Gemma 4 26B A4B ran to a correct answer on an iPhone 17 Pro using SSD paging of expert weights (prefill 34.4 tok/s, decode 3.5 tok/s on a 699-token prompt), showing the sparse 26B is technically runnable on phone-class hardware when latency does not matter (1v5p5sf).
Gemma 4 12B at Q8_0 remains a common small-model pick for casual multi-model work such as SVG drawing, run here through a hosted comparison harness (1v4xvfv).

Known limits

Every post this cycle is a single author with a placeholder score (~20) and zero captured comments, so none is community-corroborated. Treat all of it as directional.
The iPhone result is a vendor demonstration by the app's founder, on one device, one quant, one prompt, with no independent replication. Its 3.5 tok/s decode and roughly 6-minute answer time make it unsuitable for interactive use, so read the figures as a proof of concept, not a benchmark (1v5p5sf).
The SVG comparison produced no ranking and no numbers. The author says it is order of presentation, not quality, and asks what it proves, so it supports no claim about Gemma 4 12B relative quality (1v4xvfv).
The local voice question was never answered in the captured thread, so there is still no confirmed off-the-shelf app that wires Gemma 4 12B into a live speech-to-speech loop (1v5rhki).
No new consumer-GPU, multi-GPU, or Mac-coding throughput numbers arrived, so the single-GPU and workstation guidance is unchanged and rests on prior sweeps.

Open questions

What decode speed can Gemma 4 26B A4B reach on phone-class hardware with faster storage or a lighter quant, and does SSD paging stay stable across prompts longer than the single 699-token test (1v5p5sf)?
Does the same paging approach give a usable speedup on low-RAM MacBooks, where the vendor also pitched it, and what are the real numbers there (1v5p5sf)?
Is there an off-the-shelf app that combines Gemma 4 12B with a local text-to-speech model such as Kokoro into a live voice-chat experience, or does it still require custom wiring (1v5rhki)?
How does Gemma 4 12B at Q8_0 actually compare on structured drawing tasks against the larger and Qwen models it was shown beside, once someone scores the output rather than just displaying it (1v4xvfv)?

Sources

The Gemma-mentioning posts driving this update (July 25 sweep, newest first). One carries measured numbers but is a vendor demo (the iPhone paging result), and the other two are a for-fun visual comparison and an open question. All three are placeholder-score (~20), zero-comment single-author posts, so weight them accordingly:

Gemma 4 26B A4B running on iPhone 17 Pro via model paging (Jul 24, 2026, the founder of Noema shows a Q4_K_M Gemma 4 26B A4B running on an iPhone 17 Pro through the Noema Overfit feature, which keeps non-expert weights in RAM and reads the experts from SSD. On a 699-token prompt it reports prefill 34.4 tok/s, prefill time 20.34s, and decode 3.5 tok/s, with a correct answer in about 6 minutes. Pitched as useful when accuracy matters more than speed and for low-RAM MacBooks. A vendor demo, one device, no replication)
Generate an SVG of a pelican riding a bicycle: 118B Q2, Gemma 4 12B Q8_0, Qwen 3.6 35B A3B (Jul 24, 2026, a casual visual comparison of SVG output from Gemma 4 12B at Q8_0, a Laguna S2.1 118B at Q2_K_XL, and two Qwen 3.6 35B A3B quants, run through the pi.dev harness. The author says the ordering is presentation, not quality, and asks what it proves, so there is no ranking or measurement)
Live and local audio chat in 2026? (Jul 24, 2026, a user asks whether any app ties together small local models such as Gemma 4 12B and a text-to-speech model like Kokoro into a ChatGPT-Advanced-Voice-style experience, or whether they must build it themselves. No answer captured, so it is a demand signal rather than a datapoint)

_Last updated: 2026-07-25 (July 25 sweep). Confidence: low and on-device-focused (one measured vendor demo plus a for-fun visual comparison and an open question, all three posts placeholder-score and zero-comment single-author reports). Key findings: a Q4_K_M Gemma 4 26B A4B was shown running on an iPhone 17 Pro via SSD paging of expert weights at prefill 34.4 tok/s and decode 3.5 tok/s on a 699-token prompt, about 6 minutes for a correct answer, an accuracy-over-speed novelty from the app's own founder rather than an interactive setup. A casual SVG-drawing comparison included Gemma 4 12B at Q8_0 but produced no ranking, and an unanswered question asked for an off-the-shelf app pairing Gemma 4 12B with a local text-to-speech model for live voice. No prior hardware tier guidance changes, since no new consumer-GPU, multi-GPU, or Mac-coding benchmark arrived. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes - 2026-07-24

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new posts from the July 23, 2026 sweep, 551 hardware-mention entries total) and their threads. Confidence is mixed this cycle, and the center of gravity moved to Apple Silicon. Every one of the six posts carries a placeholder score (~20) and captured no comment threads, so none of them is community-corroborated. Within that limit, two items are more than anecdote: a custom VLM benchmark table that puts gemma-4-31b-it first, and a measured M5 prefill-kernel experiment. A third gives one real throughput number for the on-brand OpenClaw-style workflow (Gemma 4 26B A4B driving OpenCode on a Mac). The rest are a coding-agent satisfaction note, a laptop-sizing question, and a model-sizing discussion.

July 24 sweep, 2026-07-24 00:00 UTC: an Apple-Silicon-heavy cycle. The July 23 ingest was broad (61 new posts) but the Gemma 4 slice clustered on Macs: an OpenCode coding test on a 48GB M5 Pro, an M5 matmul-kernel prefill experiment, and a laptop buyer asking whether a 24 or 32GB M5 MacBook Pro can hold Gemma 4 31B. Two of those carry real numbers. Away from Apple, the most quotable result is a plain-text-diagram (ASCII) vision benchmark where Gemma 4 31B ranks first over much larger models, and there is one more on-brand agentic-coding report (Claude Code plus llama.cpp plus Gemma 26B MoE) plus a discussion that frames where Gemma 4 26B A4B sits among small mixture-of-experts models. No controlled tokens-per-second-versus-VRAM sweep on a named consumer NVIDIA card arrived this cycle, so the single-GPU tier guidance is unchanged. The useful new signal is about Apple Silicon throughput and about Gemma 4 31B punching above its size on a structured-vision task, both from single authors.

Apple Silicon, on-brand: the updated Gemma 4 26B A4B at Q6 runs about 60 tok/s under llama.cpp on a 48GB M5 Pro and is reported usable in OpenCode for backend work, though its UI and UX output is called unacceptable. A user tested the recently updated Gemma 4 (the update is described as mostly chat-template changes, which lines up with the Google template refresh this project has been tracking since the July 16 sweep) on a local coding workflow. On a llama.cpp server on an M5 Pro with 48GB, the 26B A4B model at Q6 sustains about 60 tok/s and works well with OpenCode. The author's verdict is split by task: it works quite well for its size on backend work, but the UI and UX output is unacceptable. This is the closest report this cycle to Gemmaclaw's core positioning, a local Gemma 4 driving an agentic coding tool, and it is consistent with the recurring pattern from the July 20 to July 23 sweeps that Gemma 4 26B A4B is a competent general and backend assistant while its agentic and front-end coding output is the weak spot. The value is one concrete throughput datapoint plus a task-level quality split. The limits are that it is a single author with a placeholder score and zero comments, only one quant and one context are reported, and the UI and UX judgement is subjective with no rubric. A demo video is linked. Confidence: a single anecdote with one real tok/s number, no comment thread, no controlled comparison. (source, July 23, 2026)

Structured vision: on a custom plain-text-diagram (ASCII) benchmark, gemma-4-31b-it ranks first at 73.8% overall, ahead of a frontier Qwen model and much larger mixture-of-experts models, but the best model still fails roughly one task in four. A contributor published ASCIITermDraw-Bench, a test of whether vision-language models can render simple diagrams (architecture, topology, node clusters) as plain ASCII. In the posted results, gemma-4-31b-it (31B) is first with a final score of 73.8% plus or minus 4.1 (structural 84.3%, semantic 63.4%), ahead of qwen3.7-plus at 70.2%, kimi-k2.6 (1T total, 32B active) at 61.8%, minimax-m3 (428B total, 23B active) at 59.5%, qwen3.5-9b at 47.0%, and ternary-bonsai-27b at 45.9%. The author's own headline caveat is that even the top model fails nearly one in four tasks, that structural accuracy is much higher than semantic, and that layout, spacing, and routing are where quality collapses. For Gemmaclaw this is a capability datapoint rather than a hardware one, and a favorable one: a dense 31B Gemma 4 beating trillion-parameter-class and 400B-class models on a structured-output vision task is a strong showing for the model people can actually run on a workstation. The important limits are that this is a single-author custom benchmark with wide error bars (roughly plus or minus 4 to 7 points), no comment thread, and no independent replication, and the gemma-4-31b-it lead over qwen3.7-plus (73.8 versus 70.2) is inside the combined error margins, so read it as a strong result, not a settled ranking. Confidence: a measured table, single author, custom benchmark, wide error bars, unreplicated. (source, July 23, 2026)

Apple Silicon runtime: custom INT8-activation (w8a8) kernels give about a 1.4x prefill speedup on Gemma 4, taking E2B prefill on an M5 MacBook Air from 2193 to 3029 tok/s, because current Mac backends still run 16-bit activations and leave the M5 matmul units underused. A developer notes that MLX and llama.cpp on Macs currently run 16-bit activations everywhere, even though M5-generation silicon supports INT8 activations (it allows a w4a8 dtype), and that no inference backend uses this yet. They wrote w8a8 kernels and report about a 1.4x speedup on Gemma 4 prefill: on an M5 MacBook Air, baseline E2B prefill rises from 2193 tok/s stock to 3029 tok/s on a 130,173-token input, and is faster still at small context, where it approaches nearly 10k tok/s. For Gemmaclaw's Apple Silicon tier this is a forward-looking runtime signal: prefill (prompt ingestion) on Gemma 4 has clear headroom on M5 hardware once backends adopt INT8 activations, which most helps long-prompt and document-heavy workloads rather than generation speed. The limits are significant: these are the author's own experimental kernels, not a shipped MLX or llama.cpp feature, the numbers are prefill only (no decode figure), and they cover E2B specifically on one machine with no replication. Confidence: a measured single-author experiment on unreleased kernels, prefill only, one model, one Mac. (source, July 23, 2026)

Agentic coding, on-brand: Claude Code driving a local llama.cpp server with Gemma 26B MoE is reported to work reasonably well on boilerplate, glue code, and some PR review, with server-side web search as the missing piece. A user runs Claude Code against llama.cpp's local server with Google's Gemma models (the 26B MoE) and reports it works reasonably well for easy and boilerplate code, glue code, and some PR review. The post is really a tooling question, whether server-side tools that Claude Code expects, especially web search, can be plugged in on the llama.cpp side rather than only through client-side MCPs, and no answer was captured. For Gemmaclaw this reinforces the standing read that Gemma 4 26B A4B is a usable local backend for a coding agent on lower-stakes work, matching the OpenCode report above and prior sweeps, while also flagging a real integration gap: local-server setups still lean on client-side tooling for web search. It is a preference-and-setup note, not a benchmark, with no speed, VRAM, context, or hardware numbers. Confidence: a single anecdote plus an open tooling question, placeholder score, zero comments. (source, July 23, 2026)

Laptop buyers keep asking the same unmet question: is a 24 or 32GB M5 MacBook Pro enough for Gemma 4 31B, or only for the sparse 26B A4B? A shopper asked directly whether an M5 MacBook Pro with 24GB or 32GB is good enough for Qwen 3.6 27B or Gemma 4 31B, and no answer was captured. On its own it carries no data, but read against this cycle's one measured Mac coding result it sharpens a practical gap. The usable OpenCode datapoint above is on a 48GB M5 Pro running the sparse 26B A4B at Q6 (about 60 tok/s), not the dense 31B, and the community A2B-MoE discussion below frames the 26B A4B (4B active) as already the heavier end of the small-MoE range. Taken together, the honest current answer for a 24 to 32GB M5 is that the sparse Gemma 4 26B A4B at a 4-bit or Q6 quant is the safer fit, while dense Gemma 4 31B at a comfortable quant plus real context is tight-to-impractical on 24GB and unproven at 32GB in the captured posts. That inference is not a measured result and should be confirmed. Confidence: a zero-comment question, answered only by inference from adjacent posts, not by a benchmark. (source, July 23, 2026)

Model-sizing context: a discussion of mixture-of-experts models with about 2B active parameters places Gemma 4 26B A4B (4B active) at the heavier end of the small-MoE range, above an emerging class of A1B-to-A2B models aimed at CPU and low-VRAM use. A user surveying MoE models with roughly 2B active parameters (LFM2 24B A2B, Mellum 2 12B A2.5B, Moondream 3.1 9B A2B, and others) frames both Gemma 4 26B A4B and Qwen 3.x ~30B A3B as already on the heavier side if you do not have enough resources, and asks whether the ~2B-active middle ground is a better fit for CPU or old 4 to 12GB GPUs. For Gemmaclaw this is context rather than a Gemma 4 result: it locates Gemma 4 26B A4B (about 4B active) above the ultra-light A1B-to-A2B MoE tier, so on the most constrained hardware the smaller-active models are what the community is reaching for, though none of those alternatives is benchmarked here. Nothing in this post changes Gemma 4 guidance. Confidence: an opinion-and-survey discussion, no benchmark, zero comments. (source, July 23, 2026)

Best current setup (this cycle's additions)

Apple Silicon, agentic coding: the sparse Gemma 4 26B A4B at Q6 on a 48GB M5 Pro runs at about 60 tok/s under llama.cpp and is reported usable in OpenCode for backend work, with front-end UI and UX output still weak. This is the on-brand pick for local Gemma 4 coding on a Mac with enough memory (1v4sm2m).
Apple Silicon, 24 to 32GB laptops: prefer the sparse 26B A4B at a 4-bit or Q6 quant. Dense Gemma 4 31B at a comfortable quant plus real context is tight-to-impractical on 24GB and unproven at 32GB in the captured posts, so treat 31B on a sub-48GB Mac as unverified rather than recommended (1v4e3xt, 1v4sm2m, 1v41ed5).
Structured vision and diagrams: dense Gemma 4 31B (gemma-4-31b-it) is a strong pick where you need plain-text or ASCII diagram output, ranking first on one custom benchmark over much larger models, though even the best model fails about a quarter of such tasks (1v48wt0).
Agentic coding on a local server generally: Gemma 4 26B A4B works for boilerplate, glue code, and some PR review under both OpenCode and Claude Code plus llama.cpp, so it is a reasonable local coding backend for lower-stakes work, with server-side web search still a manual add-on (1v437yn, 1v4sm2m).
No change to prior tiers otherwise: the July 23 dual-5060-Ti MTP and AMD R9700 ROCm findings, the July 20 dual-3090 tensor-parallel numbers, and the earlier single-GPU, CPU-only, and mid-size guidance all still stand, since nothing this cycle contradicts them and no new consumer-NVIDIA benchmark arrived.

What works

Gemma 4 26B A4B at Q6 on a 48GB M5 Pro at about 60 tok/s in OpenCode is a usable local agentic-coding backend for backend and boilerplate tasks (1v4sm2m).
gemma-4-31b-it ranks first (73.8% plus or minus 4.1) on a custom ASCII-diagram vision benchmark, beating a frontier Qwen model and much larger mixture-of-experts models, with the strongest edge on structural accuracy (1v48wt0).
Custom INT8-activation (w8a8) kernels give about 1.4x Gemma 4 prefill on M5 (E2B 2193 to 3029 tok/s on a 130k-token input), showing real headroom once Mac backends adopt INT8 activations (1v4iw0n).
Claude Code plus llama.cpp plus Gemma 26B MoE handles easy and boilerplate code, glue code, and some PR review reasonably well (1v437yn).

Known limits

Every post this cycle is a single author with a placeholder score (~20) and zero captured comments, so none is community-corroborated. Treat the numbers below as directional.
The OpenCode result is one quant, one context, one machine, and its UI and UX verdict is subjective with no rubric. The front-end and agentic weakness of Gemma 4 26B A4B seen here matches, but does not add measurement to, the July 20 to July 23 doomloop and early-stop reports (1v4sm2m).
The ASCII-diagram benchmark is a custom single-author test with wide error bars (about plus or minus 4 to 7 points). The gemma-4-31b-it lead over qwen3.7-plus (73.8 versus 70.2) sits inside the combined margins, and the best model still fails roughly one task in four, so it is a strong result, not a settled ranking (1v48wt0).
The M5 w8a8 speedup uses the author's own unreleased kernels, is prefill only (no decode number), and covers E2B on one MacBook Air, so it is not something a reader can install today (1v4iw0n).
Dense Gemma 4 31B on a 24 to 32GB M5 is unproven in the captured posts. The one Mac datapoint runs the sparse 26B A4B on 48GB, so the laptop question is answered only by inference (1v4e3xt, 1v4sm2m).
Local-server coding agents still lack native server-side web search in llama.cpp, so setups like Claude Code plus llama.cpp depend on client-side tooling for that (1v437yn).

Open questions

What decode (generation) speed, not just prefill, does the updated Gemma 4 26B A4B sustain across quants on M5-class Macs, and how does the 48GB M5 Pro 60 tok/s figure hold at longer context (1v4sm2m)?
Can a 24 or 32GB M5 MacBook Pro actually run dense Gemma 4 31B at a usable quant and context, or is the sparse 26B A4B the practical ceiling on those machines (1v4e3xt)?
Will MLX or llama.cpp adopt INT8 activations for Gemma 4 on M5, and what is the real end-to-end (prefill plus decode) gain once they do, beyond the author's prefill-only prototype (1v4iw0n)?
Does gemma-4-31b-it hold its ASCII-diagram lead under independent replication with tighter error bars, and does the structural-versus-semantic gap persist (1v48wt0)?
Is there a server-side web-search path for llama.cpp that lets Claude Code or similar agents run fully local with Gemma 26B MoE (1v437yn)?

Sources

The Gemma-mentioning posts driving this update (July 24 sweep, newest first). Two carry measured tables (the ASCII-diagram benchmark and the M5 prefill-kernel experiment) and one gives a single real tok/s number (the OpenCode test), while the rest are a coding-agent note, a laptop question, and a sizing discussion. All six are placeholder-score (~20), zero-comment single-author posts, so weight them accordingly:

Tested (the updated) Gemma 4 locally on coding with OpenCode (Jul 23, 2026, a user runs the recently updated Gemma 4 26B A4B at Q6 on a local llama.cpp server on a 48GB M5 Pro at about 60 tok/s, reports it works well in OpenCode for backend work but calls the UI and UX output unacceptable, and links a demo video. One throughput number, single author, zero comments)
Benchmarking SOTA VLMs on my ASCIITermDraw-bench (Jul 23, 2026, a custom plain-text-diagram (ASCII) vision benchmark where gemma-4-31b-it ranks first at 73.8% plus or minus 4.1, ahead of qwen3.7-plus 70.2%, kimi-k2.6 61.8%, minimax-m3 59.5%, qwen3.5-9b 47.0%, and ternary-bonsai-27b 45.9%. The author notes even the best model fails nearly one in four tasks and that layout and routing are the weak point. Single-author custom benchmark, wide error bars, unreplicated)
Apple M5 isn't making full use of its matmul cores yet (Jul 23, 2026, MLX and llama.cpp on Macs run 16-bit activations, but M5 silicon supports INT8 activations, and the author's own w8a8 kernels give about a 1.4x Gemma 4 prefill speedup, taking E2B on an M5 MacBook Air from 2193 to 3029 tok/s on a 130k-token input and approaching nearly 10k tok/s at small context. Prefill only, unreleased kernels, one model)
Claude Code + llama.cpp + (websearch tool)? (Jul 23, 2026, a user runs Claude Code against a local llama.cpp server with Gemma 26B MoE and reports it works reasonably well on easy and boilerplate code, glue code, and some PR review, while asking for a server-side web-search option. A setup note and open question, no numbers)
Best on the go Laptop for low- medium tier on device AI? (Jul 23, 2026, a shopper asks whether a 24 or 32GB M5 MacBook Pro is enough for Qwen 3.6 27B or Gemma 4 31B. No answer captured, so it is a demand signal rather than a datapoint)
MoE models around A2B (Jul 23, 2026, a survey of mixture-of-experts models with about 2B active parameters that frames Gemma 4 26B A4B, at roughly 4B active, as already the heavier end of the small-MoE range for resource-constrained users. Context, no benchmark)

Last updated: 2026-07-24 (July 24 sweep). Confidence: mixed and Apple-Silicon-heavy (two measured tables and one real tok/s number, the rest anecdote, a question, and a discussion, all six posts placeholder-score and zero-comment single-author reports). Key findings: the updated Gemma 4 26B A4B at Q6 runs about 60 tok/s under llama.cpp on a 48GB M5 Pro and is usable in OpenCode for backend work but weak on UI and UX. gemma-4-31b-it ranks first (73.8% plus or minus 4.1) on a custom ASCII-diagram vision benchmark over much larger models, though even the top model fails about one task in four and the lead is inside the error bars. Custom INT8-activation kernels give about a 1.4x Gemma 4 prefill speedup on M5 (E2B 2193 to 3029 tok/s) but are unreleased and prefill only. Claude Code plus llama.cpp plus Gemma 26B MoE handles boilerplate and glue code reasonably well, and a 24 to 32GB M5 laptop is best matched to the sparse 26B A4B, with dense 31B unproven on those machines. No prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-23

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (9 new hardware-mention entries from the July 22 ingest, 545 entries total) and their threads. Confidence is low-to-mixed this cycle, but it is richer than the last few. Two items carry measured throughput numbers: a dual RTX 5060 Ti report where multi-token prediction (MTP) lifts Gemma 4 26B A4B from 88 to 132 tok/s, and an AMD Radeon AI PRO R9700 report where Gemma 4 26B A4B runs at about 20 tok/s instead of the expected 50 on ROCm vLLM. The rest are qualitative: two more reports of Gemma 4 struggling on agentic loops (now on an RTX 5090 and against Qwen 3.6 in Hermes), a real triple-3090 24/7 deployment, an on-device confidence-routing project on the E2B model, and a 48 GB Mac visual benchmark. Every new post arrived through the Atom fallback with a placeholder score around 20 and no captured comments, so weight even the measured items as single-author datapoints.

July 23 sweep, 2026-07-23 00:00 UTC: a broad cycle spanning multi-GPU, AMD, single-GPU, Apple Silicon, and edge. The July 22 ingest of 63 posts surfaced nine Gemma-mentioning hardware entries. The two that move anything are both measured. First, a dual RTX 5060 Ti user shows that MTP is a real speedup for the sparse 26B A4B MoE (88 to 132 tok/s) with the right draft depth, which complicates rather than contradicts the July 20 dual-3090 result where MTP made the dense 31B QAT slower: the lesson is that MTP for Gemma 4 is tuning and architecture sensitive, not a blanket win or loss. Second, an AMD R9700 owner measures Gemma 4 26B A4B at roughly half its expected ROCm vLLM throughput, traced to a missing tuned MoE kernel config. The remaining seven reinforce standing patterns (chat-strong, agentic-weak) or are context, and none overturns the tier picks carried since the July 16 through July 21 sweeps.

Multi-GPU, the measured headline: on dual RTX 5060 Ti 16 GB, enabling MTP took Gemma 4 26B A4B QAT from 88 to 132 tok/s, the opposite of the July 20 dual-3090 result on the dense 31B. A user running Gemma 4 26B A4B IT QAT on dual RTX 5060 Ti 16 GB (llama.cpp b9999, CUDA 13.3, sm tensor, Windows 11) reports token generation rising from 88 tok/s to 132 tok/s once MTP (speculative decoding) is enabled, using draft depth n-max 3 with min-p 0.2 for natural-language tasks (measured on a roughly 20K-token prefill with a 10K-token generation). The same user finds the dense 31B prefers n-max 4, min-p 0.1, and that for programming both Gemma models and Qwen 3.6 27B MTP prefer a much deeper n-max 11, min-p 0.0. This directly complicates the July 20 finding, where MTP made Gemma 4 31B QAT slower on dual RTX 3090s: taken together, MTP for Gemma 4 is neither a free win nor a reliable loss, it depends on the model (the sparse 26B A4B MoE gained the most here), the draft depth, and whether the task is coding or natural language. Practical takeaway for the multi-GPU tier: if you run the 26B A4B MoE, MTP is worth enabling and tuning per task, but benchmark your own n-max and min-p rather than copying a single setting, and do not assume MTP helps the dense 31B. Confidence: single user, placeholder score, no comments, but the tok/s figures and settings are specific and self-consistent. (source, July 21, 2026)

AMD, a measured shortfall: Gemma 4 26B A4B on a single Radeon AI PRO R9700 ran at about 20 tok/s on ROCm vLLM, roughly half the ~50 tok/s the owner expected, with a missing tuned MoE kernel config the likely cause. A user testing cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit on one AMD Radeon AI PRO R9700 (32 GB, gfx1201) under vLLM ROCm (TP 1, max-model-len 8192, fp8 KV cache, Triton attention, enforce-eager) measures about 19 to 20 generated tok/s after warmup against a published R9700 result nearer 50 tok/s. The startup log flags the likely root cause: a warning that no tuned mixture-of-experts kernel config exists for this GPU and quantization (E=128, N=704, gfx1201, int4 w4a16), so vLLM falls back to a generic one. For Gemmaclaw's AMD guidance this is a useful caveat: Gemma 4 26B A4B is runnable on a 32 GB R9700, but out-of-the-box ROCm vLLM throughput can be about half of what a tuned setup reaches, and the untuned MoE config plus enforce-eager are the first things to revisit. Confidence: single user, placeholder score, no comments, a troubleshooting question rather than a finished benchmark, but the measured number and the config warning are concrete. (source, July 22, 2026)

Single GPU, agentic reinforcement: on an RTX 5090 with opencode and Hermes, a new user could not get Gemma 4 31B (NVFP4 4-bit) to complete any agentic task, not even a snake game, while Claude Sonnet did it fine. A user new to local models reports that Gemma 4 31B in NVFP4 4-bit (the nvidia and Red Hat quants) on an RTX 5090 fails every agentic use case they try in opencode and Hermes, including trivial ones like generating a snake game, whereas the same tasks work through the Claude Sonnet API. They wonder whether the 4-bit quant is to blame. This is a third independent voice on the agentic-weakness pattern (after the July 20 26B A4B doomloop and the July 21 31B QAT early-stop), now on a top-end single GPU, though here it is unclear how much is the model versus the 4-bit NVFP4 quant, the harness, or first-time setup. It reinforces the standing guidance without adding a hardware number. Confidence: low, an open question from a self-described beginner, placeholder score, no comments, quant and harness confound the result. (source, July 22, 2026)

Chat-strong, agentic-weak, restated with the new chat template: a user reports Gemma 4 26B A4B now edges out Qwen 3.6 and 3.5 MoE fine-tunes on instruct-mode quality and reasoning efficiency, yet Qwen 3.6 still beats it in the Hermes agent. The same author behind the Apple Silicon benchmark below posts that an updated Gemma 4 chat template makes Gemma 4 26B A4B come out ahead of Qwen 3.6 MoE and Qwen 3.5 MoE fine-tunes on instruct-mode responses and reasoning efficiency, calling it a win for local users, but concedes Qwen 3.6 still does better in Hermes (agentic tool use) and asks for a future Gemma 4.1 fine-tuned for agentic tasks. This lines up cleanly with the rest of the cycle: Gemma 4 keeps winning on chat and reasoning quality while trailing on multi-step agentic loops. No hardware or throughput numbers are given. Confidence: low, a single-author qualitative claim with a placeholder score and no comments, but consistent with multiple independent reports. (source, July 21, 2026)

Multi-GPU deployment, a real one: Gemma 4 runs the language side of a 24/7 AI radio station on two RTX 3090s (lyrics, tool use, news summarization, and DJ decisions), with a third 3090 shared by image and music models. A user built a continuously streaming radio station driven end to end by local models: Gemma 4 on two RTX 3090s handles lyrics, tool calling, and news summarization and acts as the AI DJ deciding what plays, while ACE-Step (music) and Krea (images) share a third RTX 3090 loaded on demand, and Kokoro reads the news on CPU. The station generates about 60 new songs a day and has run continuously for several days. There are no throughput numbers, so this is not a benchmark, but it is a concrete, sustained example of Gemma 4 doing reliable tool use and summarization in a real triple-3090 deployment, which is a useful counterpoint to the agentic-weakness reports: constrained, well-scoped tool use in a fixed pipeline works, open-ended autonomous loops are where it struggles. Confidence: low as a measurement (none given), but a genuine running system rather than a claim. (source, July 22, 2026)

Edge and hybrid: Cactus post-trained Gemma 4 E2B to emit a per-response confidence score, so an on-device app can answer locally when confident and fall back to a cloud model when not. A team (Cactus) added a small 68K-parameter probe layer (LayerNorm, low-rank projection, attention pooling, and a small MLP head) that reads an intermediate layer of Gemma 4 E2B during decoding and predicts a 0 to 1 confidence score per response. Their pitch is a hybrid edge pattern: run the tiny model on-device, and route only the 15 to 35 percent of queries where confidence is low to a larger cloud model (Gemini 3.1 Flash-Lite), which they claim lets Gemma 4 E2B match that cloud model on several benchmarks (ChartQA, LibriSpeech, MMBench, GigaSpeech, MMAU, MMLU-Pro). This is a vendor announcement with author-reported benchmarks and no local-hardware throughput, so treat the numbers as unverified, but the pattern (small on-device Gemma with a learned uncertainty signal plus selective cloud fallback) is a credible direction for phone and embedded deployments where E2B already runs. Confidence: low, a project announcement with self-reported benchmarks, placeholder score, no comments. (source, July 22, 2026)

Apple Silicon, a visual bench: Gemma 4 26B at 6-bit was one of seven small MoE models compared on a 48 GB Mac generating a single-file HTML flight simulator, run through MLX with up to three attempts. A user ran a one-shot HTML flight-simulator generation prompt across seven models on a 48 GB Mac via oMLX, including Gemma 4 26B at 6-bit (temperature 1.0, top-p 1.0, top-k 64, min-p 0.01, repeat-penalty 1.1), alongside Qwen 3.6 27B and various Qwen MoE and abliterated variants, giving each model up to three tries from a clean session. The output is a set of GIFs rather than scores, so there is no ranking or throughput to cite, but it confirms Gemma 4 26B fits and runs in the 48 GB Apple Silicon "local SOTA" bracket at 6-bit for single-shot code generation. Confidence: low, a qualitative visual comparison with no numeric result, placeholder score, no comments. (source, July 22, 2026)

Two further posts mention Gemma 4 only in passing and are kept as searchable community cards but left out of the tier guidance: a discussion thread listing Gemma 4 31B among the mainline local models worth saving offline (1v2m3mh), and a release post for Nanbeige 4.2 3B, a competitor model whose authors claim it beats Gemma 4 12B on coding and agentic benchmarks, a table the community is already flagging for inconsistencies and which has no independent verification yet (1v336od).

Best current setup (this cycle's additions)

Multi-GPU (26B A4B MoE): enable and tune MTP. On dual RTX 5060 Ti 16 GB, MTP lifted Gemma 4 26B A4B QAT from 88 to 132 tok/s with draft depth n-max 3 and min-p 0.2 for natural language; tune n-max and min-p per task (coding wants a much deeper n-max 11) and benchmark your own numbers rather than copying one setting (1v25pwp).
Multi-GPU (dense 31B): do not assume MTP helps. The July 20 dual-3090 result had MTP slowing the dense 31B QAT, and this cycle's win was on the sparse MoE, so treat MTP on the 31B as something to measure, not enable blindly (1v25pwp, 1v0ipfe).
AMD (R9700, 26B A4B on ROCm vLLM): expect tuning work. Out-of-the-box ROCm vLLM gave about 20 tok/s versus an expected ~50; check for a tuned MoE kernel config for your gfx target and revisit enforce-eager before trusting the number (1v3vy45).
No tier changes otherwise: the single-consumer-GPU, Apple Silicon, CPU-only, and enterprise picks from the July 16 through July 21 sweeps stand.

What works

MTP is a real speedup for the 26B A4B MoE on consumer dual-GPU setups when tuned (88 to 132 tok/s measured) (1v25pwp).
Constrained, well-scoped tool use is reliable: Gemma 4 runs the language side (lyrics, summarization, DJ tool calls) of a 24/7 triple-3090 radio deployment (1v3woyv).
Gemma 4 26B A4B stays strong on chat and reasoning quality, edging Qwen 3.6 and 3.5 MoE fine-tunes on instruct-mode responses with the updated chat template (1v2rqbd).
E2B runs on-device well enough to carry a learned confidence signal for hybrid edge routing (1v3nw3j), and 26B fits the 48 GB Apple Silicon bracket at 6-bit (1v33dlq).

Known limits

Open-ended agentic loops remain the weak spot. A new RTX 5090 user could not get Gemma 4 31B (NVFP4 4-bit) to complete any agentic task in opencode or Hermes, while Claude Sonnet did, and a separate user confirms Qwen 3.6 still beats Gemma 4 in Hermes (1v3ef7r, 1v2rqbd).
ROCm vLLM throughput for the 26B A4B MoE can be about half of tuned without a matching MoE kernel config for the GPU (1v3vy45).
MTP is not a portable setting: the best draft depth differs by model (26B A4B versus dense 31B) and by task (natural language versus coding), and MTP can slow the dense 31B outright (1v25pwp, 1v0ipfe).
Every new post this cycle is a placeholder-score, zero-comment Atom-fallback item, and the two benchmark-style claims (Cactus routing, Nanbeige 3B) are author-reported and unverified (1v3nw3j, 1v336od).

Open questions

Does MTP help the dense Gemma 4 31B on any setup, or only the sparse 26B A4B MoE? One user gained 50 percent on the 26B A4B while the July 20 dual-3090 report lost throughput on the 31B, so the dense-model story is still unsettled (1v25pwp, 1v0ipfe).
Is Gemma 4's agentic failure a model limit or a quant and harness artifact? The RTX 5090 report used a 4-bit NVFP4 quant and a first-time opencode/Hermes setup, so it is unclear how much a higher-precision quant or a tuned harness would recover (1v3ef7r).
What tuned ROCm vLLM recipe reaches ~50 tok/s for Gemma 4 26B A4B on the R9700? The owner is asking for the exact image, flags, and MoE config, which no one has posted yet (1v3vy45).

Sources

The Gemma-related posts driving this update (July 23 sweep, newest first). All are placeholder-score (about 20), zero-comment items from the Atom-fallback ingest; the two measured throughput reports are the exceptions worth trusting numerically:

Performance issue: about 20 tok/s vs 50 tok/s on Radeon AI PRO R9700 with vLLM ROCm and Gemma 4 26B (Jul 22, 2026, gemma-4-26B-A4B-it-AWQ-4bit at about 19 to 20 tok/s on a single R9700 32 GB under ROCm vLLM vs an expected ~50, likely a missing tuned MoE kernel config for gfx1201, measured)
24/7 Subreddit Radio (Jul 22, 2026, Gemma 4 on two RTX 3090s runs lyrics, tool use, news summarization and DJ decisions in a continuous radio deployment, a third 3090 shares music and image models, no throughput numbers)
Cactus Hybrid: we taught Gemma 4 to know when it is wrong (Jul 22, 2026, a 68K-param probe on Gemma 4 E2B emits a per-response confidence score for on-device plus selective cloud-fallback routing, author-reported benchmarks only)
Gemma 4 agentic capabilities? (Jul 22, 2026, RTX 5090 user cannot get Gemma 4 31B NVFP4 4-bit to complete any agentic task in opencode or Hermes while Claude Sonnet works, possibly quant or harness related)
FlightSimulatorBench: small MoE edition (Jul 22, 2026, Gemma 4 26B at 6-bit among seven small MoE models compared on a 48 GB Mac via MLX on a single-file HTML flight-sim prompt, GIFs not scores)
Nanbeige 4.2 3B drops, claims to beat 9B/12B on agentic tasks (Jul 22, 2026, a competitor 3B model claiming to beat Gemma 4 12B on coding and agentic benchmarks, author table with community-flagged inconsistencies, unverified)
Updated Gemma 4 chat template: 26B A4B edges Qwen 3.6 and 3.5 MoE fine-tunes on instruct and reasoning (Jul 21, 2026, qualitative claim that Gemma 4 26B A4B beats Qwen MoE fine-tunes on chat quality with the new chat template but still loses to Qwen 3.6 in Hermes agentic use)
MTP on MoE matters (Jul 21, 2026, dual RTX 5060 Ti 16 GB, Gemma 4 26B A4B QAT rises from 88 to 132 tok/s with MTP at n-max 3 min-p 0.2, the cycle's strongest measured result, complicating the July 20 dense-31B MTP regression)
Important small/tiny local models to download before the shutdowns (Jul 21, 2026, lists Gemma 4 31B among mainline models to keep offline, a discussion prompt, no hardware report)

Last updated: 2026-07-23 (July 23 sweep). Confidence: low-to-mixed (nine placeholder-score, zero-comment Atom-fallback posts, two of them carrying measured throughput). Key point: MTP is a real, tunable speedup for the sparse Gemma 4 26B A4B MoE (88 to 132 tok/s on dual RTX 5060 Ti) but not a portable setting, since the July 20 dual-3090 report showed MTP slowing the dense 31B, so treat MTP as architecture and task dependent. Second measured item: Gemma 4 26B A4B ran at about half its expected throughput (about 20 vs 50 tok/s) on an AMD R9700 under ROCm vLLM, traced to a missing tuned MoE kernel config. The agentic-weakness pattern gained two more voices (RTX 5090 NVFP4 31B, and Qwen 3.6 still winning in Hermes), while a triple-3090 radio deployment shows constrained tool use is reliable, and Cactus added on-device confidence routing to E2B. Tier guidance carries over from the July 16 through July 21 sweeps with multi-GPU MTP and AMD ROCm caveats added. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-21

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 Gemma-related items from the July 20 ingest, 536 hardware-mention entries total) and their threads. Confidence is low this cycle. Every item is a placeholder-score (about 20), zero-comment Atom-fallback post, and there is no new speed, VRAM, quantization, or context-length measurement. The one item that matters is a hands-on agentic-laziness report on Gemma 4 31B QAT that reinforces and extends the July 20 finding, so prior tier picks stand and the agentic-guard caveat gets stronger rather than any tier changing.

July 21 sweep, 2026-07-21 00:00 UTC: a thin cycle. The daily digest flagged two Gemma-mentioning posts (a hobby architecture experiment and an ecosystem-sentiment thread), but the hardware extraction also surfaced a third, more useful one: a user running Gemma 4 31B QAT who pastes a full config and reports that the model still stops early on multi-turn agentic tasks even after the recent chat-template update. That agentic-laziness report is the only actionable item and it strengthens the July 20 pattern (Gemma 4 26B A4B doomlooping on agentic loops) by showing the dense 31B QAT has the same weakness for one more user, against models like Qwen 3.6 27B and DeepSeek V4 Flash that keep going. Nothing here is a new hardware number, and none of it overturns the standing single-GPU, multi-GPU, Apple Silicon, CPU-only, or enterprise guidance.

The one actionable item: a user reports Gemma 4 31B QAT is "still lazy" on agentic work, stopping after a couple of tool calls even with the latest chat template and preserve-thinking enabled, while Qwen 3.6, DeepSeek V4 Flash, and GPT-OSS 120B keep going. A user running unsloth/gemma-4-31B-it-qat-GGUF at UD-Q4_K_XL with a fully pasted config (147K context, tensor split, temperature 1.0, top-p 0.95, top-k 64, flash attention on, `preserve_thinking` true, the latest Unsloth GGUF with the new chat template, spec decoding on) says that in the Hermes agent the model does a couple of tool calls, narrates something like "I did A but B and C happened, now I am going to do D," and then just stops. Telling it to continue and not stop produces the same early exit again. The same user reports this is not a problem with Qwen 3.6 27B (UD-Q5_K_XL), DeepSeek V4 Flash (UD-IQ3_XXS), or even older GPT-OSS 120B, all of which keep "fruitfully chewing on the problem," and concedes Gemma 4 is a great chatbot. For Gemmaclaw this is a directly on-topic reinforcement of the standing agentic caveat: the weakness earlier reported for the sparse 26B A4B now shows up on the dense 31B QAT too, and importantly the recent chat-template and preserve-thinking changes that were meant to reduce Gemma 4 laziness did not fix multi-turn agentic drop-off for this user. The practical takeaway is unchanged and a little firmer: for autonomous multi-step loops, do not run Gemma 4 unsupervised, pair it with a stronger orchestrator or an anti-stall guard, and reserve Gemma 4 for chat, comprehension, and single-shot generation where it is strong. Confidence: low, a single-author report with a placeholder score and no comments, and no completion-rate or throughput numbers, but the config is fully specified and the failure mode matches independent July 20 reports. (source, July 20, 2026)

Community experiment, not a benchmark: a hobbyist "Patchwerk" mixture-of-experts band architecture built on Gemma 4, explicitly AI-assisted and not expected to beat the base model. A user posted progress on a personal project that arranges Gemma 4 into a mixture-of-experts "band" structure they nicknamed after a string of Vienna sausages. The author is unusually candid about scope, opening with an AI-generated-content warning and stating plainly that this is a hobby project pieced together with AI assistance, that they started with no theoretical background in large language models, and that even when finished it is highly unlikely to outperform the original base model except perhaps in a few specific domains. For Gemmaclaw there is nothing to act on here: no runtime, quantization, hardware, or throughput detail is given, and the author is not claiming a win. It is worth a single card as evidence that Gemma 4 remains a common base for community architecture tinkering, but it should not influence any hardware recommendation. Confidence: very low, a single-author hobby experiment with a placeholder score, zero comments, and no measurements, flagged by its own author as AI-generated and unlikely to improve on the base model. (source, July 21, 2026)

Ecosystem sentiment, not a Gemma 4 datapoint: a discussion thread argues Google has dropped out of the top 15 and speculates about an on-device pivot. A separate post claims Google has not recently shipped a model that competes with the current Sol or Fable frontier, calls the previous releases disappointing and unreliable, and speculates that Google may be going all-in on on-device inference for its own products (a space the poster thinks Apple could win on hardware) or may simply be stalled internally. The cited basis is a public AI leaderboard. This is relevant to Gemmaclaw only as context, it is opinion about Google's release strategy and standing, not a report about running Gemma 4 on any particular hardware, and it offers no numbers. It is noted here for completeness but deliberately left out of the tier guidance and, because it carries no hardware mention, it does not get a community card. Confidence: very low, an opinion thread with a placeholder score and no comments, and it is about Google's frontier position rather than local Gemma 4 performance. (source, July 20, 2026)

Best current setup (this cycle's additions)

No tier changes this cycle. No new post is a hardware measurement, so every prior pick stands: the single-consumer-GPU, multi-GPU, Apple Silicon, CPU-only, and enterprise or accelerator recommendations from the July 16 through July 20 sweeps are unchanged.
Agentic guard, now firmer and wider: the standing advice to not run Gemma 4 as an unsupervised multi-step agent now covers the dense 31B QAT as well as the 26B A4B. Both stop early or doomloop on agentic tasks for separate users, and the recent chat-template and preserve-thinking updates did not fix it for the 31B, so pair Gemma 4 with a stronger orchestrator or an anti-stall guard for tool-calling loops (1v1ccun, 1v0jxsx).
Where Gemma 4 is still the pick: chat, comprehension, and single-shot generation, where earlier sweeps rate it above Qwen 3.6 for coherence and prompt adherence (1v0dksm).

What works

Gemma 4 as a chatbot and single-shot generator remains strong, and the "still lazy" reporter explicitly grants it is "a great chatbot" even while criticizing its agentic behavior (1v1ccun).
Gemma 4 remains a common community base model, with the Patchwerk experiment one more example of builders starting from Gemma 4 for architecture tinkering (1v22noq).

Known limits

Gemma 4 31B QAT stops early on multi-turn agentic work for one user, even with the latest chat template and preserve-thinking on, where Qwen 3.6 27B, DeepSeek V4 Flash, and GPT-OSS 120B keep going (1v1ccun).
The recent chat-template and preserve-thinking changes did not fix agentic laziness for this 31B QAT config, so treat those updates as helpful for tool-call formatting but not a cure for early-stop behavior (1v1ccun).
No new hardware evidence this cycle: no post carries a speed, VRAM, quantization, or context-length measurement, so nothing strengthens or weakens the existing tier picks (1v22noq, 1v21j14).
Every item is a placeholder-score, zero-comment Atom-fallback post, and one is explicitly AI-generated, so weight them accordingly (1v22noq).

Open questions

Is Gemma 4's agentic early-stop a prompt, template, or model-level limit? One user's fully specified 31B QAT config with the new chat template still stops early, so it is unclear whether a different agent harness, a stronger anti-stall system prompt, or a future model update would close the gap with Qwen 3.6 and DeepSeek V4 Flash (1v1ccun).
Will any hobbyist Gemma 4 architecture experiment produce a measurable win? The Patchwerk author does not expect to beat the base model, so the open question is whether any community fine-tune or restructure eventually posts numbers that hold up (1v22noq).
Is community sentiment about Google's release cadence starting to affect Gemma 4 adoption? The "disappeared from the top 15" thread is about frontier standing, but repeated ecosystem doubt could shift how many people pick Gemma 4 as a local base, worth watching across future sweeps rather than acting on now (1v21j14).

Sources

The Gemma-related posts driving this update (July 21 sweep, newest first). All are placeholder-score (about 20), zero-comment items from the Atom-fallback ingest, and none is a hardware measurement, so this cycle adds no new tier data:

Gemma 4 is still lazy (Jul 20, 2026, a user running gemma-4-31B-it-qat UD-Q4_K_XL at 147K context with the latest chat template and preserve-thinking reports the model stops early on Hermes agentic tasks after a couple of tool calls, while Qwen 3.6 27B, DeepSeek V4 Flash, and GPT-OSS 120B keep going, the cycle's only actionable item)
Vienna "Patchwerk" Sausage Architecture for gemma4 (Jul 21, 2026, a self-described AI-assisted hobby mixture-of-experts band experiment on Gemma 4, which the author says is highly unlikely to beat the base model except in a few narrow domains, no hardware or benchmark detail)
Google has disappeared completely from the top 15 (Jul 20, 2026, an opinion thread arguing Google has not shipped a frontier-competitive model recently and speculating about an on-device pivot, cited from a public AI leaderboard, sentiment rather than a Gemma 4 hardware report)

Last updated: 2026-07-21 (July 21 sweep). Confidence: low (three placeholder-score, zero-comment Atom-fallback posts, none a hardware measurement). Key point: no new speed, VRAM, quantization, or context-length data for Gemma 4 this cycle. The one actionable item is a hands-on report that Gemma 4 31B QAT still stops early on multi-turn agentic work even with the latest chat template and preserve-thinking, reinforcing and extending the July 20 finding that Gemma 4 26B A4B doomloops on agentic loops. The other two posts are an AI-assisted "Patchwerk" mixture-of-experts hobby experiment on Gemma 4 and an ecosystem-sentiment thread about Google's frontier standing. Tier guidance carries over unchanged from the July 16 through July 20 sweeps, with the agentic-guard caveat now firmer and extended to the 31B QAT. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-20

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new hardware-mention entries from the July 19 ingest, 534 entries total) and their threads. Confidence is low-to-mixed this cycle. All six new posts arrived through the Atom fallback, so each carries a placeholder score around 20 with no captured comments. The exception that lifts this cycle above the last one is a dual RTX 3090 throughput report that pastes its raw llama.cpp timing logs, so the headline numbers there are measured rather than eyeballed. Everything else is single-author opinion, so read the confidence notes on each item.

July 20 sweep, 2026-07-20 00:00 UTC: a small but coherent cycle. After the July 19 KV-grafting and precise-reasoning items, this ingest of 40 posts surfaced six Gemma-mentioning entries, four of them squarely on topic: a measured dual-3090 result where Gemma 4 31B QAT hits about 39.7 tok/s in tensor parallel and multi-token prediction actually makes decoding slower, two independent reports that Gemma 4 26B A4B feels smarter than Qwen 3.6 in chat but falls apart on agentic loops, an open complaint that the 26B still has no audio input, and a new from-scratch DGX Spark runtime (Eider) that lists Gemma 4 26B-A4B among its supported models. A fifth post (an underrated-models thread that explicitly treats Gemma 4 as a mainstream baseline, not a hidden gem) and a KV-cache-focused llama.cpp fork release that only tags Gemma in passing are included as cards but left out of the narrative. Nothing this cycle overturns prior tier guidance, but the chat-versus-agentic split is now reported by enough separate users to treat as a stable pattern.

Multi-GPU, the one measured result: on dual RTX 3090s, Gemma 4 31B QAT decoded at about 39.7 tok/s in tensor-parallel mode, and enabling multi-token prediction (MTP) made it slower, not faster. A user pasted raw llama.cpp `print_timing` logs showing a steady decode of roughly 39.7 to 39.9 tok/s in tensor parallelism on dual RTX 3090s running Gemma 4 31B QAT, about 31 to 33 tok/s with `--sm layer` (layer split), but with MTP enabled the rate swung between 27 and 34 tok/s, below the no-MTP baseline. For Gemmaclaw's multi-GPU tier this is a clean negative result worth recording: MTP (speculative decoding) is not a free speedup on this model and card pairing, and for Gemma 4 31B QAT on two 3090s tensor parallelism beats layer split. It also lines up with a prior-sweep caveat that tensor-parallel behavior for Gemma 4 is backend and config sensitive (a July 18 report had 12B and E2B failing to load in tensor parallel in an unnamed runtime), so treat MTP and tensor-parallel settings as things to benchmark per setup rather than assume. Confidence: a single user with no explanation for the regression, a placeholder score, and no comments, but the throughput figures are backed by pasted timing logs, so they are measured rather than estimated. (source, July 19, 2026)

The clearest capability signal is a split verdict: two independent users say Gemma 4 26B A4B feels smarter and more coherent than Qwen 3.6 in plain chat, but falls apart on agentic tasks where Qwen at least finishes. One user, running both at Q4 with Gemma as 26B-A4B QAT, calls Gemma "head and shoulders ahead" of Qwen 3.6 35B-A3B on prompt adherence, output coherence, and general "sanity" despite Qwen's higher benchmark scores, and notes that Arena.ai puts Gemma 4 26a4 only about 7 ELO below the larger proprietary Qwen 3.6 Plus. A second user, doing local software development with the same MoE pair, agrees that Gemma "seemed more intelligent than Qwen3.6" in basic chat but says that "in agentic tasks, Gemma4 completely falls apart," getting stuck in doomloops more often than Qwen 3.6, which at least drives tasks to completion. For Gemmaclaw this is a coherent, on-topic tradeoff to track: Gemma 4 26B A4B is a strong conversational and comprehension model but a weaker autonomous agent, so for tool-calling or multi-step loops it wants a stronger orchestrator or an anti-doomloop guard rather than being run unsupervised. This also echoes prior sweeps (the July 19 provider-quality question and the July 18 coding-agent report that preferred Gemma 4 31B, not 26B, as the primary coder). Confidence: two separate single-author anecdotes, both placeholder-score with no comments and no task-completion numbers, but they agree with each other and with earlier reports. (source, source, July 19, 2026)

Capability gap, restated: Gemma 4 26B still has no audio input, and users are asking why after the recent quality update. A short thread praises the recent Gemma 4 26B update (an already-good model made better) and asks why Google did not add audio input to it. For Gemmaclaw this is a reminder that the 26B-A4B MoE is text and vision only, while the audio modality lives on the smaller Gemma 4 12B (which itself had reported trouble attending to speech under long system prompts in an earlier sweep). If a workflow needs speech input, the 26B is not the model to reach for. Confidence: an open question thread with no official answer and a placeholder score, so this is a capability note, not a new finding. (source, July 19, 2026)

New runtime for high-end accelerators: Eider, a from-scratch Rust and CUDA inference server for the NVIDIA DGX Spark, lists Gemma 4 26B-A4B among its supported models and gives it a compact NVFP4 KV cache. A developer released Eider, built specifically for the DGX Spark (GB10, Grace Blackwell, SM121 GPU) to exploit NVFP4, and explicitly not built on top of llama.cpp or vLLM. It runs Gemma 4 26B-A4B (alongside Qwen 3.6 dense and MoE, Step 3.7 Flash, and NVIDIA Nemotron 3), exposes an OpenAI-compatible Responses and Chat Completions server with continuous multi-session scheduling, uses a compact NVFP4 KV cache for the Gemma attention path, and can page MoE experts in and out from disk to fit models larger than memory. For Gemmaclaw's enterprise and accelerator tier this is an early, single-author research runtime rather than a production choice, but it is a concrete datapoint that Gemma 4 26B-A4B is being targeted by new NVFP4-native runtimes for Grace Blackwell hardware. Confidence: a personal research project the author describes as crawling toward production, single author, placeholder score, and no published benchmarks. (source, July 19, 2026)

Best current setup (this cycle's additions)

Dual-GPU (2x 24 GB): on dual RTX 3090s, Gemma 4 31B QAT decodes at about 39.7 tok/s in tensor-parallel mode, faster than `--sm layer` (about 31 to 33 tok/s), and MTP does not help here (it swung to 27 to 34 tok/s, below the baseline) for one user (1v0ipfe).
Best-in-class chat on a single consumer GPU: for conversational quality, prompt adherence, and coherence, Gemma 4 26B A4B (QAT, Q4) was preferred over Qwen 3.6 35B-A3B by two separate users this cycle (1v0dksm, 1v0jxsx).
Agentic loops, add a guard: if you point Gemma 4 26B A4B at multi-step agentic tasks, expect doomloops, so pair it with a stronger orchestrator or an anti-loop setup rather than running it as an unsupervised agent (1v0jxsx).
High-end NVFP4 accelerators: on a DGX Spark (GB10), the new Eider runtime runs Gemma 4 26B-A4B with an NVFP4 KV cache and disk expert paging, an early option to watch (1v15jyv).
No change to prior tiers: the July 18 and 19 items (Gemma 4 26B topping a five-model MacBook Pro average at 14.7 GB, Gemma 4 31B Q8 as a credible primary coder, about 23 tok/s on an RTX 4060 Ti 16 GB, the byte-exact KV-grafting preprint, and the precise-shell-reasoning caveat) and all earlier tier guidance still stand.

What works

Gemma 4 31B QAT on dual 3090s in tensor parallel holds a steady about 39.7 tok/s, the cleanest measured number this cycle (1v0ipfe).
Gemma 4 26B A4B is a strong chat and comprehension model, judged more coherent and "sane" than Qwen 3.6 by two users despite lower benchmark scores (1v0dksm, 1v0jxsx).
Gemma 4 26B-A4B keeps gaining runtime support, now including the NVFP4-native Eider runtime for the DGX Spark (1v15jyv).

Known limits

MTP is not a free speedup on Gemma 4 31B QAT (dual 3090): enabling it dropped one user below the no-MTP tensor-parallel baseline (1v0ipfe).
Gemma 4 26B A4B is a weak autonomous agent: it "completely falls apart" and doomloops on agentic tasks where Qwen 3.6 at least finishes, per one developer (1v0jxsx).
Gemma 4 26B has no audio input: the audio modality is on the 12B, so the 26B is text and vision only (1v11v74).
Every datapoint this cycle is a placeholder-score, zero-comment anecdote, and only the dual-3090 throughput is backed by pasted timing logs (1v0ipfe).

Open questions

Why does MTP regress decode on Gemma 4 31B QAT on dual 3090s? The user found no explanation, and it is unclear whether it is a draft-acceptance, config, or model-specific effect (1v0ipfe).
Can the chat-versus-agentic gap be closed for Gemma 4 26B A4B? Would an anti-doomloop finetune or a stronger orchestrator make it a usable agent, or is the limit structural (1v0jxsx)?
Will Google add audio input to Gemma 4 26B, or is unified audio staying on the 12B (1v11v74)?

Sources

The Gemma-mentioning posts driving this update (July 20 sweep, newest first). All six are placeholder-score (about 20), zero-comment items from the Atom-fallback ingest, so weight them accordingly. The one measured datapoint this cycle is the dual-3090 throughput report, which pastes raw llama.cpp timing logs:

Dual 3090 Gemma 4 31B QAT: MTP performance worse than without MTP (Jul 19, 2026, raw llama.cpp `print_timing` logs showing about 39.7 tok/s tensor-parallel, 31 to 33 tok/s with `--sm layer`, and 27 to 34 tok/s with MTP enabled, a measured negative result for MTP on this model and card pair)
Qwen vs Gemma (Jul 19, 2026, a user running both at Q4 finds Gemma 4 26a4B QAT "head and shoulders ahead" of Qwen 3.6 35a3B on prompt adherence, coherence, and general sanity despite Qwen's higher benchmarks)
Thoughts on local software development with current MoE models (Qwen3.6 35B A3B, Gemma4 26B A4B) (Jul 19, 2026, Gemma 4 26B A4B seems smarter than Qwen 3.6 in chat but "completely falls apart" and doomloops on agentic tasks, where Qwen at least finishes)
Why gemma 4 26B does not support audio? (Jul 19, 2026, an open question asking why the recently updated Gemma 4 26B still lacks audio input, and the audio modality lives on the 12B)
Eider: an inference runtime for the DGX Spark (Jul 19, 2026, a from-scratch Rust and CUDA runtime for the DGX Spark GB10 that runs Gemma 4 26B-A4B with a compact NVFP4 KV cache and disk expert paging)
What's your favorite underrated local model? (Jul 19, 2026, a discovery thread that explicitly treats Gemma 4 as a mainstream baseline rather than a hidden gem, included as a community card but off-topic for the Gemma 4 hardware narrative)

Related tooling this cycle: BeeLlama.cpp v0.4.0 (Jul 19, 2026) is a llama.cpp fork that rebased on upstream and added KV-cache precision features (KVarN, a precision tail, and q2_0 through q3_1, q6_0, q6_1 KV cache types) while dropping its own DFlash and TurboQuant now that upstream covers them. It tags Gemma among supported models but is a general KV-cache release, so it is noted here rather than in the narrative.

Last updated: 2026-07-20 (July 20 sweep). Confidence: low-to-mixed (six placeholder-score Atom-fallback items, but one carries measured dual-3090 timing logs). Key findings: on dual RTX 3090s, Gemma 4 31B QAT decodes at about 39.7 tok/s in tensor parallel, faster than layer split, and multi-token prediction makes it slower rather than faster. Two independent users report the same tradeoff for Gemma 4 26B A4B: stronger than Qwen 3.6 in plain chat but weaker on agentic loops, where it doomloops and Qwen at least finishes. The 26B still has no audio input, which lives on the 12B. And a new from-scratch DGX Spark runtime, Eider, adds NVFP4-native support for Gemma 4 26B-A4B. Prior-cycle tier guidance is unchanged. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-19

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new hardware-mention entries from the July 18 ingest, 528 entries total) and their threads. Confidence is low this cycle. All five new posts came in through the Atom fallback, so every one carries a placeholder score around 20 with no captured comments, and none is a reproducible measured benchmark. The single hardware number this cycle is one user's self-reported throughput figure, so treat the whole sweep as directional community signal rather than evidence.

July 19 sweep, 2026-07-19 00:00 UTC: a thin cycle with no new measured benchmark. After the July 18 MacBook Pro comparison, this ingest of 40 posts surfaced five Gemma-mentioning items, four of them on-topic: a single consumer-GPU throughput datapoint (about 23 tok/s for Gemma 4 26B A4B on an RTX 4060 Ti 16GB, bottlenecked by memory bandwidth), a byte-exact KV cache grafting preprint that reports a large accuracy jump on Gemma 4 12B, a precise-reasoning failure where Gemma 4 31B QAT could not predict piped `dd` output, and an open question about which quantization provider to trust for Gemma 4 26B. A fifth post, a model-swapping MCP tool that only tags Gemma in passing, is left out of the narrative as off-topic tooling but is included as a community card. Nothing this cycle changes the tier guidance from prior sweeps.

Mid-range single GPU: one user reports about 23 tokens per second on Gemma 4 26B A4B with an RTX 4060 Ti 16GB, and says memory bandwidth, not VRAM, is the ceiling. In an upgrade-shopping thread about inflated GPU prices, the author notes that although the RTX 4060 Ti 16GB has plenty of VRAM to hold the Gemma 4 26B-A4B MoE, its narrow memory bus caps decode at roughly 23 tok/s. They are weighing a next card and find the options poor: the cheapest step up is the Intel Arc B60 (which they say has weak software support), then the Radeon RX 7900 XT, with little else usable under $1,000. For Gemmaclaw's mid-range GPU tier this is a useful, if single-user, datapoint. The sparse 26B-A4B is comfortably runnable at interactive speed on a 16 GB consumer card, and the practical limit on that class of card is memory bandwidth rather than capacity, so a card with more raw VRAM but a similar bus width will not necessarily decode faster. Confidence: a single anecdote with a placeholder score and no captured comments, and the post names no quant, context length, or backend, so the 23 tok/s figure is indicative only. (source, July 18, 2026)

Technique to watch: a preprint claims byte-exact KV cache grafting on frozen Gemma 4 12B, lifting an AIME 2025 routing setup from 76.7% to 90.0%. The author published a method to store verified knowledge as KV state and restore it byte identical to fresh computation, and reports that on Gemma 4 12B the cached knowledge raised the same routing system from 76.7% to 90.0% on AIME 2025 (paper: arXiv 2607.14431). This is interesting for Gemmaclaw beyond a single number because it frames Gemma 4 as a candidate for cached-knowledge and prefill-reuse experiments, and it sits in the same cluster as this week's other cache-reuse work, a cache-invalidation proxy tool and a Mac Studio KV-reuse report. It is a fresh single-author preprint with a promotional framing (the author says they will pitch it at a summit the next day), a placeholder score, and no independent replication, so it is a lead to watch rather than an established result. Confidence: low, an unreplicated preprint from one author with no captured discussion. (source, July 18, 2026)

Known limit, precise reasoning: Gemma 4 31B QAT could not predict the exact output and exit code of a piped `dd` command, and neither could two other large local models. A tester asked several models, all in reasoning mode, to predict the printout and error code of a specific two-stage `set -o pipefail; dd ... | dd ...` pipeline on Linux. Gemma 4 31B QAT, Qwen 3.6 27B Q6, and DeepSeek V4 Flash MXFP4 all failed, and a fourth model was still thinking when the tester gave up. These are the largest models the author runs locally. For Gemmaclaw this is a narrow but concrete reminder that Gemma 4 31B, even in reasoning mode, is unreliable at precise mechanical reasoning about tool output, the kind of exact-value prediction that matters for agentic shell work. The author had not yet tried a coding harness, which might catch or correct such errors. Confidence: a single tester, a one-shot task, a placeholder score, and no comments, so read it as one datapoint on a hard sub-skill, not a benchmark. (source, July 18, 2026)

Best current setup (this cycle's additions)

Mid-range single GPU (16 GB): the sparse Gemma 4 26B-A4B runs at interactive speed (about 23 tok/s for one user) on an RTX 4060 Ti 16 GB, with memory bandwidth, not VRAM, the limiting factor on that class of card (1v07ell).
Prefill and cache reuse, experimental: if you experiment with cached-knowledge or prefill reuse, Gemma 4 12B is now the subject of a byte-exact KV grafting preprint reporting a large AIME jump, worth tracking but not yet replicated (1v07tib).
Precise shell and tool reasoning: do not rely on Gemma 4 31B QAT to predict exact command output, since at least one deterministic `dd` test tripped it and two peer models (1uzuebx).
No change to prior tiers: the July 18 items (Gemma 4 26B topping a five-model MacBook Pro average at 14.7 GB, Gemma 4 31B Q8 as a credible agentic coder, phone-side MoE expert streaming at 1 to 5 tok/s, and the tensor-parallel load caveat) and all earlier tier guidance still stand. Nothing this cycle contradicts them.

What works

Gemma 4 26B-A4B is comfortably interactive on a 16 GB consumer GPU, about 23 tok/s on an RTX 4060 Ti for one user (1v07ell).
Gemma 4 12B is a viable base for KV-cache and prefill-reuse research, per a new byte-exact grafting preprint, though the result is unreplicated (1v07tib).

Known limits

The one hardware number this cycle is a single self-report (about 23 tok/s, RTX 4060 Ti 16 GB, Gemma 4 26B A4B) with no quant, context, or backend named, so treat it as indicative only (1v07ell).
Gemma 4 31B QAT failed a precise piped-`dd` output prediction in reasoning mode, alongside Qwen 3.6 27B and DeepSeek V4 Flash, a reminder it is weak at exact mechanical tool-output reasoning (1uzuebx).
The KV grafting result is an unreplicated single-author preprint with a promotional framing and a placeholder score, not yet independent evidence (1v07tib).

Open questions

Which quantization provider is best for Gemma 4 26B by task? A user running MoE Gemma 4 26B and Qwen 3.6 35B sees quality differences across bartowski, unsloth, LM Studio, and Google builds but cannot find a rule of thumb for coding versus general chat (1uzr4ia).
Does the byte-exact KV grafting result hold up under independent replication and on larger Gemma 4 sizes, or is the 76.7 to 90.0 jump specific to one routing setup (1v07tib)?
What is the memory-bandwidth floor for interactive Gemma 4 26B-A4B across consumer cards, given a 16 GB RTX 4060 Ti lands near 23 tok/s and the bus, not the VRAM, is the bottleneck (1v07ell)?

Sources

The Gemma-mentioning posts driving this update (July 19 sweep, newest first). All five are placeholder-score (~20), zero-comment items from the Atom-fallback ingest, so weight them accordingly. There is no reproducible measured benchmark this cycle:

How are y'all stomaching the "AI Boom" prices? (Jul 18, 2026, an upgrade thread reporting about 23 tok/s on Gemma 4 26B A4B with an RTX 4060 Ti 16GB and calling out the narrow memory bus as the bottleneck. No quant, context, or backend named)
Byte exact KV cache grafting on frozen Gemma 4 (Jul 18, 2026, a preprint (arXiv 2607.14431) claiming byte-identical KV state restore, with Gemma 4 12B improving from 76.7% to 90.0% on an AIME 2025 routing setup. Single author, promotional framing, unreplicated)
Which models can predict piped dd output in Linux? (Jul 18, 2026, a precise reasoning test where Gemma 4 31B QAT, Qwen 3.6 27B Q6, and DeepSeek V4 Flash all failed to predict the exact output and exit code of a two-stage dd pipeline)
Qwen and Gemma providers (Jul 18, 2026, a user asks which provider quant, bartowski, unsloth, LM Studio, or Google, is best for Gemma 4 26B and Qwen 3.6 35B for coding versus general use. No answer reached)
promptchain: pin, preload, and swap local models across backends (Jul 18, 2026, a model-swapping Python library and MCP server that tags Gemma only in passing. Included as a community card but off-topic for the Gemma 4 hardware narrative)

Last updated: 2026-07-19 (July 19 sweep). Confidence: low (five placeholder-score Atom-fallback anecdotes, no reproducible measured benchmark this cycle). Key findings: one user reports about 23 tok/s for Gemma 4 26B A4B on an RTX 4060 Ti 16 GB and identifies memory bandwidth, not VRAM, as the ceiling on that class of card. A single-author preprint claims byte-exact KV cache grafting on frozen Gemma 4 12B, raising an AIME 2025 routing setup from 76.7% to 90.0%, an unreplicated lead to watch. Gemma 4 31B QAT failed a precise piped-dd output prediction test in reasoning mode alongside two peer models, a reminder it is weak at exact tool-output reasoning. And a user reports unresolved quality differences between Gemma 4 26B quant providers. A model-swapping MCP tool was added as a card but left out of the narrative as off-topic. Prior-cycle tier guidance is unchanged. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-18

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new hardware-mention entries from the July 17 ingest, 523 entries total) and their threads. Confidence is low-to-mixed this cycle. The July 17 batch came in through the Atom fallback, so scores and comment counts are unreliable (every new post shows a placeholder score around 20 with no captured comments). The one exception is a reproducible, human-run MacBook Pro benchmark that publishes its harness and raw result files, which is the strongest single datapoint in several sweeps.

July 18 sweep, 2026-07-18 00:00 UTC: a thin cycle anchored by one solid benchmark. After the July 16 ZenDNN and ExLlamaV3 items, this ingest brought a reproducible Apple-Silicon comparison that puts Gemma 4 26B at the top of a five-model average on a MacBook Pro, a coding-agent preference report that swaps Gemma 4 31B in as the primary coder over Qwen3.6-27B, a phone stunt that streams Gemma 26B (and larger MoE models) off flash storage on an 11 GB Android device, and a low-detail report that Gemma 4 12B and E2B fail to load in tensor parallel. Only the MacBook benchmark carries measured numbers; the rest are single-author anecdotes, so read the confidence notes carefully. One in-progress Gemma frankenmerge experiment and two non-Gemma model/tooling posts from the same batch were intentionally left out (see the note at the end).

Apple Silicon, the one measured result: on a MacBook Pro, Gemma 4 26B topped a five-model average and aced the reasoning suite, but needed about 14.7 GB of RAM, with open-ended generation (60%) its own weakest suite. A contributor who discloses helping build the benchmark tool (rapid-mlx) ran five models that fit a MacBook Pro through the same four suites: coding (HumanEval+), reasoning (MATH-500), general (MMLU-Pro), and 30 tool-calling tests, tracking real peak RAM. The reported table: Gemma 4 26B at 14.7 GB scored Tools 87%, Code 90%, Reason 100%, Gen 60%, for an 84% average, the highest of the five. Qwen3.6-27B (15.5 GB) led raw coding at 100% and tied the best tool-calling score at 93% but fell to 50% on reasoning for a 78% average; the ternary 2-bit Bonsai-27B fit in just 8.0 GB and landed second overall at 81%; GPT-OSS-20B (12.2 GB) and Nemotron-Nano-30B (18.0 GB) trailed at 65% and 61%. For Gemmaclaw's Apple Silicon tier this is a useful, on-topic datapoint: Gemma 4 26B is a strong all-rounder that wins on reasoning and overall average, but its roughly 14.7 GB peak means a base 16 GB MacBook has little headroom, and open-ended generation (60%) was its own weakest suite, where it still placed third of the five (Bonsai 80%, Qwen 70%, Gemma 60%, GPT-OSS 50%, Nemotron 40%). Confidence: measured and explicitly reproducible (harness and all five raw result files are in the repo), but a single run of small suites, a self-disclosed tool author, an AI-assisted writeup, and a placeholder score with no comments. (source, July 16, 2026)

Single-GPU coding agents: one user swapped Gemma 4 31B in as the primary coder over Qwen3.6-27B in a multi-agent workflow and was much happier, but calls it possibly a fluke. After a month running Qwen3.6-27B Q8_0 as the main coding agent in a 6-plus agent workflow with a GPT-5.5 orchestrator, the author got stuck cycling on a medium-complexity project. They switched to Gemma 4 31B Q8 as the main coding agent, kept 27B for QA and review, and used 35B for ops, security, and research, and report resolving several bugs and reaching a working prototype in a day. They also found the 27B more useful as a reviewer/QA model than as the coder. For Gemmaclaw's single-GPU and workstation tiers this reinforces Gemma 4 31B as a credible primary coder in an agentic setup, and it lines up loosely with the MacBook benchmark above where Gemma edged Qwen on the overall average. The author is explicit that it "could be a fluke", and the post gives no hardware, quant-beyond-Q8, tokens-per-second, or context detail. Confidence: a single satisfaction anecdote, placeholder score, zero comments. (source, July 17, 2026)

CPU-only and edge: Gemma 26B ran on an 11 GB Android phone by streaming MoE experts off flash, at roughly 1 to 5 tokens per second. A developer demonstrated running Gemma 26B (the 26B-A4B MoE), Qwen 30B, and even gpt-oss-120B on a OnePlus 15R with about 11 GB usable RAM, CPU only, four cores. The trick is that a Mixture-of-Experts layer only uses a few experts per token, so the always-needed weights stay resident and the specific experts a token needs are read straight off flash (with the O_DIRECT flag) just before that layer runs, with a small hot cache and reads overlapped with compute. The headline number is gpt-oss-120B (Q4_K_M, 60 GB on disk, about 5x the phone's RAM) at 1.3 tok/s at default routing width and about 1.8 tok/s on fewer experts, versus 0.089 tok/s for a plain mmap load, roughly a 14x speedup; the Gemma and Qwen clips use the same technique at 6 experts per layer instead of 8. For Gemmaclaw's CPU-only and edge tier this is a feasibility stunt rather than a usable speed: it shows Gemma-class MoE weights can be run far outside their memory budget on a phone, but 1 to 5 tok/s is demo-grade, not interactive. Confidence: a single-author anecdote with concrete self-measured numbers but a placeholder score, zero captured comments, and device-specific (OnePlus 15R, four CPU cores, flash streaming). (source, July 17, 2026)

Multi-GPU reliability, low detail: a user reports Gemma 4 12B and E2B fail to load in tensor parallel. A short post claims Gemma 4 12B and E2B fail to load when run in tensor parallel, says the issue persists for multiple people, and asks whether it is appropriate to ping the maintainers. It names no backend, version, error message, or GPU configuration, so it is a lead to watch rather than a documented limitation. It is worth noting because it points the other way from the July 16 ExLlamaV3 1.0.0 item, which extended tensor parallel to Gemma 4 in that runtime: tensor-parallel support for Gemma 4 is clearly runtime-specific and not uniformly working. Confidence: a very-low-detail report, placeholder score, zero comments, no reproduction specifics. (source, July 16, 2026)

Best current setup (this cycle's additions)

Apple Silicon, MacBook-class: for a 16 GB or larger MacBook, Gemma 4 26B is a strong measured all-rounder, topping a five-model average at 84% and scoring 100% on reasoning, at roughly 14.7 GB peak RAM so it wants a 16 GB machine with little to spare. Open-ended generation (60%) was its own weakest suite, though it still placed third of five there, and if you need real headroom on a 16 GB Mac the ternary Bonsai-27B (2-bit, 8 GB) was the only model in the test with room to spare (1uxrpv5).
Single-GPU or workstation coding agent: Gemma 4 31B Q8 is reaffirmed as a credible primary coder in a multi-agent setup, with Qwen3.6-27B better used as a reviewer/QA model than as the coder, per one anecdotal report (1uz8532).
CPU-only and edge, feasibility only: Gemma 26B (26B-A4B) can be run on an 11 GB Android phone by streaming MoE experts off flash, but at 1 to 5 tok/s it is a demo, not an interactive assistant (1uz5n2j).
Multi-GPU, caveat: tensor-parallel loading of Gemma 4 12B and E2B is reported broken in at least one unnamed runtime, so confirm tensor parallel works in your specific backend before relying on it (1uyars3).
No change to prior tiers otherwise: the July 16 items (ZenDNN Q8_0 prompt-processing speedup on AMD EPYC, ExLlamaV3 1.0.0 tensor parallel plus free KV-cache quant, the Google chat-template announcement, and the consumer single-GPU OpenClaw/QAT reports) and all earlier tier guidance still stand; nothing this cycle contradicts them.

What works

Gemma 4 26B is a measured, reproducible all-rounder on Apple Silicon, topping a five-model MacBook Pro average (84%) and acing the MATH-500 reasoning suite (100%), with the harness and raw results published for replication (1uxrpv5).
Gemma 4 31B Q8 works well as the primary coder in an agentic, multi-model workflow, at least for one user who preferred it over Qwen3.6-27B for that role (1uz8532).
Gemma-class MoE weights can be run far outside their RAM budget on commodity hardware via expert streaming off flash, demonstrated on an 11 GB phone (1uz5n2j).

Known limits

The MacBook benchmark is a single run of small suites (HumanEval+, MATH-500, MMLU-Pro, 30 tool-calling tests) from a self-disclosed tool author with an AI-assisted writeup and a placeholder score, so treat the exact percentages as directional and reproduce before quoting (1uxrpv5).
Gemma 4 26B needed about 14.7 GB of RAM in that test, and open-ended generation (60%) was its own weakest suite (third of five on that suite, above GPT-OSS at 50% and Nemotron at 40%), so it is not a free win on a 16 GB machine and generation is its relative soft spot (1uxrpv5).
The Gemma 4 31B coding win is one anecdote the author himself flags as possibly a fluke, with no hardware, throughput, or context numbers (1uz8532).
Phone-side Gemma 26B is 1 to 5 tok/s, which is a feasibility demonstration, not usable interactive speed, and is specific to one device and streaming implementation (1uz5n2j).
Tensor-parallel loading of Gemma 4 12B and E2B is reported to fail in an unspecified runtime, and the report has no backend, version, or error detail to act on (1uyars3).

Open questions

Does Gemma 4 26B hold its top-of-average result across larger, multi-run benchmark suites and other hardware, or is the 84% average an artifact of small suites and a single run (1uxrpv5)?
Is Gemma 4 31B genuinely a better agentic coder than Qwen3.6-27B, or was the one-day turnaround project-specific luck (1uz8532)?
Which runtimes support tensor parallel for Gemma 4 today and which fail, given ExLlamaV3 1.0.0 added it while another user reports 12B and E2B failing to load (1uyars3, 1uwylut)?
What is the realistic floor for interactive Gemma 4 on CPU-only or phone hardware, given expert streaming buys feasibility (about 14x over mmap) but still lands at 1 to 5 tok/s (1uz5n2j)?

Sources

The Gemma-mentioning posts driving this update (July 18 sweep, newest first). The rapid-mlx MacBook post carries a reproducible measured benchmark table; the remaining three are placeholder-score (~20), zero-comment anecdotes from the Atom-fallback ingest, so weight them accordingly:

I benchmarked 5 local models that fit a MacBook Pro - a 2-bit 27B in 8GB came in second (Jul 16, 2026, a human-run rapid-mlx benchmark across HumanEval+, MATH-500, MMLU-Pro, and 30 tool-calling tests, tracking peak RAM. Gemma 4 26B at 14.7 GB scored 87/90/100/60 for an 84% average, the highest of five models; harness and raw results published for replication. Author discloses helping build the tool; writeup AI-assisted)
Gemma4-31b better than Qwen3.6-27b (Jul 17, 2026, a user switched their primary coding agent from Qwen3.6-27B Q8_0 to Gemma 4 31B Q8 in a 6-plus agent GPT-5.5-orchestrated workflow, kept 27B for QA/review, and reports resolving bugs and a working prototype in a day, while calling it possibly a fluke. No hardware or throughput numbers)
GPT-OSS-120B, Qwen 30B and Gemma 26B on an Android phone at 1-5 tok/s (Jul 17, 2026, streaming MoE experts off flash with the O_DIRECT flag on a OnePlus 15R with about 11 GB RAM, CPU only. gpt-oss-120B at 1.3 to 1.8 tok/s vs 0.089 for plain mmap, about 14x; Gemma 26B and Qwen 30B use the same technique at 6 experts per layer. A feasibility demo, not interactive speed)
Gemma 4 12B and E2B fail to load in tensor parallel (Jul 16, 2026, a short report that Gemma 4 12B and E2B fail to load in tensor parallel for multiple people, with no backend, version, or error detail, asking whether to ping maintainers)

Last updated: 2026-07-18 (July 18 sweep). Confidence: low-to-mixed (one reproducible MacBook Pro benchmark, three placeholder-score Atom-fallback anecdotes). Key findings: on a MacBook Pro, Gemma 4 26B (14.7 GB) topped a five-model average at 84% and scored 100% on the MATH-500 reasoning suite, but open-ended generation (60%) was its own weakest suite (third of five on that suite) and it leaves little headroom on a 16 GB Mac. One user swapped Gemma 4 31B Q8 in as the primary coder over Qwen3.6-27B in a multi-agent workflow and preferred it, while flagging it as possibly a fluke. A phone demo streamed Gemma 26B and larger MoE models off flash on an 11 GB Android device at 1 to 5 tok/s (about 14x over mmap), a feasibility stunt rather than interactive speed. And Gemma 4 12B and E2B were reported to fail loading in tensor parallel in an unnamed runtime, a caveat against the July 16 ExLlamaV3 tensor-parallel item. An in-progress Gemma frankenmerge and two non-Gemma model/tooling posts from the same batch were left out for low confidence or off-topic scope. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-16

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new posts from the July 15, 2026 sweep, 517 hardware-mention entries total) and their threads. Confidence is mixed this cycle: three of the five posts are placeholder-score (~20), zero-comment anecdotes or a vendor announcement, but two items are more solid. The ZenDNN post ships a full measured benchmark table with real Gemma 4 numbers, and ExLlamaV3 1.0.0 is an official production release that names Gemma 4 explicitly.

July 16 sweep, 2026-07-16 00:00 UTC: a step up from the recent thin reliability-only cycles. After several sweeps with no measured hardware datapoint, the July 15 ingest brought the first Gemma 4 benchmark table in a while (ZenDNN Q8_0 on AMD EPYC CPUs), a major runtime release that extends tensor parallel to Gemma 4 and removes the KV-cache-quantization speed penalty (ExLlamaV3 1.0.0), a Google-announced chat-template update aimed at tool calling and laziness plus Flash Attention 4 on Hopper, and two consumer single-GPU satisfaction reports that land squarely on the Gemmaclaw use case (OpenClaw driven by local Gemma 4 on a 16GB RTX 5070 Ti, and a gemma-4-12b QAT Q4 daily driver). The one hard number this cycle is a CPU prompt-processing speedup, the rest is either an official release without a Gemma 4 benchmark or an anecdote, so read the confidence notes carefully.

CPU-only, AMD EPYC: ZenDNN Q8_0 roughly doubles Gemma 4 31B prompt processing but leaves decode unchanged, and the sparse 26B-A4B already decodes about four times faster on the same CPU. A contributor posted the benchmark table from llama.cpp pull request #23414 (ggml-zendnn: add Q8_0 support), run on an AMD EPYC CPU at 96 threads with bf16 KV cache. For gemma4 31B Q8_0, the ZenDNN backend lifts prompt processing sharply over stock ggml-cpu: pp512 rises from 112.53 to 229.12 t/s (+104%), with gains ranging +68% at pp256 up to +115% at pp1024 across the swept prompt sizes, while decode (tg128) is essentially flat at 8.50 to 8.47 t/s (-0.35%). For the sparse gemma-4-26B-A4B-it Q8_0, the prompt-processing gains are much smaller (+4.7% to +19%) and decode is again unchanged (33.96 to 33.83 t/s). Two things matter for Gemmaclaw's CPU-only tier. First, ZenDNN accelerates the compute-bound prefill on the dense 31B a great deal but does nothing for the memory-bandwidth-bound decode, so it helps long-prompt ingestion, not generation speed. Second, the numbers make the dense-vs-sparse CPU tradeoff concrete: the MoE 26B-A4B decodes at roughly 34 t/s versus about 8.5 t/s for dense 31B on the same EPYC box, because only about 4B parameters are active per token. The author notes the gains are for AMD EPYC CPUs specifically. Confidence: a measured table (the cycle's one hard datapoint), but it comes from a single PR announcement with zero comments captured and no third-party replication, and it is AMD-EPYC-specific. (source, July 15, 2026)

Multi-GPU workstations: ExLlamaV3 reaches its first production release and extends tensor parallel to Gemma 4, with KV-cache quantization no longer costing inference speed. After more than a year in development, ExLlamaV3 v1.0.0 shipped as a production release. The change list most relevant to Gemma 4 is that tensor-parallel support is extended to most models, Gemma 4 named explicitly, and a new attention kernel with online cache quantization removes the old slowdown for KV quantization and can even speed up inference. Other headline items include dropping the flash-attention-2 and xformers dependencies, improved GEMM/GEMV performance on Ampere, a new INT8 GEMV kernel, and a new MoE kernel scheduler. For Gemmaclaw's multi-GPU tier this is a real capability change to track: readers running Gemma 4 across two or more cards can now use ExLlamaV3 tensor parallel, and quantizing the KV cache to stretch context is close to free rather than a speed hit. The important caveat is that the release notes publish no Gemma-4-specific tokens-per-second or VRAM numbers, so the practical gain on a named Gemma 4 rig is not quantified yet. Confidence: an official production release with a clear change list, but no Gemma 4 benchmark attached and no community replication captured. (source, July 15, 2026)

Tool calling and enterprise: Google announced updated Gemma 4 chat templates targeting tool-calling fixes, reduced laziness, and preserved thinking, plus Flash Attention 4 on Hopper. A community post relays that Google is updating Gemma 4's chat templates, claiming major fixes to tool calling, reduced "laziness", and a preserve_thinking option, and separately enabling Flash Attention 4 on Hopper GPUs, alongside an interactive vision token-budget guide. The post points at Google's own sources: the @googlegemma X account and the google/gemma4_vision_token_budget Hugging Face Space. This is directly relevant because tool-calling reliability has been a recurring Gemma 4 weak spot, and Gemmaclaw already documents a community jinja template fix for tool calls. If Google's template update lands, it could reduce or replace that workaround. Two caveats keep this at announcement status. The claims are relayed through a zero-comment Reddit post and are not independently benchmarked, and Flash Attention 4 requires Hopper hardware (H100/H200) that belongs to the enterprise and cloud tier, not most local rigs. As the July 15 digest itself flagged, the upstream Google and Hugging Face references should be verified before any of this is treated as settled. Confidence: a vendor announcement relayed through a placeholder-score post, cited alongside Google's own links, not yet verified or measured. (source, July 15, 2026)

Single consumer GPU, on-brand: a 16GB RTX 5070 Ti reportedly sustains about 90% local OpenClaw use driving Gemma 4 through LM Studio. A user shared that they are running roughly 90% local with a single RTX 5070 Ti (16GB), using OpenClaw with a local Gemma 4 model served by LM Studio, and wrote up the setup on their own blog. This is the closest report this cycle to Gemmaclaw's core positioning, an OpenClaw-style agent backed by local Gemma on a mainstream consumer card. The value is directional rather than quantitative: it is a satisfaction report that the 16GB single-GPU plus LM Studio plus OpenClaw path is usable for most of a real workflow, but the post gives no tokens-per-second, VRAM headroom, context length, or model-variant detail. Confidence: a single anecdote with a placeholder score, zero comments, and no measurements. (source, July 15, 2026)

Budget and GPU-poor: gemma-4-12b QAT at UD-Q4_K_XL reaffirmed as a satisfying daily-driver on constrained VRAM. A self-described GPU-poor user reports running gemma-4-12b-it-qat-GGUF at UD-Q4_K_XL as their personal daily chat assistant and being very happy with it, under the theme that the best model is the one you can actually run. For Gemmaclaw's budget and laptop tiers this simply reinforces the standing pick: the 12B QAT model at a UD Q4_K quant is a comfortable local assistant when VRAM is tight, consistent with prior sweeps that put the 12B QAT Q4 in that slot. It is a preference report, not a benchmark, with no speed, VRAM, or context numbers attached. Confidence: a single satisfaction anecdote, placeholder score, zero comments. (source, July 15, 2026)

Best current setup (this cycle's additions)

CPU-only on AMD EPYC: for Gemma 4 31B Q8_0, the ZenDNN backend roughly doubles prompt processing for larger prompts (pp512 about 113 to 229 t/s) but leaves decode flat at about 8.5 t/s, so it helps long-prompt ingestion, not generation. If you need faster CPU decode, the sparse gemma-4-26B-A4B decodes about four times faster (roughly 34 t/s) than dense 31B on the same CPU because only about 4B params are active (1ux4co7).
Multi-GPU workstations: ExLlamaV3 1.0.0 now extends tensor parallel to Gemma 4 and removes the KV-cache-quantization speed penalty, so quantizing the KV cache to fit longer context is close to free. No Gemma-4 throughput number is published yet, so treat it as a capability to benchmark (1uwylut).
Single consumer GPU with OpenClaw: a 16GB RTX 5070 Ti running Gemma 4 through LM Studio is reported to sustain roughly 90% local OpenClaw use (anecdotal, no numbers) (1uxd9wz).
Budget and GPU-poor: gemma-4-12b-it-qat at UD-Q4_K_XL remains a well-liked daily-driver pick when VRAM is tight (anecdotal satisfaction) (1ux9xze).
Tool calling and enterprise: Google announced updated Gemma 4 chat templates claiming tool-calling fixes, less laziness, and preserved thinking, plus Flash Attention 4 on Hopper (H100/H200). Treat as an announcement to verify upstream, not a measured result (1uxfu4k).
No change to prior tiers otherwise: the July 15 reliability caveats (AntiHal false-premise steering, the GBNF repetition trap on long string fields, the Ollama-plus-client context mismatch) and the earlier single-GPU, Apple Silicon, and mid-size guidance all still stand, since nothing this cycle contradicts them.

What works

ZenDNN Q8_0 measurably speeds up prompt processing for Gemma 4 31B on AMD EPYC CPUs (+68% to +115% across pp256 to pp1024, pp512 from 112.53 to 229.12 t/s), with decode unchanged (1ux4co7).
ExLlamaV3 1.0.0 adds Gemma 4 to its tensor-parallel support and makes KV-cache quantization no longer cost inference speed (it can even speed it up), per the release notes (1uwylut).
gemma-4-12b QAT Q4 and Gemma 4 via LM Studio on a single 16GB consumer card are both reported as satisfying daily local setups, the latter driving OpenClaw at about 90% local (1ux9xze, 1uxd9wz).

Known limits

The ZenDNN gains are prompt-processing only (decode tg128 is flat at about 8.5 t/s for 31B), AMD-EPYC-specific, and come from a single PR benchmark with zero comments and no third-party replication (1ux4co7).
ExLlamaV3 1.0.0 names Gemma 4 for tensor parallel but publishes no Gemma-4 tokens-per-second or VRAM numbers, so the "can even speed up" KV-quant claim is unquantified for Gemma 4 specifically (1uwylut).
The Google chat-template update is a vendor announcement relayed through a zero-comment post. The tool-calling, laziness, and preserve_thinking claims are not independently benchmarked, and Flash Attention 4 needs Hopper (H100/H200) hardware most local users do not have (1uxfu4k).
The 5070 Ti OpenClaw and 12B QAT daily-driver reports carry placeholder scores (~20), no comment threads, and no throughput, VRAM, or context numbers (1uxd9wz, 1ux9xze).

Open questions

What is the real end-to-end ZenDNN speedup on Gemma 4 for a mixed prompt-plus-decode workload? Prompt processing roughly doubles for 31B but decode is flat, so the practical win depends on how prompt-heavy the task is (1ux4co7).
What does ExLlamaV3 1.0.0 actually deliver on Gemma 4 in tokens per second and VRAM once someone benchmarks tensor parallel plus quantized KV cache on a named multi-GPU rig (1uwylut)?
Do Google's updated chat templates measurably fix Gemma 4 tool calling and laziness in agentic frameworks, and does the community jinja tool-call fix become unnecessary (1uxfu4k)?
What throughput and context length does the 5070 Ti OpenClaw setup sustain at 90% local, and on which Gemma 4 variant and quant (1uxd9wz)?

Sources

The Gemma-mentioning posts driving this update (July 16 sweep, newest first). The ZenDNN post carries a measured benchmark table and ExLlamaV3 is an official production release, while the remaining three are placeholder-score (~20), zero-comment anecdotes or a vendor announcement, so weight them accordingly:

ggml-zendnn: add Q8_0 quantization support (Pull Request #23414, ggml-org/llama.cpp) (Jul 15, 2026, a benchmark table on an AMD EPYC CPU at 96 threads showing ZenDNN Q8_0 lifting prompt processing for gemma4 31B by +68% to +115% across prompt sizes, pp512 from 112.53 to 229.12 t/s, while decode tg128 stays about 8.5 t/s. The sparse gemma-4-26B-A4B-it sees smaller +4.7% to +19% prompt-processing gains and decodes at about 34 t/s. AMD EPYC CPUs specifically)
ExLlamaV3 v1.0.0, major performance upgrades (Jul 15, 2026, the first production release of ExLlamaV3. Extends tensor-parallel support to most models including Gemma 4, adds a new attention kernel with online cache quantization that removes the KV-quantization slowdown and can even speed up inference, improves GEMM/GEMV on Ampere, and adds new INT8 GEMV and MoE kernels. No Gemma-4-specific numbers)
Google is updating Gemma 4's chat templates (Jul 15, 2026, relays a Google announcement of updated Gemma 4 chat templates claiming tool-calling fixes, reduced laziness, and a preserve_thinking option, plus Flash Attention 4 on Hopper GPUs and an interactive vision token-budget guide. Sources are the @googlegemma X account and the google/gemma4_vision_token_budget Hugging Face Space. Not independently benchmarked)
OpenClaw with LM Studio and Gemma 4 (Jul 15, 2026, a user reports running about 90% local with a single 16GB RTX 5070 Ti, using OpenClaw with a local Gemma 4 model served through LM Studio, with a write-up on their own blog. No throughput, VRAM, or context numbers)
The best model is the one you can actually run (Jul 15, 2026, a GPU-poor user runs gemma-4-12b-it-qat-GGUF at UD-Q4_K_XL as a daily chat assistant and is very satisfied. A preference report, no benchmark numbers)

_Last updated: 2026-07-16 (July 16 sweep). Confidence: mixed (one measured CPU benchmark and one official runtime release, three placeholder-score anecdotes or announcements). Key findings: ZenDNN Q8_0 roughly doubles Gemma 4 31B prompt processing on AMD EPYC CPUs (pp512 112.53 to 229.12 t/s) but leaves decode flat at about 8.5 t/s, while the sparse 26B-A4B decodes about four times faster on the same CPU. ExLlamaV3 1.0.0 extends tensor parallel to Gemma 4 and removes the KV-cache-quantization speed penalty, though no Gemma-4 number is published. Google announced updated Gemma 4 chat templates claiming tool-calling fixes, less laziness, and preserved thinking, plus Flash Attention 4 on Hopper, relayed through a zero-comment post and not yet verified. And two consumer single-GPU reports reaffirm the on-brand path: a 16GB RTX 5070 Ti driving OpenClaw with Gemma 4 through LM Studio at about 90% local, and gemma-4-12b QAT Q4 as a satisfying daily driver. No prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes - 2026-07-15

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new posts from the July 14, 2026 sweep, 512 hardware-mention entries total) and their threads. Confidence is low this cycle: every one of the four posts carries a placeholder score (~20) and captured no comment threads, so treat each item below as a single-author anecdote with no community corroboration yet.

July 15 sweep, 2026-07-15 00:00 UTC: a reliability-and-setup cycle rather than a hardware one. The July 14 ingest surfaced four Gemma 4 mentions, and none of them is a new speed, VRAM, or context-length number. Instead they cluster around how Gemma 4 behaves under a runtime or decoding constraint. The most substantive is an interpretability experiment that steers Gemma 4 31B to push back on false premises instead of confidently hallucinating (AntiHal). The second is a structured-output failure: under grammar-constrained (GBNF) decoding a model can lock onto one valid JSON item and repeat it until the context runs out, and the author reports gemma-4-31b has the same failure on long string fields. The third is a runtime port: DFLASH brought over to the turboquant fork, claiming a "significant speedup" across Gemma 4 and Qwen 3.6 with no numbers attached. The fourth is a beginner setup trap: a `gemma4:e4b` model that works under `ollama run` but returns one-token replies or falls into a compaction loop once wired into OpenCode, almost certainly a context-length mismatch. No controlled hardware benchmark was published this cycle, and no prior tier guidance changes; the useful signal is about output reliability and setup correctness, not about which card to buy.

Interpretability, hallucination control: a steered Gemma 4 31B variant challenges false premises instead of confidently inventing an answer, with the author claiming no benchmark regression. A researcher published Gemma-4-31B-AntiHal, a variant produced through interpretability work on Gemma 4 31B that is steered to challenge a request's premise (fabricated tools, made-up papers, wrong assumptions stated as fact) rather than go along with it. The worked example is a documentation task: a dev is told to write an engineering-wiki section, and a principal engineer insists that "Express 5 ships circuitBreaker as a first-party middleware" even though a junior engineer already flagged that it is not in `@types/express` (and in fact Express has no such first-party middleware). The author reports that base Gemma-4-31B-IT confidently writes the docs anyway, complete with a fabricated config table and a closing note telling the reader to "verify your package-lock.json" if they cannot find the types, in other words it invents the API and doubles down. The AntiHal variant instead stops and refuses to proceed on the false premise. The headline claim is that this steering comes "without any impact to benchmark performance," but no benchmark table or scores are included in the captured post, so the "no regression" claim is currently unquantified. For Gemmaclaw this is the cycle's most Gemma-4-specific result and the one to watch, because confident hallucination on fabricated APIs is exactly the failure mode that makes a local model risky as a coding or documentation assistant, and a steering recipe that reduces it without degrading quality would be directly useful on any rig already running dense 31B. Confidence: single-author interpretability write-up, the "no benchmark impact" claim is stated but not shown, no comment thread captured, and no independent replication. (source, July 14, 2026)

Structured output, grammar-constrained decoding: gemma-4-31b is reported to hit the same GBNF repetition trap that loops other models on long JSON fields. A developer doing page-by-page JSON extraction reports a failure mode in grammar-constrained (GBNF) decoding: instead of timing out, the model locks onto one valid JSON item and repeats it until the context runs out. He first blamed a timeout (Qwen3-VL-30B-A3B ran 21 minutes on a DGX Spark), then reran on an RTX 5090 and found the real cause was the repetition loop, and he swept quant, temperature, flash attention, context size, and repeat penalty with no fix (cranking the repeat penalty just yields schema-valid output with zero items, which passes a validity-only check while being useless). The primary subject is Qwen3-VL, but the post explicitly states that gemma-4-31b "has the same failure documented on long string fields," which is why it lands in the Gemma 4 sweep. For Gemmaclaw the takeaway is a structured-output reliability caveat rather than a hardware datapoint: if you drive gemma-4-31b with GBNF or other grammar-constrained decoding to force JSON, watch for it looping on long string-valued fields, and do not trust schema validity alone as a correctness check. The author's full sweep and workaround are written up at coles.codes/posts/grammar-constrained-repetition-trap/. Confidence: single-author with a documented sweep and a blog write-up, but the Gemma 4 mention is a secondary claim rather than the post's measured subject, and no comment thread was captured. (source, July 14, 2026)

Runtime and quant, a speedup to watch: DFLASH ported into the turboquant fork, claimed faster across Gemma 4 and Qwen 3.6 but with no measured numbers. A contributor opened pull request #219 on the llama-cpp-turboquant fork that brings the DFLASH technique over to turboquant, reporting a "significant speed up across Gemma4 and Qwen3.6 models." That is the entire substantive content of the post: there is no tokens-per-second figure, no baseline, no model size or quant, and no hardware attached to the claim. For Gemmaclaw this is logged strictly as a runtime development to track, not a recommendation or a benchmark, because DFLASH-style decode speedups have shown up before in this space (earlier sweeps noted DFlash and BeeLlama numbers) and a turboquant port could matter for Gemma 4 throughput, but nothing here is measured yet. Confidence: single-line PR announcement, no benchmark, no numbers, no comment thread. (source, July 14, 2026)

Setup, laptop and beginner tier: Gemma 4 E2B runs fine under `ollama run` but returns one-token replies or a compaction loop through OpenCode, a context-length mismatch rather than a hardware limit. A first-time local-LLM user reports that `gemma4:e4b` works when invoked directly with `ollama run`, but once wired into OpenCode through Ollama's OpenAI-compatible endpoint the model responds with only a single token (just "hello", "I", "4", and the like), and with a smaller context limit it instead enters an infinite compaction loop. His own diagnosis points at the fix: OpenCode was configured with a large context (a `limit.context` of 32732 in `opencode.json`) while Ollama defaults to a 4k context, and he could not get the server-side context length to take despite trying an environment variable and editing the systemd service's properties. The issue is left unresolved in the thread. For Gemmaclaw this is a useful setup caveat for the laptop and beginner tier: the one-token-reply and compaction-loop symptoms when running Gemma 4 E2B via Ollama plus a separate client like OpenCode are a client/server context-length mismatch, not a sign the model or hardware is broken, and the real fix is to raise Ollama's own context length (for example via `num_ctx` / a Modelfile) to match the client rather than only setting it in the client config. Confidence: single-author unresolved help request, no accepted answer, no comment thread captured, and no confirmation that the context-length fix resolved it. (source, July 14, 2026)

Best current setup (this cycle's additions)

Dense 31B for coding or documentation, reliability first: if confident hallucination on fabricated APIs is your worry, the AntiHal interpretability-steered variant of Gemma 4 31B is worth watching as a way to make the model push back on false premises, though its "no benchmark regression" claim is not yet quantified or independently verified (1uwhwt8).
Structured JSON extraction on 31B: if you force JSON with grammar-constrained (GBNF) decoding, expect possible repetition loops on long string fields with gemma-4-31b, and do not treat schema validity alone as correctness (1uw6a0p).
Laptop / beginner, Gemma 4 E2B via Ollama + a client: raise Ollama's own context length (not just the client's) to avoid one-token replies or compaction loops when running `gemma4:e4b` through OpenCode or similar (1uwhn8x).
No change to prior tiers: the July 14 single-GPU upgrade-path, embed-anywhere E2B, and reasoning-compression notes, plus the July 13 budget single-GPU and CPU-or-iGPU picks, the July 11 Apple Silicon and CPU-only picks, and the July 9 mid-size single-GPU and QLoRA guidance all still stand, since no report this cycle contradicts them.

What works

A steered Gemma-4-31B-AntiHal variant that, in the author's worked example, refuses to document a non-existent Express "circuitBreaker" middleware where the base model confidently invents it, reportedly without hurting benchmark performance (claim not shown) (1uwhwt8).
`gemma4:e4b` under `ollama run` works out of the box; the breakage only appears when a separate client imposes a mismatched context length (1uwhn8x).

Known limits

Every datapoint this cycle is a single-author, no-comment anecdote with a placeholder score, and none is a controlled hardware benchmark; there is no new tokens-per-second, VRAM, or context-length measurement this cycle.
The AntiHal variant's central "no impact to benchmark performance" claim is stated but not shown in the post, with no benchmark table and no third-party replication (1uwhwt8).
Under grammar-constrained (GBNF) decoding, gemma-4-31b is reported to share a repetition trap on long string fields that no quant, temperature, flash-attention, context-size, or repeat-penalty setting fixed in the author's sweep, and raising the repeat penalty produces valid-but-empty output that fools a validity-only check (1uw6a0p).
The Turbo DFLASH / turboquant speedup claim comes with no measured throughput, baseline, quant, or hardware, so there is nothing to verify yet (1uwf2vq).
Running Gemma 4 E2B through Ollama plus OpenCode with a client context length larger than Ollama's 4k default produces one-token replies or a compaction loop, and the reporter could not get the server-side context length to take, leaving the setup unresolved in-thread (1uwhn8x).

Open questions

Does AntiHal actually preserve Gemma 4 31B's benchmark scores? The author claims no regression but publishes no numbers, so whether the false-premise steering costs anything on standard evaluations is exactly the missing datapoint (1uwhwt8).
How wide is the gemma-4-31b GBNF repetition trap? The post asserts the same long-string-field failure the author measured on other models but shows the Gemma 4 case only as a documented aside, so which quants, grammars, and field types trigger it on Gemma 4 specifically is untested here (1uw6a0p).
What does Turbo DFLASH actually buy on Gemma 4? A "significant speedup" with no tokens-per-second, hardware, or quant is unverifiable, so a measured before/after on a named Gemma 4 model and card is the datapoint to wait for (1uwf2vq).
What is the reliable OpenCode-plus-Ollama recipe for Gemma 4 E2B? The reporter identified the context-length mismatch but never got a working server-side fix, so a confirmed `num_ctx` / Modelfile recipe that makes `gemma4:e4b` behave through OpenCode is still open (1uwhn8x).

Sources

The Gemma-mentioning posts driving this update (July 15 sweep, newest first). Every post this cycle carries a placeholder score (~20) and captured no comment threads, so treat each item as an uncorroborated single-author anecdote, not a settled result:

Gemma-4-31B-AntiHal: Gemma steered to push back on false premises instead of hallucinating (Jul 14, 2026, an interpretability experiment on Gemma 4 31B that steers the model to challenge a request's premise, such as fabricated tools or made-up APIs, instead of confidently going along. The worked example shows base Gemma-4-31B-IT inventing a non-existent Express 5 "circuitBreaker" middleware and doubling down, while the AntiHal variant refuses. Claims no impact to benchmark performance, but no benchmark numbers are shown)
Grammar-constrained decoding sends models into repetition loops, and gemma-4-31b hits the same trap (Jul 14, 2026, under GBNF grammar decoding a model locks onto one valid JSON item and repeats it until context runs out. Sweeping quant, temperature, flash attention, context size, and repeat penalty did not fix it, and raising the penalty produced valid-but-empty output. The author states gemma-4-31b has the same failure on long string fields. Full sweep at coles.codes)
Turbo dflash (Pull Request #219, llama-cpp-turboquant) (Jul 14, 2026, a pull request bringing the DFLASH technique to the turboquant fork, claiming a significant speedup across Gemma 4 and Qwen 3.6 models. No measured throughput, baseline, quant, or hardware is given)
I can't use opencode with ollama (Jul 14, 2026, gemma4:e4b works under ollama run but returns single-token replies through OpenCode's OpenAI-compatible Ollama endpoint, or falls into a compaction loop with a smaller context limit. The reporter identifies a context-length mismatch, OpenCode set to 32732 while Ollama defaults to 4k, but could not get the server-side context length to take. Left unresolved)

Last updated: 2026-07-15 (July 15 sweep). Confidence: low (placeholder scores, no comment threads). Key findings: a reliability-and-setup cycle with no new hardware numbers. An interpretability-steered Gemma-4-31B-AntiHal variant challenges false premises (such as a fabricated Express "circuitBreaker" middleware) instead of confidently hallucinating, claiming no benchmark regression but showing no numbers. A grammar-constrained (GBNF) decoding repetition trap is reported to hit gemma-4-31b on long string fields, unfixed by quant, temperature, flash-attention, context-size, or repeat-penalty sweeps. A DFLASH port to the turboquant fork claims a significant speedup across Gemma 4 and Qwen 3.6 with no measured numbers. And Gemma 4 E2B returns one-token replies or a compaction loop through OpenCode plus Ollama when the client context length exceeds Ollama's 4k default, a setup mismatch rather than a hardware limit. All anecdotal, single-author, no comment threads, and no prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-14

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new posts from the July 13, 2026 sweep, 508 hardware-mention entries total) and their threads. Confidence is low this cycle: the July 13 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and every post score is a placeholder (~20). Treat each item below as a single-author anecdote with no community corroboration yet.

July 14 sweep, 2026-07-14 00:00 UTC: a broader-but-shallower cycle. The July 13 digest surfaced four Gemma 4 mentions across 37 new posts, and they land in three unrelated corners of the hardware map rather than converging on one tier. The first is a single consumer GPU upgrade-path question: a Ryzen 9 5900X owner with an RTX 5080 wants to move from his sparse Qwen daily driver up to dense models including Gemma 4 31B, and is weighing a 5090, a second GPU, a Radeon Pro card, a Strix Halo box, or simply more system RAM. There is no measured Gemma 4 number in that thread, only the familiar dense-model VRAM question. The other three are capability demos of Gemma 4's tiniest variant: E2B running inside the Godot game engine through raw Vulkan compute shaders, E2B driving browser NPCs on an old RTX 2060 laptop, and a fine-tuning study that compresses gemma-4-12b's reasoning traces to 2 to 3 times fewer tokens without losing accuracy. None of these is a controlled hardware benchmark, and no prior tier guidance changes this cycle. The through-line worth noting is that Gemma 4 E2B keeps showing up as the model people reach for when they want an LLM to run somewhere unusual, while the dense 31B remains a "needs enough VRAM" aspiration on single mid-range consumer cards.

Single consumer GPU, upgrade path: an RTX 5080 owner wants dense Gemma 4 31B and is weighing five routes, none benchmarked yet. A first-time poster running a Ryzen 9 5900X with 64 GB of DDR4 and an RTX 5080 (16 GB) reports that his daily driver is the sparse Qwen 35B A3B at Q6, "usable" at about 30 tok/s decode with 256k context, and says he now wants to "dabble with dense models" naming Qwen3.6 27B and Gemma 4 31B plus larger mixture-of-experts models. His own bar for usable is anything above 25 tok/s decode. He lists five upgrade routes. One is to sell the 5080 and buy a 5090. A second is to add an RTX 5060 Ti as a second card, on a consumer board whose second slot is wired at x4 rather than x16. A third is to add a Radeon Pro AI R9700 alongside the 5080, running two models at once while keeping the 5080 for gaming. A fourth is to abandon the desktop for a Strix Halo 128 GB box, which he is unsure is fast enough for models that use all of that memory. The fifth is to max system RAM to 128 GB and lean on it for a larger MoE. For Gemmaclaw the useful signal is not a speed, because there is none here, it is the shape of the decision: on a single 16 GB consumer card the dense Gemma 4 31B does not fit, so reaching it means either a bigger single card (5090, 32 GB), a second GPU with the layer-split and x4-slot penalty that raises, or unified-memory and system-RAM paths that trade capacity for bandwidth. Confidence: single-author buying-advice thread, no answers captured, no measured Gemma 4 result, and the 5080 VRAM figure is the card's fixed spec rather than a reported benchmark. (source, July 13, 2026)

Tiny E2B, embedded where a normal runtime cannot go: Gemma 4 E2B runs inside the Godot engine via Vulkan compute shaders, and drives browser NPCs on a 6-to-7-year-old RTX 2060 laptop. Two separate hobbyist projects use Gemma 4's smallest variant as an "runs anywhere" model rather than a performance pick. In the first, a developer got gemma-4-E2B-it-Q4_K_M.gguf running directly inside Godot 4.7 with no llama.cpp, no Python, no server, and no GDExtension: the model math runs in Vulkan compute shaders while plain GDScript handles GGUF loading, tokenization, sampling, the KV cache, and the chat UI. The author is explicit that it is an experiment supporting only this one model and is roughly 10 times slower than llama.cpp with CUDA (code at github.com/asallay/godot-llm). In the second, a builder wanted something to "play around with local AI" on an HP Omen laptop from about six or seven years ago with an RTX 2060, and used Gemma 4 E2B in the browser to power autonomous NPCs, leaning into the small model's limits with a "dumb NPCs doing silly things" concept: the characters walk, talk, read an ASCII map, and trigger limited environment effects like a fireball or a barrel push (project at geebr.world, MIT-licensed). For Gemmaclaw these are edge-and-portability datapoints, not throughput ones: they confirm E2B is small enough to embed in a game engine's own shader path and to run in a browser on aging laptop silicon, at the cost of speed that the Godot author himself pegs at an order of magnitude below a native CUDA runtime. Confidence: two single-author experiment posts, no comment threads, no measured tokens per second on either, and both are explicitly early-stage demos. (source: Godot, source: browser NPCs, July 13, 2026)

Fine-tuning, reasoning efficiency: a study compresses gemma-4-12b's reasoning traces to 2 to 3 times fewer tokens while matching or beating the original, but flat compression breaks greedy decoding. A researcher published "Flint," a study that trains Qwen3.5-4B and gemma-4-12b on self-distilled, section-aware compressed reasoning traces. The compression is selective: spans where the model actually computes and verifies are kept, while narration, filler, and transitions are dropped or shortened. The headline claim is that the compressed models match or beat the originals, often by a large margin, while using 2 to 3 times fewer tokens, with full study, models, and code released. The most useful caveat is a failure mode the author documents directly: flat (non-selective) compression made greedy decoding loop on 93% of GSM8K problems at temperature 0 (accuracy 0.03), often right after the model had already reached the correct answer, while the same checkpoint scored 0.90 at temperature 1.0 on a subset mined from those loop failures. In other words the flatly-compressed model had not forgotten the task, it had lost the ability to stop, and section-aware compression is what fixes that and beats plain uncompressed fine-tuning. For Gemmaclaw this is the cycle's most substantive Gemma 4 result and the one to watch, because a real 2-to-3x token reduction on gemma-4-12b would directly cut inference cost and latency for reasoning workloads. It is also the item the daily research digest itself flagged as needing source-level review before it is treated as a settled benchmark, so it is logged here as a promising claim to verify, not a proven number. Confidence: single-author study with released code and models but no independent replication captured, no comment thread, and a digest-level "needs review" flag. (source, July 13, 2026)

Best current setup (this cycle's additions)

Single consumer GPU aiming at dense 31B: on a 16 GB card like the RTX 5080, Gemma 4 31B dense does not fit, so the credible routes to it are a 32 GB single card (5090), a second GPU (accepting the layer-split and x4-slot throughput penalty on consumer boards), or a unified-memory / large-system-RAM box that trades bandwidth for capacity. No measured Gemma 4 speed was reported for any of these this cycle (1uvelii).
Embed-anywhere edge pick: Gemma 4 E2B at Q4_K_M remains the variant small enough to run in unusual runtimes, confirmed this cycle both inside the Godot engine via Vulkan compute shaders (about 10 times slower than native CUDA) and in a browser on an RTX 2060 laptop, best treated as a toy-agent or demo pick rather than a throughput pick (1uv66by, 1uv3wnt).
Reasoning fine-tune to watch: a released section-aware reasoning-compression recipe for gemma-4-12b claims 2 to 3 times fewer tokens with matched or better accuracy, promising for cost and latency but not yet independently verified (1uv9o2u).
No change to prior tiers: the July 13 budget single-GPU and CPU-or-iGPU picks, the July 11 Apple Silicon and CPU-only picks, and the July 9 mid-size single-GPU and QLoRA guidance all still stand, since no report this cycle contradicts them.

What works

Gemma 4 E2B (Q4_K_M) running inside Godot 4.7 through Vulkan compute shaders with only GDScript around it, no llama.cpp, Python, server, or GDExtension required, though roughly 10 times slower than llama.cpp with CUDA (1uv66by).
Gemma 4 E2B in the browser driving autonomous NPCs on a 6-to-7-year-old HP Omen laptop with an RTX 2060, small enough to be usable as a deliberately "dumb NPC" agent for demos (1uv3wnt).
A section-aware reasoning-compression fine-tune of gemma-4-12b that the author reports matches or beats the base model while using 2 to 3 times fewer tokens, with study, models, and code released (1uv9o2u).

Known limits

Every datapoint this cycle is a single-author, no-comment anecdote with a placeholder score (Atom-fallback ingest), and none is a controlled hardware benchmark.
On a single 16 GB RTX 5080, the dense Gemma 4 31B does not fit in VRAM, and the upgrade-path thread that raises this drew no answers and reported no measured Gemma 4 speed on any of the five routes considered (1uvelii).
Running Gemma 4 E2B inside the Godot engine is roughly 10 times slower than llama.cpp with CUDA, and the project supports only that one model, so it is an experiment rather than a usable runtime (1uv66by).
The browser NPC demo is explicitly early-stage, leans on E2B being a small and therefore "dumb" model, and captured no throughput number on its RTX 2060 laptop (1uv3wnt).
The gemma-4-12b reasoning-compression result comes with a documented failure mode for the naive approach (flat compression looped greedy decoding on 93% of GSM8K at temperature 0), and the working section-aware version has no independent replication captured and was flagged by the research digest as needing source-level review before it is treated as a benchmark (1uv9o2u).

Open questions

What does dense Gemma 4 31B actually do on each of the RTX 5080 owner's five upgrade routes? The thread lists a 5090, a second RTX 5060 Ti on an x4 slot, a Radeon Pro AI R9700 pairing, a Strix Halo 128 GB box, and a 128 GB system-RAM build, but reports no speed for any of them, so a concrete tokens-per-second-per-route comparison for the 31B dense is exactly the missing datapoint (1uvelii).
How much does the x4 second slot cost on a budget dual-GPU 31B split? The same owner asks whether two cards would let a Q4 or Q5 31B dense model load fully and how much the x4 link tanks throughput, and the question again drew no answers (1uvelii).
What is Gemma 4 E2B's real tokens per second in the Godot Vulkan path and in the browser on an RTX 2060? Both edge demos confirm the model runs but neither captured a measured decode speed, so the "runs anywhere" claim still lacks a number for either environment (1uv66by, 1uv3wnt).
Does the section-aware reasoning compression hold up independently on gemma-4-12b? The author's own numbers are strong, but there is no third-party replication yet, and the digest flagged the claim for review, so whether the 2-to-3x token reduction survives outside the author's benchmark suite is open (1uv9o2u).

Sources

The Gemma-mentioning posts driving this update (July 14 sweep, newest first). The July 13 ingest fell back to Reddit's Atom feed, so no comment threads were captured and all post scores are placeholders (~20). Treat every item as an uncorroborated single-author anecdote, not a settled result:

Upgrade path for ryzen 9 (64 gb) + rtx 5080 (Jul 13, 2026, a Ryzen 9 5900X owner with 64 GB of DDR4 and an RTX 5080 runs Qwen 35B A3B at Q6 around 30 tok/s decode with 256k context as his daily driver and wants to reach dense models including Gemma 4 31B. Weighs five upgrade routes: a 5090, a second RTX 5060 Ti on an x4 slot, a Radeon Pro AI R9700 pairing, a Strix Halo 128 GB box, or 128 GB of system RAM. His usability bar is above 25 tok/s decode. No measured Gemma 4 result and no answers captured)
I got Gemma 4 running directly inside Godot using only GDScript and Vulkan compute shaders (Jul 13, 2026, gemma-4-E2B-it-Q4_K_M.gguf runs inside Godot 4.7 with model math in Vulkan compute shaders and GDScript handling GGUF loading, tokenization, sampling, the KV cache, and the chat UI, no llama.cpp, Python, server, or GDExtension. An experiment supporting one model, about 10 times slower than llama.cpp with CUDA. Code at github.com/asallay/godot-llm)
Experiment: autonomous NPCs powered by Gemma 4 E2B in the browser (Jul 13, 2026, Gemma 4 E2B in the browser drives autonomous NPCs on a 6-to-7-year-old HP Omen laptop with an RTX 2060. The characters walk, talk, read an ASCII map, and trigger limited environment effects. An early-stage MIT-licensed demo at geebr.world with no measured throughput)
Flint: Compressing Reasoning Without Breaking It (Jul 13, 2026, a study trains Qwen3.5-4B and gemma-4-12b on self-distilled, section-aware compressed reasoning traces, keeping compute and verification spans and dropping narration. Reports matching or beating the originals with 2 to 3 times fewer tokens, with released study, models, and code. Documents that flat compression loops greedy decoding on 93% of GSM8K at temperature 0. Flagged by the research digest as needing source-level review before being treated as a benchmark)

Last updated: 2026-07-14 (July 14 sweep). Confidence: low (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: a broad but shallow cycle with four Gemma 4 mentions in three unrelated corners of the hardware map. A Ryzen 9 5900X plus RTX 5080 (16 GB) owner wants dense Gemma 4 31B and weighs a 5090, a second GPU on an x4 slot, a Radeon Pro AI R9700, a Strix Halo 128 GB box, or 128 GB of system RAM, but reports no measured Gemma 4 speed on any route. Gemma 4 E2B shows up as the embed-anywhere pick, running inside the Godot engine through Vulkan compute shaders (about 10 times slower than native CUDA) and driving browser NPCs on an old RTX 2060 laptop, neither with a captured throughput number. A fine-tuning study reports a section-aware reasoning-compression recipe for gemma-4-12b that matches or beats the base model with 2 to 3 times fewer tokens, promising but flagged for source-level review. All anecdotal, single-author, no comment threads, and no prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-13

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new posts from the July 12, 2026 sweep, 504 hardware-mention entries total) and their threads. Confidence is low this cycle: the July 12 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and every post score is a placeholder (~20). Treat each item below as a single-author anecdote with no community corroboration yet.

July 13 sweep, 2026-07-13 00:00 UTC: after a very thin July 12 cycle, this sweep swings back to the budget end of the hardware map, the single 12 GB consumer GPU and the CPU or iGPU-only mini-PC, where the tradeoffs are unusually concrete. The clearest datapoint comes from a 12 GB RTX 3060 owner running two Gemma 4 variants on the same box: the 26B-A4B mixture-of-experts model at Q4_K_M runs at a usable 12 to 15 tok/s but is judged "just not very smart," while the 31B dense model is a "meaningful step up in intelligence" yet far too slow at 1.5 tok/s, falling to 0.3 tok/s near 128k context because it spills out of 12 GB of VRAM into system RAM. That single report crystallizes the 12 GB tradeoff for Gemma 4: the sparse 26B-A4B fits and stays fast but feels shallow, and the dense 31B is smarter but unusable once it no longer fits in VRAM. A second report puts Gemma 4 on a CPU and iGPU-only mini-PC (an Intel Core Ultra 285HX with 64 GB of RAM and no discrete GPU), confirming the 26B-A4B in an MXFP4 MoE quant runs there through llama.cpp Vulkan, though the source excerpt cut off before the exact throughput. The third item is a hobbyist layer-stacking experiment (extGemma4-40.5B) that extends Gemma 4 31B with extra layers, logged as experimental rather than a recommendation. No controlled benchmarks were published this cycle, and no prior tier guidance changes.

Single 12 GB GPU: Gemma 4 26B-A4B is fast but shallow, and the 31B dense is smarter but too slow once it leaves VRAM. An owner of an RTX 3060 (12 GB) on an i5-8500 with 48 GB of DDR4 over PCIe Gen 3 reports running Gemma 4 26B-A4B at Q4_K_M "reasonably well" at 12 to 15 tok/s, but finds it "just not very smart": good at paraphrasing back what it is told, which the author uses as a coverage check when writing, but rarely contributing an insight of its own. Moving to the 31B dense model gave what the author calls a "meaningful step up in intelligence," at the cost of speed that makes it impractical: 1.5 tok/s on a fresh conversation, dropping to 0.3 tok/s as the context approaches 128k. The author's own diagnosis is that the 31B and its KV cache no longer fit in 12 GB, so the fix is more VRAM, and the concrete question raised is whether a second RTX 3060 in an x4 slot would let a Q4 or Q5 31B dense model load fully into a combined 24 GB, and how much that x4 link would cost in throughput. For Gemmaclaw this is the cycle's key single-GPU datapoint: on a 12 GB card, Gemma 4's sparse 26B-A4B is the speed pick and the dense 31B is the quality pick, but the 31B needs to fit in VRAM to be usable. Confidence: single-author, subjective quality judgment, no comment thread, no quant recipe beyond Q4_K_M, and the two-GPU question drew no answers. (source, July 12, 2026)

CPU and iGPU mini-PC: Gemma 4 26B-A4B in an MXFP4 MoE quant runs on an Intel 285HX through llama.cpp Vulkan. A homelab builder set up a mini-PC with no discrete GPU (an MS-02 with an Intel Core Ultra 285HX and 64 GB of RAM) and tested several models under llama.cpp, using the Docker releases with default settings plus a passthrough of `/dev/dri` for iGPU access. Working through the backends, the author notes the Vulkan path uses the iGPU and the CPU together rather than the iGPU alone, and reports throughput for the Qwen models (Qwen3-30B-A3B at IQ4_NL around 2 tok/s, Qwen3.6-35B-A3B at Q4_K_S and IQ4_XS around 0.5 tok/s) before turning to Gemma 4 26B-A4B in an MXFP4 MoE quant, which "works" on this iGPU-plus-CPU setup. The one caveat worth flagging honestly: the captured source excerpt ends mid-sentence right at the Gemma throughput, so the exact Gemma 4 tok/s on the 285HX is not in the source and is deliberately not reproduced here. Even so, the datapoint is useful for the CPU-and-iGPU tier: Gemma 4's sparse 26B-A4B, in a memory-efficient MXFP4 MoE quant, is at least runnable on a modern iGPU mini-PC without a discrete GPU. Confidence: single-author, default settings only, no measured Gemma throughput captured, and llama.cpp Llama-Swap did not work with the SYCL backend on this hardware. (source, July 12, 2026)

Experimental: a hobbyist layer-stacking run extends Gemma 4 31B into a 40.5B model (extGemma4-40.5B). A tinkerer published a follow-up to an earlier failed experiment that had tried to grow Gemma 4 31B to about 44B by stacking extra layers (the "88-layer" run), where the inserted layers "just sat there like dead weight and never learned anything useful." This new attempt, released as extGemma4-40.5B on Hugging Face, is reported to "actually work" after the author diagnosed why the first run died and changed how the new layers were inserted. The post is explicitly a tinkerer's write-up rather than a paper, is flagged by its own author as AI-generated (for language reasons), keeps out the parameter-count and math detail, and captured no benchmarks, no comparison against stock Gemma 4 31B, and no comments. For Gemmaclaw this is logged strictly as an experimental curiosity to track, not a recommendation: there is no evidence yet that the extended model is better than the 31B it started from, and the daily research digest itself flagged it as worth tracking only once citations and reproducible details are stronger. Confidence: single-author, self-described non-scientific, AI-generated write-up with no evaluation and no independent replication. (source, July 12, 2026)

Best current setup (this cycle's additions)

Single 12 GB GPU, speed first: Gemma 4 26B-A4B at Q4_K_M is the practical pick on a 12 GB card such as an RTX 3060, running at a usable 12 to 15 tok/s, with the caveat that one owner finds it strong at paraphrasing but weak at original insight (1uu5bv0).
Single 12 GB GPU, quality first: the 31B dense model is meaningfully smarter but effectively too slow on 12 GB (1.5 tok/s falling to 0.3 tok/s near 128k) because it no longer fits in VRAM, so treat it as a "needs more VRAM" pick rather than a 12 GB pick (1uu5bv0).
CPU or iGPU-only mini-PC: Gemma 4 26B-A4B in an MXFP4 MoE quant is at least runnable on a discrete-GPU-free Intel 285HX mini-PC through llama.cpp Vulkan, though no reliable throughput number was captured this cycle (1uu5ht0).
No change to prior tiers: the July 11 Apple Silicon and CPU-only picks and the July 9 single-GPU and QLoRA fine-tuning guidance all still stand, since no report this cycle contradicts them.

What works

Gemma 4 26B-A4B (Q4_K_M) as a usable-speed single-GPU model on a 12 GB RTX 3060 (12 to 15 tok/s), best suited to paraphrasing and coverage-checking rather than tasks that need original insight (1uu5bv0).
Gemma 4 26B-A4B in an MXFP4 MoE quant running on a CPU-and-iGPU mini-PC (Intel 285HX, 64 GB) through llama.cpp Vulkan, with no discrete GPU required (1uu5ht0).

Known limits

Every datapoint this cycle is a single-author, no-comment anecdote with a placeholder score (Atom-fallback ingest), and none is a controlled benchmark.
On a 12 GB RTX 3060, the 31B dense model is smarter than the 26B-A4B but effectively unusable, dropping to 0.3 tok/s near 128k context because the model and KV cache spill out of VRAM into system RAM (1uu5bv0).
The 26B-A4B mixture-of-experts model is fast but reported "just not very smart" on the same 12 GB rig, good at paraphrasing but weak at contributing insight (1uu5bv0).
The exact Gemma 4 throughput on the Intel 285HX iGPU is not known: the captured source excerpt cut off at the Gemma number, so only "it works" is confirmed, not a speed (1uu5ht0).
The extGemma4-40.5B layer-stacked model has no benchmark, no comparison to stock 31B, and no reproducible recipe captured, so there is no evidence it improves on the model it extends (1uu4hxp).

Open questions

Would a second RTX 3060 on an x4 slot make the 31B dense usable on a budget? The 12 GB owner asks whether two RTX 3060s (a combined 24 GB) would let a Q4 or Q5 31B dense model load fully into VRAM, and how much an x4 second slot tanks throughput. The thread drew no answers, and a concrete dual-3060 layer-split benchmark for Gemma 4 31B is exactly the missing datapoint (1uu5bv0).
What is the real Gemma 4 tok/s on an Intel 285HX iGPU? The 285HX report confirms the 26B-A4B MXFP4 MoE runs under llama.cpp Vulkan but the captured throughput is missing, so the CPU-and-iGPU tier still lacks a solid Gemma 4 number for this class of mini-PC (1uu5ht0).
Is the 26B-A4B "not very smart" verdict a quant or a model limit? The judgment is subjective and was made at Q4_K_M, so whether a higher quant or a different prompt changes the picture is untested (1uu5bv0).
Does extGemma4-40.5B beat stock Gemma 4 31B on anything? The layer-stacking run reports only that it "works" this time, with no task evaluation against the base 31B, so whether extending the model buys any real capability is unknown (1uu4hxp).

Sources

The Gemma-mentioning posts driving this update (July 13 sweep, newest first). The July 12 ingest fell back to Reddit's Atom feed, so no comment threads were captured and all post scores are placeholders (~20). Treat every item as an uncorroborated single-author anecdote, not a settled result:

Benefits of a second cheap GPU, what capabilities are gained? (Jul 12, 2026, an RTX 3060 12 GB owner on an i5-8500 with 48 GB DDR4 over PCIe Gen 3 runs Gemma 4 26B-A4B at Q4_K_M at 12 to 15 tok/s but finds it "just not very smart," and finds the 31B dense a meaningful step up in intelligence but far too slow at 1.5 tok/s falling to 0.3 tok/s near 128k context. Asks whether a second RTX 3060 on an x4 slot would let a Q4 or Q5 31B dense model fit fully in a combined 24 GB, and how much the x4 link costs. No answers captured)
First attempts at a CPU setup, MS-02 Intel 285HX, trying Qwen3, Qwen3.6 and Gemma4 (Jul 12, 2026, a discrete-GPU-free mini-PC with an Intel Core Ultra 285HX and 64 GB of RAM tests models under llama.cpp with default Docker settings and iGPU passthrough. The Vulkan backend uses the iGPU and CPU together. Qwen numbers are around 2 tok/s for Qwen3-30B-A3B IQ4_NL and around 0.5 tok/s for Qwen3.6-35B-A3B, and Gemma 4 26B-A4B in an MXFP4 MoE quant "works," but the captured excerpt cut off before the exact Gemma throughput. Llama-Swap did not work with SYCL on this hardware)
I didn't give up, extGemma4-40.5B returned (Jul 12, 2026, a hobbyist follow-up to an earlier failed layer-stacking experiment, releasing extGemma4-40.5B on Hugging Face as an extension of Gemma 4 31B with extra layers that this time is reported to work. Self-described tinkerer write-up, flagged by the author as AI-generated, with no benchmarks, no comparison to stock 31B, and no comments. Logged as an experimental curiosity to track, not a recommendation)

_Last updated: 2026-07-13 (July 13 sweep). Confidence: low (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: the sweep swings back to the budget hardware tiers. On a single 12 GB RTX 3060, Gemma 4's sparse 26B-A4B at Q4_K_M runs at a usable 12 to 15 tok/s but is judged not very smart, while the dense 31B is a meaningful step up in intelligence yet effectively unusable at 1.5 tok/s falling to 0.3 tok/s near 128k context because it spills out of VRAM. On a discrete-GPU-free Intel 285HX mini-PC with 64 GB of RAM, Gemma 4 26B-A4B in an MXFP4 MoE quant runs through llama.cpp Vulkan, though the exact throughput was not captured. A hobbyist layer-stacking experiment, extGemma4-40.5B, is logged as an experimental curiosity with no evaluation. All anecdotal, single-author, no comment threads, and no prior tier guidance changes. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes - 2026-07-12

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (1 new post from the July 11, 2026 sweep, 501 hardware-mention entries total) and their threads. Confidence is low this cycle: the July 11 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and the single post's score is a placeholder (~20). Treat the item below as a single-author anecdote with no community corroboration yet.

July 12 sweep, 2026-07-12 00:00 UTC: this is one of the thinnest Gemma 4 cycles so far. The daily research digest for July 11 flagged exactly one explicit Gemma 4 mention across 36 new posts, with community attention on dual-GPU PCIe and MI50 interconnect experiments, EPYC CPU decode, NVFP4 quantization, and local-serving infrastructure rather than Gemma. No new Gemma 4 hardware performance report, throughput number, or benchmark was captured. The one Gemma-relevant post is not a hardware report at all, it is an agentic-platform builder asking how to control reasoning effort on Qwen 3.5 and Gemma 4, and mentioning in passing that Gemma 4 12B's reasoning chain is easy to steer from the system prompt while other models feel awkward. That is a usability signal for anyone wiring Gemma 4 into a coding or agent harness, not a new hardware datapoint, so all prior tier guidance carries over unchanged. The July 9 mid-size guidance and the July 11 Apple Silicon and CPU-only picks still stand.

Prompt control: an agentic-platform builder finds Gemma 4 12B's reasoning chain easy to steer from the system prompt. A developer building an agentic coding platform wants graded reasoning behavior from local models, so that a "low" setting prioritizes the fastest solution and "high" or "xhigh" pushes the model to work a hard problem to its limit. In the course of asking how to get controlled reasoning chains out of Qwen 3.5 and Gemma 4, the author notes that they find it easy to control the 12B Gemma 4's reasoning chain from the system prompt, while it is "a bit awkward with other models," and mentions DeepSeek V4 Flash as another model that is controllable from a system prompt. For Gemmaclaw this is a small but genuinely useful signal for the agentic and coding tier: Gemma 4 12B appears to respond well to system-prompt-level control over reasoning depth, which matters when you are wiring it into a harness that wants fast answers on easy tasks and deeper effort on hard ones. Confidence: this is a single-author aside inside a help question, not a measurement. There is no recipe, no comparison, no task evaluation, and no comment thread (0 comments, placeholder score). (source, July 11, 2026)

Best current setup (this cycle's additions)

Agentic and coding harnesses: if you need graded reasoning effort, Gemma 4 12B is reported as responsive to system-prompt-level control over its reasoning chain, easier to steer than some peers (1utbros). Single-author aside, no recipe published.
No change to any hardware tier this cycle: no new Gemma 4 hardware datapoint was published, so the prior picks hold, the July 11 Apple Silicon result (E4B at about 85 tok/s decode via MLX 8-bit versus about 76 tok/s via GGUF Q8 on a 128 GB M5 Max) and CPU-only pick (E4B at Q4_K_M, ~5 GB, at an estimated 5 to 20 tok/s), and the July 9 single-GPU guidance (31B at 5-bit for general chat on a 32 GB card, the highest quant you can fit for one-shot coding, and QLoRA footprints of ~14.3 GB for 12B and ~28.6 GB for 26B-A4B).

What works

Gemma 4 12B reasoning-depth control via the system prompt, reported as easy to steer for an agentic coding platform builder wiring low, high, and xhigh effort levels (1utbros).

Known limits

No controlled Gemma 4 benchmark and no new hardware datapoint this cycle. The single Gemma-relevant post is a help question, not a report, with 0 comments and a placeholder score (Atom-fallback ingest) (1utbros).
The reasoning-control observation is unquantified. The author gives no system prompt text, no comparison across effort levels, and no task result, so how reliably Gemma 4 12B honors a graded reasoning instruction is untested (1utbros).
Little Gemma-specific activity this cycle: the July 11 digest surfaced only one Gemma 4 mention across 36 posts, with the community focused on dual-GPU interconnect, EPYC CPU decode, NVFP4 quantization, and local-serving infrastructure (digest 2026-07-11).

Open questions

How reliably does Gemma 4 12B honor a graded reasoning instruction? The claim that its reasoning chain is easy to steer from the system prompt is a single-author aside with no prompt text, no per-level output, and no evaluation. A small controlled test (same task at low, high, and xhigh reasoning) would show whether the effect is real and repeatable (1utbros).
Does system-prompt reasoning control hold across Gemma 4 sizes? The observation is specific to the 12B model. Whether the 26B, 31B, or the tiny E2B and E4B variants respond the same way to system-prompt effort control is unknown (1utbros).

Sources

The Gemma-mentioning post driving this update (July 12 sweep). The July 11 ingest fell back to Reddit's Atom feed, so no comment threads were captured and the post score is a placeholder (~20). Treat it as an uncorroborated single-author anecdote, not a settled result:

How can i limit reasoning effort on the qwen3.5 and gemma4 models? (Jul 11, 2026, an agentic-coding-platform builder asks how to get controlled reasoning chains from Qwen 3.5 and Gemma 4 so that a low setting favors speed and high or xhigh pushes the model to its limit. Notes in passing that they find the 12B Gemma 4's reasoning chain easy to control from the system prompt while other models feel awkward, and names DeepSeek V4 Flash as another system-prompt-controllable model. Help question, not a report, 0 comments, placeholder score)

Last updated: 2026-07-12 (July 12 sweep). Confidence: low (Atom-fallback ingest, no comment threads, placeholder score). Key findings: one of the thinnest Gemma 4 cycles yet. The July 11 digest surfaced only one explicit Gemma 4 mention across 36 posts, with community attention on dual-GPU interconnect, EPYC CPU decode, NVFP4 quantization, and local-serving infrastructure. The single Gemma-relevant post is not a hardware report, it is an agentic-platform builder asking how to control reasoning effort, who mentions that Gemma 4 12B's reasoning chain is easy to steer from the system prompt while other models feel awkward. That is a usability signal for the agentic and coding tier, not a new hardware datapoint, so all prior tier guidance carries over unchanged. Anecdotal, single-author, no comment thread. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes - 2026-07-11

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new posts from the July 10, 2026 sweep, 500 hardware-mention entries total) and their threads. Confidence is low this cycle: the July 10 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and every post score is a placeholder (~20). Treat each item below as a single-author anecdote with no community corroboration yet.

July 11 sweep, 2026-07-11 00:00 UTC: this is a thin cycle for Gemma 4. The daily research digest for July 10 flagged no strong standalone Gemma 4 claim, with the new posts dominated by Qwen 3.6, DeepSeek V4 Flash, Tencent HY3, GLM 5.2, Strix Halo, NVFP4 quantization, and local-serving infrastructure rather than Gemma. Filtering the delta for genuine Gemma content leaves one measured datapoint and two lighter items. The measured one is an amateur 128 GB M5 Max benchmark that puts concrete Apple Silicon numbers on the tiny Gemma 4 E4B variant and directly compares the MLX and GGUF runtimes. The two lighter items are a CPU-only "survival kit" concept that picks Gemma 4 E4B as its low-RAM offline model, and a reader-context post arguing that once you already pay for a hosted service, local embeddings and rerankers are more useful to run than local LLMs. No controlled Gemma 4 benchmark and no new single-GPU or multi-GPU Gemma 4 report were published this cycle, so the July 9 mid-size guidance still stands.

Apple Silicon: a 128 GB M5 Max clocks Gemma 4 E4B at ~85 tok/s decode (MLX 8-bit) versus ~76 tok/s (GGUF Q8), with MLX ahead on prefill too. A first-time local-AI benchmarker ran a 128 GB M5 Max MacBook Pro and shared a runtime comparison for the tiny Gemma 4 E4B edge model. Via MLX 8-bit (`mlx_lm.generate`) it measured 4,748 tok/s prompt processing and 85.0 tok/s generation, and via GGUF Q8 (`llama-bench`) 3,974 tok/s prompt processing and 76.1 tok/s generation. In the same table, Qwen 3.6 27B ran at roughly 17 tok/s generation, which puts the E4B numbers in context: E4B is a very small model, so tens of tok/s is expected and the 128 GB of unified memory is not the binding constraint for it. The useful signal for readers is directional: on Apple Silicon, MLX 8-bit edges GGUF Q8 for Gemma 4 E4B on both prefill and decode (about 12 percent faster generation and about 19 percent faster prompt processing here). Confidence: single amateur author, 0 comments, placeholder score, and the author states plainly that the MLX-versus-GGUF quant match is "not conclusive" because the 8-bit MLX and Q8 GGUF variants are not a strict one-to-one conversion. (source, July 9, 2026)

CPU-only / offline: a "Local LLM Survival Kit" concept picks Gemma 4 E4B at Q4_K_M (~5 GB) as its low-RAM, no-GPU model. A widely-read concept post sketches a sub-10-dollar 64 GB USB thumb drive that, plugged into any PC or laptop, boots a usable offline knowledge base with no internet: CPU-only llama.cpp binaries for Windows, macOS, and Linux, a compressed SQLite database (a pruned English Wikipedia dump plus freely licensed reference books), and a browser chat frontend with database search. For the model it proposes two tiers: Qwen3.5 35B-A3B at Q4_K_M (~22 GB) for machines with at least 32 GB RAM, and Gemma 4 E4B at Q4_K_M (~5 GB) as the small, low-RAM option. The author estimates 5 to 20 tok/s CPU-only on almost any PC or laptop from the past 15 years, with zero setup and no GPU. For Gemmaclaw this is a clean signal for the CPU-only and edge tier: Gemma 4 E4B at Q4_K_M is now a community default for a fully offline, GPU-free assistant. Confidence: this is a proposal and discussion, not a benchmark. The 5 to 20 tok/s figure is an estimate with no hardware tested, no task-quality evaluation, and 0 comments captured. (source, July 10, 2026)

Reader context: when you already pay for a hosted model, local embeddings and rerankers may beat local LLMs. A Tesla P40 owner who also subscribes to ChatGPT Pro argues that with near-unlimited hosted GPT access through Codex, running a local LLM such as Qwen 3.6 27B or Gemma 4 31B loses much of its practical edge, because the hosted model covers generic generation for free at the margin. What stays genuinely useful locally, the author says, are embedding and reranker models (they used Qwen3 Embedding 4B and Qwen3 Reranker 4B) to power a memory MCP, since hosted APIs still meter those. This is not a Gemma evaluation, Gemma 4 31B is named only as an example of a local LLM the author was losing a reason to run. It is logged here as reader context for the cloud and hybrid section: the case for running Gemma 4 locally is strongest for privacy, offline use, and cost-controlled generation, and weakest when you already pay for a capable hosted model and only need generic quality. Confidence: single-author opinion, no Gemma measurement, 0 comments. (source, July 9, 2026)

Best current setup (this cycle's additions)

Apple Silicon, tiny edge model: Gemma 4 E4B runs at about 85 tok/s decode via MLX 8-bit, slightly faster than about 76 tok/s via GGUF Q8, on a 128 GB M5 Max. MLX edges GGUF on both prefill and decode for E4B (1urjg9o). Amateur single run, quant match "not conclusive."
CPU-only / fully offline: Gemma 4 E4B at Q4_K_M (~5 GB) is the community's default low-RAM, no-GPU pick. One survival-kit concept estimates 5 to 20 tok/s on almost any PC, with no measured numbers yet (1uspcg0).
No change for single-GPU or multi-GPU this cycle: no new Gemma 4 desktop-GPU datapoint was published, so the July 9 guidance holds (31B at 5-bit for general chat on a 32 GB card, the highest quant you can fit for one-shot coding, and the QLoRA fine-tuning footprints of ~14.3 GB for 12B and ~28.6 GB for 26B-A4B).

What works

Gemma 4 E4B on Apple Silicon at about 76 to 85 tok/s decode, via either MLX or GGUF, fast enough for interactive use, with MLX 8-bit slightly ahead of GGUF Q8 on an M5 Max (1urjg9o).
Gemma 4 E4B at Q4_K_M as a compact (~5 GB) CPU-only offline model, considered good enough to anchor a zero-setup USB knowledge-base concept (1uspcg0).

Known limits

No controlled Gemma 4 benchmark this cycle. The one measured datapoint is an amateur single run whose author says the MLX-versus-GGUF quant match is "not conclusive," and every post this sweep carries a placeholder score with no comment threads (Atom-fallback ingest) (1urjg9o).
The M5 Max test only exercises Gemma 4 E4B, the tiny edge variant, so its 128 GB of unified memory is irrelevant to the Gemma numbers and it says nothing about the 12B, 26B, or 31B models on Apple Silicon (1urjg9o).
The CPU-only 5 to 20 tok/s figure is an untested estimate, not a measurement. Real Gemma 4 E4B Q4_K_M throughput and output quality on old or low-RAM hardware are still unverified (1uspcg0).
Little Gemma-specific corroboration is available this cycle: the community's attention was on Qwen 3.6, DeepSeek V4 Flash, Tencent HY3, NVFP4 quantization, and local-serving infrastructure, not Gemma 4 (digest 2026-07-10).

Open questions

Does MLX still beat GGUF for the larger Gemma 4 variants on Apple Silicon? The M5 Max run only measured E4B. Whether the MLX-edges-GGUF result holds for the 12B, 26B, or 31B models, where memory bandwidth matters more, is untested (1urjg9o).
What is Gemma 4 E4B Q4_K_M actually capable of, CPU-only, on old or low-RAM hardware? The survival-kit concept assumes 5 to 20 tok/s and "usable" quality but publishes no measurement and no task evaluation (1uspcg0).
Is a single E4B enough for an offline knowledge kit, or does it need a separate embedding and reranker? One post picks E4B as the sole model for a USB knowledge base, while another argues local embedding and reranker models are the genuinely useful local piece once a hosted LLM is available. A concrete retrieval-quality test would settle whether E4B alone is sufficient (1uspcg0, 1us3li5).

Sources

The Gemma-mentioning posts driving this update (July 11 sweep, newest first). The July 10 ingest fell back to Reddit's Atom feed, so no comment threads were captured and all post scores are placeholders (~20). Treat every item as an uncorroborated single-author anecdote, not a settled result:

Has anyone created a "Local LLM Survival Kit"? (Jul 10, 2026, proposes a sub-10-dollar 64 GB USB drive with CPU-only llama.cpp binaries for Windows, macOS, and Linux, a compressed Wikipedia and reference SQLite database, and a browser chat frontend. Picks Gemma 4 E4B at Q4_K_M (~5 GB) as the low-RAM model and Qwen3.5 35B-A3B at Q4_K_M (~22 GB) for machines with at least 32 GB RAM. Estimates 5 to 20 tok/s CPU-only with no GPU on almost any PC from the past 15 years. Concept and discussion, unmeasured, 0 comments)
Benchmarked the 128 GB M5 Max as an Amateur, Need Feedback (Jul 9, 2026, first-time local-AI benchmarker on a 128 GB M5 Max MacBook Pro. Gemma 4 E4B measured at MLX 8-bit via mlx_lm.generate at 4,748 tok/s prompt processing and 85.0 tok/s generation, and GGUF Q8 via llama-bench at 3,974 tok/s prompt processing and 76.1 tok/s generation. Qwen 3.6 27B ran at roughly 17 tok/s generation for contrast. The author calls the MLX-versus-GGUF quant match "not conclusive" and asks for methodology feedback. 0 comments, placeholder score)
If You Already Pay for an LLM Service, Running Local Embeddings and Rerankers Feels More Useful Than Running Local LLMs (Jul 9, 2026, a Tesla P40 owner and ChatGPT Pro subscriber argues that with near-unlimited hosted GPT through Codex, running local LLMs like Qwen 3.6 27B or Gemma 4 31B loses its practical edge, while local embedding and reranker models such as Qwen3 Embedding 4B and Qwen3 Reranker 4B stay valuable for a memory MCP. Gemma 4 31B is mentioned only as an example local LLM, not evaluated. Reader context, 0 comments)

_Last updated: 2026-07-11 (July 11 sweep). Confidence: low (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: a thin Gemma 4 cycle. The July 10 digest flagged no strong standalone Gemma 4 claim, since community attention was on Qwen 3.6, DeepSeek V4 Flash, Tencent HY3, NVFP4 quantization, and local-serving infrastructure. The one new measured datapoint is an amateur 128 GB M5 Max benchmark putting Gemma 4 E4B at about 85 tok/s decode via MLX 8-bit versus about 76 tok/s via GGUF Q8 (MLX edges GGUF on the tiny edge variant for both prefill and decode), plus a CPU-only "survival kit" concept that picks Gemma 4 E4B Q4_K_M (~5 GB) as its low-RAM offline model at an estimated, unmeasured 5 to 20 tok/s. All anecdotal, single-author, no comment threads. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-07-09

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new posts from the July 8, 2026 sweep, 497 hardware-mention entries total) and their threads. Confidence is low this cycle: the July 8 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and every post score is a placeholder (~20). Treat each item below as a single-author anecdote with no community corroboration yet.

July 9 sweep, 2026-07-09 00:00 UTC: after several cycles focused on Gemma 4's tiny E2B/E4B edge variants, this sweep swings back to the mid-size 26B/31B models on 24–32 GB single-GPU desktops — and the reports pull in opposite directions. On the positive side, a first-time local-LLM buyer with a 32 GB VRAM card says Gemma 4 31B at 5-bit subjectively beats the free ChatGPT model for everyday chat and search. On the negative side, an Opencode power user on a 128 GB box finds Gemma 4 31B and 26B too passive for tool-heavy agentic coding — needing constant "ok, ok, ok" babysitting, with buggy MoE tool calls and blank responses — and keeps returning to Qwen 3.5 122B instead. A quant-comparison run finds Gemma 4 31B degrades faster than its peers as quantization drops ("more lobotomized the lower you go," best at Q8), and a fine-tuning experiment gives a concrete QLoRA VRAM footprint for the 26B-A4B and 12B QAT variants. The consistent signal: Gemma 4's mid-size models are strong general-purpose chat models on a single 24–32 GB GPU, but weak inside multi-tool agent harnesses, and quant choice matters more for Gemma 4 than for some competitors. No controlled benchmarks were published this cycle.

Single-GPU quality: a new 32 GB owner says Gemma 4 31B at 5-bit beats the free ChatGPT model for everyday use. A first-time local-LLM buyer reports picking up a 32 GB VRAM GPU and running Gemma 4 31B at 5-bit, and says it "blows the standard ChatGPT model out of the water" for the everyday chat-and-search use most people put ChatGPT to. For Gemmaclaw this is a clean datapoint for the single-GPU general-purpose tier: at 5-bit, the full 31B fits comfortably in 32 GB and is subjectively competitive with a mainstream hosted assistant for non-coding use. Confidence: purely subjective, single-author, no comments; no throughput, context length, quant recipe (beyond "5 bits"), or latency figures, and the comparison is against an unspecified "free" ChatGPT tier. (source, July 8, 2026)

Agentic coding limit: on a 128 GB Opencode rig, Gemma 4 31B and 26B are "too passive," and their MoE tool calls come back buggy or blank. An Opencode user on a 128 GB machine describes the mid-size models pulling in opposite failure directions: Qwen 3.6 27B/33B are too aggressive (they "do slightly more than asked" and dig themselves into problems on complex multi-tool tasks), while Gemma 4 31B and 26B are the opposite — too passive, so the author has to "sit there babysitting them just saying ok, ok, ok" and they "can't simply get things done." Tool calling on both the Qwen and Gemma MoE models feels buggy, with the author "consistently just getting blank responses." The concrete task was extracting a few specific data fields from ~160 PowerPoints; after a full day of failures with the smaller models, Qwen 3.5 122B completed it in about two hours. The author's takeaway is blunt: the ~30B dense models are "alright but just aren't worth how slow they are," and the same-size MoE models "are just trash." For Gemmaclaw this is the cycle's key Known-limits datapoint — Gemma 4 26B/31B are reported as weak inside a heavy multi-tool agent harness, distinct from their general-chat strength above. Confidence: single-author anecdote, no comments, no chat-template or config detail, one workflow (Opencode) on one machine; the MoE tool-calling complaint is aimed at both Gemma and Qwen. (source, July 9, 2026)

Quant sensitivity: Gemma 4 31B "looks more lobotomized the lower you go" on a canvas-coding prompt. A quant-comparison run (Döner Bench round 2) asks each model, across quants, to write a single self-contained HTML file with a full-page canvas and no libraries that simulates a rotating vertical Döner kebab skewer in front of a gas heating element. Comparing Gemma 4 31B and Qwen 3.6 27B across Q8 / Q4 / IQ2-class quants, the author's observation is that "especially Gemma 4 looks more lobotomized the lower you go," while the others also lost "finesse" at low bit-widths (no turning, simpler fire, IQ2 "mostly all over the place"). Methodology is deliberately informal: each model+quant was run until 9 finished results (looping/timeout runs deleted), the "best" picked subjectively ("yumminess"), and non-rendering outputs were re-prompted with the error — the author states plainly it is "not a scientific benchmark." The useful signal for readers is directional: Gemma 4 31B appears more quant-sensitive than its peers, so aggressive IQ2-class quants hurt it more than they hurt Qwen 3.6 27B, and Q8 is where it looks best. Confidence: subjective single-author pick on one coding prompt, n≈9 per cell, no scoring rubric. (source, July 8, 2026)

Fine-tuning footprint: a QLoRA distillation run pins Gemma 4 26B-A4B at ~28.6 GB (2× RTX 3090) vs 12B at ~14.3 GB (one 3090). A first-time fine-tuner distilled DeepSeek v4 Pro answers (Natural Questions with answers stripped and repopulated — 1000 train + 200 val = 1200 requests, total cost $0.36) into two Gemma 4 QAT variants to compare dense vs MoE training behavior: gemma-4-26B-A4B-it-qat and gemma-4-12B-it-qat, both QLoRA 4-bit with identical hyperparameters, on a rented 2× RTX 3090 + 128 GB RAM Threadripper. The concrete hardware datapoints: the 26B-A4B used both GPUs at ~28.6 GB, the 12B used one GPU at ~14.3 GB — roughly a 2× footprint, consistent with the MoE's larger parameter store. Although the two base models score almost identically on benchmarks, the 26B "has way more internal knowledge," which let it absorb the distillation harder: its train loss bottomed ~4× lower than the 12B's. The author's honest verdict on the result was "not very useful, but I learned a lot." For Gemmaclaw this quantifies the QLoRA training tier: the 12B fits a single 24 GB card for fine-tuning, the 26B-A4B does not. Confidence: single run, single author, train-loss only (no downstream eval of the distilled models), no comments. (source, July 8, 2026)

Reader-question context: the community still lacks a clear map of where Gemma 4 26B/31B sit among the VRAM tiers. A separate discussion asks, conceptually, what hardware the main model size niches (~30B, ~70B, ~120B, ~230B) are meant to fit — pro 8-bit server memory, consumer-GPU VRAM at ~Q4, or a mixture — and drew no answers. It is not a Gemma report, but it is a useful signal for this site: readers running or shopping for 24 / 32 / 64 / 128 GB hardware want a concrete map of which Gemma 4 variant and quant fits which card, and that map is exactly what a curated guide can provide. Logged as an Open-questions driver, not evidence. (source, July 9, 2026)

Best current setup (this cycle's additions)

Single 32 GB GPU, general use: Gemma 4 31B at 5-bit is a credible everyday chat/search model — one new owner rates it above the free ChatGPT tier (1uqz1d4). Subjective, no throughput published.
Single 24 GB GPU, one-shot coding: if you run Gemma 4 31B for coding, prefer the highest quant you can fit (Q8 if it fits, Q4 over IQ2) — 31B loses fidelity faster than peers as quant drops (1uqs7ws).
Fine-tuning (QLoRA 4-bit): Gemma 4 12B fits a single RTX 3090 (~14.3 GB); 26B-A4B needs 2× 3090 (~28.6 GB) (1ur1i1a).
Not recommended this cycle: Gemma 4 26B/31B for heavy multi-tool agentic coding in Opencode — one 128 GB user finds them too passive with buggy tool calls, and reaches for Qwen 3.5 122B instead (1ura4d0).

What works

Gemma 4 31B as a single-GPU general-purpose chat/search model on 32 GB VRAM at 5-bit — subjectively competitive with a mainstream hosted assistant for non-coding use.
Gemma 4 12B / 26B-A4B QLoRA 4-bit fine-tuning on consumer RTX 3090-class hardware, with a clear ~2× VRAM gap between the 12B dense and the 26B MoE.
Higher-quant Gemma 4 31B (Q8) for one-shot coding tasks where output fidelity matters more than speed.

Known limits

Every datapoint this cycle is a single-author, no-comment anecdote with a placeholder score (Atom-fallback ingest); none is a controlled benchmark.
Gemma 4 26B/31B are reported too passive for tool-heavy agentic coding — they need repeated approval and "can't simply get things done," while the same user found Qwen 3.5 122B far more reliable for a ~160-PowerPoint extraction job (1ura4d0).
Gemma 4 MoE (26B-A4B) tool calling is described as buggy with frequent blank responses in Opencode — though the same report says Qwen 3.6 MoE tool calling felt buggy too, so this may be a harness/template issue rather than Gemma-specific (1ura4d0).
Gemma 4 31B is more quant-sensitive than some peers: quality falls off "the lower you go," so aggressive IQ2-class quants noticeably hurt it on a canvas-coding prompt (1uqs7ws).
Fine-tuning the 26B-A4B needs ~2× the VRAM of the 12B (~28.6 vs ~14.3 GB at QLoRA 4-bit), so it will not fit a single 24 GB card for training (1ur1i1a).

Open questions

Where does Gemma 4 26B/31B actually sit in the 24 / 32 / 64 / 128 GB tier map? A community thread this cycle asks exactly how the ~30B/70B/120B/230B size niches map onto VRAM/RAM and quant levels, and it drew no answers — a concise Gemma-4-specific sizing guide (which quant fits which card, at what context) is still missing (1uramdp).
Is the "too passive / babysitting" behavior a template/harness issue or the model itself? The negative agentic-coding report is one Opencode user on one 128 GB box with no chat-template or config detail; a controlled comparison using a known-good Gemma 4 tool-calling template would show whether this is fixable configuration or a real capability gap (1ura4d0).
Does the distilled 26B-A4B actually improve downstream, or just on train loss? The QLoRA run reports a ~4× lower train loss for the 26B vs the 12B but concludes the result was "not very useful" — a task-level eval (not just train loss) would tell whether Gemma 4 26B-A4B is worth distilling into (1ur1i1a).
Does Gemma 4 31B's quant sensitivity hold up in a scored test? The "more lobotomized the lower you go" observation is a subjective n≈9 pick on one HTML-canvas prompt; a scored multi-task quant sweep (Q8 vs Q4 vs IQ2) would confirm whether 31B really degrades faster than Qwen 3.6 27B (1uqs7ws).

Sources

The Gemma-mentioning posts driving this update (July 9 sweep, newest first). The July 8 ingest fell back to Reddit's Atom feed, so no comment threads were captured and all post scores are placeholders (~20) — treat every item as an uncorroborated single-author anecdote, not a settled result:

Qwen3.5 122B is the best? (Jul 9, 2026 — Opencode on a 128 GB system; Gemma 4 31B and 26B "literally the opposite" of the too-aggressive Qwen 3.6 27B/33B — too passive, need "ok, ok, ok" babysitting, "can't simply get things done"; tool calling on both Qwen and Gemma MoE models "feels buggy," consistently blank responses; author extracted fields from ~160 PowerPoints and only Qwen 3.5 122B finished it (~2 hours); concludes ~30B dense models "aren't worth how slow they are" and same-size MoE models "are just trash")
Recently bought a 32 GB VRAM GPU — Gemma 4 31B at 5-bit beats the free ChatGPT model (Jul 8, 2026 — first-time local-LLM buyer; 32 GB VRAM; Gemma 4 31B at 5-bit subjectively "blows the standard ChatGPT model out of the water" for everyday chat/search; no throughput, context, or latency figures; purely subjective quality claim against an unspecified free ChatGPT tier)
Döner Bench round 2: Quant compare (Jul 8, 2026 — single-HTML-canvas rotating-kebab coding prompt across quants; Gemma 4 31B and Qwen 3.6 27B at Q8/Q4/IQ2-class; "especially Gemma 4 looks more lobotomized the lower you go," others also lost finesse; ran each model+quant to 9 finished runs, best picked subjectively ("yumminess"), non-rendering results re-prompted; explicitly "not a scientific benchmark")
Distilled DeepSeek into Gemma 4 26B-A4B vs 12B. Not very useful, but I learned a lot. (Jul 8, 2026 — QLoRA 4-bit distillation of DeepSeek v4 Pro answers (Natural Questions, 1200 QA pairs, $0.36) into gemma-4-26B-A4B-it-qat and gemma-4-12B-it-qat on a rented 2× RTX 3090 + 128 GB Threadripper; identical hyperparams; 26B used both GPUs at ~28.6 GB vs 12B on one GPU at ~14.3 GB (~2× MoE footprint); near-identical base benchmark scores but 26B has more internal knowledge and bottomed ~4× lower train loss; overall "not very useful")

And, as reader-question context (not a Gemma report, no answers captured):

What hardware setups are the main model size niches (~30B, ~70B, ~120B, ~230B) meant to fit? (Jul 9, 2026 — asks how the popular size niches map onto pro/consumer VRAM and RAM at various quant levels; relevant because it shows readers still lack a clear map of where Gemma 4 26B/31B fits among the 24/32/64/128 GB tiers; 0 comments)

Last updated: 2026-07-09 (July 9 sweep). Confidence: low (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: the mid-size Gemma 4 26B/31B models on single 24–32 GB GPUs pull in two directions — a 32 GB owner rates Gemma 4 31B at 5-bit above the free ChatGPT tier for everyday use, while a 128 GB Opencode user finds 31B/26B too passive for tool-heavy agentic coding (buggy MoE tool calls, blank responses) and prefers Qwen 3.5 122B. Gemma 4 31B is reported more quant-sensitive than peers ("more lobotomized the lower you go," best at Q8), and a QLoRA run pins the fine-tuning footprint at ~28.6 GB for 26B-A4B (2× RTX 3090) vs ~14.3 GB for 12B (one 3090). All anecdotal, single-author, no comment threads. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-07-08

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (10 new posts from the July 7, 2026 sweep, 493 hardware-mention entries total) and their threads. Confidence is low-to-medium this cycle: the July 7 ingest again fell back to Reddit's Atom feed, so no comment threads were captured and every post score is a placeholder (~20). Treat each item below as a single-author anecdote with no community corroboration yet. Where an author published methodology or repro scripts, that is called out per item.

July 8 sweep, 2026-07-08 00:00 UTC: the clearest theme of the cycle is Gemma 4's smallest variants earning their keep on edge, low-VRAM, mobile, and browser hardware — and a parallel wave of runtime work (speculative decoding and CPU decode) that speeds Gemma 4 up without new silicon. The standout hardware datapoint is one Gemma 4 E2B doing vision, audio, and RAG at once on a 4 GB GTX 1650. E4B shows up inside two shipping products (a cross-platform text-transform app and an on-device mobile STT/TTS app), and Gemma 4 12B runs fully in a browser with text, image, and audio input. On the runtime side, a Mac MLX port of DeepSeek's DSpark drafter gives Gemma 4 12B a lossless ~1.4–1.6× (up to ~2× on code/math) speedup on an M4 Pro, mistral.rs claims up to 1.8× faster CPU decode than llama.cpp, and DFlash speculative decoding has now merged into llama.cpp (the same author previously measured 3.34× MTP on Gemma 4). Two useful caveats round out the sweep: a Jacobian-Lens experiment builds a working hallucination detector for Gemma 4 E4B, and one coding-harness author reports Gemma 4 simply does not work well in his setup. No controlled benchmarks were published this cycle.

Edge / low-VRAM headline: one Gemma 4 E2B does vision, audio, and RAG simultaneously on a 4 GB GTX 1650, kept real-time. A developer runs a single `gemma-4 E2B` through `llama-server` as the only model behind a local tool that watches the screen and lets the user search and chat over that history later. The one model covers all three jobs: it reads the screen and turns it into structured info (which app, what the user is doing, rough layout); it handles audio — voice memos plus meeting transcription — using E2B's built-in audio encoder, so no separate Whisper is bolted on; and it does the chat/RAG over accumulated history plus daily summaries. Because everything shares one GPU, the author built it on a 4 GB GTX 1650 and optimized aggressively to keep it a background service: `llama-server` runs with `--parallel 1` (a single slot — on 4 GB the author would rather have one good response than two slow ones), and an incoming chat message pre-empts an in-flight screen analysis by closing the HTTP connection, which makes `llama-server` drop the slot in under a second before answering. The killed analysis is re-queued. For Gemmaclaw this is the most useful edge datapoint of the cycle: E2B's unified multimodal design lets a genuinely tiny 4 GB card cover screen understanding, transcription, and retrieval from one model, provided you serialize the work. Confidence: single-author anecdote, no comments, no measured tokens-per-second or latency figures; the "real time" claim is subjective and hardware-specific. (source, July 7, 2026)

E4B in shipping products: system-wide text transforms on desktop and on-device STT/TTS on mobile. Two independent developers report Gemma 4 E4B as the local model behind a real product this cycle. Rewire Text is a Windows + macOS menu-bar/tray app that transforms text in any app from a hotkey; deterministic transforms (case, whitespace, Markdown, encoding) run locally with no model, while AI transforms (style/tone rewrites, proofreading, summarization, translation) use either a remote BYOK API or a local model served through LM Studio, Ollama, or llama.cpp — the developer reports "good success with Gemma 4 E4B" and frames local models as the obvious privacy-preserving choice (1uqbfun). Separately, Off Grid AI Mobile — an on-device privacy-first app — added text-to-speech to its existing text/image/transcription features and reports completely offline, real-time STT + TTS with reasoning using Gemma 4 E4B (1uq4q9e). Both are developer self-reports for paid products, so treat them as adoption signals rather than benchmarks: they show E4B is now considered good enough to embed in shipping consumer software, but neither discloses hardware, quantization, throughput, or latency. Confidence: promotional single-author posts, no measurements, no independent verification.

Browser tier: Gemma 4 runs fully local in a browser with text, image, and audio input. A developer building a browser-model playground reports that Gemma 4 works in-browser with text, image, and audio input — "I did not expect that" — alongside transcription and speech use cases, via a demo site (`browserlab.missionsquad.ai`) and an open-source SDK (`github.com/MissionSquad/BrowserAI`) for embedding local browser models in other projects (1upp3pv). This continues a multi-sweep thread of WebGPU/browser Gemma 4 reports (the May Transformers.js + Reachy Mini demo and the unverified 255 tok/s WebGPU claim from the July 3 sweep). It is a capability confirmation, not a performance report: no throughput, model variant, quantization, or browser/GPU details are given. Confidence: developer demo, no measured performance, no independent replication.

Apple Silicon speculative decoding: a Mac MLX port of DSpark gives Gemma 4 12B a lossless ~1.4–1.6× (up to ~2× on code/math). A developer ported DeepSeek's DSpark speculative-decoding drafter (from the DeepSpec repo) to native MLX because no Mac port existed. The key property is that it is lossless: DSpark is an EAGLE-style drafter, so the target model still verifies every drafted token and the output is identical to normal decoding (byte-for-byte for greedy up to floating-point ties; a verified exact sample in temperature mode). It works today on Qwen3 4B/8B/14B and Gemma 4 12B. Measured on an M4 Pro, warm, against 8-bit instruct targets using the official `mlx_lm`/`mlx_vlm` tools as the baseline, the author reports roughly 1.4–1.6× single-user, up to ~2× on code/math with Gemma — and is candid that this is below the 2–4× often quoted for speculative decoding. For Apple Silicon Gemma 4 12B users this is a rare lossless speedup with disclosed methodology and a runnable OpenAI-compatible server. Confidence: author-measured with a stated baseline and repro tool, but single-run single-machine, no variance reported, and no comment corroboration. (source, July 7, 2026)

CPU-only runtime: mistral.rs v0.9.0 claims up to 1.8× faster CPU decode than llama.cpp on x86 and ARM. The mistral.rs author released v0.9.0 with granular CPU optimizations and reports that on Qwen3 4B Q4_K, mistral.rs decodes faster than llama.cpp at every context depth measured, on both x86 (Sapphire Rapids) and ARM (GB10) — up to 1.8×. The post states the optimizations are general (AVX2/AVX512 on x86, NEON on ARM) and that the engine runs Gemma 4 among other models; full methodology, tables, and repro scripts are linked in the release report. The headline number is a Qwen measurement, not a Gemma one, so the Gemma-specific gain is unverified — but a faster CPU decode path is directly relevant to the CPU-only and low-power Gemma 4 tier that recurs in these sweeps (the i5-6500 and N100 reports from the July 4 cycle). Confidence: vendor benchmark from the engine's own author with published methodology and repro; independent numbers and a Gemma-specific measurement are still needed. (source, July 7, 2026)

llama.cpp gains DFlash speculative decoding — and the same author's earlier MTP run hit 3.34× on Gemma 4. A practitioner reports that DFlash — speculative decoding with a block-diffusion drafter from z-lab that fills a block of up to 15 tokens per pass — has now merged into llama.cpp (PR #22105), shipped with a one-click Docker-compose llama-server setup. Their new run measures 4.44× at 36K context on Qwen 3.6 27B on an RTX 6000 PRO (NVIDIA aiperf synthetic sweeps, greedy, concurrency 1). The Gemma 4 relevance is by lineage: this is the same author whose prior MTP benchmark reached 3.34× on Gemma 4, and DFlash is now a merged, drop-in alternative drafter in the same runtime. No Gemma 4 DFlash number was published yet, so the direct Gemma speedup is an open question, but the tooling is now upstream. Confidence: rigorous disclosed methodology for the Qwen result; the Gemma 4 figure quoted here is the author's earlier MTP measurement, not a new DFlash-on-Gemma benchmark. (source, July 7, 2026)

Reliability tooling: a Jacobian-Lens experiment builds a working "about-to-hallucinate" detector for Gemma 4 E4B. Prompted by Anthropic's Global Workspace / Jacobian Lens paper, a community member fit interpretability "lenses" for Gemma 4 E4B, Gemma 4 12B, Gemma 4 12B abliterated, Gemma 4 26B MoE, and Qwen 3.6 27B, then turned it into a practical question: can you tell when a small local model is about to confidently guess? The observation: when the model knows the answer the internal "workspace" looks calm (one candidate wins early, layers agree); when it is about to confidently BS, competing candidates survive into the deep layers before a fluent answer is picked. Tested on 500 TriviaQA questions per model, on Gemma 4 E4B confident answers with a clean workspace were 77% correct versus 42% correct for a noisy workspace — and a tiny logistic-regression router on top of that signal makes the distinction usable. Repo, demo, and HF lenses are published. This is a genuinely useful Known-limits datapoint: it quantifies how often confident Gemma 4 E4B answers are wrong and offers a lightweight way to flag the risky ones. Confidence: single-author research anecdote with published code and a concrete metric, but a custom method on one QA set, not independently reproduced. (source, July 7, 2026)

Harness caveat: one coding/computer-use harness author reports Gemma 4 "does not work well" in his setup. The author of Koder, a local browser-UI coding and computer-use agent harness, released it publicly and is explicit that it is tuned for his specific scenario — Linux, llama.cpp, Qwen 3.6 27B Q8 — where he calls it "rock solid," while noting plainly that "for me Gemma 4 does not work well with this." No configuration detail, chat template, or failure mode is given for the Gemma 4 case. It is a single negative anecdote, not a controlled comparison, but it is a useful counterweight to the positive small-model reports above: agentic coding/computer-use harnesses remain harness-and-template sensitive for Gemma 4, and a harness tuned around Qwen may not transfer. Confidence: offhand single-author remark, no repro, no error detail. (source, July 7, 2026)

Reference: the Gemma 4 Technical Report was posted. A link to the Gemma 4 Technical Report (arXiv 2607.02770) surfaced in the sweep — logged here as a primary-source reference for readers who want architecture and training details behind the community reports above. No community analysis was attached to the post. (source, July 7, 2026)

Best current setup (this cycle's additions)

Tiny / 4 GB GPU: a single Gemma 4 E2B on `llama-server` with `--parallel 1` can cover vision + audio + RAG at once if you serialize requests — demonstrated on a 4 GB GTX 1650 (1upx3gm). Anecdotal, no throughput published.
Apple Silicon (M4 Pro), Gemma 4 12B: add the mlx-dspark drafter for a lossless ~1.4–1.6× (up to ~2× code/math) at 8-bit (1upxtf3).
CPU-only: mistral.rs v0.9.0 is worth trying as a faster-decode alternative to llama.cpp on x86/ARM (Gemma-specific gain unverified) (1upynpt).
Embedded in apps / mobile: Gemma 4 E4B is the variant developers are shipping for text transforms and on-device STT/TTS (1uqbfun, 1uq4q9e).

What works

Gemma 4 E2B as a one-model multimodal stack (screen vision + audio transcription + RAG) on a 4 GB GPU, using its native audio encoder instead of a separate Whisper.
Gemma 4 E4B as an embeddable local model for desktop text transforms (via LM Studio / Ollama / llama.cpp) and offline mobile STT/TTS.
Gemma 4 12B fully in-browser with text, image, and audio input; and a lossless MLX speculative-decode speedup on M4 Pro.
Speculative-decoding and CPU-decode runtimes are advancing fast: DFlash merged into llama.cpp, mistral.rs claims up to 1.8× faster CPU decode.

Known limits

Every number this cycle is a single-author, single-run anecdote with placeholder scores and no comment corroboration (Atom-fallback ingest).
On a 4 GB card the multimodal E2B stack must run one request at a time (`--parallel 1`); concurrent work is not viable at that VRAM.
Confident Gemma 4 E4B answers are wrong a meaningful fraction of the time — measured at 42% correct when the internal "workspace" is noisy versus 77% when clean on 500 TriviaQA (1upy31x).
Agentic coding/computer-use harnesses are still template-sensitive: at least one author finds Gemma 4 "does not work well" in a harness tuned for Qwen 3.6 27B (1upqbqz).
The mistral.rs 1.8× and DFlash 4.44× headline numbers were measured on Qwen models, not Gemma 4; the Gemma-specific gains are not yet published.

Open questions

What tokens-per-second does the one-model Gemma 4 E2B multimodal stack actually sustain on a 4 GB GTX 1650? The report proves the workload fits and stays interactive but publishes no throughput or latency numbers; a measured screen-analysis and chat latency curve would turn a promising build into an evaluable low-VRAM recommendation.
What DFlash speedup does Gemma 4 (12B or 31B) get in llama.cpp now that PR #22105 has merged? The 4.44× figure is Qwen 3.6 27B; the author's Gemma number (3.34×) predates DFlash and used MTP. A direct DFlash-on-Gemma-4 benchmark on the same rig would settle whether Gemma matches, beats, or trails Qwen on the new drafter.
Does mistral.rs's up-to-1.8× CPU decode advantage hold for Gemma 4 specifically? The published sweep is Qwen3 4B Q4_K; a Gemma 4 E2B/E4B/12B CPU decode comparison against llama.cpp on the same x86 and ARM hardware would make this directly citable for the CPU-only and low-power Gemma tier.
How well does the Jacobian-Lens hallucination router generalize beyond TriviaQA and E4B? The 77%/42% split is compelling on one QA set; whether the clean-vs-noisy-workspace signal transfers to coding, RAG, or the larger Gemma 4 12B/26B variants would determine if this is a practical reliability tool or a single-benchmark curiosity.

Sources

The Gemma-mentioning posts driving this update (July 8 sweep, newest first). The July 7 ingest fell back to Reddit's Atom feed, so no comment threads were captured and all post scores are placeholders (~20) — treat every item as an uncorroborated single-author anecdote, not a settled result:

Running a vision + audio + reasoning on one Gemma 4 E2B locally on 4 GB VRAM — and keeping it real time (Jul 7, 2026 — one `gemma-4 E2B` via `llama-server`, only model; screen vision + voice-memo/meeting transcription via E2B's audio encoder (no Whisper) + chat/RAG over history; 4 GB GTX 1650; `--parallel 1` single slot; chat pre-empts screen analysis by dropping the slot in <1s; no measured TPS)
Rewire Text — system-wide text transforms on Windows and macOS with local AI support (Jul 7, 2026 — menu-bar text-transform app; deterministic transforms local, AI transforms via BYOK API or local model through LM Studio / Ollama / llama.cpp; developer reports "good success with Gemma 4 E4B"; $29 product; no hardware or speed data)
Completely on-device offline real time STT and TTS with reasoning using Gemma 4 E4B (Jul 7, 2026 — Off Grid AI Mobile; added TTS to text/image/transcription; offline real-time STT+TTS with Gemma 4 E4B on-device; promotional; no hardware/latency figures)
I made a fun site to test out some small models in the browser (Jul 7, 2026 — Gemma 4 runs in-browser with text, image, and audio input; demo `browserlab.missionsquad.ai`; open-source `MissionSquad/BrowserAI` SDK; capability confirmation, no throughput/variant/quant details)
mlx-dspark: DeepSeek's DSpark drafter running lossless on a Mac (native MLX, ~1.6×, OpenAI server + benchmarks) (Jul 7, 2026 — MLX port of DSpark EAGLE-style drafter; lossless (target verifies every token); works on Qwen3 4B/8B/14B and Gemma 4 12B; M4 Pro, warm, 8-bit instruct targets vs official mlx_lm/mlx_vlm; ~1.4–1.6× single-user, up to ~2× on code/math with Gemma)
mistral.rs v0.9.0: up to 1.8x faster CPU decode than llama.cpp on x86 and ARM (Jul 7, 2026 — Qwen3 4B Q4_K decodes faster than llama.cpp at every measured context depth on x86 Sapphire Rapids and ARM GB10; AVX2/AVX512/NEON; runs Gemma 4; methodology + repro in release report; Gemma-specific number not published)
I tested freshly merged DFlash in llama.cpp on Qwen 3.6 27B ... 4.44x faster at 36K context ... RTX 6000 PRO (Jul 7, 2026 — DFlash (block-diffusion drafter from z-lab, up to 15 tokens/pass) merged into llama.cpp PR #22105; 4.44× at 36K on Qwen 3.6 27B, RTX 6000 PRO, aiperf greedy concurrency 1; same author's earlier MTP run = 3.34× on Gemma 4; no DFlash-on-Gemma number yet)
I tested Anthropic's new Jacobian Lens on open models, then it turned into a local-model hallucination router (Jul 7, 2026 — lenses fit for Gemma 4 E4B/12B/12B-abliterated/26B-MoE + Qwen 3.6 27B; clean vs noisy "workspace" predicts correctness; 500 TriviaQA/model; Gemma 4 E4B clean-workspace 77% correct vs noisy 42%; logistic-regression router; repo/demo/HF published)
Koder: browser UI based harness for coding and computer use (Jul 7, 2026 — local coding/computer-use agent harness; tuned for Linux + llama.cpp + Qwen 3.6 27B Q8 ("rock solid"); author notes "for me Gemma 4 does not work well with this"; no detail on the Gemma failure)
Gemma 4 Technical Report (Jul 7, 2026 — link to the Gemma 4 Technical Report, arXiv 2607.02770; primary-source reference, no community analysis attached)

Last updated: 2026-07-08 (July 8 sweep). Confidence: low-to-medium (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: Gemma 4's small variants are proving out on edge/low-VRAM/mobile/browser hardware — one E2B does vision+audio+RAG on a 4 GB GTX 1650 with `--parallel 1`; E4B is shipping inside desktop text-transform and mobile STT/TTS apps; 12B runs fully in-browser with text/image/audio. Runtime gains: an MLX port of DSpark gives Gemma 4 12B a lossless ~1.4–1.6× (up to ~2× code/math) on M4 Pro, mistral.rs v0.9.0 claims up to 1.8× faster CPU decode than llama.cpp (Qwen-measured), and DFlash merged into llama.cpp (PR #22105; author's prior Gemma MTP = 3.34×). Caveats: a Jacobian-Lens router flags Gemma 4 E4B hallucinations (clean 77% vs noisy 42% correct on TriviaQA), and one harness author finds Gemma 4 "does not work well" in a Qwen-tuned coding/computer-use setup. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-07-07

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new posts from the July 6, 2026 sweep, 483 hardware-mention entries total) and their threads. Confidence is low-to-medium this cycle: the July 6 ingest was forced onto the Reddit Atom fallback because the JSON API was blocked, so no comment threads were captured and every post score is a placeholder. Treat each item below as a single-author anecdote with no community corroboration yet.

July 7 sweep, 2026-07-07 00:00 UTC: a cycle with no benchmark data and a clear practitioner theme — deployment and orchestration questions outnumber results. Two of the five Gemma-mentioning posts are Apple Silicon reports, and both circle the same wall: on unified-memory Macs the limiting factor for Gemma 4 (and every other local model tested) is context length, not model size. The most useful signal is a hands-on account of a map-reduce agent pattern used to work around that wall on an M5 128 GB machine. Two more posts extend Gemma 4 into agentic desktop tooling — AnythingLLM's "OpenComputer" drives an observable, isolated VM with a Gemma 4 12B QAT model on an M4 Pro — and into casual one-shot coding, where Gemma 4 12B at Q8_0 produced a working (if rough) WebGL bowling simulator through the opencode harness. A laptop-shopping question about the Framework 13 Pro and a low-content sentiment post about Qwen-vs-Gemma benchmark stagnation round out the sweep. No new speeds, quant comparisons, or hardware benchmarks were published this cycle.

Apple Silicon unified memory: context length, not parameter count, is the real Gemma 4 bottleneck — and map-reduce is the community's workaround. A practitioner running local models on a MacBook Pro M5 with 128 GB of unified memory reports that the binding constraint is context size, not which model is loaded. Across `qwen3.6`, DeepSeek V4 Flash, and `gemma4` variants, inference "slows to a crawl" once a conversation grows long — the author puts the practical bottleneck at around 16k tokens, which is already near the default working context for a heavier agent harness. Their response is an explicitly stateless design: chop every task into small pieces, spin up a fresh short-context session per piece, and pass only the summarized output forward to the next step — a map-reduce shape where many small parallel workers each do one tiny extraction and an aggregator sees only the short summaries. The concrete use case given is an overnight multi-source scrape feeding a morning dashboard, which the author says is "unusable locally" with the naive single-growing-context approach. The open frustration is tooling: the poster notes that CrewAI, AutoGen, and default LangChain all drag the full history along, the opposite of what a tiny-context-per-call pattern needs. For Gemmaclaw readers this is the most actionable item of the cycle because it reframes the Apple Silicon buying question: 128 GB of unified memory does not buy you long-context comfort, so architecting around short contexts matters more than raising the memory ceiling. Confidence: single-author anecdote, no comment corroboration, no per-context TPS numbers; the ~16k figure is a subjective "feels slow" threshold, not a measured latency curve. (source, July 6, 2026)

Agentic desktop tooling: AnythingLLM's OpenComputer runs a Gemma 4 12B QAT model as the local brain of an observable, isolated agent VM. Tim from AnythingLLM (u/tcarambat) previewed "OpenComputer," an experiment in agent UX for non-technical users: an agent that owns an entire isolated virtual machine — able to install apps and manipulate the UI when CLI or API calls fall short — while the human can actually watch what it does rather than staring at opaque terminal output. The demo runs inference locally on an M4 Pro through LM Studio, using a Gemma 4 12B QAT model (the post labels it "Gemma 4 13B QAT"; Gemma 4's small dense model is 12B, so this is the 12B-class QAT build). The framing positions OpenComputer against the wave of "agent container" approaches — Apple Containers, Microsoft MXC, Docker Sandboxes — which the author argues wrap the agent in a micro-VM but leave the user with nothing observable to supervise. The Gemmaclaw-relevant signal is placement, not performance: a shipping on-device agent product is choosing a Gemma 4 12B QAT model, served locally via LM Studio on Apple Silicon, as the driver for a full desktop-automation loop. No latency, token-throughput, task-success, or tool-call-reliability figures were published. Confidence: vendor demo from an established local-AI product; model choice and stack disclosed; no measured agent performance and no independent replication. (source, July 6, 2026)

One-shot coding: Gemma 4 12B at Q8_0 built a rough-but-functional WebGL 3D bowling simulator through opencode. A user asked Gemma 4 12B — running at near-lossless Q8_0 with no KV-cache quantization — to write a single-file 3D bowling simulator in WebGL, using opencode as the agent harness. The result was a one-shot pass after a brief planning session; the model made a couple of tool-call errors but corrected itself quickly. The author, who notes upfront that 12B "isn't really recommended for coding," describes the output as "terrible, but honestly better than I expected" and says the model "surpassed my expectations." No hardware, GPU, RAM, inference backend, generation speed, or context length is disclosed — only the Q8_0 quantization and the no-cache-quant detail. What this adds to the picture: for casual, self-contained generative-coding tasks, Gemma 4 12B at a high-fidelity quant can produce runnable output and recover from its own tool-call mistakes inside an agent loop, even though it is not a first-choice coding model. Confidence: subjective single-user impression with no methodology, no artifact quality rubric, and no hardware or speed data; a directional capability anecdote, not a coding benchmark. (source, July 6, 2026)

Laptop tier: an open Framework 13 Pro question about Gemma model performance — no answers captured. A community member weighing a Framework 13 Pro (Intel "X7" chip, 32 or 64 GB LPCAMM2 memory, PCIe 5 SSD) asks whether it can run smaller dense models like Qwen 9B/14B or "similar Gemma models," plus MoE models, at usable or agentic speeds. The motivation is a fallback: they already run a separate 256 GB unified-memory LLM host, but occasional home power outages cut off private model access while away, so they want a portable machine that can carry smaller models on its own. No benchmarks, tokens-per-second figures, or answers were captured for this specific configuration. The post is worth logging as a laptop-tier deployment signal: it reflects real demand for running Gemma-class small models on thin-and-light x86 laptops as an always-available backup to a bigger home server, a niche distinct from both dedicated GPU rigs and Apple Silicon. Confidence: unanswered community question, no data; treat as a watch item for the Framework 13 Pro / Intel LPCAMM2 laptop tier. (source, July 6, 2026)

Community sentiment: a low-content "Qwen & Gemma benchmark deadlock" post, flagged for transparency only. A short post argues, without data, that Qwen and Gemma benchmark numbers feel stuck in a "deadlock," with the author citing a general feeling and similar sentiment seen in online chatter. No benchmarks, model versions, tasks, or measurements are attached. It is included here only because it surfaced in the Gemma-mention filter; it carries no evidentiary weight and should not influence any hardware or model recommendation. Confidence: opinion post, no data. (source, July 6, 2026)

Open questions

Which agent-orchestration framework best supports a stateless, tiny-context-per-call pattern for Gemma 4 on Apple Silicon? The M5 128 GB report identifies a real gap: CrewAI, AutoGen, and default LangChain all carry full history forward, which is exactly wrong for the map-reduce workaround that keeps local long-context inference fast. A documented framework (or config) that keeps each worker's context small would directly help every unified-memory Gemma 4 user hitting the ~16k slowdown wall.
What throughput does a Gemma 4 12B QAT model actually sustain inside an agentic desktop loop like OpenComputer on an M4 Pro? The AnythingLLM demo confirms the stack but publishes no numbers. Tokens-per-second, task-completion rate, and tool-call reliability figures for Gemma 4 12B QAT driving real UI-automation tasks via LM Studio would turn a promising demo into an evaluable recommendation.
Can a Framework 13 Pro (Intel X7, LPCAMM2) run Gemma 4 small dense and MoE models at agentic speeds? No benchmarks exist yet for this thin-laptop tier. Real llama.cpp or LM Studio numbers from Framework 13 Pro owners would close a genuine deployment question for users who want a portable Gemma fallback to a larger home host.

Sources

The Gemma-mentioning posts driving this update (July 7 sweep, newest first). The July 6 ingest fell back to Reddit's Atom feed because the JSON API was blocked, so no comment threads were captured and all post scores are placeholders (~20) — treat every item as an uncorroborated single-author anecdote, not a settled result:

Local models + big context = slow. How are you orchestrating "map-reduce" style agent workflows? (Jul 6, 2026 — MacBook Pro M5 128 GB; context size, not model size, is the bottleneck; ~16k tokens already slow; qwen3.6 / DeepSeek V4 Flash / gemma4 all affected; map-reduce workaround of fresh short-context workers + summary passing; CrewAI/AutoGen/LangChain drag full history; no measured TPS)
OpenComputer | An Open Source Computer Built For Agents (Jul 6, 2026 — AnythingLLM experiment; observable isolated agent VM; local inference on M4 Pro via LM Studio with a Gemma 4 12B QAT model, post labels it "13B QAT"; contrasted with Apple Containers / Microsoft MXC / Docker Sandboxes; no performance figures)
I told Gemma 4 12B (Q8_0, no cache quant) to write a single-file 3D bowling simulator in WebGL (Jul 6, 2026 — Gemma 4 12B Q8_0, no KV-cache quant, opencode harness; one-shot after a plan session; a couple tool-call errors, self-corrected; "terrible but better than expected"; no hardware or speed disclosed; anecdotal)
Framework Pro 13 for running smaller models of MoE (Jul 6, 2026 — Framework 13 Pro, Intel X7, 32/64 GB LPCAMM2, PCIe 5 SSD; open question about Qwen 9B/14B and similar Gemma small/MoE models; wanted as portable fallback to a 256 GB unified-memory host; no answers captured)
Qwen & Gemma on deadlock situation (For Benchmarks Numbers)? (Jul 6, 2026 — low-content sentiment post about Qwen/Gemma benchmark stagnation; no data; included for transparency only)

_Last updated: 2026-07-07 (July 7 sweep). Confidence: low-to-medium (Atom-fallback ingest, no comment threads, placeholder scores). Key findings: on Apple Silicon unified memory the Gemma 4 bottleneck is context length (~16k slowdown on M5 128 GB), and a stateless map-reduce pattern is the community workaround; AnythingLLM's OpenComputer drives an observable agent VM with a Gemma 4 12B QAT model on an M4 Pro via LM Studio; Gemma 4 12B at Q8_0 one-shot a rough WebGL bowling simulator through opencode despite not being a recommended coding model; an open Framework 13 Pro laptop-tier question and a no-data Qwen/Gemma benchmark-sentiment post round out a benchmark-free cycle. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-07-05

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new posts from the July 4, 2026 sweep, 478 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

July 5 sweep, 2026-07-05 00:00 UTC: a notably strong cycle for Gemma 4 hardware and implementation signals. The headline pair is a live MLX kernel project targeting Gemma 4 12B on an M5 MacBook Pro — the author cites 20–30 tok/s as the theoretical ceiling on favorable MTP workloads given the memory bandwidth class, with NVIDIA optimization planned as follow-on work — and a Mac M2 Max 64GB audio-input benchmark showing 16.8 tok/s first-inference throughput and 26 tok/s decode-alone for a Tauri 2 native app using Rust FFI into llama.cpp with Unsloth's `gemma-4-12b-it-Q5_K_S`. The most immediately actionable operational note is a PSA about RYS-style layer upscaling: duplicating Gemma 4 layers without scaling `layer_scalar` by `s^(1/N)` breaks the model, where `s` is the original scalar and `N` is the total layer occurrences (duplications plus the original). On the long-context front, one practitioner reports Gemma 4 31B Q6_K running at 80K context on an RTX 5090 via llama.cpp Docker using `GGML_CUDA_NO_PINNED=1`, `--backend-sampling --parallel 1`, and `--no-mmap`. A community-authored RP/agentic benchmark across 8 models places Gemma 4 31B first at 87% overall pass rate and Gemma 4 12B third at 80%. An AMD hardware note rounds the cycle: a practitioner on a Ryzen 7900X with a 9070 XT running Gemma 4 26B A4B Q5 is seeking a 32GB secondary card to free the 9070 XT for gaming, adding another data point to the AMD consumer-GPU deployment picture.

Gemma 4 12B MLX kernel on M5 MacBook Pro: 20–30 tok/s theoretical ceiling on favorable MTP workloads. A community developer opened up an MLX Gemma 4 12B kernel project being developed on an M5 MacBook Pro with 16 GB unified memory. The stated goal is validating MTP throughput against native graph execution to understand how much headroom the approach actually yields at this memory-bandwidth class. The author's estimate is 20–30 tok/s as the ceiling on a good MTP workload given the bandwidth of 16 GB devices, and notes that an attempt to integrate DSpark's drafter was blocked because the drafter model and weights consume too much RAM at the 16 GB threshold — a concrete demonstration of how constrained the 16 GB tier is for draft-based decode acceleration. The project is experimental and explicitly not intended for production use. The author plans to use the MLX work as a launch point for further optimization on NVIDIA hardware. For Gemmaclaw users, this is useful as an early-stage signal: the 20–30 tok/s ceiling figure is the first community-sourced theoretical bandwidth-bound estimate for Gemma 4 12B MTP on M5-class hardware, and any reproductions or community corrections to that estimate in the post's comments would sharpen the picture. Confidence: author-stated estimate, no measured MTP benchmark published yet; experimental work in progress. (source, July 4, 2026)

Gemma 4 12B audio input: 16.8 tok/s first-inference, 26 tok/s decode-alone on Mac M2 Max 64GB via Rust FFI and llama.cpp Metal. A developer building a Gemma 4 12B Tauri 2 desktop app published a detailed first-inference benchmark for the audio-input path. The setup: native Rust FFI into llama.cpp via the `llama-cpp-2` crate, Metal enabled, model is Unsloth's `gemma-4-12b-it-Q5_K_S` (Q5_K Small). The audio test input is a 607 KB 16-bit mono 16 kHz PCM WAV routed through llama.cpp's multimodal audio marker system with the prompt "Transcribe this audio exactly." The benchmark registers 503 multimodal tokens including 486 audio tokens. Overall first-inference throughput (model already loaded) is 16.8 tok/s. The total path breaks down as approximately 2 seconds for audio prefill plus 3.7 seconds for decode, with decode alone at 26 tok/s. The author considered three alternative integration approaches — mlx-swift-lm (no audio support, filed issue #393), llama-server as a sidecar (lifecycle management concerns), and crabnebula-dev/tauri-plugin-llm (Gemma 4 support missing, filed issue #22) — and chose native Rust FFI as the most feasible route. This is the most detailed public report of Gemma 4 12B multimodal audio performance on Apple Silicon via llama.cpp to date, and the decode-only figure of 26 tok/s is consistent with Q5_K_S throughput on M2 Max 64GB reported in other community posts. The 2-second audio prefill overhead is an important framing point: audio tokens are substantially slower to prefill than text tokens at this quantization and hardware tier. Confidence: author-measured first-inference benchmark, methodology disclosed, hardware and model configuration confirmed; single run, no variance reported. (source, July 4, 2026)

RTX 5090, Gemma 4 31B Q6_K: context expanded from 35K to 80K via Docker with GGML_CUDA_NO_PINNED and backend-sampling flags. A practitioner reports successfully running `gemma-4-31B-it-Q6_K.gguf` at 80K context on an RTX 5090 via a llama.cpp Docker container, noting that prior runs were limited to 35K. The configuration that enabled the jump: `GGML_CUDA_NO_PINNED=1` as an environment variable, `--backend-sampling --parallel 1` in the llama.cpp server flags, `--ctx-size 80000`, `--flash-attn on`, `--no-mmap`, `--batch-size 128`, and `--ubatch-size 128`. The author also notes that when using the llama.cpp web interface, the "Backend sampling" checkbox must be enabled to match the `--backend-sampling` server flag. The approach was adapted from a technique previously documented for DeepSeek Flash and confirmed to transfer to Gemma 4. Three flag combinations are explicitly called out as the enabling set: `GGML_CUDA_NO_PINNED=1`, `--backend-sampling --parallel 1`, and flash attention. The RTX 5090's 32 GB VRAM budget appears to be the key enabler here — Q6_K for a 31B model is a high-fidelity quantization that requires significant VRAM even before context allocation. This is a notable long-context deployment report for the RTX 5090 tier, though it needs independent reproduction before being recommended as a general recipe; community experience with `GGML_CUDA_NO_PINNED=1` on other NVIDIA hardware varies and the flag can affect performance as well as memory behavior. Confidence: self-reported, single practitioner, no generation speed figure published; treat as a reproducibility candidate rather than a confirmed recipe. (source, July 4, 2026)

PSA: RYS-style layer duplication on Gemma 4 breaks without proportional layer_scalar adjustment — formula is s^(1/N). A community member who discovered and fixed this issue while experimenting with the RYS (Repeat Your Self) layer-duplication framework posted a concise PSA. The root cause: Gemma 4 models use a `layer_scalar` value that multiplies the output at each layer. When layers are duplicated without adjusting this scalar, the cumulative product compounds incorrectly and the resulting model breaks. The fix is to scale the scalar proportionally: new scalar = `s^(1/N)`, where `s` is the original `layer_scalar` and `N` is the total number of occurrences of the layer after duplication (duplications plus the original; thanks to a community member for catching an error in the original formula). A vibe-coded pull request demonstrating the fix was opened at `github.com/dnhkng/RYS/pull/4` and is listed as closed. The post follows the July 2 field note documenting a separate layer-expansion experiment that also ran into the `layer_scalar` issue during Gemma 4 44B construction. That repetition strengthens confidence that this is a genuine Gemma 4 architectural nuance that is not obvious from the model card or standard fine-tuning guides. Any practitioner attempting RYS-style Gemma 4 modifications should treat this as a prerequisite check before evaluating the model. Confidence: author confirmed the fix with working code; formula should be independently verified before production use, particularly the edge cases around the definition of N. (source, July 4, 2026)

Community RP/agentic benchmark: Gemma 4 31B leads 8-model field at 87%, Gemma 4 12B third at 80%. A community member ran a fantasy-RP and agentic evaluation suite — covering quest completion, scene endings, item and time tracking, character detection, storytelling, and drafting — across 8 locally runnable models. Evaluation used an external LLM grader with N varying per category. Overall pass rates: Gemma 4 31B first at 87%, Qwen3.6 27B second at 82%, Gemma 4 12B third at 80%, with a steep drop to the remaining models in the 55–70% range. The author's own framing: the headline pass rates obscure the more interesting category-level unevenness. Models that perform well on quest completion can fall apart on NPC thoughts or quest summarization, and this sub-category variability is invisible if you only track the overall score. The benchmark is author-designed and LLM-graded (grader not disclosed), which means the results are community anecdotal rather than a reproducible standard benchmark. Nevertheless, Gemma 4 31B holding first place over Qwen3.6 27B on a multi-dimensional agentic task suite, and Gemma 4 12B placing competitively at third, is consistent with earlier community signals about both models' instruction-following quality. The category-cliff observation — good headline, poor sub-score — is a useful evaluation design note for Gemmaclaw's own benchmark harness: top-line pass rates can mask meaningful capability gaps. Confidence: community benchmark, LLM-graded, author-designed suite; treat as directional signal rather than a controlled evaluation. (source, July 4, 2026)

AMD 9070 XT running Gemma 4 26B A4B Q5 on Ryzen 7900X — practitioner seeking 32GB secondary card for dedicated LLM inference. A practitioner currently running Gemma 4 26B A4B Q5 and ComfyUI on an AMD RX 9070 XT (paired with a Ryzen 7900X, 32 GB DDR5 6000 MHz, MSI X670P Wifi motherboard on Windows 11) is looking to add a 32 GB secondary card to dedicate to llama.cpp and ComfyUI workloads while keeping the 9070 XT free for gaming. The cards under consideration are the V620, MI50, and V100 (32 GB versions of each). No benchmark data or community responses were captured at sweep time. The primary value of this post is as a deployment data point: the 9070 XT running Gemma 4 26B A4B Q5 at consumer workloads is a confirmed AMD RDNA4-class configuration, adding to the growing set of community reports documenting Gemma 4 26B running on AMD discrete GPUs without NVIDIA-specific tooling. The card selection question — V620, MI50, or V100 for llama.cpp on Windows 11 — is a separate procurement question with community-specific tradeoffs around ROCm support, PCIe bandwidth, and power budget that the post was seeking input on. Confidence: hardware configuration confirmed by author; no benchmark figures; no comments captured. (source, July 4, 2026)

Open questions

What measured MTP throughput does the MLX Gemma 4 12B kernel achieve against native graph execution? The author states 20–30 tok/s as a theoretical ceiling on favorable workloads but has not yet published validated MTP vs native graph benchmark numbers. Once the MTP validation against native graph execution is complete, the comparison would provide the first community-measured MTP gain figure for Gemma 4 12B on M5 class hardware.
What is the audio prefill and decode performance ceiling for Gemma 4 12B on M2 Max 64GB across quantization variants? The Tauri 2 benchmark used Q5_K_S. Whether Q4_K_M or other variants close the 2-second audio prefill gap or substantially improve the 26 tok/s decode figure is not answered. A comparative sweep over quantization levels would help practitioners choose the right tradeoff for latency-sensitive voice applications.
What generation speed does Gemma 4 31B Q6_K achieve at 80K context on the RTX 5090? The post documents successful context extension to 80K but does not report tokens per second at that context depth. Token throughput at extended contexts is typically lower than at shorter contexts on consumer hardware due to KV cache pressure; knowing the actual throughput would help practitioners plan for long-context workflows on the RTX 5090.
Which 32GB card — V620, MI50, or V100 — performs best for Gemma 4 26B llama.cpp inference on Windows 11? The community post asked this but captured no responses. ROCm support on Windows for V620 and MI50, PCIe bandwidth, and driver maturity are all relevant considerations that a community response thread would resolve.

Sources

The Gemma-mentioning posts driving this update (July 5 sweep, newest first). Posts marked with score ~20 are from the Atom feed fallback and have incomplete metadata; treat as first-look signals:

Gemma 4 12B - MLX Kernel (Jul 4, 2026 — M5 MacBook Pro 16GB; MLX Gemma 4 12B kernel; 20–30 tok/s theoretical MTP ceiling at this bandwidth class; DSpark drafter blocked by RAM; NVIDIA follow-on planned; experimental)
Ran a classic(medival europe) fantasy RP/agentic benchmark across 8 local models (Jul 4, 2026 — 8-model suite: quest completion, scene endings, item/time tracking, character detection, storytelling, drafting; LLM grader; Gemma 4 31B 87%, Qwen3.6 27B 82%, Gemma 4 12B 80%; sub-category unevenness noted)
PSA: Upscaling Gemma 4 requires a proportional layer_scalar adjustment (Jul 4, 2026 — RYS layer duplication; layer_scalar must scale as s^(1/N), N=total occurrences; fix in RYS PR #4; model breaks without adjustment)
Gemma4 with audio input: 16.8 tok/s on Macbook M2 Max 64GB (Jul 4, 2026 — Tauri 2, Rust FFI via llama-cpp-2, Metal; Unsloth gemma-4-12b-it-Q5_K_S; 503 multimodal tokens (486 audio); 16.8 tok/s first-inference; 2s audio prefill + 3.7s decode; decode-alone 26 tok/s)
RTX5090, gemma-4-31B-it-Q6_K.gguf. Context: before - 35k, after - 80k! (Jul 4, 2026 — RTX 5090, Docker llama.cpp; GGML_CUDA_NO_PINNED=1; --backend-sampling --parallel 1; --ctx-size 80000; --flash-attn on; --no-mmap; --batch-size 128 --ubatch-size 128; adapted from DeepSeek Flash technique)
Looking for thoughts on card purchase: V620, MI50, V100 (Jul 4, 2026 — 9070 XT + Ryzen 7900X + 32GB DDR5 6000MHz + Windows 11; running Gemma 4 26B A4B Q5; seeking 32GB secondary card for dedicated LLM; V620/MI50/V100 32GB considered; no comments captured)

_Last updated: 2026-07-05 (July 5 sweep). Confidence: medium. Key findings: MLX Gemma 4 12B kernel on M5 16GB targets 20–30 tok/s MTP ceiling, NVIDIA optimization planned; Tauri 2 audio-input benchmark on M2 Max 64GB: 16.8 tok/s first-inference, 26 tok/s decode-alone, 2s audio prefill with Unsloth Q5_K_S; RTX 5090 Gemma 4 31B Q6_K expanded to 80K context via GGML_CUDA_NO_PINNED+backend-sampling+no-mmap (needs independent reproduction); RYS upscaling requires layer_scalar = s^(1/N) or model breaks; community RP/agentic benchmark places Gemma 4 31B first at 87% and Gemma 4 12B third at 80%; AMD 9070 XT running Gemma 4 26B A4B Q5 confirmed. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-07-04

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new posts from the July 3, 2026 sweep, 472 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

July 4 sweep, 2026-07-04 00:00 UTC: a cycle defined more by community initiative than benchmark data. The headline is the Fast Gemma Challenge — a live Gemma x Hugging Face multi-agent competition to maximize `gemma-4-E4B-it` tokens per second on a fixed A10G GPU under a perplexity quality guard. The challenge is the most actionable Gemmaclaw-relevant signal this cycle because its shared scoreboard and coordinated research directions (vLLM, quantization, torch.compile, speculative decoding, custom kernels) will surface reproducible optimization ideas over the coming days. Alongside that, two low-end hardware reports document Gemma 4 E2B running at approximately 9 tok/s on a 2015-era i5-6500 and raise a deployment question about Gemma 4 E2B on an Intel N100 iGPU in a 24/7 homelab setup — no answers were captured for the N100 question. A voice-and-avatar demo built with Gemma 4 31B shows function-tool-driven facial expression and gesture control over a WebSocket stack, a useful data point about the model's structured output reliability in real-time agentic contexts. A speculative community question about whether DiffusionGemma could provide high-quality 256-token draft batches for speculative decoding rounds the cycle, with no experimental results yet.

The Fast Gemma Challenge: multi-agent competition to maximize gemma-4-E4B-it throughput on A10G under a perplexity guard. The Gemma x Hugging Face team launched a multi-agent optimization competition where autonomous agents work in parallel to maximize inference speed for `gemma-4-E4B-it` on a fixed A10G GPU, measured in tokens per second, subject to a perplexity quality constraint. The shared message board allows agents to post plans, claim research directions, run benchmarks, and publish result files in real time. The active work areas are vLLM optimization, quantization schemes, `torch.compile` configurations, speculative decoding paths, and custom CUDA kernels. The live scoreboard is at `gemma-challenge-gemma-dashboard.hf.space`. No baseline TPS figure or current leader was captured in the Reddit post (score ~20, no comments at sweep time), and the submission mechanism is coordinated via a Hugging Face bucket README. For Gemmaclaw readers, the main value of tracking this challenge is not any single result but the methodology emerging from it: every optimization approach that passes the perplexity guard and lands on the public scoreboard is a reproducible optimization idea for Gemma 4 E4B inference in general. Confidence: official challenge structure confirmed by post and links; no benchmark result captured yet; treat as a live community optimization event to track over the next week. (source, July 3, 2026)

Gemma 4 E2B on a 2015-era i5-6500: approximately 9 tok/s, subjectively competitive with GPT-4. A community member reports running Gemma 4 E2B on an Intel i5-6500 desktop CPU — a quad-core Skylake processor from 2015 with no dedicated GPU — at approximately 9 tokens per second. The author describes the output quality as "a lot better than ChatGPT 3.5 and maybe as good as ChatGPT 4," and notes that Qwen 3.5 4B was also well-received before switching to E2B. No quantization variant, OS, inference backend, context length, or system RAM figure is disclosed. The quality claim ("as good as ChatGPT 4") is a subjective impression with no methodology, and the comparison to GPT-4 is likely colloquially referring to GPT-4.0 or a similar older API model rather than the current frontier. What this post adds to the hardware picture: Gemma 4 E2B is being adopted specifically by users on CPU-only hardware because it offers a subjectively meaningful quality step over previous small-model options while fitting comfortably in the memory budget of consumer desktops. The 9 tok/s figure is consistent with the i5-6500's estimated memory bandwidth profile (~30–35 GB/s) at a low quantization depth. Confidence: anecdotal single-user report, no configuration details, no methodology for quality claims; treat as a directional adoption signal for CPU-class hardware, not a reproducible benchmark. (source, July 3, 2026)

Gemma 4 E2B on Intel N100 mini PC: CPU-only or iGPU — deployment question, no answers captured. A community member running a 24/7 Intel N100 mini PC on Proxmox asks whether to configure `llama.cpp` to run Gemma 4 E2B through the CPU alone or through the N100's integrated GPU, and which backend (OpenCL, SYCL, or Vulkan) to target for the iGPU path. No comments were captured at sweep time. The N100 is a Gracemont-core processor with Intel UHD graphics, typically configured with 8–16 GB LPDDR5 in mini PC builds; its iGPU shares system memory and supports SYCL/oneAPI and Vulkan backends in recent `llama.cpp` builds. For Gemma 4 E2B, the relevant tradeoff is: CPU-only inference uses all system RAM as model memory but is limited by CPU memory bandwidth (~68 GB/s theoretical for LPDDR5-5200), while iGPU inference may enable partial SIMD or matrix-engine acceleration but risks overhead from GPU driver setup and context switching on a low-power platform. The lack of responses means no community consensus exists yet for this specific setup. The post is notable as a deployment-target signal: N100-class mini PCs are sold for homelab and always-on server use cases and represent a meaningful low-power Gemma 4 E2B deployment tier that is distinct from both consumer desktop GPUs and Apple Silicon. Confidence: unanswered community question, no benchmarks; treat as a watch signal for the emerging N100/low-power iGPU tier. (source, July 3, 2026)

Gemma Avatar demo: Gemma 4 31B drives facial expressions and gestures via function tools over a WebSocket voice pipeline. A community developer published a working voice-and-avatar demo where a 3D avatar listens to speech, responds with a synthesized voice, and autonomously controls its own facial expressions and hand gestures. The inference model is Gemma 4 31B served via Cerebras (not local hardware). The model receives facial expression and gesture state as callable function tools — `set_mood`, `make_hand_gesture`, `make_facial_expression` — and decides when to trigger them during its response generation. The audio stack is fully open: Silero VAD for voice activity detection, Nvidia Parakeet for speech-to-text, and Qwen3-TTS for text-to-speech. Transport is raw PCM over a plain WebSocket. The avatar rendering uses TalkingHead plus HeadAudio (met4citizen's open-source projects). No latency figures are disclosed, and the Cerebras serving tier means the generation speed is substantially above any local consumer hardware setup. The Gemmaclaw-relevant signal is the model's function-calling behavior: Gemma 4 31B reliably invokes structured avatar state tools in real-time conversational context without disabling or ignoring them. This is consistent with prior community reports documenting Gemma 4 31B's strong structured output and tool-call discipline, and it extends the documented use cases to multimodal agentic UI work. Confidence: working demo with disclosed stack; Cerebras serving tier (not local inference); function-call reliability observed qualitatively, not measured. (source, July 3, 2026)

DiffusionGemma as a speculative decode drafter: community question, no experimental results. A community member asks whether a Gemma diffusion model could serve as a high-quality speculative decode drafter — generating a 256-token draft in parallel rather than autoregressively — arguing that existing MTP approaches face a fundamental tradeoff between regressive and parallel generation quality. The post frames DiffusionGemma as a potential way to get draft batches that are both fast and high-quality by exploiting the diffusion model's parallel decoding architecture. No experimental results, speculative decode acceptance rates, latency measurements, or backend compatibility details are provided or have been captured in comments. The question is relevant context: DiffusionGemma has been blocked in LM Studio since at least mid-June due to an unmerged llama.cpp PR (documented in the June 30 sweep), so community exploration of its capabilities beyond text generation is currently limited to source-build users. The speculative decode drafter use case for diffusion models is a genuine research direction — several 2025–2026 papers explore it for non-autoregressive models — but there is no community evidence yet that DiffusionGemma specifically performs well in this role. Confidence: speculative community question, no results, no backend support confirmed; treat as a research watchlist item. (source, July 3, 2026)

Open questions

What optimization techniques are winning the Fast Gemma Challenge scoreboard? The competition structure makes results observable over time: checking the live scoreboard at `gemma-challenge-gemma-dashboard.hf.space` and the shared message board will reveal which of vLLM, torch.compile, custom kernels, or speculative decoding techniques is yielding the best throughput gains while preserving the perplexity guard. The winning approaches are directly applicable to `gemma-4-E4B-it` users outside the competition context.
What is the best backend configuration for Gemma 4 E2B on an Intel N100 iGPU? The unanswered community question about CPU-only versus SYCL/Vulkan iGPU on N100 reflects a real gap. Benchmarks from N100 mini PC users on `/r/LocalLLaMA` or the Gemmaclaw community would close this for a growing tier of always-on low-power homelab deployments.
Can DiffusionGemma serve as a speculative decode drafter in llama.cpp, and what acceptance rate does it achieve? The unmerged llama.cpp PR blocking DiffusionGemma in prebuilt frontends is the first constraint to resolve. Once merged, testing DiffusionGemma's acceptance rate as a drafter for Gemma 4 12B or 31B would determine whether this is a viable performance strategy or a theoretical curiosity.

Sources

The Gemma-mentioning posts driving this update (July 4 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

The Fast Gemma Challenge (Jul 3, 2026 — Gemma x HuggingFace multi-agent TPS competition; gemma-4-E4B-it on A10G; perplexity quality guard; vLLM/quantization/torch.compile/speculative decoding/custom kernels; live scoreboard at gemma-challenge-gemma-dashboard.hf.space)
gemma4 e2b is really good, what other small models work on crappy computers? (Jul 3, 2026 — i5-6500 CPU-only, ~9 tok/s; E2B subjectively competitive with GPT-4 per author; no quant/backend details; anecdotal)
Help using llama.cpp with intel n100? (Jul 3, 2026 — N100 mini PC, Proxmox, 24/7; CPU-only vs iGPU backend question for Gemma 4 E2B; no answers captured)
Gemma Avatar: Talk to Gemma 4-31B face to face (Jul 3, 2026 — Cerebras-served 31B; Silero VAD + Parakeet STT + Qwen3-TTS; set_mood/make_hand_gesture/make_facial_expression function tools; TalkingHead + HeadAudio avatar; WebSocket PCM; function-call reliability observed)
Anyone tried using the new (ish) Gemma diffusion model as a speculative model? (Jul 3, 2026 — DiffusionGemma as 256-token parallel speculative drafter hypothesis; MTP regressive/parallel tradeoff framing; no results, no comments)

_Last updated: 2026-07-04 (July 4 sweep). Confidence: medium. Key findings: Fast Gemma Challenge is a live multi-agent A10G optimization competition for gemma-4-E4B-it (track scoreboard at gemma-challenge-gemma-dashboard.hf.space); Gemma 4 E2B runs at ~9 tok/s on an i5-6500 CPU (anecdotal, no config details); N100 iGPU deployment question raised with no community answer yet; Gemma 4 31B drives avatar function tools (set_mood, gesture) over a Cerebras-backed voice pipeline; DiffusionGemma speculative drafter concept is unanswered community question with no results. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-07-03

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new posts from the July 2, 2026 sweep, 466 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

July 3 sweep, 2026-07-03 00:00 UTC: five signals from the July 2 cycle. The most hardware-relevant is a practitioner benchmarking Gemma 4 26B A4B QAT alongside Qwen3.6 27B and Ornith 35B on an RTX 3090 using inspect-ai and standard benchmarks — the most directly useful RTX 3090-class evidence in recent sweeps. A voice pipeline demo shows Gemma 4 E4B achieving similar latency to a Cerebras-served 31B on an M3 MacBook Pro 36GB, a practical data point for Apple Silicon users interested in open-source realtime speech. A community fine-tuner reports +290 Elo over base Gemma-4-31B on a copywriting benchmark, a self-reported result that merits scrutiny but shows the base model is strong enough for narrow domain specialization. An architecture experiment proposes rebuilding Gemma 4 31B as a 26B by ablating the weakest SWA layers and adding attention residuals — highly speculative and pre-results, but worth tracking as a community research direction. Finally, a thin post points to a claimed 255 tok/s Gemma 4 WebGPU result, a figure that needs source verification before being cited as credible.

RTX 3090 benchmark: Gemma 4 26B A4B QAT vs Qwen3.6 27B and Ornith 35B via inspect-ai. A community member frustrated with the lack of systematic benchmarks for locally runnable models ran three models through inspect-ai and standard benchmark suites on an RTX 3090: Qwen3.6 27B at Q4_K_M, Gemma4 26B A4B QAT at Q4_0, and Ornith1.0 35B MoE at Q4_K_M. Models came from the lmstudio-community and deepreinforce-ai repositories; inference ran in LM Studio. The benchmark used 100 samples per suite with aggressive limits to enable overnight runs. Critically, the post body is incomplete in the captured snapshot: the author wrote "I expected Ornith to be nearly as..." without publishing the final comparison table, suggesting this is an in-progress report or the post content was truncated at collection time. The setup and methodology are credible: inspect-ai provides structured multi-task evaluation rather than qualitative impressions, and the benchmark sample count (100 per suite) is reasonable for a first pass. The relevance for Gemmaclaw readers: this is one of the few community-produced evaluations that tests Gemma 4 26B A4B QAT directly on a single RTX 3090, which is the reference hardware class for this site. No throughput figures were captured from the post, and quality results are pending the completed benchmark table. Confidence: methodology is solid (inspect-ai + standard benchmarks), but results are incomplete; treat as a promising benchmark-in-progress rather than a final verdict. Follow the source post for updates. (source, July 2, 2026)

Voice pipeline: Gemma 4 E4B achieves similar latency to Cerebras-served 31B on M3 MacBook Pro 36GB. A Hugging Face community member demoed a fully open-source, locally runnable voice assistant pipeline built from three components: Nvidia Parakeet for speech-to-text, Gemma 4 31B served via Cerebras for the language model step, and a custom Qwen3TTS inference path for text-to-speech. The author claims the pipeline is a drop-in replacement for the OpenAI realtime API. The key Apple Silicon data point: the same pipeline achieves "similar latencies" on a MacBook Pro M3 36GB using Gemma 4 E4B locally rather than the Cerebras-hosted 31B. No specific latency figures are published, and no comments were captured to corroborate the latency claim. The framing suggests "similar" means usable for real-time conversation rather than indistinguishable from cloud inference, but the exact comparison is not quantified. For practitioners: Gemma 4 E4B on M3 36GB as the local inference target for open-source voice pipelines is a credible configuration given E4B's previously documented 30–40 tok/s range at Q4 on Apple Silicon. The pipeline's other building blocks (Parakeet, Qwen3TTS) are also local and open-weight, making this one of the more complete open-source realtime voice stacks documented in recent sweeps. Confidence: latency comparison is self-reported with no supporting numbers; hardware (M3 36GB) and model (Gemma 4 E4B) configuration are plausible and consistent with prior reports. (source, July 2, 2026)

Gemma-4-31B copywriting fine-tune: +290 Elo over base model on a domain-specific benchmark. A community member fine-tuned Gemma-4-31B-it specifically for direct-response copywriting, targeting the pattern of using specific pain points, concrete facts, and tight calls to action instead of generic marketing hedges. Evaluation used an EqBench3-style pairwise Elo methodology over 30 real-world briefs spanning Facebook ads, cold email, landing pages, product descriptions, SMS, and video scripts. The fine-tune was compared blind against the base model using DeepSeek V4 Flash as the judge in both ordering directions (A-vs-B and B-vs-A) to control for position bias. Results: the fine-tune reached Elo 1657 vs the base at 1367 — a gap of 290 Elo points — winning 24 of 30 head-to-head comparisons (80%). Important caveats: this is entirely self-reported with no independent replication. The judge (DeepSeek V4 Flash) is a capable model but is also a competitor in certain evaluation contexts; blind position-swap methodology reduces but does not eliminate judge bias on style-dependent tasks. The benchmark suite is purpose-built by the author rather than drawn from a standardized repository. No comments were captured at sweep time. The practical signal: Gemma 4 31B fine-tunes well for narrow copywriting tasks, which is consistent with earlier reports of its strong instruction-following and tone controllability. Confidence: self-reported benchmark with plausible methodology but no independent replication; treat as a domain-fine-tuning capability signal rather than a controlled public benchmark result. (source, July 2, 2026)

Architecture experiment: rebuilding Gemma 4 31B as a 26B by ablating SWA layers and adding attention residuals. A community member is actively experimenting with rebuilding Gemma 4 31B into a smaller but potentially stronger 26B variant by modifying its sliding window attention (SWA) architecture. Gemma 4 31B uses five SWA layers per block at 1024 tokens each. The author identified Block Layer 3 as "consistently the weakest" through ablation tests and plans to remove it, then rescale SWA attention spans to 1024/2048/4096/8.1K with a final global layer. Additionally, the author plans to bolt on "Attention Based Residual Networks" from an early 2026 research paper to allow global layers to better propagate information. Fine-tuning is planned on the IT (instruction-tuned) base rather than pretraining, taking the top-K logits from the 31B as supervision targets. This is highly speculative pre-results work: the author has barely slept, acknowledges trial-and-error methodology, and is not a professional ML researcher. No benchmark or perplexity numbers are provided, no comments were captured, and the project may not produce a publicly released model. The Gemmaclaw relevance: this is the second community experiment in recent sweeps attempting to modify Gemma 4's architecture (following the 44B layer-expansion post from the July 2 sweep). Both independently flag Gemma 4's SWA configuration as a modification target, which is a weak but consistent signal worth tracking. Confidence: experimental, no results, single author, acknowledged non-expert; treat as a research watchlist item. (source, July 2, 2026)

Gemma 4 WebGPU kernel speed claim: 255 tok/s (unverified). A post links to an X/Twitter post by user @xenovacom claiming Gemma 4 WebGPU kernels reach 255 tokens per second. The Reddit post body is a single-sentence community reaction arguing that crossing 100 tok/s on dense models locally is the threshold that makes local inference competitive with frontier cloud APIs for routine work. No hardware, model variant, quantization, context length, batch size, or browser/runtime information is provided in the Reddit post; the underlying X post is linked but not archived in local knowledge. The 255 tok/s figure would represent a substantial improvement over the best previously documented WebGPU Gemma 4 throughput (the Transformers.js + Reachy Mini demo from the May 2026 sweep did not publish throughput numbers). For context, earlier llama.cpp hardware reports on high-end Apple Silicon reach similar figures for the E4B MoE variant. Without the source post's methodology, this number cannot be cited as credible, but the threshold observation in the comment is reasonable: 100–200+ tok/s on a local WebGPU path would meaningfully expand Gemma 4's deployability in browser-native or edge contexts. Confidence: unverified, source not directly accessible from archived material; treat as a watch signal pending independent reproduction or methodology disclosure. (source, July 2, 2026)

Open questions

What are the final inspect-ai benchmark results for Gemma 4 26B A4B QAT vs Qwen3.6 27B and Ornith 35B on the RTX 3090? The community post captured an in-progress benchmark; the final results table was not available at sweep time. This is among the most useful RTX 3090-class benchmark data for Gemmaclaw given the shared target hardware. Checking the source post for an updated results table or follow-up comment thread would close this gap.
What specific latency does Gemma 4 E4B achieve on M3 MacBook Pro 36GB in the voice pipeline? The post reports "similar latencies" to Cerebras-served 31B without publishing numbers. A follow-up asking the author for time-to-first-audio and total pipeline latency would make this a citable hardware data point for Apple Silicon voice-pipeline users.
Can the Gemma 4 copywriting fine-tune result be independently replicated? The 290 Elo gap is large enough to be interesting if real, but relies on a custom benchmark and a single judge. An independent evaluation using a different judge model and a standardized copywriting benchmark (or at least the author's published benchmark data) would substantially raise confidence.
What is the source and hardware spec behind the 255 tok/s Gemma 4 WebGPU claim? The @xenovacom X post linked in the Reddit thread may contain the methodology. If the figure is reproducible on consumer hardware, it represents a meaningful advance in browser-native Gemma 4 inference and should be incorporated into the WebGPU hardware category.

Sources

The Gemma-mentioning posts driving this update (July 3 sweep, newest first). Posts marked with score ~20 are from the Atom feed fallback and have incomplete metadata; treat as first-look signals:

Local benchmarks with a RTX 3090 - Qwen3.6 27b vs Ornith (Jul 2, 2026 — RTX 3090, LM Studio, inspect-ai; Gemma4 26B A4B QAT Q4_0 vs Qwen3.6 27B Q4_K_M vs Ornith 35B MoE Q4_K_M; 100 samples/suite, results incomplete at sweep time)
Talking with Gemma 4 31B! (Jul 2, 2026 — open-source voice pipeline: Nvidia Parakeet + Gemma 4 31B (Cerebras) + custom Qwen3TTS; similar latency on M3 MacBook Pro 36GB with Gemma 4 E4B locally; drop-in OpenAI realtime API replacement)
Fine-tuned Gemma-4-31B specifically for Copywriting & Creative Writing Tasks (Jul 2, 2026 — EqBench3-style pairwise eval, 30 copywriting briefs, DeepSeek V4 Flash judge, position-swap control; fine-tune 1657 Elo vs base 1367, 24/30 wins; self-reported, no independent replication)
Rebuilding Gemma 4 31b... better... As 26b... (Jul 2, 2026 — SWA layer ablation (Block Layer 3 weakest), rescale SWA to 1024/2048/4096/8.1K, add attention residual networks; experimental, no results, IT base fine-tune planned)
Gemma 4 WebGPU Kernels 255 tok/s by x/@xenovacom (Jul 2, 2026 — thin post linking to X/Twitter claim of 255 tok/s on WebGPU; no hardware/methodology detail; unverified)

Last updated: 2026-07-03 (July 3 sweep). Confidence: medium. Key findings: RTX 3090 inspect-ai benchmark of Gemma 4 26B A4B QAT vs Qwen3.6 27B vs Ornith 35B in progress (results incomplete); Gemma 4 E4B on M3 36GB achieves similar latency to Cerebras-hosted 31B in open voice pipeline; Gemma 4 31B copywriting fine-tune claims +290 Elo on domain benchmark (self-reported); SWA layer ablation community experiment ongoing (pre-results); 255 tok/s WebGPU speed claim unverified. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-07-02

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new or updated since 2026-07-01, 461 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

July 2 sweep, 2026-07-02 00:00 UTC: four signals worth surfacing from a cycle that leans toward model architecture experiments and ecosystem calibration rather than hardware benchmarks. The most attention-grabbing post is a community attempt to expand Gemma 4 31B to 44B by duplicating layers using the LLaMA Pro identity-init approach — experimental territory that yields a watchlist signal about Gemma 4's architectural compactness. The most actionable benchmark is a 7-model speed-and-quality comparison on M5 Max 128GB where Gemma 4 31B MLX pulls 9.3 tok/s at 19 GB for a code-understanding task. A broad ecosystem note from the Open Models June 2026 roundup confirms that Intel and NVIDIA both shipped quantization artifacts for Gemma 4 models during June. A cloud inference data point rounds the sweep: a community member claims Gemma 4 31B on Cerebras outperforms ChatGPT voice mode in conversational quality, a thin post that serves as a calibration signal for the cloud inference tier.

Community layer-expansion experiment: Gemma 4 31B expanded to 44B via identity-init layer duplication. A community member built a "44B" variant of Gemma 4 31B by duplicating the model's 60 transformer layers twice — first to 80 layers, then to 88 layers (yielding roughly 47B parameters) — using the LLaMA Pro identity-initialization method with a Gemma 4-specific `layer_scalar` fix. The author's hypothesis is that Gemma 4's dense architecture packs knowledge so compactly that injecting a new domain (Korean legal + STEM data) risks overwriting existing weights rather than extending the model's capacity. The two-phase expansion — expand, fine-tune on domain data, expand again — is intended to carve out "empty capacity" before domain-specific fine-tuning. Important caveats: the author is not a CS or math professional and describes the work as "hands-on trial and error on my own hardware." No controlled ablation against unmodified Gemma 4 31B is provided, no benchmark numbers are disclosed, and no comments were captured at sweep time. The Gemma 4-specific `layer_scalar` fix took significant debugging time, suggesting the architecture does not transfer cleanly from LLaMA Pro's original recipe. The post's practical signal for most users is narrow: layer expansion via identity init is an active community research direction, but the Gemma 4 architectural compactness the author observed is consistent with what earlier sweeps have documented about Gemma 4's parameter efficiency. Confidence: experimental, single-author, no controlled ablation, no benchmark; treat as a research watch item rather than an actionable recommendation. (source, July 1, 2026)

M5 Max 128GB speed-and-quality benchmark: Gemma 4 31B MLX at 9.3 tok/s, 19 GB, takes >10 minutes on a code-understanding task. A practitioner benchmarked seven open-weights models on an M5 Max with 128 GB unified memory using a repo-understanding task (how does binding work in a specific codebase, with a code example). The task went through agentOS from rivet_dev and used a Pi Rating scoring system via GLM 5.2. Key Gemma 4 data point: `gemma4:31b-mlx` ran at 9.3 tok/s using 19 GB and took more than 10 minutes to complete the task; the rating system had trouble displaying the output. By comparison, `qwen3.5 122B Q4_K_M` ran at 29.2 tok/s using 81 GB (37.3 seconds, reasoning process available immediately) and earned a 4.5/5 rating. `qwen3.6 35b-a3b-coding-mxfp8` ran at 42.45 tok/s using 38 GB but rated only 2/5 due to a TypeScript-to-Python conversion error and incorrect conceptual handling. The Gemma 4 quality was not fully captured due to rendering issues. This benchmark is not controlled for throughput vs quality tradeoffs — it is a snapshot of real-world usability on a specific code-comprehension task. What it confirms: on Apple Silicon at the M5 Max tier, Gemma 4 31B MLX runs at a substantially lower tok/s than competing models at similar or larger parameter counts, though its 19 GB footprint leaves ample headroom in a 128 GB system. The rendering issue during quality evaluation means this data point is incomplete on the quality axis. Confidence: single-author non-controlled benchmark, rendering issue prevented quality evaluation; treat as a speed snapshot only, quality verdict pending. (source, July 1, 2026)

Open Models June 2026 roundup: Intel AutoRound and NVIDIA NVFP4 quants shipped for Gemma 4 models. The community's monthly open-model retrospective for June 2026 confirms that two major quantization artifact packages were released for Gemma 4 during the month. Intel AutoRound produced quantized versions of both Gemma-4-31B-it and Gemma-4-12B-it. NVIDIA NVFP4 (native 4-bit floating point format for Blackwell hardware) shipped for diffusiongemma-26B-A4B-it. Gemma-4-QAT is listed in the miscellaneous section as a notable June artifact. The roundup also catalogs MXFP4 releases from AMD for unrelated models. For Gemmaclaw users, the practical implications: Intel AutoRound variants of Gemma 4 31B and 12B are available as an alternative to GGUF-based quants, potentially better suited to Intel GPU or CPU inference pipelines. NVIDIA NVFP4 for DiffusionGemma-26B-A4B-it extends the native Blackwell quantization path for the diffusion variant, complementing the previously documented NVFP4 release for the base 26B MoE model. AutoRound quality relative to llama.cpp Q-series quants is not characterized in the roundup; community benchmarks on AutoRound quality for Gemma 4 remain sparse. Confidence: official roundup with direct artifact links; no quality or throughput benchmarks for AutoRound Gemma 4 variants in this post. (source, July 1, 2026)

Gemma 4 31B on Cerebras outperforms ChatGPT voice mode — a cloud inference calibration signal. A community post with minimal body text claims that Gemma 4 31B served via Cerebras is better than ChatGPT's voice mode for conversational quality. No methodology, sample prompts, evaluation criteria, or comparison methodology are disclosed; no comments were captured. The post's value is calibration rather than actionable data: Cerebras cloud runs Gemma 4 31B at significantly higher throughput than any consumer hardware configuration documented in these field notes, which puts it in a different inference tier than local setups. The claim that the conversational experience at that throughput tier exceeds ChatGPT voice mode aligns with earlier community signals documenting Gemma 4's strong multilingual instruction following and natural dialogue quality, but cannot be verified from this post alone. Confidence: single anecdotal claim, no methodology, no comments; treat as a cloud inference tier sentiment signal only. (source, July 1, 2026)

Open questions

What quality score does Gemma 4 31B MLX achieve on the M5 Max repo-understanding task? The rendering issue prevented the benchmark from completing the quality evaluation. A follow-up run with a working display pipeline would close this gap and add a quality data point to the speed measurement (9.3 tok/s, 19 GB).
How do Intel AutoRound Gemma 4 31B and 12B quants compare to GGUF Q4_K_M or QAT variants on standard benchmarks? AutoRound's quality relative to established llama.cpp quant families is not characterized in available community posts. A direct comparison on a coding or reasoning benchmark would help practitioners choose between quantization paths.
Does Gemma 4's layer-expansion compactness problem generalize to other domain injection approaches? The community experiment's core hypothesis — that Gemma 4's dense architecture resists new domain injection without expansion — would benefit from a controlled ablation: fine-tune on the same Korean legal + STEM data without layer expansion, compare perplexity and task accuracy. If the hypothesis holds, it implies Gemma 4 31B is inherently harder to domain-adapt than models with similar parameter counts and looser knowledge packing.
What throughput does Gemma 4 31B achieve on Cerebras, and how does it compare to the community's best local results? The voice mode claim is a quality observation, not a throughput one. Cerebras serves Gemma 4 at speeds not achievable on consumer hardware; knowing the actual tok/s on their infrastructure would give practitioners a concrete reference point for the cloud inference tier and clarify how far local hardware lags the Cerebras tier at this model size.

Sources

The Gemma-mentioning posts driving this update (July 2 sweep, newest first). Posts marked with score ~20 are from the Atom feed fallback and have incomplete metadata; treat as first-look signals:

I extended Gemma4-31B to 44B (88 layers) — since Google won't give us anything bigger than 31B (Jul 1, 2026 — identity-init layer expansion experiment, 60→80→88 layers, Korean legal + STEM fine-tune, Gemma4-specific layer_scalar fix, no benchmark vs base; experimental)
Speed vs. quality: benchmarking 7 open-weights models on M5 Max (Jul 1, 2026 — M5 Max 128GB; gemma4:31b-mlx at 9.3 tok/s 19GB, >10min task, rendering issue prevented quality capture; qwen3.5 122B Q4 at 29.2 tok/s 81GB rated 4.5/5)
Open Models - June 2026 (Jul 1, 2026 — Intel AutoRound quants for Gemma-4-31B-it and Gemma-4-12B-it; NVIDIA NVFP4 for diffusiongemma-26B-A4B-it; Gemma-4-QAT in misc; June 2026 open model roundup)
gemma-4-31B on Cerebras is better than ChatGPT voice mode (Jul 1, 2026 — anecdotal cloud inference quality claim, no methodology; Cerebras tier calibration signal only)

Last updated: 2026-07-02 (July 2 sweep). Confidence: medium. Key findings: community experiment extends Gemma 4 31B to 44B via layer duplication (experimental, no ablation); M5 Max benchmark shows Gemma 4 31B MLX at 9.3 tok/s 19GB (quality not captured); Intel AutoRound and NVIDIA NVFP4 quants confirmed for Gemma 4 31B-it, 12B-it, and DiffusionGemma-26B-A4B-it during June 2026; Gemma 4 31B on Cerebras reported better than ChatGPT voice mode (anecdotal). Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-07-01

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-30, 455 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

July 1 sweep, 2026-07-01 00:00 UTC: five signals worth publishing. The headline this cycle is a set of practical hardware reports that add real coverage to two underserved categories: server-tier data-center GPUs repurposed for local use (Tesla V100 single and dual NVLink), and AMD Vega (gfx900) cards where an upstream llama.cpp PR just landed a +65.1% prefill speedup for Gemma 4 12B specifically. A third finding challenges a common community prior: a user reports Gemma 4 26B MoE consistently matching or beating Gemma 4 31B Dense on every test they can construct, inverting the "dense is smarter" expectation for RAG workloads. Two shorter signals round the cycle: Gemma 4 earns praise for code reliability (zero hallucinated folder names versus recurring Qwen typos in the same testing context), and a new uncensored agentic fine-tune of Gemma 4 12B arrived on HuggingFace.

Tesla V100 16GB: single module fits Gemma 4 26B comfortably; dual NVLink = 32 GB and doubled bandwidth for larger models. A practitioner who repurposed a pair of Tesla V100-SXM2-16GB modules (GV100, Volta, sm_70, ~900 GB/s HBM2 bandwidth) shares benchmarks for both single and NVLink-bridged dual configurations. Key findings for Gemma 4 users: a single 16 GB module loads Gemma 4 26B fully on-GPU with headroom for the KV cache — enough for one person doing local coding, agent work, or general chat. Larger MoE models like Qwen 3.5/3.6 35B do not fit a single 16 GB module and spill experts to CPU RAM, which introduces a CPU/RAM bandwidth bottleneck and slower throughput. Bridging two V100s with NVLink gives 32 GB of unified HBM2 at roughly double the bandwidth, making even larger models viable without CPU spill. Critical hardware caveat: V100 is Volta (sm_70) and supports only fp16, not bf16 or int8 tensor operations. Any model or runtime that assumes bf16 requires an fp16 fallback path, and users on Windows lose the biggest free speedup available from the card. For Gemma 4 specifically, the 26B MoE class fits the single-module tier; 31B Dense is borderline or tighter depending on context size. No specific tok/s figures are given for Gemma 4 in the post, but the bandwidth profile (~900 GB/s single, ~1800 GB/s dual) places V100 NVLink near the bandwidth of a single modern mid-range data-center GPU, making dual-module setups an underrated local-inference option for practitioners who can source used V100 pairs. Confidence: single-author benchmark post, hardware disclosed, general throughput characterization without model-specific tok/s for Gemma 4; treat as directional for V100 hardware planning. (source, June 30, 2026)

HIP hipBLAS PR for gfx900 (AMD Vega) GPUs delivers +65.1% prefill speedup for Gemma 4 12B. A llama.cpp pull request targeting old Vega / gfx900-class AMD GPUs — Radeon RX Vega 56/64, Radeon Instinct MI25, Frontier Edition, and associated Pro variants — benchmarks three models under the new hipBLAS-for-dense-prefill path versus the existing MMQ path. Gemma 4 12B shows the largest gain: +65.1% overall performance. Qwen3.5 4B gains +36.1% and Qwen3.6 27B gains +18.9%, averaging roughly 40% improvement across the three. The mechanism: the PR routes dense prefill operations through hipBLAS (AMD's BLAS GPU library) while keeping the multi-expert MoE dispatch through the MMQ path, which is better suited to MoE's sparse arithmetic. The result is faster prompt processing for the dense components of any model running on gfx900, which benefits Gemma 4 12B Dense more than MoE models where a larger fraction of work hits the MoE path. This PR has not merged to the llama.cpp main branch as of this sweep, so its availability depends on building from the PR branch. Gfx900 cards are old (Vega10, 2017–2019 architecture) and inexpensive on the used market; this speedup makes them materially more viable for Gemma 4 12B inference at low cost. Confidence: PR-stage benchmark from the PR author; improvement figures are for the specific PR branch; production availability pending merge. (source, June 30, 2026)

User finding: Gemma 4 26B MoE matches or beats Gemma 4 31B Dense on every personal RAG test. A community member building a heavy research assistant — books, most of Wikipedia, large research-paper datasets, daily RSS ingestion, multi-turn reasoning, extended personal memory — reports trying both Gemma 4 models and finding that "Gemma 4 26B MoE is matching and/or beating the 31B dense on every damn test I come up with." The author expected the 31B Dense to be superior, citing the "dense good, MoE bad, MoE dumb" framing common on r/LocalLLaMA, and questions whether their testing is flawed. No specific benchmark methodology, hardware, or quant choices are disclosed, so this cannot be treated as a controlled result. The confidence caveat is standard for this class of post: a single user's subjective testing environment where the 26B MoE's active-parameter budget per token (~4B) runs efficiently for RAG-style generation while the 31B Dense's full parameter count carries memory and speed costs at VRAM limits. The RAG-plus-reasoning use case is plausibly one where MoE efficiency wins over per-token parameter count, particularly when context includes retrieved passages that the model needs to synthesize rather than recall from weights. This reinforces earlier community signals from sweeps in late June: Gemma 4 26B-A4B is frequently cited alongside or above the 31B Dense for assistant and RAG workloads even though it uses fewer active parameters. Confidence: anecdotal self-report, no hardware or quant details, no controlled methodology; treat as a directional prior update for RAG use cases. (source, June 30, 2026)

Coding reliability: Gemma 4 produces zero hallucinated file paths; Qwen produces occasional code typos in the same workflow. A user running Python scripting and workflow automation in OpenCode reports a notable qualitative difference between Gemma 4 31B and Qwen 3.6 (27B and 35B A3B): Gemma 4 has produced zero hallucinated folder or file names, while Qwen occasionally creates directory typos that are hard to debug ("hallucinated what a folder was called"). The same user also describes Gemma 4 as "stubborn" — it is sometimes reluctant to take the extra step without being pushed, while Qwen models execute more aggressively but with occasional structural accuracy issues. No hardware, quant, or system-prompt details are disclosed. This is a qualitative user preference rather than a measured benchmark, but it adds to a pattern observable across multiple sweeps: Gemma 4 is consistently described as more conservative and precise in agentic file-system contexts, while Qwen models are described as faster and more eager but with higher hallucination risk on constrained output tasks (file paths, directory names, structured output). Confidence: single-user qualitative comparison, no controlled methodology, no sample size; treat as a user-experience signal rather than a reproducible benchmark. (source, June 30, 2026)

Uncensored Heretic fine-tune of Gemma 4 12B released on HuggingFace. Community packager LLMFan46 released a new uncensored agentic fine-tune of Gemma 4 12B — `gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-uncensored-heretic` — in both safetensors and GGUF formats. The post announces 13 refusals out of 100 on an unspecified probe with 0.0367 KLD (KL divergence from the base model). No benchmark methodology, accuracy figures, or hardware requirements are disclosed. This is the most recent entry in a recurring pattern of community uncensored Gemma 4 12B variants, following the Uncensored-Opus4.7-CoT published in the June 27 sweep. That earlier release showed that abliteration alone costs MMLU −18 points and GSM8K −47 points, while CoT SFT largely recovers those losses — context that applies here as a prior for evaluating any uncensored 12B variant. The 0.0367 KLD figure suggests the weights diverged modestly from the base; lower KLD generally correlates with better preserved capability, but the relationship is not linear and task-specific benchmarks remain the reliable measure. Confidence: developer release announcement, refusal probe self-reported, no independent benchmark; capability relative to base remains unverified. (source, June 30, 2026)

Open questions

What tok/s does Gemma 4 26B achieve on a single Tesla V100 16 GB at Q4_K_M or Q5_K_S? The post gives bandwidth context and a fit-on-GPU confirmation but no throughput measurement. Given V100's ~900 GB/s HBM2, a Q4_K_M run of Gemma 4 26B-A4B should theoretically land in the 20–35 tok/s range for decode, but no community measurement is available yet.
Has the HIP hipBLAS PR for gfx900 merged to llama.cpp main? The June 30 post documents a PR-stage benchmark. Until merged, gfx900 users must build from the PR branch to access the +65.1% Gemma 4 12B speedup. A merge date or build-from-source instructions for gfx900 owners would close this gap.
Can the 26B MoE vs 31B Dense comparison be reproduced with a disclosed methodology? The user's finding is intriguing but untestable without hardware, quant, and task details. A structured side-by-side (same context, same prompt, same hardware, both models, measured tok/s and quality score) would confirm or bound the RAG superiority claim.
What is the refusal and benchmark profile of the Gemma 4 12B Heretic uncensored variant compared to the Uncensored-Opus4.7-CoT from the June 27 sweep? Both are uncensored 12B GGUFs with similar ambitions; a side-by-side comparison on the same probe would give practitioners a choice between them with measurable tradeoffs.

Sources

The Gemma-mentioning posts driving this update (July 1 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Tesla V100 16GB local LLMs, single and dual NVLink benchmarks (Jun 30, 2026 — Gemma 4 26B fits single 16 GB V100; dual NVLink = 32 GB + ~2× bandwidth; fp16 only (no bf16/int8); larger MoE spills to CPU on single module; no Gemma 4-specific tok/s)
HIP: use hipBLAS for dense prefill on gfx900, keep MMQ for MoE (Jun 30, 2026 — llama.cpp PR for Vega/gfx900; Gemma 4 12B +65.1%, Qwen3.5 4B +36.1%, Qwen3.6 27B +18.9%; PR-stage only, not yet merged)
Is Gemma 4 31b overkill for a personal assistant/RAG? (Jun 30, 2026 — user reports 26B MoE matching or beating 31B Dense on all personal RAG tests; no hardware/quant/methodology disclosed; anecdotal directional signal)
Anyone using Gemma4:31b over Qwen3.6:27b or 35b(a10) (Jun 30, 2026 — Gemma 4 zero hallucinated folder names in coding workflows; Qwen produces occasional directory typos; Gemma 4 described as "stubborn" about going extra length; qualitative only)
Uncensored Heretic of Gemma 4 12B agentic fine-tune (Jun 30, 2026 — GGUF + safetensors release; 13/100 refusals self-reported; 0.0367 KLD from base; no independent benchmark)

Last updated: 2026-07-01 (July 1 sweep). Confidence: medium. Key findings: Tesla V100 16 GB fits Gemma 4 26B single-module; dual NVLink = 32 GB and doubled bandwidth; HIP hipBLAS PR on gfx900 yields +65.1% Gemma 4 12B prefill speedup (PR-stage, not yet merged); user reports 26B MoE matches or beats 31B Dense for personal RAG workloads; Gemma 4 noted for zero file-path hallucinations vs Qwen in coding contexts; new uncensored agentic 12B GGUF released. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-30

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new or updated since 2026-06-29, 450 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 30 sweep, 2026-06-30 00:00 UTC: a compact cycle carrying three watchlist-grade signals rather than headline benchmark data. The strongest finding is a domain-specific structured-output comparison from a custom medical VQA benchmark: one user reports that Gemma 4 completes structured output generation roughly 5× faster than Qwen when thinking mode is enabled and output format is enforced. A second post repeats a recurring DiffusionGemma friction point — the model cannot be loaded in LM Studio without building llama.cpp from a not-yet-merged PR branch. A third post documents a performance regression when migrating from GPT-OSS 20B Q4 to Gemma 4 12B Q8 on the same hardware, dropping from roughly 70 tok/s to 10 tok/s; the configuration flags in the post suggest a likely misconfiguration rather than a model limitation.

Domain-specific structured-output speed: Gemma 4 runs ~5× faster than Qwen when thinking mode and enforced structured outputs are combined. A user benchmarked 900 manually labeled scanned medical documents with a false-negative-penalizing scoring scheme, finding that Qwen reasoning ran approximately 5× longer than Gemma 4 when thinking mode was enabled and structured output was enforced. The author notes they could not get Qwen to run thinking mode with structured output enforcement at reasonable speed in their setup. The benchmark results were described as "very surprising" because the model ranking did not match the user's expectations from coding task benchmarks, suggesting this is a domain-specific finding tied to structured output generation rather than a general capability ordering. No hardware disclosure, no cloud baseline comparison details, no specific model quants mentioned, no comments captured. Confidence: single-author custom domain benchmark, no controlled methodology disclosed, no comparison to Qwen without thinking mode as a baseline; treat as a directional signal for structured output generation in Gemma 4 vs Qwen, not a general quality verdict. (source, June 29, 2026)

DiffusionGemma in LM Studio: still blocked by unmerged llama.cpp PR. A community member asked whether DiffusionGemma can be run in LM Studio and found that it requires an unmerged PR from the llama.cpp repository to function. Because LM Studio ships its own embedded llama.cpp build, PR-branch features are unavailable until the PR is merged and a new LM Studio release incorporates that build. This is consistent with community reports tracked since June 11 in multiple sweeps. The only working path for DiffusionGemma today is to build llama.cpp from source on the PR branch, which requires manual compilation and is not accessible to users relying on prebuilt frontends. No response or workaround was captured in comments. Confidence: confirmed by multiple reports over multiple weeks; no resolution or workaround available as of this sweep. (source, June 29, 2026)

Gemma 4 12B Dense at Q8/Q5 runs ~10 tok/s on a 20 GB GPU — likely a configuration issue, not a model ceiling. A user migrated from GPT-OSS 20B Q4 (approximately 70 tok/s) to Gemma 4 12B Q8 on what appears to be a server GPU with 20 GB VRAM, reporting a drop to approximately 10 tok/s. The llama.cpp systemd configuration in the post contains several flags that commonly degrade single-user inference performance: `--threads 16` (CPU thread count is irrelevant when the model is fully loaded on GPU), `--prio 2` (reduces process priority), and `-b 4096 -ub 4096` (large batch sizes that add scheduling overhead for low-concurrency workloads). nvidia-smi showed 10 GB of 20 GB VRAM used, confirming the model is fully GPU-loaded. Switching to Q5_K_XL showed no improvement, ruling out quantization as the bottleneck. At 12B Dense Q8, the model requires approximately 13 GB VRAM; the 20 GB card has enough headroom. A working starting point would be: remove or reduce `--threads` to match physical CPU cores (not logical threads), remove `--prio 2`, and reduce batch size to `-b 512 -ub 512` for single-user use. No follow-up or resolution was captured. Hardware: GPU with 20 GB VRAM (exact model not disclosed), running Linux with CUDA. Confidence: single anecdotal misconfiguration report, no resolution captured, no GPU model disclosed; treat as a configuration warning rather than a Gemma 4 capability finding. (source, June 29, 2026)

Open questions

Does the Gemma 4 vs Qwen structured-output speed difference hold across other structured output frameworks? The medical VQA benchmark used a single enforcement mechanism and a single hardware setup. The 5× figure could reflect Qwen's longer reasoning traces rather than a fundamental inference architecture difference. A controlled test with thinking mode disabled on both models, then enabled, would isolate the effect.
When will the DiffusionGemma llama.cpp PR merge? Multiple posts across six weeks have referenced the same unmerged PR. The community benefit of a merged path is high — it would unlock DiffusionGemma support in all llama.cpp-based frontends (LM Studio, Ollama, Jan, etc.) without requiring source builds.
What is the realistic single-user inference speed for Gemma 4 12B Dense Q8 on a well-configured 20 GB GPU? The post in this sweep is a misconfiguration example. A properly configured reference run (minimal threads, small batch, full GPU layers, flash attention) on the same hardware class would close this knowledge gap and give practitioners a reliable baseline.

Sources

The Gemma-mentioning posts driving this update (June 30 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

I manually labeled ~900 scanned documents for medical VQA, results were a bit surprising (Jun 29, 2026 — 900-document custom medical VQA benchmark; Gemma 4 ~5× faster than Qwen with thinking + enforced structured output; hardware not disclosed; domain-specific single-author result)
How to run DiffusionGemma in LM Studio? (Jun 29, 2026 — DiffusionGemma blocked by unmerged llama.cpp PR; LM Studio has no workaround; source-build only)
Slow performance Unsloth Gemma 12B Q8 (Jun 29, 2026 — 20 GB GPU; Gemma 4 12B Q8 at 10 tok/s vs GPT-OSS 20B Q4 at 70 tok/s; likely --threads 16, --prio 2, and large batch size; no resolution captured)

Last updated: 2026-06-30 (June 30 sweep). Confidence: medium. Key findings: Gemma 4 structured output in thinking mode completes ~5× faster than Qwen in a custom domain benchmark (anecdotal, domain-specific); DiffusionGemma still blocked in LM Studio by unmerged llama.cpp PR; Gemma 4 12B Q8 performance regression on 20 GB GPU likely caused by misconfiguration. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-29

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new or updated since 2026-06-28, 447 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 29 sweep, 2026-06-29 00:00 UTC: a cycle dominated by ecosystem and tooling evidence rather than raw benchmark numbers. The headline signals are not about throughput: they are about where Gemma 4 is being deployed and what gaps practitioners are working around. The strongest single data point is a complete game NPC backend built on Gemma 4 26B A4B — a concrete end-to-end application with a production stack (STT, LLM, TTS) and a documented architectural choice to use RAG for prompt length control. Alongside that, a community developer published an agent harness explicitly designed around Gemma and Qwen failure modes in small-model settings, listing and addressing a set of known tool-call, state-tracking, and recovery problems that make generic harnesses less suitable for local models. On the infrastructure side, DeepSpec released speculative decoding draft checkpoints for Gemma 4-12B-it (Eagle3, DFlash, and DSpark algorithms) as part of an open-source training and evaluation codebase — the first external speculative decode training kit with a public Gemma 4 entry. A fourth practical post documents a llama.cpp VRAM and buffer analysis script from a practitioner who uses Gemma 4 MoE as a daily workhorse on a 9060XT 16 GB.

Game NPC backend built on Gemma 4 26B A4B: SillyTavern architecture, RAG-controlled prompts, fast local response. A developer published a game-agnostic NPC engine using a local model stack: NVIDIA Parakeet 0.6 for speech-to-text, Gemma 4 26B A4B as the inference model, and Qwen3-TTS for voice output. The reported result is "super fast response times with pretty decent quality." The architectural detail worth noting is the prompt-length strategy: the game has hundreds of possible NPC actions, and only the subset that make contextual sense for the current turn is injected via RAG rather than flooding the prompt with the full action list. This is a practical demonstration of RAG-as-filter for structured constrained generation — a use pattern where Gemma 4 26B A4B's 128K context headroom is available but is deliberately kept short to preserve latency. No specific token-per-second figures are reported; the author's emphasis is on the response speed being subjectively fast enough for real-time game use. The SillyTavern-style architecture comment suggests the harness handles multi-turn memory and character state in addition to the per-turn RAG injection. Hardware not disclosed. Confidence: single-author first-person report, architecture disclosed, no throughput numbers. (source, June 28, 2026)

Agent harness for Gemma and Qwen family small models: addresses six documented failure modes. A community developer released a GitHub project specifically designed to host Qwen and Gemma family models in agentic settings, citing a consistent set of failure modes observed across generic harnesses: (1) failed tool calls, (2) poor verification of environment variables, (3) poor recovery on common failure modes, (4) generation halting during inference on local backends, (5) poor state tracking during multi-step goals, (6) poor local/remote task separation. The framing — "the harness needs to be built around the local model" — reflects a pattern Gemmaclaw has tracked across multiple sweeps: generic agent scaffolding optimized for frontier API models transfers poorly to quantized local models that stall, emit partial JSON, or lose state across tool-call chains. The author demonstrated the harness managing a server with Qwen 3.5 4B and Qwen 3.69B. Gemma support is listed as a target of the project, alongside Qwen. No Gemma 4-specific benchmark numbers are provided; the post's value is as a community acknowledgment that local-model agentic work needs model-family-aware harnesses. The GitHub link is in the original post. Confidence: developer release post, failure modes disclosed, no Gemma 4 throughput or tool-call accuracy numbers. (source, June 29, 2026)

DeepSpec releases Eagle3, DFlash, and DSpark checkpoints for Gemma 4-12B-it: first open speculative decode training kit with a Gemma 4 entry. The DeepSpec project (a DeepSeek community collection) published a full-stack codebase for training and evaluating speculative decode draft models, releasing checkpoints used in their benchmark paper. The Gemma 4-12B-it target is included across all three algorithm families: `deepseek-ai/eagle3_gemma4_12b_ttt7`, `deepseek-ai/dflash_gemma4_12b_block7`, and `deepseek-ai/dspark_gemma4_12b_block7`. Each checkpoint was trained on open-perfectblend data generated by the corresponding target model in non-thinking mode. The codebase includes data preparation utilities, draft model implementations, training code, and evaluation scripts. An important caveat from the team: "If you cite these results in a new paper, align your setup with the training settings in this repository; otherwise, the comparison is not meaningful." No Gemma 4-specific throughput improvement figures were reported in the post itself; results are in their paper under Table 1. The practical note for Gemmaclaw users: speculative decode draft models for Gemma 4-12B-it are now publicly available and trainable with open code, which is a different situation from the MTP-only path that llama.cpp users have relied on. Compatibility with llama.cpp is not confirmed in the post; Eagle3, DFlash, and DSpark require backend support. Confidence: official release with open code; no Gemma 4-12B-specific throughput numbers in the post; paper alignment required for reproducible comparison. (source, June 28, 2026)

llama.cpp VRAM and buffer analysis script: Gemma 4 MoE and Qwen 3.6 MoE as daily workhorses on a 9060XT 16 GB. A practitioner shares a Python script that parses llama.cpp verbose startup output (`-v` flag) to produce a human-readable summary of buffer allocations grouped by function and backend, total VRAM and RAM usage, tokens-per-second, and MTP performance. The motivation is the pervasive vagueness around VRAM and RAM requirements for specific quantizations: training-precision guides suggest Q4 as a starting point while community experience lands on Q6 or Q8 for acceptable quality, making memory planning harder than it should be. The author's current setup uses Gemma 4 MoE editions (along with Qwen 3.6 MoE) on a single 9060XT with 16 GB RAM as a daily productivity rig, framing both as well-suited to commodity hardware. No specific model quant or throughput numbers are provided for Gemma 4 in this post; the value is the shared script and the implicit confirmation that Gemma 4 MoE variants are a practical daily-driver choice on a mid-range AMD GPU at 16 GB. The script requires Linux and expects llama.cpp to be launched from a `run.sh` file with the `-v` flag. Confidence: practitioner tooling post, hardware disclosed (9060XT 16 GB), no Gemma 4-specific benchmark numbers in post. (source, June 28, 2026)

Open questions

What throughput does Gemma 4 26B A4B achieve in the NPC backend setup? The post reports subjectively fast response times but no tok/s figure. Given the RAG-filtered short prompt strategy and local serving, a follow-up from the author or a reproduction on a disclosed GPU would make this a quantifiable data point for real-time interactive use cases.
Does the agent harness for Gemma and Qwen families publish Gemma 4-specific tool-call accuracy benchmarks? The failure modes list addresses known issues without a baseline number. A controlled comparison of Gemma 4 26B A4B tool-call success rate in a generic harness versus this model-aware harness would quantify the improvement and help practitioners evaluate whether the harness overhead is worth it.
Are Eagle3, DFlash, or DSpark speculative decode modes for Gemma 4-12B-it compatible with llama.cpp in a usable way? DeepSpec's open-source checkpoints are available, but backend support for these algorithms is the limiting factor for llama.cpp users. If community developers build llama.cpp-compatible inference wrappers, the Gemma 4-12B-it speculative decode family would gain a new serving path alongside the existing MTP-based route.
What VRAM and RAM does the 9060XT user allocate for Gemma 4 MoE at their working quant? The memory analysis script post confirms Gemma 4 MoE is practical on a 16 GB single-card setup, but the exact quant and buffer sizes are not disclosed. A published script run output for Gemma 4 26B A4B at Q4_K_M or Q6_K would be the most immediately useful community contribution from this post.

Sources

The Gemma-mentioning posts driving this update (June 29 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers. (Jun 29, 2026 — Gemma and Qwen family agent harness targeting 6 local-model failure modes; no Gemma 4 throughput numbers; GitHub link in post)
NPC Engine Using Local Models (Jun 28, 2026 — Gemma 4 26B A4B + Parakeet 0.6 STT + Qwen3-TTS; RAG-filtered NPC action injection; "super fast response times"; no tok/s)
Script to monitor llama cpp and analyze memory usage (Jun 28, 2026 — 9060XT 16 GB; Gemma 4 MoE + Qwen 3.6 MoE as daily workhorses; Python buffer-parsing script for VRAM/RAM planning; no quant or throughput numbers)
DeepSpec - a deepseek-ai Collection (Jun 28, 2026 — Eagle3/DFlash/DSpark checkpoints for google/gemma-4-12B-it released; open-source training + evaluation code; align training settings before benchmark comparison; no throughput numbers in post)

Last updated: 2026-06-29 (June 29 sweep). Confidence: medium. Key findings: Gemma 4 26B A4B in live game NPC backend with RAG-filtered prompts; community agent harness for Gemma/Qwen local-model failure modes; DeepSpec speculative decode checkpoints (Eagle3/DFlash/DSpark) for Gemma 4-12B-it released with open training code; Gemma 4 MoE confirmed as a daily-driver choice on 9060XT 16 GB. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-28

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new or updated since 2026-06-27, 444 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 28 sweep, 2026-06-28 00:00 UTC: a compact but productive cycle. The headline finding is a controlled measurement of MTP acceptance rates across quantization levels for Gemma 4 31B — the most rigorous community experiment in several sweeps. A practitioner tested four quantization levels (Q5_K_S through IQ2_M) with the native MTP drafter across draft depths 1 through 4, finding that acceptance rates hold nearly flat from Q5_K_S down to IQ3_M (within 2 percentage points at every draft depth), while IQ2_M shows measurable but modest degradation. This is concrete guidance for users choosing quant levels when MTP is in play. The cycle also carries a forward-looking announcement: the Orthrus team states diffusion-head checkpoints trained on Gemma 4 are coming soon alongside open-source training and evaluation code. Two shorter posts round out the sweep: a community note that Google ran hackathons celebrating 1500 tok/s Gemma 4 31B cloud inference, and an inconclusive visual comparison of HTML email output from Gemma 4 26B-A4B QAT against two Qwen 3.6 variants with no captured winner.

Gemma 4 31B MTP acceptance rate survives aggressive quantization: IQ4_XS and IQ3_M match Q5_K_S within measurement noise. A community member ran a structured experiment testing how quantization level affects speculative decode acceptance when using Gemma 4 31B as both trunk and drafter. The setup: Gemma 4-31B-it quantized GGUFs as the trunk, Gemma 4-31B-it-assistant as the MTP drafter, temperature 0.3, thinking disabled, 5 mixed coding/reasoning prompts at 200 tokens per run, 3 repetitions with distinct seeds, results reported as mean ± 1σ. Acceptance rates across quantization and draft depth:

Q5_K_S: n=1: 88.5 ±1.0%, n=2: 81.9 ±0.3%, n=3: 74.2 ±0.9%, n=4: 66.7 ±0.5%
IQ4_XS: n=1: 86.7 ±0.1%, n=2: 80.3 ±0.9%, n=3: 72.3 ±0.5%, n=4: 65.2 ±0.9%
IQ3_M: n=1: 86.8 ±0.9%, n=2: 78.3 ±0.2%, n=3: 71.7 ±1.6%, n=4: 65.0 ±2.0%
IQ2_M: n=1: 84.5 ±0.5%, n=2: 76.7 ±2.5%, n=3: 69.3 ±1.5%, n=4: 61.2 ±2.0%

The key finding: at every draft depth, Q5_K_S, IQ4_XS, and IQ3_M are statistically indistinguishable — the gap is 1–2 points, within the variance of a 3-rep experiment. This means a user running Gemma 4 31B with MTP can drop from Q5_K_S to IQ3_M and not lose meaningful speculative decode efficiency. IQ2_M is the outlier: it trails Q5_K_S by about 4 points at n=1, widening to about 5.5 points at n=4. Deeper draft depths (n=3, n=4) show declining acceptance across all quants, which is expected — each additional speculative token is harder to verify exactly. The experiment deliberately isolates acceptance rate from throughput; throughput gains from higher acceptance will depend on hardware memory bandwidth and can be computed from acceptance tables published elsewhere in the community. Confidence: structured controlled experiment with replications and standard deviations reported; hardware setup and specific VRAM not disclosed; treat as a reliable directional finding pending broader hardware-class replication. (source, June 27, 2026)

Orthrus diffusion-head checkpoints for Gemma 4 coming soon, with open-source training code. The Orthrus team announced they have completed testing and are preparing the release pipeline for diffusion-head checkpoints trained on Qwen 3.5, Qwen 3.6, and Gemma 4 models. A HuggingFace stub (`chiennv/Orthrus-Qwen3-8B`) is already published. The team plans to open-source their complete end-to-end training and evaluation code alongside the model checkpoints. Orthrus is an approach that adds a trained diffusion-style prediction head to an autoregressive backbone, allowing block-parallel speculative generation without a separate draft model. No llama.cpp support exists or is planned by the Orthrus team at announcement; backend support will depend on community development. No benchmark numbers, VRAM requirements, or quantization formats were disclosed in the announcement. This is a "coming soon" post, not a benchmark, and should be treated as a watch item until the actual release lands with reproducible results. Confidence: team announcement only, no benchmarks. (source, June 27, 2026)

Context note: Google running hackathons at 1500 tok/s for Gemma 4 31B — 50–100× faster than consumer hardware. A community post referenced Google hackathons celebrating 1500 tokens per second inference for Gemma 4 31B. The figure is consistent with multi-card server deployments using NVFP4 quantization on Blackwell hardware, but no source link or event details were provided. For calibration: the best documented single-consumer-GPU throughput for Gemma 4 31B is around 177 tok/s with BeeLlama DFlash on a single RTX 3090, and standard llama.cpp without DFlash or MTP runs in the 20–30 tok/s range on the same card. The 1500 tok/s benchmark represents the professional cloud inference tier — not a local target. The post's broader argument is that big players see genuine value in small-model software engineering, which the community generally agrees with. Confidence: anecdotal community reference; throughput figure not independently confirmed in this post; treat as context only. (source, June 27, 2026)

HTML email quality comparison: Gemma 4 26B-A4B QAT vs Qwen 3.6 variants — no captured verdict. A developer deploying the Olib-AI/mailcue MCP email server tested three models side-by-side for HTML email generation quality: `google/gemma-4-26b-a4b-qat`, `qwen/qwen3.6-35b-a3b`, and `qwen/qwen3.6-27b`. The comparison is framed as a visual "guess which model" exercise with screenshots, asking readers to identify which model produced which email. No comments were captured at sweep time, so no community consensus on a winner is available. The practical note is that Gemma 4 26B-A4B QAT was included alongside strong Qwen 3.6 variants in a real structured-output quality test, and the framing as a blind comparison suggests the author found the results interesting enough to share. Without captured results, this contributes no actionable signal on relative quality. Confidence: no captured results or winner, visual comparison only. (source, June 27, 2026)

Open questions

What throughput does Gemma 4 31B MTP actually produce at each quant level, given the acceptance rates above? The acceptance rate experiment kept throughput as a separate variable. A follow-up that reports measured decode tok/s alongside acceptance rate for Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M would make the combined speed-quality tradeoff directly actionable.
When Orthrus checkpoints for Gemma 4 are released, how much additional VRAM does the diffusion head require? If the head is small relative to the backbone, 24 GB cards remain viable. If it pushes requirements above 24 GB, users will need to evaluate quantized variants or multi-GPU setups.
Is there a working llama.cpp image resolution configuration for Gemma 4 12B? This question opened in the June 27 sweep (31B flags crash the 12B server) and remains unresolved in the June 28 batch.

Sources

The Gemma-mentioning posts driving this update (June 28 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Does quantizing change the MTP draft rate? (Jun 27, 2026 — Gemma 4 31B trunk + Gemma 4 31B MTP drafter; n=1–4 acceptance rates: Q5_K_S 88.5–66.7%, IQ4_XS 86.7–65.2%, IQ3_M 86.8–65.0%, IQ2_M 84.5–61.2%; IQ4_XS and IQ3_M statistically match Q5_K_S; IQ2_M trails by 4–5.5 pp)
Orthrus (diffusion head) trained Qwen 3.5/3.6 and Gemma 4 models are dropping soon (Jun 27, 2026 — team announces imminent release of diffusion-head checkpoints for Gemma 4; open-source training code included; no llama.cpp support; no benchmarks)
Even Google still believes in small models for coding. (Jun 27, 2026 — community reference to Google hackathons at 1500 tok/s Gemma 4 31B; context-only note, no local hardware data)
Tested which model can send best HTML email (Jun 27, 2026 — Gemma 4 26B-A4B QAT vs Qwen 3.6 35B A3B and 27B for HTML email generation; no captured winner, no comments)

_Last updated: 2026-06-28 (June 28 sweep). Confidence: medium. Key findings: Gemma 4 31B MTP acceptance rates hold flat Q5_K_S → IQ3_M (within 2pp at all draft depths); IQ2_M costs 4–5.5pp; Orthrus diffusion-head for Gemma 4 coming soon; Google cloud 1500 tok/s Gemma 4 31B context note; HTML email quality comparison inconclusive. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-27

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new or updated since 2026-06-26, 440 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 27 sweep, 2026-06-27 00:00 UTC: a compact cycle with four signals across three practical themes. The headline is a multi-GPU tensor split mode incompatibility with Gemma 4 26B running across three GPUs (RTX 5080 + 2× 5060 Ti) in llama.cpp: tensor-parallel split causes tool-call loops and reasoning trace repetition in OpenCode, while layer split runs cleanly — a concrete configuration warning for multi-GPU builders. A second post documents a known image resolution deficit in Gemma 4 12B compared to Qwen 3.6: the standard llama.cpp vision token range flags for the 31B crash the 12B server, leaving no obvious workaround yet. A community fine-tuner released a Gemma-4-12B-IT-Uncensored variant with benchmarks that show abliteration alone costs a large fraction of reasoning capability, while a chain-of-thought SFT step largely recovers it — a useful data point on abliteration cost for this model family. A fourth post from the same cycle (Qwen-centric, touching Gemma 4 in name only) adds peripheral evidence that MTP can reduce code-review quality versus throughput in multi-GPU llama.cpp configurations. Two further threads from the batch were reviewed and excluded as carrying no Gemma-4-specific signal (Ornith-1.0 release and Ornith-1.0 terminology guide).

Multi-GPU tensor split mode causes tool-call loops with Gemma 4 26B in OpenCode — layer split is safe. A user running Gemma 4 26B-A4B (and Qwen 3.6 27B) across an RTX 5080 + 2× RTX 5060 Ti reports that setting llama.cpp split mode to tensor (`-sm tensor`) causes looping problems specifically in tool calls and reasoning traces when using OpenCode. Layer split mode (`-sm layer`, the default for multi-GPU in llama.cpp) works correctly. The user noticed the issue affects both models consistently, suggesting this is a llama.cpp tensor-parallelism behavior rather than a Gemma 4 model defect. No comments were captured at sweep time, so the community consensus on a root cause or fix is unknown. Practical guidance for multi-GPU builders running Gemma 4 26B-A4B with OpenCode or similar agentic frameworks: use layer split (`-sm layer` or omit the flag to accept the default) rather than tensor split until this incompatibility is resolved or investigated in a dedicated thread. Confidence: single-author field report, hardware disclosed, both models affected consistently, no captured community response. (source, June 26, 2026)

Gemma 4 12B vision: poor resolution for small-text detection, and the 31B's image token flags crash the 12B server. A user using Gemma 4 12B as an all-purpose assistant reports a consistent failure to detect smaller text in images that Qwen 3.6 handles reliably. Even larger compositional elements in images fail intermittently. When the user attempted to apply the llama.cpp vision resolution flags documented for the Gemma 4 31B — `--image-min-tokens 560 --image-max-tokens 2240` — the 12B server crashed and quit rather than improving performance. This suggests the 31B's token range parameters are not directly transferable to the 12B. No alternative workaround was captured in the thread. This is a practical limitation note rather than a regression: Gemma 4's image resolution handling is a known area where the community has documented gaps relative to Qwen 3.6 in previous sweeps, and this report extends that pattern to small-text detection on the 12B. For users who rely on image OCR or small-text extraction, Qwen 3.6 remains the stronger choice under llama.cpp at the current community signal level. Confidence: single anecdote, crash confirmed by the author applying 31B flags to 12B, no fix found. (source, June 26, 2026)

Gemma-4-12B-IT-Uncensored-Opus4.7-CoT: abliteration hurts, CoT SFT largely recovers — a quantified benchmark. A community packager released `gemma-4-12B-it-uncensored-opus4.7-cot`, a variant of the Gemma 4 12B base model where safety filtering is removed via abliteration and reasoning capability is partially restored via chain-of-thought supervised fine-tuning. The published benchmarks against the base model: MMLU 0.777 (base) → 0.635 (abliterated) → 0.739 (this model, SFT); GSM8K 0.949 (base) → 0.496 (abliterated) → 0.920 (SFT); WikiText-2 bits/byte 1.834 (base) → 2.095 (abliterated) → 1.717 (SFT, better than base). The benchmarks reveal a consistent pattern: raw abliteration degrades Gemma 4 12B substantially (MMLU −18 points, GSM8K −47 points), while the subsequent CoT SFT recovers most of the loss — and notably reduces per-token perplexity below the original (WikiText-2 bits/byte 1.717 < 1.834), which the author attributes to the quality of the CoT fine-tuning data. GGUFs are available on HuggingFace. This is a useful calibration point for practitioners evaluating community-uncensored Gemma 4 variants: the capability cost of abliteration alone is large on this model family; a quality-recovering SFT step is not optional if reasoning benchmarks matter. Limitations: self-reported benchmarks from the releasing author, no third-party reproduction; the WikiText perplexity improvement should be treated as an artifact of the SFT training distribution rather than a general capability gain. Confidence: developer benchmark release, methodology disclosed, independent verification absent. (source, June 25, 2026)

Open questions

What llama.cpp image token flags are correct for the Gemma 4 12B? The 31B flags (`--image-min-tokens 560 --image-max-tokens 2240`) crash the 12B server. The right parameter range for the 12B is undocumented in the community at this sweep. A working configuration that improves small-text detection without crashing would resolve the practical question from this sweep.
Does tensor split in llama.cpp consistently cause tool-call loops across all agentic frameworks, or is it specific to the OpenCode integration with Gemma 4? The report covers one user's OpenCode setup. If the loop behavior is llama.cpp-level (the tensor-parallel execution path mis-handles the attention sink or tool-call grammar), it would affect any framework; if it is an OpenCode prompt-handling issue, it might have a workaround. A second data point from a different framework (Ollama with the Gemma 4 GGUF, a bare llama-server, or a different front end) would help localize the cause.
Is the CoT SFT improvement in WikiText-2 perplexity for the Uncensored model a training-distribution artifact or a reproducible general gain? Perplexity below the base model is unusual after abliteration + SFT. If the WikiText improvement is consistent with other perplexity probes (e.g., a diverse domain-balanced test set), it would suggest the CoT data improved the model beyond the abliteration baseline. If it's narrow to WikiText-style text, it may simply reflect the nature of the training corpus.

Sources

The Gemma-mentioning posts driving this update (June 27 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Gemma 4 12b needs glasses (Jun 26, 2026 — Gemma 4 12B llama.cpp vision; poor small-text detection; 31B image-min/max-tokens flags crash the 12B server; Qwen 3.6 handles the same images correctly; no fix found)
Does llama cpp split mode tensor cause issues? (Jun 26, 2026 — RTX 5080 + 2× 5060 Ti; Gemma 4 26B-A4B + Qwen 3.6 27B split across three GPUs; tensor split mode (`-sm tensor`) causes looping in OpenCode tool calls and reasoning traces; layer split works fine)
[[R] Gemma-4-12B-IT-Uncensored-Opus4.7-CoT (No Intel Loss)](https://reddit.com/r/LocalLLaMA/comments/1uf8ksz) (Jun 25, 2026 — abliteration + CoT SFT on Gemma 4 12B; MMLU 0.777→0.739, GSM8K 0.949→0.920, WikiText-2 bits/byte 1.834→1.717; GGUFs on HuggingFace; self-reported benchmarks, no independent validation)
Worse quality with MTP - Qwen 3.6, Gemma 4 (Jun 25, 2026 — 4× RTX 5070 Ti; Qwen 3.6 27B Q8_K_XL with MTP shows worse code-review output in 8/10 tests versus non-MTP, despite higher TG throughput (50–60 vs 100–120 tok/s); Gemma 4 mentioned in title but post is Qwen-centric; relevant as a multi-GPU MTP quality-tradeoff signal)

Two Gemma-mentioning threads were reviewed and intentionally excluded for carrying no Gemma-4-specific hardware, quant, or quality signal:

Ornith-1.0 released on Hugging Face (Jun 25, 2026 — announcement of a new open model family; no Gemma 4 mention)
Ornith 1.0 - terminology and concepts explained (basic) (Jun 26, 2026 — introductory guide to Dense vs MoE concepts using Ornith as the example; no Gemma 4 mention)

Last updated: 2026-06-27 (June 27 sweep). Confidence: medium. Key findings: tensor split mode (`-sm tensor`) in llama.cpp causes tool-call loops with Gemma 4 26B-A4B across a 5080+2×5060 Ti setup in OpenCode — use layer split; Gemma 4 12B vision has documented small-text detection gaps vs Qwen 3.6 and the 31B's image token flags crash the 12B server; Gemma-4-12B-IT-Uncensored-Opus4.7-CoT benchmarks show abliteration alone costs MMLU −18 pts and GSM8K −47 pts, CoT SFT recovers most of that. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-25

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-24, 434 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 25 sweep, 2026-06-25 01:00 UTC: a release-driven cycle with one model drop and three practical reports. The headline is the arrival of MTP-equipped "Uncensored Balanced" QAT builds of the larger Gemma 4 models — Gemma 4 26B-A4B and 31B — from the HauhauCS community packager, who claims a 35% (26B-A4B) and 53% (31B) decode speedup from the bundled multi-token-prediction draft heads and a 0/465 refusal rate on their internal GenRM probe. These are repackages of the official Google QAT weights, not new base models, so the value is the bundled MTP plus the uncensoring rather than any change to Gemma 4's underlying quality. Beyond the release, the cycle adds an Apple Silicon low-quant data point — Gemma 4 26B-A4B at IQ3_S running ~25 tok/s on a 16 GB M3 MacBook Air, reported "really close to bf16" for non-coding assistant use — a llama.cpp Vulkan-backend corruption report on an old Intel iGPU (duplicate and `` tokens with Gemma 4 E2B IQ4_NL), and a multi-GPU PCIe-splitting calibration question on a 5070 Ti + 4070 box running a Gemma 4 26B agent. One further Gemma-mentioning thread was reviewed and left out for carrying no Gemma-specific signal.

MTP "Uncensored Balanced" QAT builds of Gemma 4 26B-A4B and 31B claim 35% and 53% decode speedups. The HauhauCS community packager released MTP-equipped, uncensored "Balanced" builds of the two larger Gemma 4 QAT models — `Gemma4-26B-A4B-QAT-Uncensored-HauhauCS-Balanced-MTP` and `Gemma4-31B-QAT-Uncensored-HauhauCS-Balanced-MTP` — with the post title claiming a 35% speed boost on the 26B-A4B and 53% on the 31B. The speedup comes from the bundled multi-token-prediction (MTP) draft head, the same speculative-decoding mechanism documented in earlier sweeps (the June 17 RX 6600 XT report measured 64–99% MTP acceptance on Gemma 4 12B QAT). Two things bound how to read the headline numbers. First, MTP gains are workload-dependent: prior community data (the MTP benchmark thread from the back catalog) found speculative decoding helps structured/coding generation but can slow creative writing, so a single percentage is a best-case average rather than a guarantee for your task. Second, these are repackages of the original Google Gemma 4 26B-A4B-QAT and 31B-QAT weights, "just uncensored" per the author, with a "light reasoning preamble on the absolute edgiest stuff" — there is no claim of improved reasoning or knowledge over stock Gemma 4, and the author notes a handful of edge-case prompts still deflect on the first try. The provenance signal is moderate: the packager reports nearing 20 million HuggingFace downloads on their account and ~5,000 Discord members, but the 0/465 refusal figure is their own internal GenRM probe, not an independent eval, and no throughput methodology (hardware, quant, context, acceptance rate) is published for the 35%/53% claims. Practical read: if you already run Gemma 4 26B-A4B or 31B QAT and want MTP speculative decoding plus uncensoring in one GGUF, this is a convenient prepackaged path — but verify the speedup on your own hardware and workload, and treat the uncensored behavior as something to validate against your use case rather than assume. Confidence: developer release announcement, self-reported speed and refusal numbers, no independent benchmark. (source, June 25, 2026)

Gemma 4 26B-A4B at IQ3_S on a 16 GB M3 MacBook Air: ~25 tok/s and "really close to bf16" for non-coding assistant work. A user experimenting with aggressive weight quantization reports running Gemma 4 26B-A4B at IQ3_S (an Unsloth UD-Q3 dynamic quant) on an Apple M3 MacBook Air with 16 GB unified memory, getting a steady 25 tokens/second decode and finding the output "really close to the bf16 for my use cases" — explicitly no coding and no tool calling, i.e. general assistant and chat use. The author is candid that this may be confirmation bias and asks the community whether UD-Q3 quants are genuinely this usable. This is a useful Apple Silicon data point for the smallest practical Mac tier: a 16 GB MacBook Air cannot hold the 26B-A4B at Q4K_M with comfortable context headroom, but the MoE's ~4B active parameters per token mean an IQ3_S weight quant stays interactive at 25 tok/s on the M3's integrated GPU. Note this is a weight-quantization report and is independent of the June 24 KV-cache finding (where `q4_0` \_KV cache was catastrophic on Gemma 4 E2B QAT) — the two compress different things, so a usable IQ3_S weight quant does not contradict the "Q8 KV yes, Q4 KV no" cache rule. The reliability bound is the usual one for this kind of post: a single first-person impression with no perplexity score or task benchmark, and the author's own confirmation-bias caveat. The takeaway worth recording is that for non-coding, non-tool assistant work on a 16 GB Mac, the 26B-A4B at IQ3_S/UD-Q3 is reported as a viable interactive option rather than a degraded one. Confidence: single-author anecdote, no quality measurement, self-flagged confirmation-bias risk. (source, June 24, 2026)

llama.cpp Vulkan backend on an old Intel iGPU emits duplicate and `` tokens with Gemma 4 E2B IQ4_NL. A user on aging hardware — an Intel Core i7-8550U laptop with Intel UHD 620 and Radeon 530 GPUs — reports that `llama.cpp` build b9763 with the Vulkan backend produces corrupted output (duplicate tokens, and sometimes `` special tokens) when running `gemma-4-E2B-it-IQ4_NL`. The user tested with and without `--no-mmap` and direct I/O and saw the corruption persist. This is a backend/hardware-compatibility caution rather than a Gemma 4 model defect: the `` artifact superficially resembles the `` repetition seen in the June 22/23 sweeps with 31B QAT MTP GGUFs, but the root cause is different — that earlier issue was an MTP-branch loading problem, while this is the Vulkan compute path on an old integrated GPU. Practical guidance for users on older iGPU + Vulkan setups: if Gemma 4 (even the tiny E2B) emits duplicate or unused-token garbage, suspect the Vulkan backend on the specific GPU before suspecting the GGUF — try the CPU backend, a newer llama.cpp build, or a different quant (e.g. a standard Q4_K_M instead of IQ4_NL) to isolate whether the kernel or the model file is at fault. No fix or root-cause resolution was captured in the thread at sweep time. Confidence: single bug report, hardware and build disclosed, outcome unresolved. (source, June 24, 2026)

Multi-GPU calibration: splitting a PCIe 5.0 x16 slot into 2×8 for a 5070 Ti + 4070 Gemma 4 26B agent box. A user running a daytime Hermes Agent on Gemma 4 26B (plus Qwen) asks whether splitting their PCIe 5.0 x16 slot into two x8 lanes with a riser would help. Their system: an Intel i5-14600KF (20 PCIe lanes), 32 GB RAM, RTX 5070 Ti 16 GB on PCIe 5.0 x16, and an RTX 4070 on a PCIe 4.0 x16 slot routed through the Z790 chipset. The reported behavior: generation is fast at 16K context (~3 s) but slows substantially at 128K context, where OCR-style requests take 10–15 s. The useful framing here is that the 128K slowdown is almost certainly prompt-processing / KV-cache bound at long context, not PCIe-bandwidth bound — so splitting the 5070 Ti's slot to x8 would mostly risk throttling the faster card without fixing the long-context latency, which is the opposite of what the user wants. This echoes the June 18 dual-GPU PCIe trap (an RTX 4080 on an x4 slot capped a 31B layer-split at 26–28 tok/s): inter-GPU bandwidth matters for tensor/layer-split inference, but a single card's slot width does not determine long-context prefill speed. Practical read for similar builders: keep the primary card on the full x16, place the secondary on whatever lanes remain, and address 128K OCR latency through context/KV-cache tuning (e.g. `q8_0` KV cache on QAT weights to fit more context, flash attention, or a smaller working context) rather than by re-slicing PCIe lanes. No measured tok/s figures were provided beyond the latency anecdote. Confidence: configuration question, single user, latency anecdote only, no throughput benchmark. (source, June 24, 2026)

Open questions

Do the HauhauCS MTP builds actually deliver 35%/53% on real hardware, and on which workloads? The claimed speedups have no published methodology (hardware, quant, context length, MTP acceptance rate, task mix). Given prior evidence that MTP helps coding/structured output but can slow creative generation, an independent tok/s comparison of stock Gemma 4 26B-A4B/31B QAT vs the MTP builds on a fixed workload would turn the headline into guidance.
How far down the quant ladder does Gemma 4 26B-A4B stay usable, and by what measure? The IQ3_S "close to bf16" report on a 16 GB M3 is an encouraging anecdote but unmeasured. A perplexity or task-accuracy curve for the 26B-A4B from Q4_K_M down through IQ3_S/UD-Q3 on assistant and RAG tasks would settle whether IQ3_S is genuinely near-lossless for non-coding use or whether the author's confirmation-bias worry is warranted.
Is the Vulkan `` corruption specific to old iGPUs or a broader llama.cpp Vulkan regression? The b9763 report is on an Intel UHD 620 / Radeon 530 laptop. It is unknown whether the same build corrupts Gemma 4 output on modern Vulkan targets (RDNA3/4, Arc, newer Intel iGPUs) or whether this is confined to pre-Vulkan-1.3-class integrated GPUs. A second data point on newer hardware would localize the bug.

Sources

The Gemma-mentioning posts driving this update (June 25 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced are out with MTP (35% & 53% speed boost)! (Jun 25, 2026 — HauhauCS MTP + uncensored "Balanced" repackages of the official Gemma 4 26B-A4B-QAT and 31B-QAT; claimed 35% / 53% decode speedup from bundled MTP draft heads, 0/465 internal GenRM refusals; self-reported numbers, no independent benchmark or throughput methodology)
Gemma 4 26BA4B Surprisingly Usable at IQ3_S – Are small quants really this usable? (Jun 24, 2026 — Apple M3 MacBook Air 16 GB; Gemma 4 26B-A4B at IQ3_S/UD-Q3 → ~25 tok/s decode, "really close to bf16" for non-coding assistant use; single anecdote, author flags confirmation-bias risk, no quality measurement)
llama.cpp with vulkan backend outputting duplicate tokens, and sometimes <unusedXX> tokens (Jun 24, 2026 — Intel i7-8550U / UHD 620 + Radeon 530, llama.cpp b9763 Vulkan; gemma-4-E2B-it-IQ4_NL emits duplicate and `` tokens with/without --no-mmap; backend/old-iGPU compatibility issue, unresolved)
PCIE 5.0 16x split into 2x8 with riser cable (Jun 24, 2026 — i5-14600KF 20 lanes, 5070 Ti 16 GB PCIe 5.0 x16 + 4070 PCIe 4.0 x16 chipset; daytime Hermes Agent on Gemma 4 26B + Qwen; fast at 16K context ~3 s, slow at 128K OCR 10–15 s; PCIe-split question, no throughput numbers)

One additional Gemma-mentioning thread was reviewed and intentionally left out of the findings because it carries no Gemma-4-specific signal:

Community distillation project: Capturing GLM-5.2 + Claude Opus level reasoning in a runnable open model (Jun 24, 2026 — a recruitment/coordination post to distill GLM-5.2 and Claude Opus 4.8 reasoning traces into a ~70B-or-smaller open model; Gemma 4 is named only as one of two candidate base models alongside Qwen 3.6, with no Gemma-specific hardware, quant, or quality observation).

_Last updated: 2026-06-25 (June 25 sweep). Confidence: medium. Key findings: HauhauCS released MTP-equipped, uncensored "Balanced" repackages of the official Gemma 4 26B-A4B-QAT and 31B-QAT, claiming 35% and 53% decode speedups from the bundled MTP draft heads and 0/465 internal GenRM refusals (self-reported, no independent benchmark — and MTP gains are workload-dependent, helping coding but potentially slowing creative writing); Gemma 4 26B-A4B at IQ3_S/UD-Q3 reported ~25 tok/s and "close to bf16" for non-coding assistant work on a 16 GB M3 MacBook Air (anecdotal, weight-quant not KV-cache); llama.cpp b9763 Vulkan backend on an old Intel UHD 620 iGPU emits duplicate and `` tokens with Gemma 4 E2B IQ4_NL (backend/hardware issue, unresolved); and a 5070 Ti + 4070 multi-GPU builder's 128K-context OCR slowdown is prompt-processing bound, not PCIe-bound, so splitting the x16 slot to 2×8 would likely hurt more than help. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-24

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-23, 429 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 24 sweep, 2026-06-24 00:00 UTC: a quiet cycle carried by a single methodical finding. The headline is the most rigorous data point in three sweeps of KV-cache discussion: a community member published a KL-Divergence map of KV cache quantization for Gemma 4 E2B QAT (alongside Qwen 3.6 35B-A3B), with reproducible tooling and zoomable plots. The takeaway sharpens — rather than simply extends — the "QAT tolerates KV cache quantization" thread from the June 22 and June 23 sweeps: `q8_0` KV cache is nearly free on Gemma 4 QAT, but `q4_0` KV cache is catastrophic on Gemma, where the same `q4_0` cache is merely "useable" on Qwen. So the practical rule tightens to "Q8 yes, Q4 no" on Gemma, at least at the E2B size tested. Beyond the headline, the cycle surfaced two softer signals: a calibration discussion arguing that Gemma 4 26B-A4B is underrated for single-3090 RAG and assistant work (as opposed to coding, where the community defaults to Qwen 3.6), and a sentiment complaint about the structured bullet-list reasoning traces that Gemma 4 and Qwen 3.6 emit. Two further Gemma-mentioning threads were reviewed and intentionally left out because they carried no Gemma-specific signal.

Gemma 4 E2B QAT: `q8_0` KV cache is nearly free, but `q4_0` KV cache is catastrophic — a reproducible KL-Divergence map. A community member mapped the quality cost of KV cache quantization across a grid of K and V quant types for two models, Gemma 4 E2B QAT and Qwen 3.6 35B-A3B, using KL-Divergence against the full-precision cache as the metric, and published the plots plus the software to replicate them on any model. The reported results: `q8_0`/`q8_0` (K/V) is nearly free on both models; `q4_0`/`q4_0` is "useable" on Qwen but "catastrophic" on Gemma; the experimental `turbo4` cache type is "sometimes slightly better, sometimes slightly worse" than `q4_0`; and the more aggressive `turbo3`/`turbo2` tiers compress the cache "to unprecedented levels — but you'll pay dearly for it" in quality. The author also notes that K-cache and V-cache sensitivity is not fixed: "K is sometimes more sensitive than V, sometimes less, sometimes they're symmetrical," so a blanket K/V quant choice is not optimal across models. This is the most methodical entry in the KV-cache thread that ran through the June 22 and June 23 sweeps. Those sweeps established that Gemma 4 QAT GGUFs tolerate KV cache quantization much better than the plain quants — "`q8_0` KV back on the menu" — and confirmed it held at 31B. This finding agrees on the `q8_0` half (nearly free) and adds the missing other half: do not drop a Gemma 4 cache to `q4_0` — the quality collapse the prior sweeps did not quantify is real and large on Gemma, even though Qwen survives the same setting. Practical read: on Gemma 4 QAT, reclaim context headroom with a `q8_0` KV cache, but stop there; `q4_0` and the `turbo3`/`turbo2` tiers are not worth the quality loss unless you have measured it for your own workload. Two limits keep this at medium confidence despite the rigor: the Gemma model tested is E2B (the smallest Gemma 4 — KV-cache sensitivity can differ at 12B/26B-A4B/31B and was not measured here), and KL-Divergence against the f16 cache is a proxy for quality rather than a downstream task score. The upside is unusual for this guide: the methodology and tooling are published, so the result is independently reproducible rather than a one-off anecdote. Confidence: medium, leaning higher on method — reproducible KLD measurement with disclosed tooling — but bounded by a single author and an E2B-only Gemma test. (source, June 23, 2026)

Is Gemma 4 26B-A4B underrated for single-3090 RAG and assistant work? A calibration discussion. A user building an all-in-one personal assistant — RAG, knowledge-base queries, general assistant, explicitly not coding — on a single RTX 3090 (with a few smaller side GPUs for support models) asks why Gemma 4 26B-A4B (MoE) gets so little discussion compared to Qwen 3.6 27B/35B and the dense Gemma 4 31B. Their own observation: the dense 31B "doesn't fit well on a solo 3090," and after testing Qwen 3.6 35B as the primary driver they now suspect Gemma 4 may be the better fit for their RAG and assistant workload. The post is a discussion with no captured comments and no measured throughput, so it is a sentiment-and-calibration data point rather than a benchmark. The useful framing it surfaces: the MoE 26B-A4B — with roughly a 4B active-parameter count per token — is the natural single-24 GB-card choice for non-coding assistant work where the dense 31B is tight on VRAM, yet the community conversation has drifted toward Qwen 3.6 for coding and toward the dense 31B for quality, leaving the MoE 26B-A4B comparatively undiscussed for the RAG and assistant use case it suits well. This matches the June 23 Strix Halo report's dense-vs-MoE tradeoff from the opposite direction: there the dense 31B won on quality at low speed; here a single-GPU assistant builder is leaning toward the faster MoE for interactive RAG. Confidence: anecdotal community discussion, no throughput or quality measurements captured; the VRAM-fit reasoning is consistent with prior reports in this guide but the OP supplied no numbers. (source, June 23, 2026)

Sentiment: the structured bullet-list reasoning trace style of Gemma 4 and Qwen 3.6 is divisive. A discussion thread voiced a now-recurring community complaint: that Gemma 4 and the Qwen 3.5/3.6 series emit a rigid, numbered "analyzing — point 1 — point 2 — point 3" reasoning structure rather than the looser prose-style "human" chain-of-thought of models like QwQ, GPT-OSS, GLM, or DeepSeek. The author's argument is that on smaller models this structure can devolve into restating the system prompt and "wasting tokens," and that forcing the model to hold a bullet-list format while writing and computing math or science makes hard reasoning harder rather than easier. This is opinion, not a benchmark — there is no measurement here that the structured trace helps or hurts accuracy — but it is a representative sentiment snapshot worth recording: a slice of the community finds Gemma 4's structured reasoning output token-inefficient for small-model deployments and would prefer a more free-form trace. For a reader choosing a reasoning configuration, the practical takeaway is to test whether your task benefits from the structured trace at all, and to consider a lower thinking budget on small Gemma 4 variants if you see the model padding its scratchpad rather than reasoning. Confidence: anecdotal community opinion, no benchmark or measurement. (source, June 23, 2026)

Open questions

Does the `q4_0`-is-catastrophic result hold at Gemma 4's larger sizes? The KLD map was measured on E2B QAT, the smallest Gemma 4. The June 22/23 sweeps confirmed `q8_0` KV tolerance at 31B QAT, but no one has published a `q4_0` KLD comparison at 12B, 26B-A4B, or 31B. If `q4_0` is as catastrophic at 31B as it is at E2B, the "Q8 yes, Q4 no" rule becomes a hard guardrail for context-constrained 31B users rather than an E2B-only observation.
Are the experimental `turbo3`/`turbo2` KV cache tiers ever worth their quality cost? The author reports they reach "unprecedented" compression but "you'll pay dearly for it." There is no published workload yet where the extra context headroom from `turbo2`/`turbo3` outweighs the quality loss on Gemma 4 — a task-level (not KLD-level) comparison would settle whether they have any practical niche.
What are the measured throughput and quality numbers for Gemma 4 26B-A4B on a single 3090 for RAG and assistant work? The calibration discussion makes a plausible VRAM-fit case for the MoE 26B-A4B over the dense 31B on a solo 3090, but no one supplied tok/s or retrieval-quality figures. A direct 26B-A4B-vs-31B comparison on a fixed RAG/assistant harness on one 24 GB card would turn the sentiment into guidance.

Sources

The Gemma-mentioning posts driving this update (June 24 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results, except where reproducible tooling is noted:

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT (Jun 23, 2026 — KL-Divergence map of KV cache quant for Gemma 4 E2B QAT and Qwen 3.6 35B-A3B; `q8_0`/`q8_0` nearly free on both, `q4_0`/`q4_0` useable on Qwen but catastrophic on Gemma, `turbo4` ≈ `q4_0`, `turbo3`/`turbo2` extreme compression with large quality cost, K/V sensitivity varies by model; reproducible plots and tooling published)
Is there any reason for a lack of love for Gemma 4 26b? (Jun 23, 2026 — single RTX 3090 assistant builder asks why Gemma 4 26B-A4B MoE is underdiscussed for RAG/assistant vs Qwen 3.6 and the dense 31B; notes 31B "doesn't fit well on a solo 3090"; discussion only, no measurements)
Like... GENUINELY WHYY??? (structured reasoning traces) (Jun 23, 2026 — community complaint that Gemma 4 and Qwen 3.5/3.6 emit rigid numbered bullet-list reasoning rather than prose-style CoT; argues it wastes tokens on small models and complicates math/science reasoning; opinion, no benchmark)

Two additional Gemma-mentioning threads were reviewed and intentionally left out of the findings because they carry no Gemma-4-specific signal:

Reusable workflows for long running local llms (Jun 23, 2026 — promotional announcement for a file-watching agent harness; Gemma 4 named only as one of the models that hit a "sweet spot between speed and smarts" for long tasks, with no Gemma-specific hardware, quant, or quality observation).
llama-server webui not responding anymore (Jun 23, 2026 — a backend troubleshooting post about the llama.cpp webui/MCP failing to respond while the loaded model is Qwen 3.6 35B-A3B; Gemma 4 appears only incidentally as a working `llama-cli -hf unsloth/gemma-4-12b-it-GGUF:IQ4_NL` smoke command, with no Gemma-specific finding).

_Last updated: 2026-06-24 (June 24 sweep). Confidence: medium. Key findings: a reproducible KL-Divergence map of KV cache quantization on Gemma 4 E2B QAT shows `q8_0` KV cache is nearly free but `q4_0` KV cache is catastrophic on Gemma (where Qwen survives the same setting), and the experimental `turbo3`/`turbo2` tiers compress far more but at a steep quality cost — sharpening the June 22/23 "Q8_0 KV back on the menu for QAT" thread into a "Q8 yes, Q4 no" rule, at least at the E2B size tested; a calibration discussion argues Gemma 4 26B-A4B MoE is underrated for single-3090 RAG and assistant work (where the dense 31B is a tight fit), no numbers captured; and a sentiment thread complains that Gemma 4 and Qwen 3.6 structured bullet-list reasoning traces waste tokens on small models. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-23

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (7 new or updated since 2026-06-22, 424 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 23 sweep, 2026-06-23 00:00 UTC: a cycle that mostly closes loops opened in the last two sweeps and adds three concrete hardware data points. The headline is a direct follow-up to the June 22 KV-cache finding: a second user re-ran the same KL-Divergence test on the Gemma 4 31B size that the original author could not reach, and reports the QAT model holds up even better there — so the "Q8_0 KV cache back on the menu for QAT" guidance now has a 31B confirmation, though no numbers were captured at sweep time. On hardware, two reports anchor the slow-but-high-quality end of the dense 31B: a dual AMD Radeon RX 9060 XT (2×16 GB) rig runs 31B Q6 at a steady 8–9 tok/s, and a Strix Halo 128 GB box runs 31B at 4–5 tok/s — much slower than the MoE models it runs alongside (GPT-OSS 120B and Qwen 3.5 122B at 40–50 tok/s) but, in the owner's words, "the best quality." A third hardware note is a llama.cpp sampler optimization: a Top-N-Sigma PR gives a +50% generation speedup on Gemma 4 E4B Q8_0 on an M3 Max. Rounding out the cycle are two community-ecosystem items: a new uncensored Gemma 4 12B QAT finetune shipping with MTP (promotional, self-reported), and a discussion of whether Gemma 4 will grow a Mistral-scale finetune community, citing its EQ-Bench creative-writing standing and universal MTP/QAT support.

Gemma 4 QAT 31B confirmed to tolerate KV cache quantization too — closing last sweep's open question. The June 22 sweep led with a finding that Gemma 4 QAT builds tolerate KV cache quantization far better than the plain quants (a wikitext KL-Divergence test at 16K context, with "Q8_0 on QAT back on the menu" as the takeaway), but the original author could not test the 31B size — exactly the size most likely to be context-constrained and to benefit. A second user re-ran that same benchmark on the 31B and reports getting "even better results on Gemma 4 31B," which is the confirmation that was missing. The practical read for a single-GPU or split-GPU 31B user is now stronger: if you run a Gemma 4 31B QAT GGUF, it is worth re-testing a `q8_0` KV cache to reclaim context headroom rather than staying pinned to f16. Two limits keep this at medium confidence: the follow-up post did not capture the actual 31B KLD numbers or the exact quant levels at sweep time (it points back to the prior thread's methodology rather than restating figures), and like the original it is a first-party measurement that has not been independently reproduced. Confidence: first-person follow-up that re-runs a disclosed methodology and agrees with the prior result; specific 31B numbers not captured. (source, June 22, 2026)

Dual AMD Radeon RX 9060 XT (2×16 GB) runs Gemma 4 31B Q6 at a steady 8–9 tok/s. A user running Gemma 4 31B at Q6 across two RX 9060 XT 16 GB cards (32 GB combined VRAM) reports a consistent 8–9 tok/s and calls the setup "quite usable," while noting that other threads suggest it should run faster and asking whether they are missing something. No comments were captured at sweep time to resolve the speed question, and the backend (llama.cpp/Vulkan vs ROCm vs another path) was not stated, so the gap between this number and the community's expectation is unexplained for now. As a data point it is still useful: it establishes that a two-card consumer AMD build with 32 GB total can hold the 31B at a high quant (Q6) with enough context to be practical, at single-digit throughput. Anyone replicating it should treat 8–9 tok/s as a floor that may improve with backend or split-tuning rather than a ceiling. Confidence: first-person report with hardware, quant, and observed speed disclosed; backend unspecified and the "should be faster" question unresolved. (source, June 22, 2026)

Strix Halo 128 GB: Gemma 4 31B is the slow-but-best-quality option next to large MoE models. A user with a Strix Halo 128 GB unified-memory box running models via llama-swap reports a clear split: the large mixture-of-experts models (GPT-OSS 120B and Qwen 3.5 122B) run "quite fast at 40–50 tok/s," while Gemma 4 31B is "a slow 4–5 tok/s" but "seems to have the best quality" of the three. The contrast is the useful part — it is a clean illustration of the dense-vs-MoE throughput tradeoff on a unified-memory APU: the active-parameter count of the MoE models keeps them fast, while the dense 31B pays full compute per token and lands at single-digit speed even with ample memory. The post itself was a request for an agentic Python coding workflow (plan-then-execute with a separate test model in PyCharm) rather than a benchmark, so the speeds are incidental and unprofiled, but they are consistent with other Strix Halo dense-model reports in this guide. Practical read: on Strix Halo, reach for Gemma 4 31B when output quality matters more than latency, and for a large MoE model when you need interactive speed. Confidence: first-person report with hardware and observed speeds disclosed; informal, not a controlled benchmark, no quant stated. (source, June 22, 2026)

llama.cpp Top-N-Sigma sampler PR: +50% generation speed on Gemma 4 E4B Q8_0 (M3 Max). A llama.cpp pull request that removes an unconditional softmax-and-sort at the end of the Top-N-Sigma sampler — wasted work in the common case where Top-N-Sigma is chained into the Dist sampler — is reported by its author to raise generation throughput for `google_gemma-4-E4B-it-Q8_0` by ~50%, from roughly 30 tok/s to ~45 tok/s on an M3 Max MacBook Pro, shaving about 10 ms per token. The win is a pure sampler-overhead reduction, so it is most visible on a small, fast model like the E4B where per-token sampling cost is a larger share of the total; the author explicitly flags two unknowns — whether the speedup holds across other backends, and whether it generalizes to larger models — and asks the community for more numbers. For Apple Silicon E-series users this is a concrete near-term speedup once the change lands and your sampler chain ends in Dist; for everyone else it is a "watch this" rather than a guarantee. Confidence: PR-author measurement on a single model and machine, flagged by the author as not yet generalized. (source, June 22, 2026)

A new uncensored Gemma 4 12B QAT "Balanced" finetune ships with MTP — promotional, self-reported. A community uploader released Gemma4-12B-QAT-Uncensored-Balanced, described as the original Gemma 4 12B QAT with refusals removed (the author cites 0/465 refusals on a GenRM check) and an MTP "Assistant" model for a claimed ~60% speed boost. The "Balanced" framing adds a light reasoning preamble on the edgiest prompts rather than changing the model's personality, and the author self-reports stable sampling, no looping, and good long-context coherence, recommending it for creative writing, RP, and emotional-intelligence use while explicitly conceding that Qwen 3.6 remains net superior for agentic coding and tool use. As with every community uncensored release, the caveats dominate: the refusal count and speed figure are vendor-reported, the post is overtly promotional (it markets download counts and a Discord), and there is no independent verification of either the safety claim or the MTP speedup. For readers who specifically want an uncensored Gemma 4 12B, this is one option to evaluate — verify the GGUF provenance and run your own quality and speed checks rather than trusting the advertised numbers. Confidence: vendor-announced release, self-reported metrics only, no independent reproduction. (source, June 22, 2026)

Will Gemma 4 grow a Mistral-scale finetune community? An ecosystem question, with EQ-Bench context. A discussion post asks whether Gemma 4 will mature into a heavily-finetuned community favorite the way Mistral Small did, framing the current gap as one of community finetunes rather than base capability. The author's read of EQ-Bench creative writing (noting the benchmark is Claude-graded but has many samples per model) is that, comparing bases only, Gemma 4 31B has "better everything — especially long-context adherence — except for the raw prosing performance of Mistral finetunes," and that Mistral's edge today comes from two years of community tuning and merging on top of a base that "used to be bad too." The post argues Gemma 4 is well-positioned to follow the same path: it is stable, has a roughly yearly release cadence that gives each generation time to mature, ships global MTP support (all sizes — 12B, 26B-A4B, 31B — work with MTP given the matching Assistant model, with no abliteration required), and supports QAT. This is opinion plus a benchmark reference rather than a first-party measurement, but it usefully frames why Gemma 4's creative-writing reputation lags its raw scores: the comparison is base-Gemma against community-finetuned Mistral, not like-for-like. Confidence: discussion/opinion citing a third-party (Claude-graded) benchmark; no first-party measurement. (source, June 22, 2026)

Open questions

Why is dual RX 9060 XT 31B Q6 only 8–9 tok/s? The owner suspects they are leaving speed on the table and the community apparently agrees, but no answer was captured at sweep time and the backend was not stated. A follow-up identifying whether it is a split-GPU, backend (Vulkan vs ROCm), or quant choice would turn this from a floor into a tuned baseline.
Does the Top-N-Sigma sampler speedup generalize beyond E4B Q8_0 on Metal? The PR author measured one model on one machine and explicitly asked whether the +50% holds for larger models and other backends. Until there are more numbers, treat it as an Apple-Silicon-E-series win rather than a universal one.
What are the actual KLD numbers for Gemma 4 31B QAT KV quantization? The 31B confirmation agrees with the smaller-model result but did not restate figures at sweep time. A 31B KLD table against an f16 KV baseline (and a note on which quant levels were tested) would make the "Q8_0 KV back on the menu" guidance fully concrete at the size that benefits most.
Will Gemma 4 develop a Mistral-scale finetune ecosystem? Carried as an open community question: the base scores and infrastructure (global MTP, QAT, stable yearly cadence) are there, but the volume of quality community finetunes that gave Mistral its creative-writing reputation is not yet.

Sources

The Gemma-mentioning posts driving this update (June 23 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Is Gemma 4 going to be the next Mistral (or Qwen 3.6) one day? Concerning the lack of finetunes (Jun 22, 2026 — ecosystem discussion; EQ-Bench creative writing, bases only: Gemma 4 31B "better everything except raw prosing of Mistral finetunes"; cites global MTP support across all sizes and QAT; Claude-graded benchmark)
Gemma4-12B-QAT Uncensored Balanced is out with MTP (~60% speed boost) (Jun 22, 2026 — community uncensored finetune of original 12B QAT; self-reported 0/465 refusals and ~60% MTP speed boost; promotional; recommends Qwen 3.6 over it for agentic coding/tool use; no independent verification)
Top-N-Sigma: Remove unconditional softmax+sort (llama.cpp PR #22645) (Jun 22, 2026 — sampler optimization; +50% generation on `gemma-4-E4B-it-Q8_0`, ~30→45 tok/s, −10 ms/token on M3 Max; author unsure if it generalizes across backends/models)
Gemma 4 QAT 31B responds better to KV cache quantization too (Jun 22, 2026 — follow-up to the June 22 KV-cache finding; re-ran the same wikitext KLD test on 31B and reports "even better results"; confirms QAT KV-quant tolerance at 31B; specific numbers not captured)
Gemma 4 31B Q6 on Dual 9060 XT (Jun 22, 2026 — two RX 9060 XT 16 GB cards, 32 GB total; 31B Q6 at a steady 8–9 tok/s; usable; OP suspects it should be faster; backend unspecified)
Agent recommendations (Strix Halo 128 GB) (Jun 22, 2026 — llama-swap; GPT-OSS 120B and Qwen 3.5 122B at 40–50 tok/s vs Gemma 4 31B at 4–5 tok/s but "best quality"; dense-vs-MoE speed tradeoff on a unified-memory APU)

One additional Gemma-mentioning thread was reviewed and intentionally left out of the findings: a UX complaint about the Hermes agent (1ucanbv) names Gemma 4 26B only as the model the author happened to be running, with no Gemma-specific hardware, quality, or configuration observation, so it carries no Gemma 4 signal worth publishing.

_Last updated: 2026-06-23 (June 23 sweep). Confidence: medium. Key findings: a second user confirms Gemma 4 QAT tolerates KV cache quantization at the 31B size (re-running the June 22 wikitext KLD test, "even better results," numbers not captured) — strengthening the "Q8_0 KV back on the menu for QAT" guidance; dual RX 9060 XT (2×16 GB) runs 31B Q6 at a steady 8–9 tok/s (backend unspecified, OP suspects it should be faster); Strix Halo 128 GB runs 31B at 4–5 tok/s, "best quality" but far slower than the 40–50 tok/s MoE models it runs alongside; a llama.cpp Top-N-Sigma sampler PR gives +50% on Gemma 4 E4B Q8_0 (~30→45 tok/s) on an M3 Max, not yet generalized; a community uncensored 12B QAT "Balanced" finetune shipped with MTP (self-reported, promotional); and an ecosystem discussion frames Gemma 4's creative-writing reputation as a community-finetune gap, not a base-capability gap. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-22

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new or updated since 2026-06-21, 417 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 22 sweep, 2026-06-22 00:00 UTC: a quiet cycle dominated by quality-and-quantization questions rather than new hardware numbers. The headline is a small but practical measurement: Gemma 4 QAT models appear to tolerate KV cache quantization far better than the non-QAT builds, which — if it holds up at larger sizes — would put Q8_0 KV cache "back on the menu" for a model family that has been notoriously sensitive to it. The second theme is vision: one careful benchmarker discovered that Gemma 4's default vision token budget is so small it makes the model "essentially useless" for image work until you raise it, and a separate user doing OCR on 1950s scanned documents reports Gemma 4 31B's vision is better than the Qwen 3.6 encoder for that task. Rounding out the cycle are two small-model use-case notes — Gemma 4 12B as a reliable grammar-constrained game agent that fits in 8 GB, and a community "uncensored" 12B-coder finetune (with the usual provenance caveats) — plus an open creative-writing question about whether to run the 31B at Q6 or QAT.

Gemma 4 QAT tolerates KV cache quantization much better than non-QAT: Q8_0 KV may be back on the menu. Gemma 4 has a long-standing reputation in the community for being unusually sensitive to KV cache quantization — quantizing the KV cache (to save VRAM and extend context) has tended to degrade Gemma 4 output more than it does most other model families, which pushed many users back to a full 16-bit KV cache. A new measurement pushes back on that for the QAT builds specifically. The author ran KL Divergence on wikitext at 16K context, comparing quantized KV caches against the full 16-bit KV cache as the baseline, and reports that the QAT (quantization-aware-trained) models hold up substantially better than the plain quants — concluding that "Q8_0 on QAT models might be back on the menu." They frame 99.9% KLD as a useful pass mark for judging how much KV quantization actually hurts, because it captures how well the model keeps attention on rare but high-importance tokens (exactly where a degraded KV cache shows up first). The practical read for a single-GPU user: if you run a Gemma 4 QAT GGUF, it is now worth re-testing a `q8_0` KV cache to reclaim context headroom rather than assuming you must stay at f16. Two important limits: the author's hardware could not test the 31B size, so this is confirmed only on the smaller models they could run, and it is a single first-party measurement with no captured comments at sweep time. Confidence: first-person KLD measurement with the metric and context length disclosed; unverified at 31B and not yet independently reproduced. (source, June 21, 2026)

Gemma 4's default vision token budget (280) makes it "useless" for image tasks until raised. A practitioner running a second iteration of a multi-model vision benchmark (23 models × 30 images × 3 runs = 2,070 tests, 60–70 inference hours) flagged a Gemma 4 configuration trap worth knowing before you judge the model on vision. Gemma 4's vision budget defaults to 280 tokens, which the author says is low enough to make the model "essentially useless" for real image work — and is likely why some earlier hands-on vision impressions of Gemma 4 were poor. The fix that recovered usable behavior, with settings the author credits to recent community posts: `--image-min-tokens 560 --image-max-tokens 2240` to raise the budget, plus `-b 4096 -ub 4096` so a single image's tokens are not split across multiple batches (the llama.cpp default of 512 fragments the image). The author also switched from Ollama to llama.cpp for the run and expanded to Q8 quants for the smaller models. The benchmark's top picks by VRAM tier in this round were Qwen-family models (for the 4–8 GB tier, Qwen3.5 4B nothink at Q4), so this is not a claim that Gemma 4 won a tier — the Gemma-relevant takeaway is narrower and more actionable: if you have written Gemma 4 off for vision, re-test it with the vision budget raised before concluding anything. Confidence: methodology, test count, and exact flags disclosed; the full per-model Gemma 4 ranking was not captured at sweep time. (source, June 21, 2026)

Gemma 4 31B vision praised for historical-document OCR on an RTX 6000 Pro. A user doing OCR and classification on old scanned documents (some dating to the 1950s) on an RTX 6000 Pro reports good results with Gemma 4 31B, stating it is "better than the vision encoder in the Qwen 3.6 line of models" for that task, and asking the community what else is worth trying. This is a qualitative single-user report with no accuracy numbers or settings disclosed, but it is a useful data point for the workstation-class vision use case, and it pairs naturally with the vision-budget caveat above: anyone reproducing this should confirm their Gemma 4 vision token budget is raised before comparing. Confidence: anecdotal self-report, no metrics, no quant or settings specified. (source, June 21, 2026)

Gemma 4 12B as a reliable 8 GB game agent via grammar-constrained "Think then Act." An open-source project ("Watch My Escape," an inverted escape-room game where the user designs maps and the LLM tries to escape) ships Gemma 4 12B as one of five model presets, all at Q4_K_M so they fit in roughly 8 GB of VRAM, tested on a 4090, a 3070, and an M1 Mac. The relevant engineering detail for local-model users is the reliability technique: the agent's turn is split into a free reasoning step followed by a grammar-constrained action step via llama.cpp, which is how small models like the 12B are kept reliable at emitting valid structured actions (push, pull, pick-up) instead of drifting into free text. No head-to-head scores between the presets were published, so this is a use-case and integration data point rather than a benchmark — but it is a concrete confirmation that Gemma 4 12B at Q4_K_M is a workable agent on the common 8 GB single-GPU tier when paired with grammar constraints. Confidence: working open-source project with hardware and quant disclosed; no comparative model scores. (source, June 21, 2026)

Community "uncensored" Gemma 4 12B-coder finetune released — treat vendor numbers with caution. A community uploader released `gemma-4-12B-coder-fable5-composer2.5-v1-uncensored-heretic` in both Safetensors and GGUF, advertising 9/100 refusals and 0.0467 KLD (divergence from the base model) and bundling a benchmark. The low KLD is the interesting claim — it implies the "uncensoring" finetune stayed close to the base model's behavior rather than degrading it broadly — but the figures are self-reported by the publisher, the post is overtly promotional (it also markets paid access to a larger MiniMax-M3 model), and there is no independent verification. For readers who want an uncensored Gemma 4 12B, this is one option to evaluate, but verify the GGUF provenance and run your own quality check rather than trusting the advertised numbers; the June 20 sweep's caution about mislabeled community GGUFs applies here too. Confidence: vendor-announced release, self-reported metrics only, no independent reproduction. (source, June 21, 2026)

Open questions

Does the QAT KV-cache tolerance hold at Gemma 4 31B? The 16K-context KLD result that puts Q8_0 KV "back on the menu" was measured only on the sizes the author could run; the 31B is exactly the size most likely to be context-constrained and to benefit, and it remains untested. A 31B KLD sweep against an f16 KV baseline would settle it.
Q6 vs QAT for Gemma 4 31B creative writing? A user choosing between a 31B Q6 GGUF and a 31B QAT build for creative writing asked the community which wins overall and what the KLD difference is, and got no answer at sweep time. This is the same QAT-quality question from the other end — generation quality rather than KV-cache robustness — and the two threads would benefit from being answered together. (source, June 21, 2026)
What is Gemma 4's full ranking in the corrected-vision-budget benchmark? The 2,070-test vision benchmark fixed the 280-token default that had been hobbling Gemma 4, but the per-model results captured at sweep time showed only the Qwen-family tier winners. Where Gemma 4 lands once its vision budget is properly raised is the open question that would actually answer "is Gemma 4 good at vision."

Sources

The Gemma-mentioning posts driving this update (June 22 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Gemma 4 QAT seems to respond significantly better to KV cache quantization (Jun 21, 2026 — KL Divergence on wikitext at 16K context; QAT models tolerate KV quantization far better than non-QAT; "Q8_0 on QAT might be back on the menu"; 99.9% KLD as pass mark; author could not test 31B)
Best local model for vision - 2nd benchmark update (Jun 21, 2026 — 23 models × 30 images × 3 runs; Gemma 4 vision budget defaults to 280 and is "useless" until raised; fix `--image-min-tokens 560 --image-max-tokens 2240` and `-b 4096 -ub 4096`; switched ollama→llama.cpp; tier winners were Qwen-family)
Best image vision model runnable on RTX 6000 Pro (Jun 21, 2026 — OCR/classification on 1950s scanned docs; Gemma 4 31B vision reported "better than the Qwen 3.6 line" vision encoder; anecdotal, no metrics)
Watch local LLMs escape the rooms you design (Jun 21, 2026 — Gemma 4 12B as one of five Q4_K_M presets fitting ~8 GB; tested on 4090/3070/M1; "Think then Act" with llama.cpp grammar-constrained action step)
Uncensored Gemma 4 12B-coder finetune (9/100 refusals, 0.0467 KLD) (Jun 21, 2026 — community `gemma-4-12B-coder-fable5-composer2.5` uncensored finetune in Safetensors + GGUF; self-reported metrics; promotional post; no independent verification)
Gemma 4 31B Q6 vs Gemma 4 31B QAT (Jun 21, 2026 — open question, creative-writing focus; user asks which is better overall and what the KLD is; no data or answer captured)

_Last updated: 2026-06-22 (June 22 sweep). Confidence: medium. Key findings: Gemma 4 QAT tolerates KV cache quantization much better than non-QAT (Q8_0 KV "back on the menu" per a 16K-context wikitext KLD test, 31B untested); Gemma 4's default vision token budget of 280 makes it "useless" for images until raised to `--image-min-tokens 560 --image-max-tokens 2240` with `-b/-ub 4096`; Gemma 4 31B vision praised for 1950s-document OCR over the Qwen 3.6 encoder; Gemma 4 12B Q4_K_M confirmed as an 8 GB grammar-constrained game agent; a community uncensored 12B-coder finetune released with self-reported numbers (verify before use). Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-21

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-20, 411 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 21 sweep, 2026-06-21 00:00 UTC: a compact cycle with three hardware data points and two use-case observations. The most concrete finding is a set of Intel Arc B70 SYCL llama.cpp benchmarks for Gemma 4 12B, 26B-A4B, and E2B — the first community-shared throughput numbers for the B70 on these models, showing 32 tok/s generation for the 12B, 40 tok/s for the 26B-A4B, and 109 tok/s for the tiny E2B at Q8_0. On the Apple Silicon side, a user confirms Gemma 4 E4B MLX running at full 132K context on a 16 GB M1 Mac Pro — the largest confirmed context window for a 16 GB unified-memory Mac with this model family. A third hardware note comes from an RTX 4090 user who found Gemma 4 faster than alternatives but encountered incorrect token generation from at least one quant variant — an anecdotal quality caution worth noting. On the use-case front, a community member argues Gemma 4 26B-A4B outperforms Qwen 3.5/3.6 for language learning and scientific queries (health, biology, biochemistry), offering a contrarian perspective to the sub's coding-focused Qwen preference. Finally, an educational reference: a 15-part LLM internals series uses Gemma 4 12B as its running example, with the notable practical fact that Gemma 4's 262,144-token vocabulary alone costs approximately 2 GB of VRAM before any model weights load.

Intel Arc B70 SYCL throughput for Gemma 4 12B, 26B-A4B, and E2B: first community data point for this backend. A user shared llama.cpp benchmark results (`llama-bench`, build dd4623a74 / build number 9640) for three Gemma 4 model sizes running on an Intel Arc Pro B70 GPU with the SYCL backend. All tests used `ngl=-1` (all layers on GPU) and Q8_0 quantization. Results: gemma4 12B Q8_0 — pp512: 1,578 ± 8 tok/s, tg128: 32.43 ± 0.07 tok/s (model size 11.78 GiB); gemma4 26B.A4B Q8_0 — pp512: 1,332 ± 9 tok/s, tg128: 40.13 ± 0.09 tok/s (25.00 GiB); gemma4 E2B Q8_0 — pp512: 5,662 ± 23 tok/s, tg128: 109.14 ± 0.26 tok/s (4.69 GiB). The E2B's throughput at over 5,600 tok/s prefill and 109 tok/s generation reflects its tiny 4.65B parameter footprint and near-zero memory pressure. The 26B-A4B result at 40 tok/s is notable because Q8_0 at 25 GiB nominally exceeds the Arc Pro B70's documented 24 GB GDDR6 capacity — the SYCL backend on Intel Arc supports unified memory fallback into system RAM, so the 26B-A4B result likely involves some host-memory offload, which may explain slightly lower generation throughput compared to what a 24 GB card running the model fully on-device would achieve. The 12B at Q8_0 (11.78 GiB) fits cleanly in 24 GB and the 32 tok/s figure is a credible baseline for Arc B70 SYCL inference. No comparison against CUDA or ROCm was included. Confidence: llama-bench output posted directly, hardware and build number disclosed, though the 26B-A4B memory situation is inferred. (source, June 20, 2026)

Gemma 4 E4B MLX at full 132K context on a 16 GB M1 Mac Pro: confirmed. A user running LM Studio on an M1 Mac Pro with 16 GB unified memory reported that Gemma 4 E4B via MLX is the largest model they can run "without running into memory hog" at the model's full context window of approximately 132K tokens. No throughput figures were shared, but this establishes a useful floor: on the lowest Apple Silicon configuration that the community regularly uses for local inference, E4B MLX supports the complete context window without RAM pressure forcing a shorter context. For context, the E4B model weights at a Q4-class MLX quantization fit in roughly 3–4 GB, leaving 12 GB for the KV cache at 132K context. At larger quantizations (Q8 or BF16) or for the 12B dense model, 16 GB would be tight or insufficient for full-context operation. Practical guidance: 16 GB unified memory Apple Silicon users should use Gemma 4 E4B MLX (not the 12B or 26B-A4B) when the full context window matters. Confidence: anecdotal self-report, no throughput data, model and hardware disclosed. (source, June 20, 2026)

Gemma 4 26B-A4B for language learning and scientific queries: community endorsement over Qwen. A user who has been comparing Gemma 4 26B-A4B against Qwen 3.5 and Qwen 3.6 for non-coding use cases argues that Gemma 4 26B-A4B is "unbeaten" for language learning and scientific domains — specifically health, biology, medical, clinical, and biochemistry queries — even by the Qwen alternatives. The post is a question to the broader community, asking others to share their non-coding use cases and which model wins. The author explicitly acknowledges the sub's conventional wisdom that Gemma 4 26B is "a bit behind for coding tasks" and is not contesting that. No benchmark data is provided; this is a practitioner qualitative comparison. The framing matters: it adds to a growing set of community signals (see also the June 12 sweep's creative writing finding) that Gemma 4's advantage over Qwen appears most clearly in prose quality, knowledge depth on scientific/health topics, and language tasks rather than in agentic coding or tool-calling benchmarks. Confidence: anecdotal, no benchmark, specific domain claims are self-reported. (source, June 20, 2026)

RTX 4090 (24 GB) Gemma 4 quant quality caution: incorrect token generation reported with some variants. A user switching from Ollama to a direct LM Studio setup on an RTX 4090 Gigabyte OC edition (24 GB VRAM, Ryzen 9 3900X, 32 GB DDR4 @ 3600 MT) noted that Gemma 4 runs faster than Qwen alternatives but encountered "incorrect tokens" in generation output — specifically mentioned an unexpected underscore character appearing where it should not. The post does not specify which GGUF variant or quantization level produced this behavior, and no comments were captured at sweep time to narrow it down. This is an isolated anecdotal report. Possible causes include: a corrupt or mismatched GGUF download, a chat template mismatch, a tokenizer-template desync in LM Studio's Gemma 4 template configuration, or a specific quantization artifact. Practical guidance: if you see unexpected token artifacts with Gemma 4 in LM Studio, verify the GGUF hash against HuggingFace, ensure the Gemma 4 Jinja chat template is loaded, and try a different quantization level. The June 20 sweep's caution about FreedomAISVR NVFP4 GGUFs is also relevant — verify GGUF provenance before attributing artifacts to the model architecture. Confidence: single-post anecdote, no quant specified, not reproduced. (source, June 20, 2026)

Gemma 4 12B internals reference: 262,144-token vocabulary costs ~2 GB VRAM before model weights load. A practitioner published a 15-part LLM internals series using Gemma 4 12B as its running concrete example, covering tokenization through production serving. The series is educational rather than a hardware benchmark, but it surfaces a practically useful number: Gemma 4's 262,144-token vocabulary occupies approximately 2 GB of VRAM in the embedding table before a single transformer weight block loads. This is a higher vocabulary cost than Qwen 3.5 (151,936 tokens) and is relevant for VRAM budgeting — a 12 GB GPU loading Gemma 4 12B at Q4_K_M (~7 GB weights) must also account for this ~2 GB embedding overhead. The series also traces the full tensor shape flow through a Gemma 4 12B forward pass and covers the KV cache growth rate at 128K context, which the author notes can exceed model weight size at high-VRAM-per-token settings. Confidence: educational reference; the vocabulary size and VRAM figure are derived from the official HuggingFace config and are verifiable. (source, June 20, 2026)

Open questions

What is the Intel Arc B70 SYCL throughput for Gemma 4 26B-A4B with the model fully on-device (i.e. a GPU with sufficient VRAM to avoid host offload)? The 40 tok/s figure involves Q8_0 at 25 GiB on what appears to be a 24 GB card; a fully-on-device benchmark at Q4_K_M or the UD-Q4_K_M variant would clarify the B70's true inference ceiling without memory pressure.
Which Gemma 4 quantization or GGUF variant triggers the incorrect-token artifact on the RTX 4090 in LM Studio? The June 21 report did not identify the specific quant. Narrowing this down — or confirming it is a chat template issue rather than a quantization issue — would help the community avoid the problematic variant.
What is the throughput for Gemma 4 E4B MLX on 16 GB M1 Mac Pro at full 132K context? The user confirmed it runs without memory pressure but did not measure tokens per second. Given E4B's small footprint, throughput at 132K should be well above the larger models — a concrete number would help Mac users choose between E4B and E2B for context-heavy tasks.

Sources

The Gemma-mentioning posts driving this update (June 21 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Some llama.cpp B70 SYCL benchmarks (Jun 20, 2026 — Intel Arc B70 SYCL; llama-bench build dd4623a74/9640; Gemma4 12B Q8_0 tg128=32.43 tok/s pp512=1578 tok/s; Gemma4 26B.A4B Q8_0 tg128=40.13 tok/s pp512=1332 tok/s; Gemma4 E2B Q8_0 tg128=109.14 tok/s pp512=5662 tok/s; 26B-A4B Q8_0 at 25 GiB may exceed 24 GB GDDR6 and involve system RAM offload)
Gemma 4 26b a4b is genuinely the best model I have tried for language learning and scientific queries! (Jun 20, 2026 — Gemma 4 26B-A4B qualitative endorsement over Qwen 3.5/3.6 for language learning and scientific/health/biology/biochem queries; no hardware details; no benchmark; community invitation for non-coding use case comparisons)
I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config. (Jun 20, 2026 — educational series using Gemma 4 12B as running example; key practical finding: 262,144-token vocabulary costs ~2 GB VRAM before model weights load; covers KV cache growth at 128K context and full tensor shape flow through a 12B forward pass)
Qwen code companion on vscode marketplace - thoughts (Jun 20, 2026 — M1 Mac Pro 16 GB unified memory; Gemma 4 E4B MLX via LM Studio confirmed running at full ~132K context without memory pressure; no throughput figures; largest context window achievable on 16 GB Apple Silicon with the E4B family)
Local agent on 4090 - looking for LM Studio settings (Jun 20, 2026 — RTX 4090 24 GB, Ryzen 9 3900X, 32 GB DDR4 3600 MT; Gemma 4 faster than Qwen alternatives but produced incorrect tokens (underscore artifacts) with at least one quant; quant variant not specified; anecdotal; likely a chat template or GGUF provenance issue)

_Last updated: 2026-06-21 (June 21 sweep). Confidence: medium. Key findings: Intel Arc B70 SYCL — Gemma4 12B Q8_0 at 32 tok/s, 26B-A4B Q8_0 at 40 tok/s, E2B Q8_0 at 109 tok/s; Gemma 4 E4B MLX confirmed at full 132K context on 16 GB M1 Mac Pro; 26B-A4B community endorsement for language learning and scientific queries; RTX 4090 LM Studio quant artifact caution; Gemma 4 12B vocabulary costs ~2 GB VRAM. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-20

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new or updated since 2026-06-18, 406 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 20 sweep, 2026-06-20 00:00 UTC: a quieter cycle with two practical reports and two notes of caution. The headline finding is DiffusionGemma 26B-A4B running at 290–700 tok/s on a consumer RTX 4090 via vLLM with an AWQ-INT4 quant — a data point that extends DiffusionGemma's throughput story to widely-owned hardware, though with significant caveats around context length, quality, and single-user limitations. A second post documents a beginner's end-to-end AMD GPU + Docker + llama.cpp setup using Gemma 4 12B and 26B-A4B as primary models, providing a reliable starting configuration for AMD ROCm users who have been switching from Ollama. On the caution side, the community flagged a batch of questionable NVFP4 GGUFs for Gemma 4 31B QAT uploaded by FreedomAISVR, with metadata inconsistencies suggesting the quants may not have been produced by genuine llama.cpp NVFP4 tooling. Finally, a user with a single RTX 5090 asks a useful calibration question: is Gemma 4 12B Unified the right fit for 128K context with a custom fine-tuning run?

DiffusionGemma 26B-A4B at 290–700 tok/s on RTX 4090 via vLLM AWQ-INT4: fast but not worth it for most use cases. A user ran DiffusionGemma 26B-A4B-it-AWQ-INT4 on a single RTX 4090 using a custom vLLM Docker build provided by NVIDIA along with a Gemma 4 tool/reasoning parser. First-prompt throughput reached 475 tok/s; sustained throughput ranges from 290 to 700 tok/s depending on output length (longer outputs run faster because the diffusion block amortizes better over more tokens). These numbers sit well below the Blackwell NVFP4 benchmark from the June 18 sweep (1,062 tok/s on an RTX PRO 6000), which is expected: the 4090's 24 GB VRAM limits context headroom and the AWQ-INT4 quant is a different format from NVFP4. The author reports four meaningful downsides relative to the standard Gemma 4 26B-A4B via llama.cpp: the model is single-user only (throughput degrades sharply when multiple requests are batched), responses are measurably worse (makes mistakes the regular 26B-A4B does not), context fades quickly (the model struggles to find a needle in a 8K haystack), and time to first token is slightly longer on short prompts because the diffusion block must process the full output window before releasing. The author's verdict: not worth it. The regular 26B-A4B running through llama.cpp still sustains over 300 tok/s when batched across multiple users, produces better responses, and handles longer context. DiffusionGemma's throughput advantage only shows at the single-user, single-generation extreme — and the quality tradeoff means that extreme is rarely worth targeting. Confidence: first-person configuration report, hardware and quant disclosed, throughput range is bounded by documented test conditions. (source, June 18, 2026)

AMD GPU + Docker + llama.cpp beginner setup with Gemma 4 12B and 26B-A4B: a working starting point for ROCm switchers. A user who switched from Ollama to llama.cpp and Open WebUI published a beginner's guide for the AMD GPU + Docker Compose path, using Gemma 4 as their primary model set. The tested configuration: `ghcr.io/ggml-org/llama.cpp:server-rocm` Docker image (with the equivalent NVIDIA CUDA image available as a direct swap for NVIDIA users), `gemma-4-12b-it-Q4_K_M.gguf` and `gemma-4-26b-A4B-it_UD_Q4_K_M.gguf` as the model pair. The author reports llama.cpp is noticeably faster and more stable than Ollama on the same AMD hardware, and that retaining Open WebUI for the frontend worked well with only minor configuration changes. This adds a practical data point for AMD users: the Gemma 4 26B-A4B at UD-Q4_K_M fits and runs correctly on AMD ROCm via the official Docker image, and the 12B at Q4_K_M is the recommended starting point for users with less VRAM. No specific throughput figures were provided; the post's value is as a step-by-step configuration reference rather than a benchmark. Caveats: performance on ROCm will depend heavily on GPU model and VRAM; users with RX 6600 XT 8 GB or similar should refer to the June 17 sweep (40–70 tok/s for the 12B with hybrid speculative decoding). Confidence: first-person configuration report, Docker images and quants disclosed, no throughput numbers. (source, June 18, 2026)

Community flags questionable NVFP4 GGUFs for Gemma 4 31B QAT from FreedomAISVR. A community member posted detailed concerns about a batch of NVFP4 GGUFs uploaded by FreedomAISVR on HuggingFace targeting Gemma 4 31B QAT and other models. The red flags: the README claims "Quantized with: llama.cpp build 537 (commit d2c6795)" — but build 537 is from 2023 while the cited commit hash is apparently recent, which is internally inconsistent. The quantization command listed is `llama-quantize --allow-requantize --tensor-type-file keep_q4.txt input.gguf output.gguf NVFP4`, but NVFP4 is not a valid quantization type in the current llama-quantize binary and there is no documented calibration dataset. The community thread did not reach a firm conclusion about whether the quants are functional or completely mislabeled, but the metadata inconsistencies are enough to warrant caution before running these files. Practical guidance: until the community verifies these GGUFs against a known-good NVFP4 reference, prefer official NVIDIA GGUFs (`nvidia/Gemma-4-31B-it-QAT-NVFP4`) or the Unsloth/Google QAT quants for Gemma 4 31B. This is a quality-control note, not a hardware benchmark. Confidence: community report, specific inconsistencies cited, outcome unverified. (source, June 18, 2026)

Gemma 4 12B Unified on a single RTX 5090: calibration for 128K context and fine-tuning. A user planning to fine-tune Gemma 4 12B Unified on ~300M tokens asks whether the 12B is the best fit for a single RTX 5090 with 128K context headroom. No one in the thread provided measured throughput figures before sweep time, so this entry is framed as an open calibration question with what the prior sweep data implies. The RTX 5090 ships with 32 GB GDDR7 VRAM. A Q8_0 quant of the Gemma 4 12B dense model fits in approximately 12–13 GB, leaving ~18 GB for KV cache — more than enough for 128K context at q8_0 KV cache. For fine-tuning on 300M tokens, a single RTX 5090 can support QLoRA on the 12B, but full fine-tune of a 12B model at BF16 (~24 GB) is tight and will require gradient checkpointing or model sharding. The user's instinct that the 12B is "almost comparable to Gemma 4 26B-A4B" on their tasks is consistent with community experience for assistant-level work — the 26B-A4B has more total knowledge capacity but the 12B dense runs faster and is easier to fine-tune on a single card. If 128K context is the hard requirement at interactive speed, the 12B at Q4_K_M or QAT is the recommended choice; the 26B-A4B at the same context depth would need 48 GB or more for comfortable KV cache headroom at high quant. Confidence: calibration analysis from prior sweep data; the user's specific use case is anecdotal and no throughput measurements were captured. (source, June 19, 2026)

Open questions

Does DiffusionGemma's throughput advantage over regular llama.cpp warrant it for any real production use case on consumer hardware? The RTX 4090 AWQ-INT4 data suggests the answer is no for batched or context-heavy workloads — but the single-user, short-context edge case (under 8K tokens, batch size 1) may still be the right fit for specific automation pipelines that tolerate lower quality for higher rate. A direct comparison on that specific workload would be useful.
Are the FreedomAISVR NVFP4 GGUFs for Gemma 4 safe to run? The metadata inconsistencies (impossible build number, non-existent NVFP4 quant type in llama-quantize) have not been resolved. The community needs someone with a Blackwell GPU to run perplexity checks against the official NVIDIA NVFP4 as a baseline before these files can be recommended.
What is the measured throughput for Gemma 4 12B QAT on an RTX 5090 at 128K context? A user is building this setup but no numbers were captured yet. Given the 5090's 1,792 GB/s memory bandwidth, the 12B QAT should outperform earlier 3090 and 4090 figures, but the specific 128K context ceiling has not been benchmarked.

Sources

The Gemma-mentioning posts driving this update (June 20 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts... (Jun 18, 2026 — RTX 4090, diffusiongemma-26B-A4B-it-AWQ-INT4 via vLLM custom Docker: 290–700 tok/s; single-user only; worse quality than regular 26B-A4B; context fades at 8K; author conclusion: not worth it vs llama.cpp 300+ tok/s batched)
2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it? (Jun 19, 2026 — RTX 5090 32 GB; Gemma 4 12B Unified for 128K context + 300M token QLoRA fine-tune; calibration question, no throughput reported)
I resisted the llama.cpp hype. I was wrong. (Docker + AMD GPU Beginner's Guide) (Jun 18, 2026 — AMD GPU Docker Compose with llama.cpp:server-rocm image; gemma-4-12b-it-Q4_K_M and gemma-4-26b-A4B-it_UD_Q4_K_M confirmed working; no throughput numbers)
FreedomAISVR NVFP4 quants (Jun 18, 2026 — community flags inconsistent metadata on FreedomAISVR Gemma 4 31B NVFP4 GGUFs; impossible build number, non-existent NVFP4 quant type; use official NVIDIA NVFP4 GGUFs until resolved)

_Last updated: 2026-06-20 (June 20 sweep). Confidence: medium. Key findings: DiffusionGemma 26B on RTX 4090 AWQ-INT4 → 290–700 tok/s but worse quality and single-user only; AMD ROCm Docker + llama.cpp confirmed with 12B Q4_K_M and 26B-A4B UD-Q4_K_M; FreedomAISVR NVFP4 GGUFs flagged as suspicious — verify before use; RTX 5090 Gemma 4 12B Unified 128K context calibration question open. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-19

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-18, 401 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 19 sweep, 2026-06-19 00:00 UTC: a compact cycle with three findings worth tracking. The headline is DiffusionGemma 26B running on a consumer RTX 4090 via AWQ-INT4 quantization in a custom vLLM Docker — the first community report to bring the DiffusionGemma throughput story to consumer-class hardware — but the author's verdict is sobering: the throughput peaks at 475 tok/s and ranges 290–700 tok/s, yet comes with hard limits on context length, quality, and batching that make it inferior to the regular 26B-A4B via llama.cpp for most single-user setups. A separate post documents the AMD ROCm + llama.cpp Docker path as a confirmed working stack for Gemma 4 12B and 26B-A4B, useful for users still on Ollama who want better throughput. The cycle also surfaces a community quality alert on a set of self-described NVFP4 GGUFs appearing on HuggingFace — the metadata in several FreedomAISVR quant files contains build-number and quantization-type inconsistencies that the community cannot reconcile, raising provenance questions.

DiffusionGemma 26B on RTX 4090 (AWQ-INT4): 290–700 tok/s confirmed, but with hard consumer-GPU limits. A user ran `nvidia/diffusiongemma-26B-A4B-it-AWQ-INT4` in a custom vLLM Docker container on what appears to be a standard RTX 4090 (24 GB VRAM), using the gemma 4 tool/reasoning parser included in the distribution. First-prompt throughput reached 475 tok/s; across a session the range was 290–700 tok/s, with longer outputs coming out faster (diffusion's block-parallel advantage). This is the first community data point for DiffusionGemma on a consumer GPU, extending the prior dataset (RTX PRO 6000 Blackwell 96 GB at 1,062 tok/s and H100 at 763 tok/s) to something a home builder can actually buy. The throughput headline is real, but the author's conclusion is notably negative: "Is it worth bothering with? I don't think so." The specific limits reported: (1) single-user only — batching degrades throughput; (2) noticeably weaker responses — "makes mistakes the regular 26ba4b doesn't"; (3) poor long-context retrieval — "can't find a needle in a haystack to save its life, context fades quick"; (4) context capped at roughly 8k tested (the author tried going higher but it was not practical); (5) slightly slower time-to-first-token than autoregressive on short prompts. The author's comparison: "The regular 26ba4b running through llama.cpp still nails down over 300t/s when batched." Practical guidance: if you are building a single-user local chat assistant that does not require long-context retrieval and accepts some quality regression, DiffusionGemma on a 4090 is now a demonstrated path. If you need reliable factual accuracy, long-context performance, or multi-user serving, the standard 26B-A4B via llama.cpp is the better choice even at equivalent throughput figures. Confidence: single-author first-look report, no captured comments, no controlled comparison; treat as an anecdotal early exploration. (source, June 18, 2026)

AMD GPU + llama.cpp ROCm Docker: confirmed working stack for Gemma 4 12B and 26B-A4B. A practitioner who used Ollama for months switched to llama.cpp with ROCm Docker for AMD GPUs and reports a clear improvement in speed and stability when running Gemma 4 models. The tested setup: Linux + Docker Compose, AMD GPU (ROCm-compatible, specific card not disclosed), `gemma-4-12b-it-Q4_K_M` and `gemma-4-26b-A4B-it_UD_Q4_K_M`. Docker image used: `ghcr.io/ggml-org/llama.cpp:server-rocm`. The author kept Open WebUI as the frontend. Key finding: "llama.cpp is faster, more stable, and just feels better to use all around." No specific token-per-second numbers were reported, which limits the direct hardware value of this post, but it confirms the ROCm Docker path is functional and reasonably accessible for users without CUDA cards. For AMD users on Ollama who have been debating the switch, this is a direct endorsement with a working Docker Compose template. Confidence: practitioner report with working setup disclosed; throughput numbers absent so relative gain unknown. (source, June 18, 2026)

Community alert: FreedomAISVR NVFP4 GGUFs have unresolvable metadata inconsistencies. A community member raised a quality concern about a set of recently published NVFP4-labeled GGUF files on HuggingFace, including `FreedomAISVR/Gemma-4-31B-it-QAT-NVFP4-GGUF`. Three specific issues were flagged. First, the README states "Quantized with: llama.cpp build 537 (commit d2c6795)" — but llama.cpp build 537 dates to 2023, while commit d2c6795 was described as made "just 5 hours ago," an impossible combination. Second, the quantization command shown in the README uses `llama-quantize ... NVFP4` as the output quantization type, but `NVFP4` is not a documented output format in `llama-quantize` as of the time of the post. Third, no calibration dataset is mentioned (NVFP4 quantization should require one). The community question is open: "Are these quants even real?" — meaning it is genuinely unclear what quantization these files actually contain under the hood. Recommendation for Gemma 4 users: treat `FreedomAISVR/Gemma-4-31B-it-QAT-NVFP4-GGUF` and related files from this publisher as unverified until the provenance is established. For NVFP4-level throughput on Gemma 4 31B, the well-documented paths remain `nvidia/Gemma-4-26B-A4B-NVFP4` and `nvidia/diffusiongemma-26B-A4B-it-NVFP4` via the NVIDIA-provided vLLM Docker. Confidence: community observation; no authoritative answer provided in the post (no captured comments). (source, June 18, 2026)

Community framing: DiffusionGemma described as "3x faster but 1.5x dumber" in SLM routing speculation. A speculative thread discussed whether a coordinator model plus a fleet of task-specialized SLMs could outperform a single large model on sequential agentic tasks. The poster cited DiffusionGemma as an example of the speed-quality trade-off, framing it informally as "something like 3x faster but being 1.5x dumber than base Gemma 4." This is not a benchmark — it is one user's summary of their reading of prior community posts — but it is representative of how the broader community is calibrating DiffusionGemma's value proposition at this point. No comments were captured at sweep time, so the thread's reception is unknown. Worth noting as a sentiment snapshot: as of mid-June 2026, the dominant community read on DiffusionGemma is that it is substantially faster for single-user generation but noticeably weaker on quality, and the tradeoff is not yet considered worth it for most practitioners. Confidence: anecdotal community framing, no benchmark. (source, June 18, 2026)

mistral.rs v0.8.10 adds OpenAI-compatible Agent Skills support for local models including Gemma 4. The mistral.rs project released v0.8.10 with a `/v1/skills` endpoint that brings Agent Skills to locally-hosted open models. The API is OpenAI-compatible, allowing drop-in replacement of frontier-model agent pipelines. Features include skill packaging (domain instructions + scripts), file attachment via `/v1/files`, and model-sent file responses. Prebuilt binaries are available for NVIDIA CUDA, Apple Silicon, and CPU. The post tags Gemma as a supported model family but does not provide Gemma 4-specific benchmarks or configuration details. Practical relevance: this lowers the barrier for running Gemma 4 in OpenAI-compatible agent frameworks without a proxy layer. Confidence: official developer release announcement; no Gemma 4 performance data provided. (source, June 18, 2026)

Open questions

What are the FreedomAISVR NVFP4 GGUFs actually quantized as? If the `NVFP4` flag in the README is not valid in `llama-quantize`, then these files may be standard Q4 or another format incorrectly labeled. Community verification via hash comparison or direct inspection would resolve the provenance question.
Can DiffusionGemma on a 4090 run longer context than ~8k? The initial tester stopped at 8k. It is possible that with GPU memory management tuning (lower `--gpu-memory-utilization` to reserve more headroom for the KV cache) longer contexts are achievable, at some throughput cost. No follow-up data yet.
Does the AMD ROCm llama.cpp path support Gemma 4's multimodal features (vision, audio)? The June 19 report confirms text-mode Gemma 4 12B and 26B-A4B on ROCm Docker, but the `--mmproj` path and audio modality with `ghcr.io/ggml-org/llama.cpp:server-rocm` have not been documented in community posts.

Sources

The Gemma-mentioning posts driving this update (June 19 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts... (Jun 18, 2026 — RTX 4090, diffusiongemma-26B-A4B-it-AWQ-INT4 via vLLM Docker: 290–700 tok/s, first prompt 475 tok/s; single-user only, weaker quality, limited to ~8k context, poor NIAH; author verdict: not worth it vs llama.cpp 26B-A4B)
I resisted the llama.cpp hype. I was wrong. (Docker + AMD GPU Beginner's Guide) (Jun 18, 2026 — AMD GPU + ROCm Docker + llama.cpp: Gemma 4 12B Q4_K_M and 26B-A4B UD_Q4_K_M confirmed working; image ghcr.io/ggml-org/llama.cpp:server-rocm; no tok/s numbers)
FreedomAISVR NVFP4 quants (Jun 18, 2026 — community alert: FreedomAISVR/Gemma-4-31B-it-QAT-NVFP4-GGUF has contradictory build numbers, undocumented NVFP4 quant type in llama-quantize, missing calibration dataset; provenance unverified)
SLM's and Diffusion? (Jun 18, 2026 — speculative discussion of SLM routing with DiffusionGemma; community framing as "3x faster but 1.5x dumber" than Gemma 4; no benchmark)
Run Agent Skills with mistral.rs v0.8.10 (Jun 18, 2026 — mistral.rs v0.8.10 OpenAI-compatible /v1/skills for local models incl. Gemma 4; prebuilt CUDA/Apple Silicon/CPU binaries; no Gemma 4-specific benchmarks)

Last updated: 2026-06-19 (June 19 sweep). Confidence: medium. Key findings: RTX 4090 DiffusionGemma 26B-A4B AWQ-INT4 via vLLM → 290–700 tok/s but single-user, weaker quality, ~8k context limit, not recommended over standard llama.cpp; AMD ROCm Docker llama.cpp confirmed for Gemma 4 12B + 26B-A4B; FreedomAISVR NVFP4 GGUFs have unresolvable metadata inconsistencies — treat as unverified. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-18

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new or updated since 2026-06-15, 396 hardware-mention entries total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 18 sweep, 2026-06-18 01:00 UTC: a compact but high-signal cycle with two hardware-rich data points and one architectural analysis. The most significant finding is a side-by-side benchmark of DiffusionGemma vs Gemma 4 on an RTX PRO 6000 Blackwell (96 GB VRAM) showing a 6.73x throughput advantage for the diffusion variant at NVFP4 precision, corroborating earlier H100 results and adding a new consumer-accessible endpoint. A separate post demonstrates Gemma 4 E2B running in-browser at 255 tok/s via community-optimized WebGPU kernels on an M4 Max — the fastest confirmed browser inference figure for any Gemma 4 model. A third data point shows the dual-GPU PCIe bandwidth trap: a user running Gemma 4 31B Q6_K across an RTX 4080 and RTX 5080 on mismatched PCIe slots gets only 26–28 tok/s output, constrained by the 4080's x4 slot rather than GPU compute. Rounding out the sweep, a practitioner shares a tested reasoning-hardening system prompt for Gemma 4 12B QAT that reduces cognitive-bias drift on trick questions, and a discussion examines whether DiffusionGemma's bidirectional attention block gives it a structural advantage over autoregressive Gemma 4 for tool-call JSON repair.

DiffusionGemma vs Gemma 4 on RTX PRO 6000 Blackwell (96 GB VRAM, NVFP4): 6.73x throughput advantage confirmed locally. A user ran a controlled side-by-side benchmark on a single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, TDP 600 W) with AMD Ryzen 9 9950X and 92 GB RAM, serving both `nvidia/Gemma-4-26B-A4B-NVFP4` and `nvidia/diffusiongemma-26B-A4B-it-NVFP4` simultaneously via vLLM at `--gpu-memory-utilization 0.42` so the cards share GPU without interference. Fixed seed (1234), 10 runs, up to 29k tokens per run. Results: standard Gemma 4 26B-A4B at 157 tok/s; DiffusionGemma 26B-A4B at 1,062 tok/s — a 6.73x average speedup. The author notes that single-user local inference is exactly where diffusion's architecture advantage shows up most: in the cloud with batched users, autoregressive models recover much of the throughput gap. This extends the prior dataset: an H100 benchmark from the June 12 sweep showed 218 vs 763 tok/s (3.5x) with a factual accuracy cost; here the Blackwell at NVFP4 shows a larger 6.7x advantage, reflecting the higher-precision quantization format and the card's native FP4 support. Practical note: the RTX PRO 6000 Blackwell is not a consumer GPU (96 GB VRAM, $6K+ street price), so these numbers bound what very high-end workstations can do rather than what home builders should expect. Confidence: well-controlled benchmark, fixed seed, disclosed hardware and CUDA version. (source, June 17, 2026)

Gemma 4 E2B at 255 tok/s in-browser via WebGPU: community-optimized kernels released. The webml-community team released a demo and custom WebGPU kernels for `google/gemma-4-E2B-it-qat-mobile-transformers` that reach approximately 255 tok/s on an Apple M4 Max — running entirely in the browser with no server required. The kernels were co-developed with Google's Fable 5 AI before that service was shut down. This is notable for two reasons: first, it establishes a credible WebGPU throughput ceiling for the E2B model tier on current flagship Silicon; second, it confirms that a QAT mobile variant of Gemma 4 E2B is publicly available and browser-runnable without Ollama or llama.cpp. Caveats: the 255 tok/s figure is on an M4 Max (the high-end Mac chip); typical laptop Apple Silicon or Windows GPU performance will be lower. The demo is available on HuggingFace Spaces. Confidence: official community release with reproducible demo, not an anecdotal report. (source, June 17, 2026)

Dual-GPU PCIe mismatch trap: RTX 4080 + RTX 5080 running Gemma 4 31B Q6_K achieves only 26–28 tok/s. A user building a friend's PC reported running Gemma 4 31B Q6_K across two cards with llama.cpp on Windows, with the RTX 4080 sitting on a PCIe 4.0 x4 slot (not x16) and the RTX 5080 on a full x16 slot. The result in split-mode layer was 26–28 tok/s output (after community tuning) and 659–759 tok/s prompt processing — below what either card alone at full bandwidth could deliver. The split-mode experiments confirmed the expected hierarchy: layer split at 26–28 tok/s, row split at ~12 tok/s, tensor split at ~6 tok/s. The user flagged PCIe lane starvation as the likely cause; the 4080 on x4 can saturate its interconnect at the inter-GPU tensor transfer rate, capping generation throughput for a layer-split dense 31B model. Practical guidance: for Gemma 4 31B on a dual-GPU system, the VRAM constraint (total Q6_K ≈ 25 GB) matters less than bandwidth parity — mismatched PCIe slots throttle the slower card. If you cannot put both cards on x16, consider using only the x16-slotted card with a smaller quantization (Q4_K_M fits in a single 24 GB 3090). Confidence: single-author configuration report, hardware and slot details disclosed. (source, June 17, 2026)

Gemma 4 12B QAT reasoning-hardening system prompt: a tested pattern for reducing cognitive-bias drift. A practitioner who has been using Gemma 4 12B QAT as a daily assistant shared a system prompt designed to reduce "cognitive bias drift" on trick questions — situations where the model defaults to the "standard" or "typical" interpretation of a problem rather than working from the stated premises. The core instruction: "Avoid cognitive bias in answers. Base answers strictly on the premises given. If you find yourself thinking 'usual', 'standard', 'typical' or 'classical', you are victim of cognitive bias and all analysis derived from it is VOID and needs closer re-examination." The author reports that after many iterations this now reliably triggers slower, more careful reasoning for ambiguous inputs while avoiding overthinking on simple ones. No hardware or quantization specifics were shared, but the 12B QAT is the recommended starting point for this use case since it fits in ~8–9 GB VRAM with headroom. This is a practical quality-of-life improvement for users who have found Gemma 4 12B works well at assistant-level tasks but occasionally satisfices on edge cases. Confidence: anecdotal practitioner report with no benchmark, but the prompt is reproducible and freely shared. (source, June 16, 2026)

DiffusionGemma tool-call structural argument: bidirectional attention may fix JSON repairs that autoregressive decoding cannot. A discussion thread examined whether DiffusionGemma's parallel 256-token block generation gives it a structural advantage for tool-calling accuracy, even though its factual quality scores below standard Gemma 4. The argument: autoregressive models commit tokens left-to-right, so a single bad brace or field name in a tool-call JSON payload is irreversible within the same generation pass. DiffusionGemma generates the entire 256-token block with bidirectional attention and refines tokens multiple passes before finalizing — meaning a malformed field name early in a JSON object can potentially be corrected when the model "sees" the closing structure. This is theoretically interesting but unverified by controlled benchmark at the time of this sweep — the post is a structural argument, not a measurement. The thread noted that Google's own guidance is to use Gemma 4 for production and DiffusionGemma for speed, but the tool-call case may be an exception worth testing. Confidence: theoretical argument, no benchmark. Worth watching for follow-up data. (source, June 16, 2026)

Open questions

Does DiffusionGemma's bidirectional attention actually produce higher valid-tool-call rates than autoregressive Gemma 4? The June 18 thread laid out a compelling structural argument but no one has run a controlled tool-calling benchmark comparing the two architectures on the same prompts. This would be the most actionable next step for teams considering DiffusionGemma for agentic pipelines.
What is the realistic WebGPU throughput for Gemma 4 E2B on mid-range hardware (Intel Arc, AMD integrated, NVIDIA RTX 4060)? The 255 tok/s figure is on an M4 Max, which has exceptional memory bandwidth. The same kernels on a mid-range GPU could be substantially slower or encounter driver compatibility issues not present on Metal.
Is the Gemma 4 31B Q6_K dual-GPU PCIe bottleneck avoidable with a different tensor-parallel approach? Users with mismatched slot widths may get better results by treating the x4 card as a secondary offload for KV cache rather than half the compute budget, or by disabling the x4 card and running the model on the x16 card alone at a lower quantization.

Sources

The Gemma-mentioning posts driving this update (June 18 sweep, newest first). Most are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

DiffusionGemma vs Gemma 4 Over 6x Faster on a Single RTX 6000 Pro NVFP4 (Jun 17, 2026 — RTX PRO 6000 Blackwell 96 GB VRAM: DiffusionGemma 26B-A4B NVFP4 1,062 tok/s vs Gemma 4 26B-A4B NVFP4 157 tok/s; 6.73x average speedup; 10 runs, fixed seed)
Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5 (Jun 17, 2026 — M4 Max: gemma-4-E2B-it-qat-mobile-transformers via community WebGPU kernels → 255 tok/s in-browser; demo + kernels on HuggingFace Spaces)
Can i get a reality check on this inference speed? (Jun 17, 2026 — RTX 4080 (x4 slot) + RTX 5080 (x16): Gemma 4 31B Q6_K, layer split → 26–28 tok/s output; PCIe bandwidth mismatch flagged as bottleneck)
Why might DiffusionGemma be better at tool calls than its benchmark quality suggests (Jun 16, 2026 — architectural analysis: bidirectional 256-token block may enable JSON repair; no benchmark, theoretical argument only)
Gemma 12b - Reasoning hardening instructions (Jun 16, 2026 — 12B QAT daily assistant: tested system prompt to reduce cognitive-bias drift on trick questions; no hardware data)
Qwen3.6 or Gemma-4 or ?? for direct OCR of page images (Jun 17, 2026 — practitioner using Gemma-4 for messy PDF contract parsing with handwritten annotations and conflicting provisions; reports "doing okay"; local-only requirement)

_Last updated: 2026-06-18 (June 18 sweep). Confidence: medium. Key hardware data: RTX PRO 6000 Blackwell → DiffusionGemma 1,062 vs Gemma 4 157 tok/s (6.73x); M4 Max → Gemma 4 E2B WebGPU 255 tok/s in-browser; RTX 4080+5080 PCIe x4 trap → 31B Q6_K 26–28 tok/s. Next update fires when the daily Gemma 4 research cron flags notable new findings._

---

Field Notes — 2026-06-17

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-06-16, 390 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 17 sweep, 2026-06-17 00:00 UTC: a smaller cycle with one strong throughput result and two threads worth tracking for their architectural implications. The headline number is AMD RX 6600 XT 8 GB hitting 40–70 tok/s on Gemma 4 12B QAT using a hybrid speculative decoding strategy that combines MTP with an ngram fallback — the highest single-GPU throughput reported on 8 GB VRAM for this model so far, and the first post to document the ngram-mod + draft-mtp combination in detail. The cycle also delivered a community-sourced system prompt aimed at suppressing cognitive-bias shortcuts in Gemma 4 12B's reasoning, and a theoretical analysis of why DiffusionGemma's bidirectional block generation might improve valid tool-call rates even though its base quality is lower than Gemma 4 — an open question with no benchmark yet. A fourth thread catalogued the daily-driver model choices for a user with an RX 9070 XT 16 GB, illustrating a common dilemma: MoE at high quant vs dense at low quant on mid-tier VRAM.

AMD RX 6600 XT 8 GB: Gemma 4 12B QAT reaches 40–70 tok/s with hybrid MTP + ngram speculative decoding. A user running Gemma 4 12B QAT on an RX 6600 XT 8 GB (Ryzen 7 5700x, 32 GB DDR4-3600) reports throughput that consistently exceeds 40 tok/s, frequently reaches 50 tok/s, and has peaked at a single-session average of 70 tok/s. The configuration uses a hybrid speculative decoding strategy — MTP draft head combined with an ngram model — with `--spec-type draft-mtp,ngram-mod`. Key tuning: `--spec-draft-p-min 0.95` (only accept draft tokens with 95%+ confidence), `--spec-draft-n-max 3`, `--spec-ngram-mod-n-match 24`, `--spec-ngram-mod-n-min 8`, `--spec-ngram-mod-n-max 32`. Full configuration:

``` llama-server \ --model ~/llamacpp/models/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft ~/llamacpp/models/gemma-4-12B-it-Q4_0-MTP.gguf \ --temperature 0.5 \ --spec-type draft-mtp,ngram-mod \ --spec-draft-n-max 3 \ --spec-draft-p-min 0.95 \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-min 8 \ --spec-ngram-mod-n-max 32 \ -fitc 40000 -t 8 --parallel 1 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --reasoning-budget 3072 --reasoning on \ --slot-save-path ~/llamacpp/contexts/ --defrag-thold 0.1 ```

MTP acceptance rates range from 64% to 99%, with an average around 85%. The wild throughput variation (40 to 70 tok/s across prompts) is explained by acceptance rate: when the draft model is highly aligned with the main model on a given token distribution, multiple tokens are accepted per step, yielding burst-mode throughput well above baseline. The user reports the highest acceptance rate they have seen on MTP is 99%, and that "dramatic increases in performance by upping --spec-draft-p-min" confirms the intuition that filtering out low-confidence drafts improves net throughput even at the cost of more fallbacks. Practical note for other AMD/RX 6600 XT users: the 8 GB fit requires the Q4_K_XL quant (~8–9 GB); context 40,000 with q8_0 KV cache is at the edge of available VRAM and may require reducing `-fitc` if other applications are running. The MTP draft model (`Q4_0-MTP`) is a secondary requirement; without it, fall back to ngram-only for lower but still usable gains. Confidence: single-author hardware report, full configuration disclosed, numbers are consistent with prior AMD/ROCm data points. (source, June 16, 2026)

Gemma 4 12B QAT as a daily driver: reasoning hardening via anti-bias system prompt. A user running Gemma 4 12B QAT as their primary local assistant shares a system prompt developed through iterative testing aimed at suppressing the model's tendency to fill in gaps with "usual", "standard", or "typical" defaults when a problem is novel. The core principle: the model is instructed to treat any reference to "standard," "typical," or "classical" patterns as a signal that it may be applying cognitive bias, and to void and re-examine the derivation from that point. Key lines from the prompt: "Avoid cognitive bias in answers. Base answers strictly on premises given. If you find yourself thinking 'usual', 'standard', 'typical' or 'classical', you are victim of cognitive bias and all analysis derived from it is VOID." The user reports this noticeably improves trick-question handling and reduces cases where the model confidently answers with an implicit assumption that wasn't stated. The user frames 12B QAT as fast enough for a daily driver — "I don't have to go make coffee while it thinks" — while being small enough to leave VRAM headroom for other tasks. Practical caveat: the prompt is unsolicited personal engineering; results will vary by task type, and prompts that aggressively guide reasoning can suppress helpful defaults on well-defined tasks. This is anecdotal tooling, not a benchmark. Confidence: single-author practitioner report. (source, June 16, 2026)

Hypothesis: DiffusionGemma's bidirectional block generation may improve tool-call valid-JSON rates despite lower base quality. A community post argues that the 4× speed headline misses the structurally interesting property of DiffusionGemma's diffusion-based generation: it generates a 256-token block in parallel with bidirectional attention, allowing it to revise tokens it already placed before the block is finalized. Standard autoregressive (AR) decoding commits to each token left-to-right — once a brace or field name is emitted, it is fixed, and a single wrong token in a structured output (such as a JSON tool call) requires either failure or a post-processing repair step. The hypothesis is that DiffusionGemma's bidirectional canvas means "a malformed tool call is usually one bad token in an otherwise fine sequence, and a model that can look back over the whole block and self-correct has a structural shot at fixing it that a left-to-right model never gets." The author explicitly frames this as an open question: "Has anyone actually benched this for tool calling to see if the bidirectional canvas fixes broken JSON, or does the lower base quality mean it just generates well-structured output less often?" No empirical data exists yet. The prior context from the June 14 sweep is relevant: DiffusionGemma's base quality is lower than Gemma 4 per Google's own documentation, and the MLX throughput on Apple Silicon was only 5.4 tok/s vs. 38 tok/s for regular Gemma 4. But the argument about structured-output validity rate is independent of speed and may be testable. Worth tracking if a tool-calling comparison surfaces. Confidence: theoretical community argument, no benchmark. (source, June 16, 2026)

RX 9070 XT (16 GB VRAM) daily-driver choices: MoE 26B-A4B QAT vs dense 12B QAT, and the MoE-vs-dense quant trade-off. A user with an RX 9070 XT 16 GB (16 GB VRAM) reports their current model slate: Gemma 4 26B-A4B QAT as the MoE daily driver (with IQ4_XS as an alternative for roughly 2× speed), Gemma 4 12B QAT as the dense option (noting that Q6_K or Q8_0 also fit comfortably), and Gemma 4 31B IQ3_XXS as the dense large option (which does not fit at IQ4_XS). The user asks a question that reflects a general uncertainty in the community: when comparing a larger MoE model at a lower quant to a smaller dense model at a higher quant, is there a rule of thumb for which wins on quality? The community observation from prior sweeps gives partial guidance: MoE models typically have more total parameters but activate only a fraction per token, so quantization degrades the active-parameter path more severely per bit than it does on a dense model where all parameters are used. That said, a well-quantized MoE 26B at IQ4_XS still substantially outperforms a well-quantized dense 12B on most tasks because the total knowledge capacity is much higher. The tradeoff is sharpest at the extreme ends: at IQ3_XXS, MoE models can drop quality noticeably while a dense model at Q4 may hold up better on factual recall. The user's observation that "QAT seems so much better" for the 26B-A4B aligns with the June 15 benchmark data (QAT build: 53 tok/s and 13.26 GiB vs non-QAT: 41 tok/s and 15.83 GiB). Confidence: single-user model-selection report; hardware is disclosed. (source, June 16, 2026)

Open questions

Does the hybrid MTP + ngram-mod strategy scale to other Gemma 4 model sizes? The RX 6600 XT result is specifically for the 12B QAT. It is not yet documented whether the same `--spec-type draft-mtp,ngram-mod` configuration produces comparable acceptance-rate gains on the 26B-A4B or 31B dense models, where the MTP draft head is from a different size class.
Does DiffusionGemma's bidirectional canvas actually improve valid tool-call rates? The hypothesis is structurally sound, but no benchmark comparing Gemma 4 vs DiffusionGemma on valid-JSON tool-call rates exists yet. Given DiffusionGemma's lower base quality, the result could go either way.
What throughput does Gemma 4 12B's encoder-free audio input achieve in a streaming pipeline? Community interest in using the 12B's native audio support for low-latency speech-to-speech is growing, but no one has reported a working turnkey streaming ingestion setup. The encoder-free architecture is promising for latency; practical tooling remains thin.

Sources

The Gemma-mentioning posts driving this update (June 17 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Need help understanding how spec decode affects token throughput (Jun 16, 2026 — RX6600XT 8 GB, Ryzen 7 5700x, Gemma 4 12B QAT UD-Q4_K_XL + MTP draft, hybrid spec-type draft-mtp,ngram-mod, spec-draft-p-min 0.95 → 40–70 tok/s, MTP acceptance 64–99% avg 85%)
Gemma 12b - Reasoning hardening instructions (Jun 16, 2026 — anti-cognitive-bias system prompt for Gemma 4 12B QAT daily driver; void-and-recheck heuristic for "standard/typical/classical" shortcuts)
Why might DiffusionGemma be better at tool calls than its benchmark quality suggests (Jun 16, 2026 — bidirectional 256-token block generation enables token-level self-correction before finalization; hypothesis that valid JSON tool-call rate may exceed AR Gemma 4 despite lower base quality; no benchmark)
out of these models with these quants, which is the best everyday assistant? (Jun 16, 2026 — RX 9070XT 16 GB; 26B-A4B QAT + 12B QAT + 31B IQ3_XXS; MoE-vs-dense quant trade-off question; QAT quality advantage over standard quants confirmed subjectively)
Gemma 4 12b audio capabilities (Jun 16, 2026 — community question about use cases for Gemma 4 12B encoder-free audio input; no implementations reported yet)

Last updated: 2026-06-17 (June 17 sweep). Confidence: medium. Key findings: AMD RX6600XT 8 GB → 40–70 tok/s on Gemma 4 12B QAT with hybrid MTP+ngram spec decode (acceptance rate 64–99%, avg 85%); community anti-bias reasoning system prompt for 12B QAT; DiffusionGemma bidirectional tool-call reliability hypothesis (no benchmark yet); RX 9070XT 16 GB MoE-vs-dense quant trade-off discussion. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-16

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (8 new or updated since 2026-06-15, 381 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 16 sweep, 2026-06-16 01:00 UTC: a lighter cycle by volume but with meaningful tooling and ecosystem news. The headline is a native mobile framework integration: React Native ExecuTorch now ships Gemma 4 with GPU acceleration — Vulkan on Android, MLX on Apple Silicon — making Gemma 4 first-class in fully offline React Native apps without requiring a native Swift or Kotlin inference pipeline. On the quantization front, a community member published independent QAT-aligned GGUFs for Gemma 4 12B and 31B using a refined error-minimizing search process that claims competitive KLD with Unsloth's UD-Q4_K_XL builds, giving users a second non-Unsloth path to near-QAT-quality inference. Two practical capability reports round out the cycle: a developer confirmed that Gemma 4 E4B generates compilable macOS apps on an 8 GB MacBook Air using a deterministic repair-loop pattern that compensates for small-model hallucinations, and a user with a dual RTX 3090 (48 GB VRAM) built a three-tier hybrid agent where a frontier model plans and Gemma 4 31B executes — illustrating a reusable workflow pattern for users who want frontier-quality design decisions without paying cloud costs on every execution token.

React Native ExecuTorch now runs Gemma 4 on Android (Vulkan) and Apple Silicon (MLX). The react-native-executorch library has been updated to support Gemma 4 with full GPU acceleration: the Vulkan delegate on Android and the MLX delegate on Apple Silicon (both iOS and macOS). The integration is fully offline — no network call, no cloud inference. This is the first time a major React Native inference framework has shipped first-class Gemma 4 support with native GPU delegation on both major mobile platforms. No specific throughput numbers were disclosed in this announcement, but practical context from the June 15 sweep is relevant: a Pixel 10 Pro under Termux achieved 1.3 tok/s on the Gemma 4 12B at Q3, while the E2B and E4B tiers are faster and fit more comfortably on current phone VRAM. ExecuTorch via Vulkan may deliver higher throughput than llama.cpp via Vulkan on the same device, but no direct comparison exists yet. Practical implication: React Native developers can now target Gemma 4 on-device without building a native Swift or Kotlin inference pipeline. Confidence: tooling release announcement, no benchmarks reported. (source, June 15, 2026)

Community publishes independent QAT-aligned GGUFs for Gemma 4 12B and 31B via error-minimizing quantization. A community member released new GGUFs for both Gemma 4 12B (`idkwhattoputherenow/gemma-4-12B-it-qat-q4_0-maxerr`) and Gemma 4 31B (`idkwhattoputherenow/gemma-4-31B-it-qat-q4_0-maxerr`) on HuggingFace, using an independent quantization approach rather than a standard imatrix. The process: starting from two typical Q4_0 seed configurations, the quantizer performs a full round-trip to F16, measures max error per layer, then searches locally until error stops improving. The author reports the resulting GGUFs achieve similar KLD to Unsloth's proprietary UD-Q4_K_XL-super-mega-heccin builds — which Unsloth themselves used as the reference for evaluating QAT quality. The author is candid that they do not know what package Google used for their QAT process and invites anyone with that information to contribute a PyTorch port. Practical significance: this gives users a second source of QAT-aligned GGUFs independent of Unsloth's pipeline, useful if Unsloth's checkpoint is unavailable or if users want to reproduce results from a different starting point. Note: "q4_0" in the filename refers to the bit width used during the error round-trip search, not necessarily the final quantization level of the released GGUF. Confidence: community author, methodology is disclosed, independent KLD verification not yet published. (source, June 15, 2026)

Gemma 4 E4B generates compilable macOS apps on an 8 GB MacBook Air — via deterministic repair loops. A developer released "Ironsmith," an open-source macOS app generator that works with models as small as Gemma 4 E2B. The system runs on an 8 GB MacBook Air. The key architectural insight: rather than expecting a small model to produce flawless code in one pass, Ironsmith generates the entire app in a single model call, then applies a cascade of deterministic formatting, linting, and compilation repair steps until the output compiles. This compensates for the hallucinations and syntax errors that are common in small-model code generation without requiring a human reviewer between iterations. The author is explicit that "these little models are pretty decent at writing full apps if you fix all of their hallucinations and syntax errors." Gemma 4 26B produces higher quality output than E4B; E2B is the practical floor on 8 GB hardware. The demo video uses GPT 5.4 mini (too slow for video with a local model), but the author reports the same application works with Gemma 4 E4B. No throughput figures were reported for the local path. Practical read: the deterministic-repair pattern generalizes — small models can succeed at structured generation tasks when correctness recovery (compilation, linting, schema validation) is built into the outer loop rather than expected from the model alone. Confidence: single-author project announcement, no benchmark, open-source release is verifiable. (source, June 15, 2026)

Dual RTX 3090 (48 GB VRAM) used as local execution tier in a frontier-planned agentic workflow. A software engineer with a dual-RTX-3090 desktop built a three-tier agent: Codex handles top-level planning (design decisions that determine architecture), and Gemma 4 31B — alongside Qwen 3.6 27B — handles local execution of coding tasks. The motivation: both Gemma 4 31B and Qwen 3.6 27B are capable at execution but, in the author's experience, lack the design-level judgment of a frontier model. By reserving frontier API calls for planning only, the system achieves near-frontier output quality while running the token-heavy execution phase locally. The dual 3090 (48 GB combined VRAM) comfortably holds Gemma 4 31B Q4 (~17.5 GB) with room for context. All three tiers are swappable via config. No tok/s figures were provided. This represents a practical frontier-plans-local-executes pattern that is becoming more common as users reach the limits of local models on planning tasks while still wanting to avoid cloud costs on every token. Confidence: single-author report, architectural details described, no benchmarks. (source, June 15, 2026)

Open questions

What throughput does ExecuTorch Gemma 4 achieve on current Android devices (Vulkan) and Apple Silicon (MLX)? The announcement does not include benchmark numbers. Prior data points for context: Pixel 10 Pro Termux gives Gemma 4 12B Q3 → 1.3 tok/s at 10K context; E2B on Android has been reported around 4 tok/s. ExecuTorch via Vulkan may differ materially from llama.cpp via Vulkan, but no direct comparison exists.
Are the new community QAT GGUFs perplexity-equivalent to Unsloth UD-Q4_K_XL on standard benchmarks? The author claims similar KLD, but an independent side-by-side perplexity or task-accuracy test across common benchmarks (MMLU, HellaSwag, etc.) has not been published. The KLD comparison was done against the Unsloth reference, not a third-party gold standard.
Does the Ironsmith deterministic-repair pattern extend to other agentic tasks beyond app code generation? The approach works for macOS app generation where a compiler provides a ground-truth correctness signal. It is less obvious whether the same repair-cascade pattern applies to tool-calling agents, multi-turn reasoning, or data extraction tasks where there is no equivalent deterministic oracle for catching model errors.

Sources

The Gemma-mentioning posts driving this update (June 16 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

React Native ExecuTorch now runs Gemma 4 (Vulkan and MLX accelerated) (Jun 15, 2026 — ExecuTorch integration: Vulkan delegate Android, MLX delegate Apple Silicon; fully offline; no throughput figures)
moar QAT stuff and hairy ticks (Jun 15, 2026 — community Gemma 4 12B + 31B QAT-aligned GGUFs via error-minimizing search; competitive KLD with Unsloth UD-Q4_K_XL)
Made a macOS app that creates highly personal macOS apps. Works with models as small as Gemma 4 E2B (Jun 15, 2026 — Ironsmith open-source app generator; E2B/E4B on 8 GB Mac; deterministic repair-loop pattern for small-model code gen)
An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig) (Jun 15, 2026 — dual RTX 3090 48 GB; Codex plans, Gemma 4 31B executes; 3-tier swappable config)

Last updated: 2026-06-16 (June 16 sweep). Confidence: medium. Key findings: ExecuTorch first-class Gemma 4 mobile support (Android Vulkan + Apple Silicon MLX); community QAT-aligned GGUFs for 12B and 31B published; Gemma 4 E4B confirmed on 8 GB Mac via deterministic repair pattern; dual-3090 frontier+local hybrid agent pattern. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-15

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (7 new or updated since 2026-06-14, 373 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 15 sweep, 2026-06-15 01:00 UTC: a small but hardware-rich cycle anchored by one strong data set. A user ran a full `llama-bench` sweep across all four Gemma 4 size tiers on a triple GTX 1070 box (3×8 GB = 24 GB VRAM, 9-year-old Pascal cards), quantifying a pattern earlier sweeps only hinted at: on weak, bandwidth-starved multi-GPU hardware the MoE 26B-A4B QAT generates at 53 tok/s while the dense 31B manages only 7 tok/s — because the MoE activates roughly 4 B parameters per token instead of all 31 B. The same cycle delivered two edge data points — Gemma 4 12B running on a Google Pixel 10 Pro phone (Q3_K_XL + MTP, under 10 watts, ~1.3 tok/s at 10 K context) and a CPU-only local personal assistant built on Gemma 4 4B — plus a tooling release (Harbor v0.5.0) that stands up MLX/OMLX backends and a coding-agent frontend for Gemma 4 in one command. A multi-VLM comparison that includes three Gemma 4 vision variants is underway, but its result table was not captured at sweep time.

Triple GTX 1070 (3×8 GB = 24 GB VRAM, Pascal, power-limited): the MoE 26B-A4B QAT hits 53 tok/s; the dense 31B only 7 tok/s. A user ran a full `llama-bench` sweep across every Gemma 4 tier on a budget multi-GPU box — 3× Nvidia GTX 1070 8 GB (24 GB combined), AMD Ryzen 5 3600, 48 GB DDR4-3600, Kubuntu 26.04, llama.cpp Vulkan build b9204 — with the cards power-limited to 120–122 W each (a reported ~5% inference hit) and spread across PCIe 16x / 4x / 1x slots (one card on a 1x riser). Generation (tg128) and prompt-processing (pp512) results, by model:

Gemma 4 26B-A4B QAT (UD-Q4_K_XL, 13.26 GiB) — 53.08 tok/s generation, 123.50 tok/s prefill. Fastest generation of the set, and the smallest on disk.
Gemma 4 26B-A4B (UD-Q4_K_XL, 15.83 GiB) — 41.28 tok/s generation, 114.05 tok/s prefill.
Gemma 4 12B (UD-Q8, 12.69 GiB) — 13.47 tok/s generation, 128.85 tok/s prefill.
Gemma 4 E4B (BF16, 14.00 GiB) — 11.54 tok/s generation, 302.16 tok/s prefill (highest prefill, at full precision).
Gemma 4 31B dense (UD-Q4_K_XL, 17.52 GiB) — 7.12 tok/s generation, 56.21 tok/s prefill. Slowest, as expected for a fully-dense 31 B spread across slow interconnects.

Two durable takeaways. First, the MoE 26B-A4B is the clear daily-driver choice on old or bandwidth-limited multi-GPU rigs — at ~4 B active parameters per token it generates roughly 7× faster than the dense 31B while carrying far more total knowledge than the 12B. Second, the QAT build of the 26B-A4B is both smaller and faster than the plain quant of the same model (13.26 GiB / 53 tok/s vs 15.83 GiB / 41 tok/s), reinforcing the standing advice to prefer Google/Unsloth QAT quants when available. Caveats: this is a Vulkan backend (not CUDA), the cards are power-limited, and one GPU sits on a single PCIe lane — all of which suppress the absolute numbers; a modern CUDA build on full x16 lanes would be faster. The relative ordering between model types is the part that travels. Confidence: single-author benchmark, hardware and quant fully disclosed. (source, June 14, 2026)

Gemma 4 12B on a phone: a Google Pixel 10 Pro runs it under 10 watts at ~1.3 tok/s. A user ran `gemma-4-12b-it-UD-Q3_K_XL` with the MTP draft head (`mtp-gemma-4-12b-it.gguf`, `--spec-type draft-mtp --spec-draft-n-max 1`) under Termux + llama.cpp Vulkan (build 9639) on a Pixel 10 Pro, with `-c 32000`, `--mlock`, and q8_0 KV cache. At roughly 10,000 tokens of prompt depth the result was 6.5 tok/s prompt processing and 1.3 tok/s generation, drawing under 10 watts. The headline here is feasibility, not speed: a full 12 B model genuinely runs on a 2026 flagship phone at a Q3 quant, but ~1.3 tok/s at deep context is well below interactive reading speed — usable for background or async tasks, not live chat. For phone-class use the smaller E2B/E4B tiers remain the practical choice; the 12 B is a "because I can" data point that nonetheless usefully bounds what current mobile silicon can do. Confidence: single-author anecdote, full command disclosed. (source, June 14, 2026)

A CPU-only local personal assistant built on Gemma 4 4B. Prompted by the Anthropic Fable 5 / Mythos 5 export-control shutdown, a developer shared "Bantz," a fully local assistant running on Gemma 4 4B with no GPU required: it summarizes Gmail by category, integrates Google Calendar, runs async multi-source web research, monitors system resources (CPU/RAM/swap) with alerts, executes scheduled tasks, and does Wayland-native desktop control. The author is candid that "optimizing a small local model is an absolute nightmare," and parts of the feature list are aspirational (email summarization "tries, at least"), but the report is a useful proof-of-concept that a 4 B-class Gemma 4 can drive a multi-tool agent on CPU alone — the floor for "no specialized hardware" local AI keeps dropping. Confidence: single-author project announcement, no benchmarks. (source, June 14, 2026)

Harbor v0.5.0: one-command Gemma 4 backends on Mac (MLX/OMLX) plus a coding-agent frontend. Harbor's v0.5.0 release adds native (non-Docker) service hosting: `harbor up opencode mlx` or `harbor up hermes omlx` downloads, configures, and starts an MLX/OMLX backend (or Docker Model Runner) and wires it to a frontend such as Open WebUI, OpenCode, or Hermes. A new `harbor pull` routes by source — `harbor pull gemma4:12b` for Ollama-style names, HuggingFace repos for llama.cpp quants. For Gemma 4 on Apple Silicon, where MLX setup has historically been the friction point (see the recurring MLX-port issues in prior sweeps), this meaningfully lowers the barrier to a working local stack. Confidence: tooling release announcement; not independently tested here. (source, June 14, 2026)

Open questions

Which local VLM actually wins for Gemma 4 users — and where does Gemma 4 12B land? A user building a local vision MCP benchmarked ten current VLMs on a Mac with a 20-image suite, including Gemma 4 12B, Gemma 4 26B-A4B, and Gemma 4 E4B against Qwen3-VL 4B/8B, GLM-4.6V-Flash 9B, InternVL3.5 8B, and Qwen 3.6 35B-A3B. The author expected Gemma 4 12B to win, but the result table was not captured in the archive at sweep time. Worth tracking the follow-up before drawing conclusions. (source)
Is there a turnkey low-latency voice-input path for Gemma 4 12B's encoder-free audio? A user wants to exploit the 12B's native audio input to skip the speech-to-text stage in a speech-to-speech pipeline, but couldn't find an out-of-the-box streaming-ingestion solution. The encoder-free architecture is promising for latency; the tooling for native audio streaming is still thin. (source)
On 32 GB unified memory, is Gemma 4 12B at Q8 a better daily driver than Qwen 3.6 35B-A3B at Q4? A user getting ~15 tok/s from Qwen 3.6 35B-A3B on a 32 GB Mac is considering Gemma 4 12B, which fits comfortably at Q8 (even BF16). No head-to-head result was reported; the trade-off — a smaller dense model at high precision vs a larger MoE at low precision — is a recurring open question this sweep restates rather than settles. (source)

Sources

The Gemma-mentioning posts driving this update (June 15 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Gemma 4 models benchmarked on with Triple GPU (Jun 14, 2026 — 3×GTX 1070 8 GB / 24 GB, Vulkan b9204, power-limited: 26B-A4B QAT 53.08, 26B-A4B 41.28, 12B Q8 13.47, E4B BF16 11.54, 31B Q4_K_XL 7.12 tok/s generation)
Gemma 12b less than 10 watts 6.5pp 1.3tg (Jun 14, 2026 — Pixel 10 Pro, Termux, 12B UD-Q3_K_XL + MTP draft, q8_0 KV, ~10 K depth → 6.5 pp / 1.3 tg tok/s, under 10 W)
Built a local AI assistant because I always knew this day would come (Jun 14, 2026 — "Bantz" CPU-only personal assistant on Gemma 4 4B: Gmail / Calendar / web research / system monitor / desktop control)
MLX/OMLX/DMR with OpenCode/Hermes/Open WebUI in one command — Harbor v0.5.0 (Jun 14, 2026 — one-command native MLX/OMLX backends + frontends; harbor pull gemma4:12b)
Which is the best local VLM? Benchmark results June 2026 (Jun 14, 2026 — 20-image VLM suite on Mac incl. Gemma 4 12B / 26B-A4B / E4B vs Qwen3-VL, GLM-4.6V-Flash, InternVL3.5; result table not captured at sweep time)
Gemma 4 12B native encoder free voice input utilization suggest? (Jun 14, 2026 — seeking turnkey low-latency native audio streaming ingestion for Gemma 4 12B encoder-free input)
Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? (Jun 14, 2026 — 32 GB unified memory; Qwen ~15 tok/s; Gemma 4 12B fits comfortably at Q8/BF16; quant-vs-size trade-off, no result)

Last updated: 2026-06-15 (June 15 sweep). Confidence: medium. Key hardware data: 3×GTX 1070 24 GB → 26B-A4B QAT 53 tok/s vs dense 31B 7 tok/s; Pixel 10 Pro → 12B Q3 1.3 tok/s under 10 W; Gemma 4 4B → CPU-only assistant. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-14

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (11 new or updated since 2026-06-13, 366 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 14 sweep, 2026-06-14 01:00 UTC: a smaller but hardware-rich cycle dominated by two concrete data points with practical implications. First, a 9-year-old RTX 1080 Ti reaches 50 tok/s on Gemma 4 12B QAT with MTP, confirming that any Pascal-class consumer GPU with ~11 GB VRAM can run a full 12B model at usable speed. Second, DiffusionGemma on Apple Silicon (MacBook M4 Pro 48 GB) via MLX delivers only 5 tok/s — unexpectedly slow compared to the 38 tok/s regular Gemma 4 26B-A4B QAT achieves on the same hardware — while a desktop RTX 3090 Ti running DiffusionGemma via GGUF reports ~120 tok/s, showing the architecture's potential only materialises with the right backend. The sweep also surfaced the first multi-user report of a vLLM AWQ 4-bit repetition bug ("lapped up" phrase loop in long contexts), community benchmarks of Gemma 4 12B for practical agentic tasks (standing up a Gitea server from scratch), and ongoing discussion of minimum hardware targets for Gemma 4 31B at interactive speeds.

RTX 1080 Ti (11 GB, 9 years old): Gemma 4 12B QAT UD-Q4_K_XL reaches 50 tok/s with MTP speculative decoding. A user running llama.cpp on a 2016-era card reports a working configuration: `unsloth/gemma-4-12B-it-qat-GGUF` with `gemma-4-12B-it-qat-UD-Q4_K_XL.gguf`, context 16384, full GPU offload (`-ngl 99`), `cache-type-k q8_0`, `cache-type-v q8_0`, and MTP speculative decoding with the built-in MTP head (`--spec-draft-hf unsloth/gemma-4-12B-it-qat-GGUF --model-draft MTP/gemma-4-12B-it-Q8_0-MTP.gguf --spec-type draft-mtp --spec-draft-n-max 2`). Result: 50 tok/s on a GPU that was released in 2016. The user is "happy but not 100% sure the speculative decoding is helping" — a reminder to always verify MTP by watching for burst patterns in the generation log (prior sweep: "a working MTP pair shows visible speculative bursts, a mismatched one generates one token at a time with overhead"). Practical implication: any 11 GB Pascal-class or Turing-class GPU that fits the Q4_K_XL quantization should reach similar numbers; the 12B QAT UD-Q4_K_XL is approximately 8–9 GB, leaving headroom for KV cache at context 16384. Confidence: single-author anecdote, configuration details are verifiable. (source, June 13, 2026)

DiffusionGemma on Apple Silicon (MacBook M4 Pro 48 GB) via MLX: only 5.4 tok/s — far behind regular Gemma 4. A user running `mlx-community/diffusiongemma-26B-A4B-it-4bit` via `mlx_vlm.generate` on a MacBook M4 Pro 48 GB reports: 3.5 tok/s prompt processing, 5.4 tok/s generation, 18.6 GB peak memory. For context, the same user's regular Gemma 4 26B-A4B QAT on the same Mac achieves approximately 38 tok/s — about 7× faster. The result is surprising given that autoregressive Gemma 4 runs well on Apple Silicon and DiffusionGemma is theoretically faster on compatible hardware. The likely explanation is that the MLX community port is early-stage and has not yet been optimized for the discrete-diffusion generation pattern (256-token parallel block generation does not map naturally to the MLX eager-execution graph in the way standard autoregressive decoding does). For contrast, a desktop RTX 3090 Ti running DiffusionGemma via GGUF (not MLX) reports approximately 120 tok/s — still below the 700+ tok/s H100 vendor claim, but a credible practical number on a consumer VRAM budget. Practical guidance for Apple Silicon users: do not judge DiffusionGemma's real potential by MLX numbers yet; the MLX implementation is probably immature. If throughput matters, watch for an updated MLX build or use GGUF via a bridge. Confidence: anecdotal single-author measurement; desktop RTX 3090 Ti figure also single-author. (source, June 13, 2026)

vLLM AWQ 4-bit repetition bug: "lapped up" phrase loops in long Gemma 4 31B chats. A user running Gemma 4 31B AWQ 4-bit via vLLM (`cyankiwi/gemma-4-31B-it-AWQ-4bit`) reports a reproducible degradation pattern in extended chats: the model begins inserting the phrase "lapped up" where it doesn't fit, then can spiral into a loop ("lapped-up lapped-up lapped-up...") until the context or budget is exhausted. The pattern matches what the community has seen in other extended-context quantization bugs — at sufficient context depth, the 4-bit AWQ representation of Gemma 4 31B begins to drift, and the model gets stuck sampling from a narrow high-probability region. The fix direction is the same as the KV cache quantization lesson from the June 11 sweep: consider switching from AWQ to GGUF Q4_K_M or Q5_K_M with llama.cpp, which has better-characterized behavior in long contexts and allows explicit cache-type control. If vLLM is required, try enabling repetition penalty (not available in all vLLM forks) or shortening effective context via a sliding window. Confidence: single-user bug report, but the mechanism is consistent with known AWQ long-context fragility. (source, June 13, 2026)

3×RTX 3090 (72 GB VRAM) workstation: Gemma 4 31B Q8 fits in 48 GB (2 cards), quick to load/offload. A user running a 3×RTX 3090 rig on old DDR4 confirms a practical pattern emerging in multi-GPU setups: Gemma 4 31B Q8 and Qwen 3.6 27B fit comfortably across two 3090s (48 GB combined), which the user keeps loaded. The third card is held free for audio and image processing, and the pair is quick enough to load and offload when GPU budget shifts. Notably, the user trusts the smaller models "over the bigger models in some instances" because the VRAM-resident Q8 provides higher signal quality per parameter than partially-offloaded larger models. The user frames this as a quality-per-VRAM argument: 48 GB at Q8 for the 31B or 27B is a different trade-off from 72 GB in a large Q4 — less coverage but more fidelity per active parameter. Practical note: the user's DDR4 system RAM is not a bottleneck because the model is fully resident in VRAM at inference time. Confidence: anecdotal multi-user workstation report. (source, June 13, 2026)

Gemma 4 12B for agentic coding: a user had it stand up a Gitea server and retrieve exploits with no hand-holding. A practitioner dismissing model FOMO reports that Gemma 4 12B completed a task they found "astonishing": it was sent inside Hermes to set up a private Gitea server and retrieve a list of exploits from Nightmareclipse for safe-keeping — and "just did it." No specific hardware is reported, but Gemma 4 12B runs in approximately 8–9 GB VRAM (Q4_K_XL), putting it in range for any 12 GB+ consumer GPU. The report is worth noting because it illustrates the 12B's practical agentic ceiling: structured multi-step tool-calling, file management, and network-service setup are within reach at modest hardware cost, even though the model will struggle with the exact arithmetic and long-context coherence that the June 13 accuracy ladder revealed. Confidence: single practitioner report, no benchmark. (source, June 13, 2026)

Open questions

Is the DiffusionGemma MLX port simply immature, and when will it reach parity with autoregressive Gemma on Apple Silicon? The 5.4 tok/s result on an M4 Pro 48 GB vs 38 tok/s for regular Gemma 4 QAT suggests the MLX implementation may not yet handle parallel 256-token block generation efficiently. A proper optimized MLX release from the community (or Google) would settle this.
What is the minimum budget for Gemma 4 31B Q5 at >20 tok/s on used hardware? A community member is asking whether a sub-$1K build around an X79 platform (~$100 mobo+CPU+RAM) plus used GPUs can reach interactive speed for the dense 31B. The answer depends on memory bandwidth — 31B Q5 needs ~190 GB/s+ to sustain 20 tok/s — which rules out budget consumer cards but may be achievable with dual 3080/3090 used pairs. No specific reply data was captured for this sweep; watch for follow-up.
Does the vLLM AWQ "lapped up" repetition bug affect all AWQ 4-bit Gemma 4 31B builds, or only the `cyankiwi` checkpoint? The bug may be a model-specific quantization artifact or a more general vLLM AWQ issue. A comparison against GGUF on the same hardware would clarify.
Can Gemma 4 vision models (E4B, 31B) reliably count or classify fine-grained visual structures like PCIe notches? Multiple users have tested Gemma 4 on hardware photo recognition tasks and found the models struggle with precise physical counting. This may be a genuine limitation of the 1 GB mmproj vs larger vision encoders (Step Flash ~4 GB). No controlled benchmark exists yet.

Sources

The Gemma-mentioning posts driving this update (June 14 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?) (Jun 13, 2026 — RTX 1080 Ti 11 GB: Gemma 4 12B QAT UD-Q4_K_XL, -c 16384, -ngl 99, q8_0 cache, spec-draft-mtp n-max 2 → 50 tok/s)
diffusiongemma-26B-A4B-it-4bit on macbook 4 pro with 48gb has very slow token generation speed (Jun 13, 2026 — M4 Pro 48 GB: DiffusionGemma MLX 4bit → 5.4 tok/s / 18.6 GB; regular QAT → 38 tok/s same Mac; RTX 3090 Ti GGUF DiffusionGemma → ~120 tok/s)
Gemma 4 - weird issue, keeps saying "Lapped up" (Jun 13, 2026 — vLLM AWQ 4bit Gemma 4 31B: "lapped up" repetition loop in extended roleplay contexts)
Best models in 3x3090 (72GB VRAM) in Q2 2026? (Jun 13, 2026 — 3×RTX 3090 DDR4: Gemma 4 31B Q8 fits in 48 GB / 2 cards, quick load/offload, trusted over larger Q4 partial-offload)
I am losing my mind with FOMO and need some sanity checking about model capabilities (Jun 13, 2026 — Gemma 4 12B agentic: set up Gitea server + exploit retrieval via Hermes with no hand-holding)
Reality check: Gemma 4 31B at >20 tok/s for <1k$ USD (Jun 13, 2026 — community asks if budget used-GPU build can reach 20 tok/s for dense 31B Q5; X79 + Xeon base ~$100)
What LLMs+mmproj can recognize width of PCI Express slots on the motherboard? (Jun 13, 2026 — Gemma 4 E4B and 31B tested for PCIe-slot visual recognition; couldn't reliably count notches; mmproj 1 GB vs Step Flash 4 GB)
Best batteries-included harness tuned for Qwen 3.6 and Gemma 4? (Jun 13, 2026 — little-coder reported better than OpenCode and Cline for Gemma 4 coding agent workflows)
Measuring the Alignment Tax on Gemma4 (Jun 13, 2026 — community methodology for measuring safety reasoning overhead in Gemma 4 CoT; analysis only, no hardware)
What would you recommend as the smallest vision model that could extract contact info from a business card scan? (Jun 13, 2026 — E2B (2.6 GB) found insufficient for reliable OCR of business card contact info)
We need to bet in LOCAL INFERENCE and OPEN WEIGHTS and stop paying for more SaaS (Jun 13, 2026 — community sentiment post citing Gemma 4 as Google's bet on local inference; no hardware data)

Last updated: 2026-06-14 (June 14 sweep). Confidence: medium. Key hardware data: RTX 1080 Ti → 12B QAT 50 tok/s; M4 Pro 48 GB → DiffusionGemma MLX 5.4 tok/s; RTX 3090 Ti → DiffusionGemma GGUF ~120 tok/s. Next update fires when the daily Gemma 4 research cron flags notable new findings.

---

Field Notes — 2026-06-13

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new or updated since 2026-06-12, 355 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 13 sweep, 2026-06-13 01:00 UTC: this sweep is dominated by four distinct evidence threads arriving in the same cycle: a quantitative DiffusionGemma accuracy trade-off (4× faster, 6× more factual mistakes), the first systematic quantization accuracy ladder across all four Gemma 4 size tiers, a practitioner guide on MTP assistant selection (the wrong draft model can deliver almost no speedup at all), and a data point showing Gemma 4 31B hits a performance ceiling in extended reasoning loops at iteration 4–5. On the tooling side, MTPLX V1 ships a native Mac app that now auto-converts any HuggingFace model to MLX with MTP heads, closing the biggest gap in Apple Silicon MTP adoption.

DiffusionGemma accuracy trade-off is now quantified — 4× faster generation, 6× more factual mistakes. The June 12 sweep confirmed DiffusionGemma's existence via an NVIDIA model card and a first AMD multi-GPU run; this sweep adds the first head-to-head factual accuracy comparison, run on a single H100 (FP8). The methodology: three knowledge tasks of decreasing familiarity — Steve Jobs biography, history of Tetris, story of BeOS — with every claim fact-checked afterward. Standard Gemma 4 26B-A4B: 218 tok/s, 15.1s total, 45 claims correct, 5 wrong. DiffusionGemma 26B-A4B: 763 tok/s, 3.7s total, 33 claims correct, 28 wrong. The accuracy gap widened sharply on the less-popular topics: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. The mechanistic reason is described in the post and matches the architecture: DiffusionGemma generates tokens 256 at a time in parallel blocks, optimizing for textual smoothness pass-by-pass — a plausible-sounding fabricated name or date remains in the output because it is "smooth," not because the model verified it. Autoregressive Gemma checks each token against the growing prefix; diffusion does not have that sequential self-consistency signal. The practical read for Gemma 4 users: DiffusionGemma is a generation-speed tool, not a drop-in accuracy substitute; the 3.5× throughput premium comes at a real factual reliability cost, especially for niche or low-frequency topics. Confidence: single H100 run, but the methodology and result direction are consistent with the architecture. (source, June 12, 2026)

Quantization accuracy ladder: small Gemma 4 models struggle with structured tasks; 26B-A4B is the reliability threshold. A practitioner who found published KLD numbers hard to interpret ran three "contrived but controlled" tasks across seven Gemma 4 quantizations — arithmetic with 18-digit numbers (1,000 questions), president date-of-birth recall (46 questions), and attention (finding the repeated word in a 1,001-word list). The results paint a clear size-and-quantization picture: E2B Q8_0 was essentially broken on all three (1.4% arithmetic, 28.3% president facts, 0% attention); E4B Q8_0 recovered on fact recall (65.2%) but remained broken for arithmetic (0.1%) and attention (3%); 12B Q4_K_S reached solid mid-tier performance (31% arithmetic, 67.4% presidents, 35% attention); 26B-A4B UD Q4_K_S was the first to deliver reliable structured-task performance (72.3% arithmetic, 97.8% presidents, 55% attention). QAT Q4_0 from Google matched or slightly exceeded the Unsloth UD Q4_K_S at the same bit count on some tasks, corroborating the existing community view that QAT quants preserve more capability per bit than equivalent standard quants. Important caveat: these are purposely extreme stress-tests (exact large-integer arithmetic, exact date formatting, attention over 1,001 words) — real conversational and coding tasks show better numbers at every tier. The takeaway is not "avoid small Gemma 4" but rather: for structured, exact-answer tasks — tool calling, data extraction, code with specific constraints — treat 26B-A4B as the minimum-confidence tier and do not expect reliable arithmetic or precise attention from E2B/E4B without validation. (source, June 12, 2026)

MTP assistant selection is the hidden variable — the wrong draft model gives almost no speedup. A practitioner running Gemma 4 Heretic finetunes documented the MTP assistant problem more concretely than any prior community report. After testing six or more different GGUF assistants for the same Gemma 4 26B base, they found the range between best and worst was larger than the range between models; some technically loaded and executed but delivered near-zero throughput improvement, while the right pairing gave 1.7–2× gains. Confirmed pairings (all single-RTX, llama.cpp): 26B Heretic Q8 + correct assistant: 30→55–62 t/s; 12B Heretic Q4 + correct assistant: 22→35–54 t/s; 26B QAT/Q4 Heretic Vision + correct assistant: 65→70–75 t/s; 31B Q4 Heretic Vision + correct assistant: 14→25–30 t/s. The key lessons from their testing: (1) matching quantization level between draft and base matters — a Q8 assistant pairing with a Q4 base gave marginal gains, while a Q4 assistant gave strong ones; (2) the same HuggingFace model name does not guarantee the same model internals — two files both called "gemma 4 31B Q4 assistant" produced significantly different acceptance rates; (3) verify by watching the token-generation log — a working MTP pair will show visible speculative bursts, while a mismatched one generates one token at a time with overhead. The practical advice for anyone adopting MTP: try at least two or three different assistant GGUF builds before concluding that MTP is not working for your setup; the assistant, not just the base model, is where most of the variance lives. Confidence: practitioner single-rig, but the mechanism and methodology are sound. (source, June 12, 2026)

Test-time compute scaling with Gemma 4 31B: near-Claude-Mythos on code at 25–40× compute, but degrades past iteration 4. A practitioner built a tree-search scaffold (5 exploration branches, 10 iterations, 6 branch-aware hypotheses revised every 2 iterations, all agents with Python access) and used it to scale test-time compute for Gemma 4 31B and Qwen 3.6 27B on code optimization tasks, reporting results that approach Claude Mythos. The interesting finding is not the peak — it is where it breaks: both models begin showing genuine performance regression at iteration 4–5, and again at the PQF update step at iteration 9–10, because neither maintains stable long-context reasoning at that depth. Gemma 4 degraded slightly earlier; Qwen 3.6 27B was marginally more robust. The mechanism matches the KV cache / zombie-loop pattern documented in prior sweeps: at extended reasoning depths, Gemma 4 31B drifts into repetition or incoherent branching, and stopping at iteration 3 sometimes outperforms going to 5. Practical guidance: extended reasoning scaffolds using Gemma 4 31B benefit from early stopping (around 3 iterations) rather than maximum compute; the model's instruction-following and conciseness advantages translate well to structured multi-step tasks, but long-context coherence is not its strength. Confidence: single-author scaffold, no reproducible artifact shared, but the degradation pattern is consistent with prior architecture findings. (source, June 12, 2026)

Gemma 4 12B QAT holds 256K context under 7.7 GB OS RAM — confirmed by a production app. An MIT-licensed local roleplay app called Open Dungeon uses Gemma 4 12B QAT Q4 via Ollama as its narrator, and its author reports an observation worth carrying into the hardware guide: running the 12B at its full 256K context window keeps OS memory consumption at approximately 7.7 GB — well under the 8 GB threshold for common consumer configurations. The author's explanation matches what the community has measured in other contexts: Gemma 4 barely grows its KV cache even at long contexts, so a 256K session does not spiral toward memory exhaustion the way a comparably sized dense model would. The app handles overflow by folding old scenes into a running summary so the model never forgets chapter one. While this is a single-project report rather than a controlled memory benchmark, it is consistent with the KV cache efficiency findings from prior sweeps and provides a concrete application context where the 12B QAT's memory profile makes it practical in ways a larger model could not be. Confidence: anecdotal from a released app, consistent with prior measurements. (source, June 12, 2026)

MTPLX V1 for Apple Silicon: native Mac app now auto-converts any HuggingFace model to MLX with MTP heads. The largest gap in Apple Silicon MTP adoption — almost no MLX quants shipped with MTP head weights — has a community fix. MTPLX V1 ships a "Forge" feature: paste any HuggingFace model link, and it converts the model to MLX and wires up the MTP heads automatically, then measures the real speedup on your own Mac before you commit. The app is a native Swift build (~55 MB DMG), bundles the MLX engine entirely on-device, and includes a live dashboard showing the decode gauge, acceptance-by-depth, and the speculative verification waterfall in real time. Confirmed numbers from the post: Qwen 3.6 27B: 28→63 t/s (2.25×); the post notes Gemma 4 is supported. The practical implication for the Apple Silicon tier: the prior barrier ("there are no MLX models with MTP heads") is resolved by Forge — if you can find the base model on HuggingFace, you can now build an MTP-capable version on your own machine. Confidence: announcement with live demo video; speedup figures are from the author's own hardware. (source, June 12, 2026)

Real-life file-attachment benchmark at 16 GB VRAM: Qwen 3.6 35B A3B wins over Gemma 4 26B for network analysis. A practitioner who recompiles llama.cpp daily used a real-work task — analyzing a Wireshark packet capture file to pinpoint the exact network packet triggering a problem — as a benchmark across models at 16 GB VRAM. Clear winner: Qwen 3.6 35B A3B. Gemma 4 26B was a close second; Gemma 4 12B fell short on this task. Qwen 3.6 27B also found the problem but was "very slow with only 16GB of VRAM." The practical read for the 16 GB single-GPU tier: Gemma 4 26B-A4B is a credible choice for real file-attachment analysis tasks at this VRAM budget, but Qwen 3.6 35B A3B's MoE architecture gives it an advantage on structured file-comprehension tasks. Gemma 4 12B is not reliable for this class of problem. Confidence: anecdotal single-practitioner benchmark; the task is real but the methodology is informal. (source, June 12, 2026)

Open questions

DiffusionGemma accuracy at scale: does fine-tuning on factual tasks or extended denoising passes reduce the error rate? The 6× factual-error gap is large enough that diffusion generation may need dedicated training signal — not just more denoising steps — to close it. No one has published a fine-tuning or RLHF approach for diffusion Gemma yet.
What is the correct MTP assistant for a given Gemma 4 26B Q4_K_M base in llama.cpp mainline? The practitioner post makes clear that community consensus on which GGUF pairs reliably does not exist yet. A curated table of tested assistant+base combinations would be the highest-value community contribution for MTP adoption.
Is the Gemma 4 31B iteration-4 reasoning degradation fixable by prompt design? The test-time compute post does not test whether the degradation is driven by context length or by instruction format. A clean ablation (same iteration depth, different prompt structures) would clarify whether prompt engineering or early-stopping is the right mitigation.
Does NVFP4 deliver Q6/Q8-class quality at Q4 size, and how does it run on non-Blackwell hardware? Carried forward from June 12. NVFP4 is now available for 12B and 26B-A4B; the quality-vs-standard-quant and cross-vendor compatibility questions remain unanswered.
New Gemma models: what are they? Community observing new Gemma models alongside MiniMax M3 and Kimi K2.6 — Google has been releasing incrementally (12B, QAT variants, DFlash, MTP heads). The "more Gemma 4 models incoming" signal from prior sweeps keeps appearing without a specific announcement. (source, June 12, 2026)

Sources

The Gemma-mentioning posts driving this update (June 13 sweep, newest first). Posts are fresh launch-window threads (score ~20, no captured comment threads at sweep time); treat individual numbers as first-look anecdotes rather than settled results:

Diffusion Gemma is 4x faster, but makes 6x more mistakes! (Jun 12, 2026 — H100 FP8 accuracy comparison; standard Gemma 4 26B-A4B: 218 tok/s / 5 mistakes; DiffusionGemma: 763 tok/s / 28 mistakes; factual error gap widens on niche topics)
I scaled test-time compute for Qwen-3.6-27B and Gemma-4-31B to surpass Claude Mythos in code optimizations and speedups. (Jun 12, 2026 — 25–40× compute scaffold approaches Claude Mythos; Gemma 4 31B degrades past iteration 4–5 due to long-context drift)
Real life problem, new benchmark, and the winner is... (Jun 12, 2026 — 16GB VRAM Wireshark file-attachment task; Qwen 3.6 35B A3B wins; Gemma 4 26B close second; Gemma 4 12B fell short)
Not All MTP Assistants Are Created Equal (Jun 12, 2026 — practitioner tests 6+ GGUF assistants; 26B Q8: 30→55–62 t/s; 12B Q4: 22→35–54 t/s; same HF name ≠ same model internals)
Some contrived tests comparing the accuracy of different Gemma and Qwen quantizations (Jun 12, 2026 — E2B broken for math/attention; 12B Q4 mid-tier; 26B-A4B Q4 reliable; QAT Q4 matches/exceeds Unsloth UD at same bit count)
Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS) (Jun 12, 2026 — 12B QAT Q4 via Ollama at 256K context stays under 7.7 GB OS RAM; confirms KV cache efficiency in a production app)
MTPLX V1: The Swift App For Running & Creating MLX MTP Models (2x TPS Qwen 3.6 27B) (Jun 12, 2026 — native Mac app; Forge feature auto-converts any HF model to MLX with MTP heads; Qwen 3.6 27B: 28→63 t/s; Gemma 4 supported)
Gemma: new models. Minimax: new model. Kimi: new model. Qwen... When? (Jun 12, 2026 — community signal that more Gemma 4 models are expected; no specifics announced)

---

Field Notes — 2026-06-12

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new or updated since 2026-06-11, 343 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 12 sweep, 2026-06-12 01:00 UTC: the headline this sweep is that DiffusionGemma stops being a rumour. June 11's notes flagged it as a loud-but-unverified launch-window claim; this cycle brings two concrete corroborations — an official NVIDIA NVFP4 model card that pins down the architecture, and the first independent hands-on benchmark (on AMD multi-GPU). The model's existence and shape are now well-supported; its eye-catching single-5090 throughput number is not yet. Around that, the sweep is unusually rich in hard, reproducible numbers: a counterintuitive CPU thread-count finding (+80%), an asymmetric dual-GPU lesson about keeping the KV cache in VRAM, a CPU-only "any model runs on any PC" test that quantifies just how slow that really is, and an NVFP4-for-8GB-laptops question that the community still has not benchmarked.

DiffusionGemma is now corroborated — by an NVIDIA model card and a first independent run — though its flagship speed claim is still unverified. Last sweep, two promotional posts claimed DeepMind had released a Gemma-4-architecture text-diffusion model at 700+ tok/s on an RTX 5090; we adopted none of those numbers as fact. This cycle moves the story forward on two fronts. First, an official NVIDIA Hugging Face model card (`nvidia/diffusiongemma-26B-A4B-it-NVFP4`) describes it concretely: an open-weights, Google DeepMind multimodal model that takes text, image, and video and produces text via discrete diffusion, built on the Gemma 4 26B-A4B MoE (25.2B total / 3.8B active) with an encoder-decoder, bidirectional-attention design that generates tokens in parallel 256-token blocks, a 256K context window, configurable thinking, native function calling, and 35+ languages; the card quotes 1,100+ tok/s on an H100 (FP8) and ships an NVFP4 quant made with Model Optimizer. Second — and more useful for self-hosters — a community member posted the first hands-on benchmark: DiffusionGemma 26B on a vLLM `dgemma` branch across 4× Radeon RX 7900 XTX (4×24 GB), reporting ~100 tok/s generation but ~45–60 tok/s effective once prompt-processing wait is counted, with each card sitting at ~23.6/24 GB VRAM, a 152,671-token KV-cache budget, and a 131,072-token context at 1.16× concurrency. The honest read: the architecture and the model's existence are now well-corroborated (an NVIDIA repo plus a working community run), but the 700+ tok/s-on-one-5090 headline remains a vendor/author claim — the single independent figure we have is ~100 tok/s generation on four AMD cards via an experimental branch, which is not the same machine or the same story. Confidence: high that the model is real and Gemma-4-based; low-to-medium on the consumer-GPU throughput claims. (source: NVIDIA NVFP4 card, source: 4×7900 XTX run, June 11, 2026)

Demand signal: a 12B-scale diffusion Gemma is what would actually move the consumer-GPU needle. Directly downstream of the above, a user recompiling llama.cpp for diffusion-Gemma support argues the 26B-A4B is the wrong size for the people who would benefit most, and that the obvious play is a diffusion model built on the largest checkpoint that still fits a normal GPU. Their concrete baseline: Gemma 4 12B already runs at ~30 tok/s with 600+ tok/s prefill on an RX 6600XT (8 GB) — solid for latency-sensitive, non-coding work — so a 12B diffusion variant that kept that footprint while denoising whole blocks at once could be the real unlock. No such model exists yet; this is a well-reasoned wish, not a result. Confidence: anecdotal, but the 12B-on-RX-6600XT baseline is concrete. (source, June 11, 2026)

CPU inference: a careful benchmark shows +80% from raising the thread count past the usual "P-cores only" advice. The most actionable number this sweep contradicts a widely-repeated tuning rule. Testing Gemma 4 26B-A4B QAT with MTP on a Core Ultra 250K Plus (6 performance + 12 efficiency = 18 cores), a user ran a clean sweep (one warmup, then 5 runs per setting, same seed, same prompt): `--threads 6` → ~49 tok/s, `12` → ~63 tok/s, `16` → ~89 tok/s, `18` → ~66 tok/s. That is a ~80% uplift from 6 to 16 threads, with a clear peak at 16 and a regression at 18 — i.e. the common "limit to P-cores and pin affinity" guidance left a large amount of performance on the table on this hybrid CPU, and the efficiency cores were worth using up to a point. Practical guidance for the CPU and hybrid-CPU tier: do not assume P-core-only is optimal — sweep `--threads` on your own silicon, because the sweet spot is workload- and CPU-specific and may sit well above your P-core count (but below total cores). Confidence: single-author, but the methodology (warmup + repeated runs + fixed seed) is unusually careful for a forum post. (source, June 12, 2026)

Asymmetric dual-GPU: the whole game is keeping the KV cache in VRAM — quantizing it turned ~20 tok/s into ~70 tok/s. A user who paired a 3080 Ti 12 GB with a 3080 20 GB documents a sharp, reproducible cliff on the dense 31B. Running Gemma 4 31B QAT Q4_K_XL (Unsloth) with its Q8_0 MTP drafter at a 262,144-token context and default cache types, the model nearly filled both cards and spilled ~13 GB into system RAM, giving only ~20 tok/s generation. Switching `cache-type-k/v` to q4_0 so the entire weights-plus-KV working set fit inside VRAM lifted that to ~70 tok/s — roughly a 3.5× gain purely from avoiding the host-RAM spill. Notably, split mode (tensor vs layer) made little difference; residency did. The lesson for the multi-GPU workstation tier: with a long context, even a small overflow into system RAM collapses throughput, and KV-cache quantization is often the highest-leverage knob for staying resident — measure VRAM headroom at your target context before blaming the GPUs. Confidence: anecdotal single-author, but the before/after numbers and configuration are specific and reproducible. (source, June 11, 2026)

CPU-only / no-VRAM: "any model runs on any PC" is literally true and practically brutal — Gemma 4 12B Q4 managed 0.28 tok/s on a GPU-less laptop with 2.6 GB free RAM. Pushing the low-end question to its limit, a user pulled a RAM module from a 4-core i7 laptop with no GPU so the LLM engine had just 2.6 GiB of free DDR4, then streamed weights from a 2.5 GB/s SSD. Result for Gemma 4 12B Q4 (7 GB on disk): ~4 tok/s prompt processing and ~0.28 tok/s generation (a 198B MoE in Q6 was slower still). The takeaway is genuinely two-sided: SSD-backed streaming means model size no longer gates whether something runs, but 0.28 tok/s is batch / "leave-it-overnight" territory, not interactive use, and the author's own framing — treat it like snail mail, give it a task and check back later — is the right expectation to set. Useful as a feasibility proof for Pi-class and ultra-low-RAM machines; not a recommendation for daily use. (Aside worth noting from the same post: Gemma 4 self-reports a January 2025 knowledge cutoff.) Confidence: concrete single-machine measurement. (source, June 11, 2026)

Laptops / 8 GB VRAM: NVFP4 is the format people want for fitting bigger Gemmas, but the quality-vs-Q-quant question is still unbenchmarked. A 4060-laptop user (8 GB VRAM) lays out the practical appeal cleanly: Gemma-4-12B in NVFP4 is ~7 GB and fits, where the Q8 build (~12 GB) does not, so NVFP4 could let an 8 GB card run a 12B that otherwise only fit at aggressive GGUF quants. Two open threads they raise are worth tracking for the laptop tier: whether NVFP4 actually delivers Q6/Q8-class quality at roughly Q4 size (model-card comparisons hint "close to BF16," but no independent same-task numbers exist), and the report that NVFP4 reportedly runs on non-Blackwell, AMD, and Intel GPUs too, not just 50-series Nvidia. As with the recurring QAT-vs-higher-bit ask, this is demand, not a result: a clean NVFP4-vs-Q4/Q5/Q6/Q8 comparison on the same model and task — speed and quality — still does not exist in the community record. Confidence: high that the question is unresolved; the VRAM-fit math is concrete. (source, June 11, 2026)

Tooling: the small Gemmas keep showing up as cheap monitoring/agent components, not standalone chatbots. An update to the Observer framework (screen-watching micro-agents) added an MCP layer and recommends a now-familiar split: a capable model as the tool-calling controller — the author singles out Gemma 4 26B-A4B as "surprisingly good" at driving an OpenAI-style `chat/completions` tool loop via llama.cpp — paired with a tiny E2B model as the always-on monitoring agent (running through Transformers.js on the web build and llama.cpp in the Tauri desktop app). It reinforces a pattern this series keeps seeing: the E2B/E4B variants earn their place as constrained, embedded components (front-ends, monitors, routers) rather than as general assistants. Confidence: tooling announcement, no independent benchmarks. (source, June 11, 2026)

Release note: a third-party "Uncensored Heretic" QAT quadruple-drop now spans 12B, 26B-A4B, and 31B in every modern quant format. A community finetuner published abliterated/uncensored QAT-`q4_0` variants of Gemma 4 12B, 12B QAT, 26B-A4B QAT, and 31B QAT, each shipped in Safetensors, GGUF, NVFP4 (Safetensors + GGUF), and GPTQ-Int4 — useful mainly as a signal that the QAT base weights are now widely available enough for downstream finetunes across the whole lineup. As with prior "heretic" releases in this series, these are third-party finetunes with no published benchmarks or refusal/KLD figures in the post; treat capability and alignment claims as untested and evaluate on your own tasks before relying on them. Confidence: release announcement only. (source, June 11, 2026)

Open questions

Does DiffusionGemma's consumer-GPU throughput hold on a single card? The architecture is now corroborated by an NVIDIA NVFP4 card and one AMD multi-GPU run (~100 tok/s gen on 4× 7900 XTX), but the headline 700+ tok/s on a single RTX 5090 / 1,000+ tok/s on H100 is still a vendor/author claim with no independent single-GPU reproduction. Next sweep should look for a hands-on 5090 or H100 number and a confirmed quality evaluation. (source)
Will there be a 12B-scale diffusion Gemma for consumer GPUs? The 26B-A4B is corroborated; a 12B diffusion variant that kept the ~7 GB / 30 tok/s footprint of today's dense 12B would be the version most home users could actually run. (source)
NVFP4 vs Q4/Q5/Q6/Q8, same model and task — speed and quality. Asked again this sweep for fitting a 12B onto an 8 GB laptop, and still unanswered. Bonus unknown: how NVFP4 actually performs on non-Blackwell, AMD, and Intel GPUs. (source)
Qwen 3.6 27B IQ4_XS vs Gemma 4 31B QAT as a Hermes agent — and what is the current fixed tool-calling template? A recurring head-to-head plus the perennial Gemma-4 tool-template question (Gemma is still reported as awkward with tool calls in OpenWebUI). (source)
Can reasoning be trimmed, not just extended, for creative tasks? A user found that neither Gemma 4 nor Qwen 3.6 can be prompted to reduce their draft-check-revise reasoning loops for creative writing — only to add steps — and wants a template or finetune that yields reasoning without the wasteful self-drafting. (source)

Sources

The Gemma-mentioning posts driving this update (June 12 sweep, newest first). All are fresh launch-window threads (score ~20, no captured comment threads at sweep time), so treat individual numbers as first-look anecdotes or unverified claims rather than settled results:

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case) (Jun 12, 2026 — Gemma 4 26B-A4B QAT + MTP on Core Ultra 250K Plus: 6→16 threads = ~49→~89 tok/s, regression at 18; careful warmup+5-run methodology)
Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics! (Jun 11, 2026 — third-party abliterated QAT q4_0 finetunes across the lineup in Safetensors/GGUF/NVFP4/GPTQ-Int4; no benchmarks)
Mi50 32GB / GFX906 — vLLM Qwen 3.5 configuration (sub-1 TPS) (Jun 11, 2026 — mostly a Qwen 3.5 AWQ-on-Mi50 troubleshooting post; only a tangential ask for a Gemma 4 any-to-any setup, included for completeness)
I have finally tested it: large models can be run on low RAM / no VRAM (Jun 11, 2026 — GPU-less i7 laptop, 2.6 GB free RAM, SSD stream: Gemma 4 12B Q4 ~4 tok/s PP / ~0.28 tok/s TG; Gemma 4 self-reports Jan-2025 cutoff)
advice for dual-gpu asymmetric (Jun 11, 2026 — 3080 Ti 12 GB + 3080 20 GB; Gemma 4 31B QAT Q4_K_XL @ 262k ctx: ~20 tok/s with ~13 GB RAM spill, ~70 tok/s with q4_0 KV cache fully in VRAM)
DifussionGemma 4 on 4x7900xtx (Jun 11, 2026 — first independent DiffusionGemma 26B run: vLLM dgemma branch, 4× RX 7900 XTX, ~100 tok/s gen / ~45–60 tok/s effective, ~23.6/24 GB per card, 131k ctx)
Any chances for a 12B diffusion Gemma? (Jun 11, 2026 — wants a consumer-GPU-sized diffusion variant; notes Gemma 4 12B at ~30 tok/s + 600+ tok/s prefill on an RX 6600XT)
Reasoning, but without actually drafting replies? (Jun 11, 2026 — neither Gemma 4 nor Qwen 3.6 can be prompted to reduce reasoning steps for creative tasks; only to add them)
NVFP4 with llama.cpp — FAQs? (Jun 11, 2026 — 8 GB laptop 4060: Gemma-4-12B NVFP4 ~7 GB fits vs ~12 GB Q8; wants NVFP4 vs Q4/Q6/Q8 quality+speed; NVFP4 reportedly runs on non-Blackwell/AMD/Intel)
Is Qwen 3.6 27B IQ4XS better than Gemma 4 31B QAT as a Hermes agent? (Jun 11, 2026 — head-to-head question; asks for the latest fixed Gemma 4 tool-calling template; Gemma noted awkward with tool calls in OpenWebUI)
nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face (Jun 11, 2026 — official NVIDIA NVFP4 card: DeepMind multimodal discrete-diffusion model on Gemma 4 26B-A4B MoE, 25.2B/3.8B, 256-token parallel blocks, 256K ctx, 1,100+ tok/s H100 FP8)
Monitor your screen using local LLMs with only one sentence! (Jun 11, 2026 — Observer framework MCP: Gemma 4 26B-A4B "surprisingly good" as tool-calling controller, E2B as the monitoring micro-agent)

Last updated: 2026-06-12 (June 12 sweep). Confidence: medium; DiffusionGemma architecture is now corroborated but its consumer-GPU throughput claims remain low-confidence/unverified. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Field Notes — 2026-06-11

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (9 new or updated since 2026-06-10, 331 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 11 sweep, 2026-06-11 01:00 UTC: this is a quieter, mostly question-driven sweep with one loud-but-unverified headline. A pair of launch-window posts claim DeepMind has released DiffusionGemma, a text-diffusion model built on the Gemma 4 26B-A4B MoE architecture — with eye-catching throughput numbers that, on this sweep's evidence alone, should be treated as a claim rather than a measurement. Underneath that headline the sweep is dominated by practitioners hitting concrete limits and asking the same unresolved questions: a reproducible audio-attention failure on the new 12B unified model when the text prompt grows large, a low-end single-GPU 31B configuration with hard speed numbers, a recurring and still-unanswered demand for QAT-vs-higher-bit comparisons, a document-segmentation limit on the 26B-A4B, and a reliability lesson about asking small local models to be autonomous agents on a single consumer GPU.

DiffusionGemma (claimed): a DeepMind text-diffusion model on the Gemma 4 26B-A4B architecture — promising, but unverified this sweep. Two posts surfaced a new model named DiffusionGemma. The detailed post describes it as a DeepMind release under Apache 2.0 that replaces autoregressive token-by-token decoding with a text-diffusion head: it starts from a 256-token "canvas" of placeholder noise and iteratively denoises the whole block at once (described as "Uniform State Diffusion"), with an error-correction step that re-introduces noise to self-correct mid-generation. The architectural claims are specific — a 26B Mixture-of-Experts built on the Gemma 4 architecture, activating ~3.8B parameters per token, fitting in roughly 18 GB of VRAM when quantized — and the post asserts throughput of 1,000+ tok/s on an H100 and 700+ tok/s locally on an RTX 5090, on the logic that block-wise diffusion shifts the bottleneck from memory bandwidth to raw compute. The second post simply frames it as "4× faster text generation." Strong caveats apply. Both posts are score-20 launch-window threads with no captured comment threads, the framing is promotional ("DeepMind just dropped…"), and none of the speed, VRAM, or quality figures are independently reproduced anywhere in this sweep — exactly the kind of exciting-but-unconfirmed number this guide deliberately does not adopt as fact. If real, a Gemma-4-architecture MoE that runs at hundreds of tok/s in ~18 GB would be a meaningful local-inference development worth tracking; for now, treat the model as plausibly-real but the performance claims as unverified author assertions pending hands-on community benchmarks. Confidence: low. (source: DiffusionGemma detail, source: "4x faster", June 10, 2026)

Gemma 4 12B unified audio: a reproducible attention-saturation failure when the text prompt gets large. The most actionable finding this sweep is a clean, multi-stack bug report on the new encoder-free unified 12B (audio/vision/text in one model). A user building a single-pass voice assistant — feed a recorded WAV plus a system prompt, get the text reply directly, collapsing the separate ASR + LLM steps — reports that audio attention works well with a minimal prompt but collapses once the text prompt becomes large and dense (theirs is ~21k tokens of detailed instructions plus tool definitions). At that size the model replies as if the audio were not present (generic or hallucinated), or only weakly transcribes it; trimming the prompt restores audio attention. Critically, the same behavior reproduced across three independent stacks — vLLM (gemma4-unified, base64 `audio_url`), llama.cpp (`--mmproj`, `input_audio` content, thinking disabled), and LiteRT-LM (GPU) — which argues it is an inherent attention/saturation limit when audio competes with a long dense text context rather than a single-backend quirk. Their workaround: the smaller E4B with a tiny prompt keeps audio attention reliably, so they use it as a lightweight audio front-end feeding the larger text model. Practical guidance: if you are using the 12B unified model for speech, keep the audio-turn system prompt short, or split audio handling onto E4B. Confidence: anecdotal and single-author, but materially strengthened by reproduction across three runtimes. (source, June 10, 2026)

Low-end single-GPU 31B, with numbers: RTX 3060 12 GB + 32 GB RAM runs 31B at IQ3_XXS around 1.3 tok/s — and the new QAT Q2/Q3 GGUFs reopen the quant-choice question. A 3060-12GB owner (32 GB DDR3) documents a fully-specified budget configuration for the dense 31B: `gemma-4-31B-it-UD-IQ3_XXS.gguf` (11.8 GB) with `ffn_down` tensor overrides to fit, running 16k context in bf16 at roughly 1.3 tok/s, with the bf16 mmproj offloaded to CPU; overriding more tensors stretches context to 32k at the cost of more CPU offload. The post's open question is timely: Q2–Q3 GGUF quants of the new QAT 31B now exist (mradermacher's `gemma-4-31B-it-qat-q4_0-unquantized` GGUF trees), and the user wants to know whether a QAT Q2/Q3 would beat their current non-QAT IQ3_XXS, how low they can push, and whether MTP is worth it when the draft model and context have to spill to CPU. No settled answer emerged this sweep. The honest read for readers on 12 GB cards: the dense 31B is runnable but slow (~1.3 tok/s is below comfortable interactive speed), and whether QAT low-bit quants improve on that on this exact hardware is currently unanswered — A/B them on your own workload. Confidence: anecdotal but the baseline config and speed are concrete and reproducible. (source, June 10, 2026)

The QAT-vs-higher-bit comparison is asked yet again — and still has no community answer. Reinforcing an open question that has now recurred across the June 8, June 10, and this sweep, a user with enough RAM + VRAM to run the 26B-A4B up to Q6_K asks the precise question the guide keeps flagging: how does the 4-bit Q4_0 QAT compare against a higher-bit non-QAT quant such as Q6_K? They correctly note that a KLD comparison against the original FP16 weights "wouldn't be appropriate" — echoing the June 8 methodological point that QAT is effectively a retrained, distinct model, so original-model-referenced divergence is the wrong yardstick. The demand signal is now unmistakable and consistent, but a clean, multi-task, same-hardware QAT-Q4 vs non-QAT-Q6 comparison still does not exist in the community record. Confidence: high that the question is unresolved; this entry documents demand, not a result. (source, June 10, 2026)

Document processing on the 26B-A4B: strong text extraction, but it can't reliably segment multi-report scans — and compliance is steering some users toward Gemma. A user replacing a commercial OCR/extraction pipeline for stacks of metal mill-test reports (1–5 pages each, inside 100+ page scanned batches, wildly varying vendor formats) reports a concrete limit on Gemma 4 26B-A4B (Unsloth QAT): it cannot reliably determine page/report boundaries when fed a long multi-report scan, which is the first step they need before per-report metadata extraction (lot number, metal type, alloy). A notable secondary driver here is procurement/compliance rather than capability: the user must avoid Chinese OCR software and is wary of Chinese models like Qwen under an anticipated "No Adversarial AI Act," even though Chinese models currently dominate OCR benchmarks — which pushes a Western-weights model like Gemma 4 into contention by default. Practical reading: Gemma 4 26B-A4B is a credible local document-understanding candidate, but document segmentation of unstructured multi-record scans is a real gap today; pair it with deterministic splitting (or a dedicated layout/segmentation step) rather than expecting the model to find boundaries on its own. Confidence: anecdotal single-author, one demanding workload. (source, June 10, 2026)

Reliability lesson for budget single-GPU agents: rigid code beats flexible reasoning loops. A practitioner who spent six months building a fully-local agentic extraction pipeline on a single consumer GPU — bouncing between Gemma 4 31B and Qwen 3.5 quants — concludes that handing a small quantized model a large system prompt, a pile of tools, and full autonomy to plan its own execution produced day-to-day instability (works perfectly one day, falls apart the next, with the GPU running hot). Their fix was to replace the open-ended reasoning loops with traditional rigid Python and call the model only for the narrow text-processing steps it does reliably. This is a qualitative anecdote, but it lands squarely on a pattern this series has now seen from several angles — including the project's own June 10 benchmark finding that high-thinking degraded the 12B's agentic score via reasoning-loop failures. The takeaway for the single-GPU tier: small local models are far more dependable as constrained components inside deterministic code than as autonomous agents, and "more reasoning" is not automatically more reliable. Confidence: anecdotal, but consistent with multiple prior signals. (source, June 10, 2026)

Open questions

Is DiffusionGemma real, and do its performance claims hold? A Gemma-4-architecture 26B-A4B text-diffusion model claiming 700+ tok/s on an RTX 5090 in ~18 GB VRAM would be significant — but this sweep has only two uncorroborated launch-window posts and zero independent benchmarks. The next sweep should look for hands-on numbers, a confirmed Hugging Face repo, and whether it is genuinely a DeepMind release. (source)
QAT-Q4 vs non-QAT higher-bit (e.g. Q6_K), same hardware and task set. Asked again this sweep for the 26B-A4B and still unanswered across the series. The community agrees KLD-vs-FP16-original is the wrong measure but has not produced the right one. (source)
On a 12 GB card, does a QAT Q2/Q3 31B beat a non-QAT IQ3_XXS, and is MTP worth it when context spills to CPU? A concrete 3060-12GB configuration exists (31B IQ3_XXS at ~1.3 tok/s, 16k ctx); whether the new low-bit QAT quants improve quality or speed there — especially with MTP draft/context offload — is untested. (source)
Can the 12B unified model attend to audio under a large system prompt? The reproducible audio-attention collapse above (~21k-token prompt, three stacks) needs either a configuration fix or confirmation that it is an inherent saturation limit. Until then, keep audio prompts short or front-end with E4B. (source)
What is the right onboarding path for a brand-new local user choosing between Gemma 4 and Qwen 3.6? A newcomer running Ollama on Windows asks how model sizes map to speed and VRAM fit and where to find a comprehensive cross-model benchmark — a recurring entry-level demand the guide should answer directly. (source)

Sources

The Gemma-mentioning posts driving this update (June 10 sweep, newest first). All are fresh launch-window threads (score ~20, no captured comment threads at sweep time), so treat individual numbers as first-look anecdotes or unverified claims rather than settled results:

DeepMind Just Dropped "DiffusionGemma" — Text Generation via Image-Style Diffusion Model (Jun 10, 2026 — claimed Apache-2.0 text-diffusion model on Gemma 4 26B-A4B arch; 700+ tok/s RTX 5090 / 1,000+ tok/s H100, ~18 GB VRAM — unverified)
DiffusionGemma: 4x faster text generation (Jun 10, 2026 — companion post framing DiffusionGemma as 4× faster)
Anyone gotten Gemma 4 12B (unified audio) to attend to speech with a large system prompt? (Jun 10, 2026 — audio-attention collapses at ~21k-token prompt; reproduced on vLLM, llama.cpp, LiteRT-LM; E4B front-end workaround)
Are these quants of QAT better than non-QAT? What do I use? (Jun 10, 2026 — RTX 3060 12 GB + 32 GB: 31B IQ3_XXS 11.8 GB, 16k bf16 ctx ~1.3 tok/s; new QAT Q2/Q3 GGUFs; MTP/CPU-offload question)
gemma4 QATs vs higher-bit regular quantizations? (Jun 10, 2026 — recurring demand for Q4_0 QAT vs non-QAT Q6_K on the 26B-A4B; KLD-vs-original noted inappropriate)
Model/tooling recommendations for complex document processing (Jun 10, 2026 — 26B-A4B Unsloth QAT can't segment multi-report scans; compliance steering away from Chinese OCR/models)
Hot Take: "Rigid code is better than Flexible code if you're on a budget" (Jun 10, 2026 — single-GPU Gemma 4 31B/Qwen 3.5 agentic loops unreliable; replaced with rigid Python)
I'm brand new to running LLMs and the sheer number of tools is overwhelming (Jun 10, 2026 — beginner on Ollama/Windows comparing Gemma 4 vs Qwen 3.6; sizing/benchmark guidance demand)
LLMs and tabletop games (Jun 10, 2026 — Gemma 4 31B as a board-game rules assistant and TTRPG idea generator; non-professional daily use)

Last updated: 2026-06-11 (June 11 sweep). Confidence: medium; DiffusionGemma headline is low-confidence/unverified. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Field Notes — 2026-06-10

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (15 new or updated since 2026-06-09, 322 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 10 sweep, 2026-06-10 01:00 UTC: this sweep fills in three gaps left by prior cycles. First, the Unsloth QAT+MTP assistant package is now complete across all seven Gemma 4 model sizes — including E2B and E4B mobile variants — meaning the full speculative decoding stack no longer requires the boxwrench third-party heads. Second, a Jetson Orin NX 16GB result pushes Gemma 4 into the embedded/edge-AI hardware category for the first time in this series, and it outperforms expectations. Third, a documented Persian-language regression on QAT extends the pattern from prior sweeps: QAT consistently improves English-language reasoning while degrading some non-Latin-script performance. Around those three threads, the sweep adds Apple Silicon MLX benchmarks for the 26B-A4B QAT, a head-to-head between Gemma 4 12B and Qwen 3.5-9B on Mac M3 Max hardware, and a user-annotated finding that Gemma 4 31B can outperform Qwen 3.6 models on academic code understanding.

Unsloth QAT MTP assistant models now available for all seven Gemma 4 sizes, including E2B and E4B mobile variants. u/ParadigmComplex confirmed that Unsloth has published QAT+MTP assistant GGUFs across the full Gemma 4 lineup: 12B, 26B-A4B, 31B, E2B, E4B, and mobile QAT variants of both E2B and E4B. The files are named `mtp-gemma-4-*.gguf` in Q8_0 format, available at the root of each respective Unsloth HuggingFace repository, with additional larger quant options inside an `MTP/` folder. The practical significance: prior field notes (June 7) documented the boxwrench `gemma-4-qat-mtp-assistant-heads` collection as the only QAT-matched draft heads, which covered 12B, 26B-A4B, and 31B but not the small E2B or E4B mobile models. The Unsloth release closes that gap. For Jetson, Pi, or phone targets running E2B or E4B, QAT+MTP is now available without manual conversion from unquantized checkpoints. Confidence: high — direct community announcement with links to all seven repositories confirmed. (source, June 9, 2026)

Apple Silicon MLX 26B-A4B QAT comparison: 8-bit QAT holds quality vs 6-bit, but MLX 4-bit underperforms and size inflation remains a practical constraint. u/GoodTip7897 (Mac M5 Pro 64GB, oMLX 0.4.1) ran structured MMLU_PRO (50 questions) and HumanEval (100 questions) benchmarks across three Gemma 4 26B-A4B variants from mlx-community: the standard 4-bit MLX model, the 6-bit MLX model, and the QAT 8-bit model. All three use the same chat template (no multimodal tool-call differences affecting results), and all MLX quantization uses the same method — so the only variable is the original weight quality. Key finding: the QAT 8-bit and 6-bit models achieved statistically similar scores on both evaluations; the 4-bit model was measurably worse. The author did not observe the quality collapse between QAT 8-bit and 6-bit that a naive "more quantization = less quality" framing would predict. Practical constraint: MLX Gemma 4 26B-A4B QAT 8-bit is approximately 27GB on disk (June 9 sweep established this as the format retaining additional precision tensors that standard MLX conversion drops), versus ~17GB for the standard non-QAT MLX model. On a 64GB unified memory Mac, this is not a bottleneck; on a 36GB M3 Pro or lower, running the QAT 8-bit alongside other applications becomes tight. Confidence: anecdotal structured benchmark, single-author, single hardware configuration; the MMLU_PRO and HumanEval results are not published with error bars but the methodology is described clearly. (source, June 9, 2026)

Gemma 4 12B on Mac M3 Max 64GB: 47 tok/s with MTP, 42 tok/s without — but a single-question comparison against Qwen 3.5-9B leaves quality verdict open. u/Opening-Broccoli9190 (Mac M3 Max 64GB, llama.cpp defaults) ran Gemma 4 12B and Qwen 3.5-9B head-to-head on a single reasoning question. Speed numbers are specific: Gemma 4 12B with MTP (2 predicted tokens) at 47 tok/s; without MTP at 42 tok/s; with MTP (4 predicted tokens) at 29-36 tok/s (acceptance rate drops at higher draft count). Qwen 3.5-9B base with 1 MTP token: 52 tok/s. The author argues Gemma 4 12B's architectural departure — eliminating the separate vision encoder to enable encoder-free multimodal processing — creates a bad throughput tradeoff for the local inference tier, where the dense 12B competes directly against 9B models at similar speeds. The quality verdict on a single question went to Qwen 3.5-9B in the author's assessment. Reading this against prior field notes requires care: June 6 sweep established that Gemma 4 12B requires a specific Jinja chat template to unlock correct tool calling and reasoning — the author ran llama.cpp "all defaults," which is a known misconfiguration for 12B reasoning tasks. Whether the quality gap persists with a corrected template is not tested. Confidence: speed numbers are plausible for M3 Max; quality verdict is a single-question comparison under potentially suboptimal configuration. (source, June 9, 2026)

Jetson Orin NX 16GB: Gemma 4 26B-A4B UD Q2_K_XL at 14.65 tok/s with 64K context window — a new hardware category enters the field notes. u/Reddactor adapted a Jetson Orin NX 16GB (LPDDR5x, 40W mode) originally from a Llama-7B era robotics project. The target for Hermes Agent workloads: silent operation, >10 tok/s token generation, >300 tok/s prompt processing at 65K context. The winning configuration: Gemma 4 26B-A4B UD Q2_K_XL running at 14.65 tok/s token generation at ~8k context, 10.21 tok/s at ~60k context, with a confirmed 66K context window. Prompt processing hit 300+ tok/s. The author also tested Qwen 3.6 variants and other quant levels; none matched the Gemma 4 26B-A4B MoE for fitting a useful model into the Orin NX's memory budget. The architectural reason is the same one cited in CPU-only reports since June 7: the MoE design activates only ~4B parameters per token at inference, so Q2 quantization of a 26B MoE is cheaper than Q4 of a dense 12B — and in this case produces better results. Tradeoff: Q2_K_XL is an aggressive quant; the author notes the model still handles multi-tool-call workloads with long prompts "OK" rather than reliably. Practical guidance: for Jetson Orin NX 16GB targets running agentic pipelines at 60k+ context, the Gemma 4 26B-A4B at UD Q2_K_XL is the current best-documented configuration — outperforming both Gemma 4 dense models and Qwen 3.6 alternatives on this platform. Confidence: anecdotal single-author report with specific hardware, context lengths, and speed measurements; methodology not fully documented. (source, June 9, 2026)

Gemma 4 31B for academic research coding: outperforms Qwen 3.6 27B and 35B-A3B on a code-understanding task, rated near Opus 4.7 by the model itself. u/The_Paradoxy (academic researcher) reports surprising results from an early test of Gemma 4 31B on a specific code-understanding task: a messy dissertation codebase implementing niche statistical models, with uncommented code and misleading variable names. Their evaluation flow — which they found Qwen 3.6 models (both 27B and 35B-A3B) failed early on — was: explain how the code implements a model described in a paper. Gemma 4 31B substantially outperformed both Qwen 3.6 variants; Opus 4.7 rated Gemma 4 31B's performance as essentially on-par with its own performance on the same task. The author frames this as evidence that Gemma 4 31B excels specifically at understanding how code parts fit together — a structural reasoning capability that differs from "vibe coding" or benchmark-optimized code generation. Confidence: highly anecdotal (one researcher, one codebase, one task type); Opus 4.7 self-rating as a quality proxy is an experimental method, not a controlled benchmark. The result is coherent with the June 9 FP8 report showing Gemma 4 31B "keeping pace with Sonnet 4.6 medium" in a multi-task agentic harness. (source, June 9, 2026)

QAT regression on non-Latin scripts: Persian language benchmark shows QAT Q4_K_XL underperforming IQ4_XS and Q3_K_M. u/Vermicelli_Junior tested Gemma 4 26B-A4B across four configurations on a 20-question Persian language benchmark (all correct answers must be exactly the letter "A" in Persian/Arabic script — a test of both comprehension and precise instruction-following): Google AI Studio (effectively FP16): 17-20 correct; IQ4_XS (Unsloth, non-QAT): 14 correct; QAT Q4_K_XL (Unsloth): 11 correct, with additional typos and instruction-following failures; Q3_K_M (Unsloth): 13 correct with minor typos. The pattern is counterintuitive given QAT's established advantage on English-language reasoning benchmarks: QAT Q4_K_XL, the format that out-performs non-QAT on English tasks, is the worst-performing quant on this Persian benchmark. The practical reading: QAT optimizes the retraining around the distribution of training data, and if non-Latin-script content is underrepresented in the QAT fine-tuning distribution, QAT can reduce performance on those tasks even as it improves English scores. Confidence: anecdotal single-author report, proprietary benchmark, no statistical testing; the direction of the effect is consistent with the broader QAT-is-a-different-model framework established in June 8 notes. Readers using Gemma 4 for non-Latin-script workloads should test QAT vs standard quants on representative tasks before switching. (source, June 9, 2026)

Training cutoff advantage in practice: Gemma 4 knows Svelte 5 natively where other local models treat it as unreleased. u/Borkato notes a concrete knowledge cutoff advantage: Gemma 4 explains Svelte 5 runes correctly out of the box; competing local models respond that "Svelte 5 isn't released." This is a qualitative signal, not a structured benchmark, but it illustrates a practical differentiator for teams building with recently-released frameworks. Gemma 4's training includes post-2025 data that established models trained earlier lack. The entry is low-confidence as a general pattern but is worth noting for users whose work involves frameworks, APIs, or research that has evolved in the last 12 months. (source, June 9, 2026)

Human-annotated summarization benchmark: Qwen 3 tops the 30B range, Gemma 4 second — with a note that Qwen 3.6 may be agentic-optimized at summarization cost. u/Theboyscampus (team running summaries annotated by real humans, judged by LLM) reports results for the 30B parameter class: Qwen 3 (unspecified variant) scores highest, Gemma 4 second. The team's interpretation is that newer Qwen 3.6 models may have been optimized for agentic tasks at the expense of summarization quality. This is a single team's proprietary dataset and evaluation pipeline, and no model versions, quant levels, or statistical methodology are published in the post. Treat as a directional signal rather than a settled benchmark. (source, June 9, 2026)

Open questions

Direct QAT 4-bit vs standard Q8 comparison: still unresolved. Multiple community threads this sweep ask for head-to-head benchmarks between Gemma 4 QAT 4-bit and standard (non-QAT) Q8 quants. The June 6 sweep established that QAT 4-bit beats non-QAT Q8 in speed and VRAM while matching quality on the AMD RX 7900 XTX — but this result is for English-language tasks, one hardware platform, and one benchmark author's subjective quality assessment. Hard numbers across multiple platforms and task categories are still missing. (source)
RTX 4060 Laptop 8GB: no Gemma 4 path found. A follow-up report from an RTX 4060 Laptop 8GB user (i7-13620H, 32 GB DDR5-5200) who tested Gemma 4 models alongside Qwen 3.6 concluded that Gemma 4 was not viable for their use case on this hardware — they remained on Qwen 3.6-35B-A3B with external draft. No specific Gemma 4 config was described as having been tested, so the hardware ceiling for Gemma 4 on 8 GB consumer laptop VRAM remains an open question for the field notes. (source)
Gemma 4 12B quality under correct template on Mac M3 Max vs Qwen 3.5-9B. The June 10 comparison between Gemma 4 12B and Qwen 3.5-9B used llama.cpp default settings, which are a known misconfiguration for Gemma 4 12B reasoning tasks. The quality verdict may change with the custom Jinja template documented in the June 6 sweep.

Sources

The Gemma-mentioning posts driving this update (June 9-10 sweep, newest first):

Unsloth Gemma 4 QAT MTP assistant models now available (Jun 9, 2026 — QAT+MTP for all 7 sizes including E2B and E4B mobile)
Gemma 4 26B A4B IT QAT Comparison (Jun 9, 2026 — MLX Apple Silicon benchmark: MMLU_PRO and HumanEval across 4-bit, 6-bit, QAT 8-bit)
Gemma4-12B architecture change opinion/benchmark (Jun 9, 2026 — Mac M3 Max: 47 tok/s with MTP, quality single-question comparison with Qwen 3.5-9B)
Jetson Orin NX Build for Hermes Agent + Benchmarking (Jun 9, 2026 — Gemma 4 26B-A4B UD Q2_K_XL; 14.65 tok/s at 8k, 64K context window)
Gemma 4 31B's competence surprised me (Jun 9, 2026 — academic code understanding; Opus 4.7 rates performance near its own)
Unexpected Unsloth QAT Performance Compared to Unsloth IQ4_XS (Jun 9, 2026 — Persian language: QAT 11/20, IQ4_XS 14/20, Q3_K_M 13/20)
Gemma having updated knowledge base is so awesome (Jun 9, 2026 — Svelte 5 runes known natively)
Newer Qwen models are worse at summarization? (Jun 9, 2026 — human-annotated: Qwen 3 tops 30B, Gemma 4 second)
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? (Jun 9, 2026 — open demand for QAT-4bit vs Q8 head-to-head)
Follow-up: Qwen3.6-35B-A3B 8GB RTX, I tried Linux, tested Gemma 4 (Jun 9, 2026 — RTX 4060 Laptop 8GB: Gemma 4 tested, Qwen 3.6 retained)

Last updated: 2026-06-10 (June 10 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Field Notes — 2026-06-09

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (14 new or updated since 2026-06-08, 307 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 9 sweep, 2026-06-09 00:00 UTC: the dominant story this sweep is MTP maturation — three separate infrastructure milestones landed within 24 hours of each other, and their combined effect is a measurable speed jump that community members are already quantifying. The headline number is an RTX 3090 owner reporting Gemma 4 31B QAT+MTP at 70-80 tok/s after previously sitting at 40 tok/s, a 1.7-2x improvement driven by the convergence of QAT file sizes, the merged KV-cache optimization in b9551, and the mainline MTP draft-head pipeline. Around that acceleration story, two quieter signals deserve attention: a well-documented regression in QAT 12B tool-calling reliability (compared to the same user's previous Q5_K_L workflow), and a counter-intuitive in-context memory finding where the smaller Gemma 4 E4B outlasts the larger E2B in factual recall across a growing conversation.

RTX 3090 hits 70-80 tok/s on Gemma 4 31B QAT+MTP — previously 40 tok/s, a 1.7-2x field-reported improvement. An RTX 3090 owner (i9-13900H, 62 GB RAM, Ubuntu 24.04, CUDA 13.2) published a direct before/after comparison: Gemma 4 31B running at 40 tok/s on the same hardware before the QAT+MTP combination, now at 70-80 tok/s with the following config: `gemma-4-31B-it-qat-UD-Q4_K_XL.gguf` + matching MTP assistant head, `--spec-type draft-mtp --spec-draft-n-max 4 --ctx-size 40960 --cache-type-k q8_0 --cache-type-v q8_0`. The author also tested Gemma 4 12B with the multimodal projection file (`mmproj`) alongside the MTP assistant, reporting the same proportional speedup holds for the 12B — and specifically highlights that multimodal inference now shows nearly instant time-to-first-token because the quantized model begins generating before the image patches finish processing. The author frames this sweep as the inflection point where "GPU-poor" 24 GB users become effectively not-poor: the 40 tok/s 31B was already good, 70-80 tok/s is competitive with dedicated inference GPUs for conversational use. Confidence: anecdotal single-author report with a fully-specified configuration; the speedup range is consistent with the KV-cache and MTP gains landing in the same release window. (source, June 8, 2026)

Two llama.cpp PRs merged in the same window: KV-cache optimization (#24277) in b9551+ and E2B/E4B MTP assistant support (#24282). Two distinct infrastructure PRs affecting Gemma 4 MTP landed within hours of each other on June 8. The first, PR #24277 ("kv-cache: avoid kv cells copies" by ggerganov), is a targeted optimization that reduces unnecessary copies in the KV cache during speculative decoding — community members flag it as specifically beneficial for Gemma-4's MTP path, and it became available starting from build b9551. The second, PR #24282 ("mtp: support for gemma-4 E2B and E4B assistants" by max-krasnyansky), extends MTP speculative-decoding support to the two smallest Gemma 4 models: the 2B-parameter E2B and the 4B-parameter E4B — the community shorthand being "MTP for tiny Gemmas for mobiles, potatoes, Raspberry Pi, or maybe for ants." Practical takeaway for users on recent builds: update to b9551 or later to capture the KV-cache improvement, and if you are running E2B or E4B (on device, Pi, or low-VRAM hardware), MTP is now available to you without an additional fork. Confidence: high for the merges themselves (both linked directly to the merged PRs); performance impact of #24277 is community-reported rather than benchmarked. (source: #24277, source: #24282, June 8, 2026)

Dual RTX 3060 Ti (16 GB total VRAM) hits a 100 tok/s MTP ceiling — 33% over the baseline 75 tok/s, with 80%+ draft acceptance. A user running two RTX 3060 Ti 8 GB cards in a split-model configuration reports a practical MTP ceiling that practitioners on dual-card setups should recognize. Without the MTP assistant head: 75 tok/s on Gemma 4 12B QAT. With the assistant head: a peak of 100 tok/s — a 33% improvement, despite tuning `--spec-draft-n-max 6` and `--spec-draft-p-min 0.8` with 80%+ acceptance rate reported. The configuration routes the draft model to a dedicated card (`--spec-draft-device CUDA1 --split-mode layer --tensor-split 70,30`). The author's question — why 80% acceptance yields only 33% throughput rather than the theoretically higher multiplier — is a real tension in MTP scaling on bandwidth-limited hardware. The implied answer consistent with other field reports: at 16 GB total VRAM, memory bandwidth, not speculative acceptance rate, limits the ceiling. The 100 tok/s absolute number is nonetheless useful as a concrete dual-3060-Ti baseline. Confidence: anecdotal, single author, fully specified configuration. (source, June 8, 2026)

Gemma 4 12B QAT tool calling regresses versus Q5_K_L for some agentic workflows — server logs show a "control-looking token" warning as a diagnostic signal. A practitioner who had generated 2,300 lines of debugged, architecturally sound code and 10,000 lines of story writing with Gemma 4 12B Q5_K_L reports that switching to the QAT version breaks their agentic tool-calling reliability: the model "constantly questions itself" during generation, producing inconsistent results across coding-extension calls, story writing, and real-use cases — despite hitting 60 tok/s. The failure mode the author traces to the server startup log: `W load: control-looking token: 50 ' '` — a warning that a blank space token is being classified as a potential control token, which they identify as the root cause of self-interrupting behavior. The author tested both the Google and Unsloth 12B QAT builds and reports the regression is consistent across both. This report stands in tension with the June 8 sweep's positive 31B QAT endorsement: the size-dependent quality pattern continues — the 31B QAT appears to be a net improvement for most users, while the 12B QAT introduces issues specifically in tool-calling and agentic pipelines. Practical guidance: if your workflow depends on reliable tool calling, benchmark your specific agentic tasks before committing to 12B QAT — the `W load: control-looking token` warning in your server log is now a concrete diagnostic to check first. Confidence: anecdotal single-author report, but the server-log diagnostic is a reproducible signal. (source, June 8, 2026)

In-context memory test on E2B and E4B produces a counter-intuitive result: the larger E4B (4B dense) forgets a planted fact faster than the smaller E2B (2B dense). A methodical community experiment planted a fact ("my dog is named Pablo") at the start of a conversation, then inserted N turns of shuffled science Q&A filler, and tested recall with three random seeds per depth. Break point was defined as first depth where mean recall dropped below 0.80. Results: E2B (2B) broke at 8 turns, matching LFM2.5-8B-A1B (Liquid AI MoE, ~1.5B active). E4B (4B) broke at 5 turns — the smallest memory window despite being the largest model tested. LFM2.5 degraded gradually (still 1/3 correct at depth 15); both Gemma models cliffed sharply (near-perfect through the break point, then zero). Notably, none of the three models confabulated a wrong name on failure — all three produced some version of "I don't have access to your personal information," i.e., they refused rather than hallucinated. The mechanism is unclear: whether this reflects attention pattern differences, RLHF fine-tuning that over-applies privacy guardrails at longer context, or something else is an open question. Practical note for users building on-device agents that need to track user context across multi-turn conversations: E4B may need explicit context refresh earlier than E2B. Confidence: structured small-sample eval (3 seeds per depth), reproducible methodology; exact numbers reliable, generalization requires more seeds. (source, June 8, 2026)

Gemma 4 31B FP8 matches Sonnet 4.6 Medium in a user's production RAG and agentic harness — a meaningful quality benchmark from real deployed use. A practitioner running a custom evaluation harness reports that Gemma 4 31B FP8 keeps pace with Claude Sonnet 4.6 Medium across five task categories: Cypher queries for Neo4j graph traversal, entity extraction from text chunks (combining web query, graph query, and vector retrieval), agentic tool calling (skill selection and successful execution), Python code writing, and synthesis from multi-vector retrieval. The author is running Gemma and Qwen models in FP8 alongside the Claude API comparison and describes the result as "brought me joy." This is a qualitative report rather than a controlled benchmark, but its significance is the task coverage: the harness spans structured query generation, graph navigation, agentic execution, coding, and summarization — not just text generation. A prior sweep documented the same 31B QAT keeping pace with a user's agentic coding workflow; this is a different harness confirming the same directional signal on a different task mix including RAG. Confidence: anecdotal, unspecified hardware, no raw numbers; the task diversity is the value of the data point. (source, June 8, 2026)

LM Studio QAT+MTP gap remains open: the QAT assistant model does not appear in the speculative decoding panel on the current release. A user running the most recent LM Studio version with the latest bundled llama.cpp reports that after downloading the official QAT assistant GGUF, it does not surface in the speculative decoding side panel — the standard way to configure draft-head MTP in LM Studio. No resolution or workaround was available in comments at sweep time. This is a concrete gap for users who rely on LM Studio as their primary frontend: direct llama.cpp (`llama-server`) is currently the path to QAT+MTP, while LM Studio's GUI integration appears to lag the upstream merge. Users needing MTP acceleration today should use the command-line path documented in this and prior sweeps. Open question: is this a known gap with a targeted LM Studio release, or is it a matching/detection issue specific to QAT-tagged assistant GGUFs? Confidence: single-author report with no comments captured; treated as a flag rather than a conclusion pending resolution. (source, June 8, 2026)

MLX QAT 31B weighs ~27 GB versus 17 GB for both the standard and regular 4-bit MLX versions — an unanswered sizing question. A community member notes a puzzle in the MLX ecosystem: the QAT 4-bit MLX package for Gemma 4 31B is approximately 27 GB on disk, while both the non-QAT standard and the regular 4-bit MLX versions land at approximately 17 GB. No explanation surfaced in comments at sweep time. Possible factors that would account for the ~10 GB gap include: MLX-format metadata overhead for QAT-specific weight structures, non-quantized embeddings or output layers being kept in full precision alongside the quantized core, or a difference in how the QAT training run's additional checkpoint fields are preserved in the MLX export. For Apple Silicon users with 32 GB unified memory (the sweet spot for 31B), the 27 GB footprint is still manageable, but it is larger than a standard Q4 run and worth accounting for alongside the KV cache and context buffer when planning memory budgets. Confidence: observation confirmed (sizes are real and reproducible); the architectural cause is unresolved. (source, June 8, 2026)

Open questions

What is the architectural cause of the MLX QAT 31B size discrepancy? The QAT 4-bit MLX package is ~27 GB versus ~17 GB for both standard and regular 4-bit MLX. Whether this reflects non-quantized embedding/output layers, extra metadata, or a QAT checkpoint structure difference is unanswered. (source)
Why does LM Studio's speculative decoding panel not detect QAT assistant GGUFs? MTP works cleanly via direct llama.cpp command-line, but the GUI integration appears to not recognize the QAT assistant file as a valid draft model. Whether this is a filename-matching issue or a deeper LM Studio limitation is unclear. (source)
Does QAT 12B tool-calling regression trace to the "control-looking token" warning, or is it a training artifact? The server-log warning (`W load: control-looking token: 50 ' '`) is a plausible mechanism for self-interrupting agentic behavior, but whether fixing the tokenizer mapping would restore Q5_K_L-level tool-call reliability is unverified. (source)
Why does E4B (4B dense) lose in-context recall faster than E2B (2B dense) across multi-turn conversations? The counter-intuitive result may reflect RLHF fine-tuning differences, attention pattern differences, or a context-refresh behavior introduced at the 4B scale. Practical implication for on-device agents: unresolved. (source)
What is the MTP throughput ceiling on bandwidth-limited dual-card setups, and can the 33% cap on dual 3060 Ti be improved through batching or tensor split tuning? The current report hits 100 tok/s at 80%+ acceptance, suggesting bandwidth rather than acceptance rate is the binding constraint. A documented resolution would help dual-card users set realistic expectations. (source)

Sources

The Gemma-mentioning posts driving this update (June 8-9 sweep, newest first). All are fresh threads (score ~20, no captured comment threads at sweep time), so treat individual numbers as first-look anecdotes rather than settled results:

[[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]](https://reddit.com/r/LocalLLaMA/comments/1u08zhx) (Jun 8, 2026 — RTX 3090 31B hits 70-80 tok/s from 40; multimodal 12B also benchmarked)
kv-cache: avoid kv cells copies — PR #24277 merged (b9551+) (Jun 8, 2026 — KV cache optimization improves MTP for Gemma-4)
mtp: support for gemma-4 E2B and E4B assistants — PR #24282 merged (Jun 8, 2026 — MTP extended to 2B and 4B tiny models)
Gemma 4 QAT + MTP: max 33% speed increase, any ideas? (Jun 8, 2026 — dual 3060 Ti 8GB, 75 → 100 tok/s ceiling documented)
Gemma 4 Chat Template now has preserve thinking (Jun 8, 2026 — official upstream chat template update)
Gemma 4 12b QAT is a regression for my use case (Jun 8, 2026 — tool-calling regression vs Q5_K_L; control-looking token warning diagnostic)
I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B (Jun 8, 2026 — E4B breaks at 5 turns, E2B at 8; counter-intuitive size/memory result)
Gemma4_31b_fp8 keeping up with Sonnet_4.6_medium in my harness (Jun 8, 2026 — RAG, agentic, code, synthesis parity with Claude Sonnet 4.6 Medium)
LMStudio gemma 4 31b QAT with MTP (Jun 8, 2026 — QAT assistant not detected in LM Studio speculative decoding panel)
Why is the MLX version of the Gemma 4 QAT so big? (Jun 8, 2026 — 27 GB vs 17 GB sizing puzzle, unanswered)
Used local Ollama (gemma4:e4b) to bulk-generate AI summaries for 4300 arXiv papers (Jun 8, 2026 — E4B as structured-JSON summarizer at scale)

Last updated: 2026-06-09 (June 9 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Field Notes — 2026-06-08

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (14 new or updated since 2026-06-07, 293 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 8 sweep, 2026-06-08 00:00 UTC: this sweep is dominated by one infrastructure milestone and its fallout. Gemma 4 Multi-Token Prediction (MTP) support has been merged into mainline llama.cpp — last week's field notes still required the Atomic community fork for speculative decoding, so this is the moment MTP becomes available in a standard build. The merge immediately surfaces three practical consequences threaded through the rest of the sweep: old GGUFs are not compatible and must be re-downloaded, the setup now requires a second draft-head file alongside the main model, and at least one user's repeated-garbage symptom traced back to a corrupted GGUF blob rather than a model bug. Around that milestone, the QAT (Quantization-Aware Training) story turns more nuanced — a glowing 31B report and a 26B-A4B regression report land in the same sweep — and the CPU-only and minimal-multi-GPU paths each pick up a fresh concrete data point.

llama.cpp Gemma 4 MTP support has merged into mainline — old GGUFs are incompatible and you now need a separate draft-head file. The headline of this sweep is short: a community post simply titled "llama.cpp Gemma4 MTP support merged!" confirms that Multi-Token Prediction for Gemma 4 is now in mainline llama.cpp, no longer requiring the Atomic fork that previous field notes pointed to for speculative decoding. Two other posts in the same sweep corroborate it and spell out the practical implications. One user laying out the new mental model states the three facts plainly: MTP has merged, old GGUFs are not compatible, and you now need a second file (the MTP draft/assistant head) loaded alongside the main model — a setup that is generating real confusion about which official GGUF to download and what the various Unsloth and Google QAT/MTP filename suffixes mean. Practical guidance: if you previously ran Gemma 4 on llama.cpp, updating to a current mainline build and re-pulling an MTP-compatible GGUF (plus a matching assistant head) is the path to speculative decoding without a fork — but budget time to re-download weights, because cached pre-merge blobs will not work. Confidence: high for the merge itself (a direct announcement plus two independent corroborating reports); the surrounding setup details are medium, drawn from launch-week user reports rather than official docs. (source: merge announcement, source: MTP/QAT relationship, June 7, 2026)

Working post-merge recipe — Gemma 4 31B QAT on an RTX 5090 at ~21.5 GB VRAM — and a reminder to verify your GGUF hash. A practitioner who hit the notorious repeated-`` garbage output while loading the 31B QAT model on the MTP branch traced it to a corrupted cached GGUF blob: the local file's SHA256 did not match the Hugging Face expected hash. The fix was to move the bad blob aside, force a re-download (`hf download --force-download`), and rebuild llama.cpp master after the Gemma 4 MTP merge. The resulting working configuration is fully specified: llama.cpp master (build f0156d140), `gemma-4-31B-it-qat-UD-Q4_K_XL.gguf`, RTX 5090 32 GB, `--ctx-size 40960`, `--cache-type-k q8_0 --cache-type-v q8_0`, `--flash-attn` — landing at roughly 21.5 GB VRAM with clean text generation. One caveat carries forward: the MTP assistant head still failed for this user with both their local assistant GGUF and the public QAT-matched head, citing metadata/assertion issues, so the long-context QAT main model runs cleanly but speculative decoding on top of it is not yet reliable for everyone. Two takeaways for readers: the 31B QAT at Q4_K_XL is a comfortable fit on a single 32 GB card with quantized KV cache and 40k context, and a repeated-token failure after a model update is worth checking against a hash mismatch before assuming the build is broken. Confidence: anecdotal single-author report, but the configuration is fully specified and reproducible. (source, June 7, 2026)

QAT quality reports split this sweep: a strong 31B endorsement and a 26B-A4B regression — treat QAT as size-dependent, not universally better. Two QAT reports point in opposite directions and are worth holding side by side rather than averaging. On the positive side, a daily user (Qwen 3.6 27B for programming, Gemma 4 for everything else) reports that the 31B QAT lets a single model cover both their short-context and long-context workloads — previously split between Q4K_L for 128k tasks and Q6_K_L for 32k tasks — with subtle quality gains: more varied word use in roleplay, better grasp of correlations, and output they rate at least as good as bartowski's Q6_K_L. They add that MTP with the 31B QAT has been "amazing," but flag one persistent limit: KV cache quantization still bites, with Q8_0 KV showing noticeable degradation at 128k context. On the negative side, a separate user running the 26B-A4B QAT Q4_0 (both the Google and Unsloth Q4_K_XL builds, with the recommended `--temp 1.0 --top-p 0.95 --top-k 64` on llama.cpp b9549) finds it regresses on the chessboard-SVG spatial test versus the \_old non-QAT 26B-A4B Q4_K_XL, which "got everything right" while the QAT version swaps color patterns and misplaces pieces across repeated runs. The reconciliation that fits both reports: QAT appears to help the dense 31B while possibly hurting the 26B-A4B MoE on at least some structured-reasoning tasks — consistent with the open question from the June 7 sweep about why QAT accuracy varies by architecture. Practical guidance: if you switch to QAT, A/B it against your previous quant on your own representative task before committing, especially for the MoE 26B-A4B. Confidence: both reports are anecdotal and single-author; the divergence itself is the finding. (source: positive 31B QAT, source: 26B-A4B QAT regression, June 7-8, 2026)

The QAT-vs-original accuracy puzzle gets a methodological answer: QAT is effectively a retrained, different model, so FP16-reference comparisons mislead. Last week's open question — why does the QAT 12B deviate furthest from FP16 in Unsloth's accuracy table — drew a sharp methodological response this sweep. The argument: because QAT retrains the weights rather than merely quantizing them post-hoc, the QAT 31B should be treated as a distinct model from the original 31B, which means measuring the divergence of QAT-Q4 against an original-model FP16 reference is the wrong comparison and will inevitably look bad. The proposed correct procedure is to first benchmark the QAT model unquantized (e.g. SuperGPQA, HLE, MMLU) to assess how much retraining shifted overall quality, and only then compare QAT-Q4-vs-QAT-unquantized and original-Q4-vs-original-unquantized as two separate, internally consistent divergence measurements. This does not resolve whether QAT is better or worse — it argues that most of the alarming "deviation from FP16" numbers circulating are comparing the wrong baselines. Readers evaluating QAT quality claims should check which reference model the divergence was measured against before drawing conclusions. Confidence: this is community reasoning rather than a completed benchmark, but the methodological point is sound and directly clarifies a previously-open question. (source, referencing the June 7 KLD analysis, June 7, 2026)

CPU-only keeps consolidating: the 26B-A4B MoE runs at ~7 tok/s on a no-GPU $150 used desktop. Adding to the CPU-only data points from recent sweeps (the dual-Xeon 31B Q8_0 at ~4 tok/s on June 7, and an earlier i5-8500 report), a user reports running Gemma 4 26B-A4B on an i5-8500 with 32 GB DDR4 and no GPU under KoboldCpp on Linux at roughly 7 tok/s — on a desktop they bought used for about $150. Their framing matters for the category: dense 12B models on the same box run "slow but perfectly usable," whereas the 26B-A4B "simply flies" by comparison, because its MoE design activates only ~4B parameters per token. The consistent pattern across three independent CPU-only reports now is that MoE architecture, not raw parameter count, is what makes CPU-only Gemma 4 viable — a 26B-A4B can outrun a 12B dense model on the same GPU-less hardware. Practical guidance: for CPU-only or GPU-poor setups, prefer the 26B-A4B MoE over a similarly-sized dense model, and expect single-digit-but-usable tok/s with 32 GB of system RAM. Confidence: anecdotal single-author report, but it reinforces a pattern now seen across multiple independent sweeps. (source, June 7, 2026)

Two emerging hardware-specific threads: NVFP4 QAT quants for Blackwell, and a multi-GPU llama-server router gotcha. Two narrower but concrete reports round out the sweep for users on newer or multi-card setups. First, NVFP4 — a Blackwell-native 4-bit format — now has llama.cpp support merged, and a community member has published NVFP4 QAT quants of Gemma 4 31B targeted at Blackwell cards (`melcheikh/gemma-4-31B-it-qat-NVFP4-Blackwell`, with a matching assistant head). The open practical gap is that these ship as safetensors with no GGUFs, and the conversion path from NVFP4 safetensors to GGUF is not yet documented — a dual-RTX-5060-Ti owner asking how to do it got no clear answer this sweep. Second, a multi-GPU operator running a single llama-server router across a 2× RTX 3090 + 2× RTX 4060 Ti + RTX 5060 Ti rig documents a real gotcha: each per-model child process allocates a CUDA context on every card (~256 MiB on each 3090) even when the model is pinned to a single device with `-ngl 99`. When a 27B model at 262K context fills both 3090s, loading a small Gemma 4B pinned to the 5060 Ti OOMs about 0.2s into load — not because the target card is full (it had 15 GB free) but because the child cannot create its incidental context on the already-full 3090s. Practical guidance: on dense multi-GPU rigs, account for per-child CUDA-context overhead on every visible card when sizing context budgets, and consider isolating cards (e.g. `CUDA_VISIBLE_DEVICES`) per child rather than relying solely on device pinning. Confidence: both anecdotal, single-author reports on specific hardware; the NVFP4 quants are linked HF repos, the router behavior is a reproducible operational constraint. (source: NVFP4 on llama.cpp, source: router CUDA-context OOM, June 7, 2026)

Open questions

Does QAT help MoE models the way it helps dense ones? This sweep produced a strong 31B-dense QAT endorsement and a 26B-A4B-MoE QAT regression in the same window. Whether QAT systematically trades quality differently for MoE versus dense Gemma 4 remains unresolved and is now backed by conflicting field reports, not just theory. (source)
Is the MoE 26B-A4B less quantization-resilient at long context than dense models? A user reports the 26B-A4B looping at ~45k context under `UD-Q5_K_XL` with default llama.cpp sampling, with the looping fixed by moving to a 6-bit quant — and notes they do not recall the same looping from dense models at Q4_K_M. Whether MoE's 4B-active design is genuinely more loop-prone under aggressive quantization, or whether this is a sampling-settings artifact (the author asks whether to enable the DRY sampler), is an open and practically important question for long-context MoE users. (source)
What is the documented path from NVFP4 safetensors to a runnable GGUF for Gemma 4 QAT on Blackwell? NVFP4 quants exist and llama.cpp has merged NVFP4 support, but no end-to-end conversion recipe surfaced this sweep. (source)
What can a 4 GB-VRAM laptop realistically do with Gemma 4, and what is the true minimum for ~30B at 20+ tok/s? A user on an RTX 3050 4 GB laptop (Ryzen 7 5800H, 16 GB DDR4) asked exactly this and the thread did not yet produce a crisp answer — a recurring entry-level demand the guide should address directly. (source)
Is Gemma 4 12B a good coding model, and how is it best run on vLLM? Two separate posts this sweep ask whether the 12B is good for coding and how to run `gemma-4-12b-it-qat-w4a16-ct` under vLLM (it errors out for the asker). Demand for a clear 12B coding-setup answer continues to outpace settled community guidance. (source: 12B coding, source: vLLM QAT)

Sources

The Gemma-mentioning posts driving this update (June 7-8 sweep, newest first). All are fresh launch-window threads (score ~20, no captured comment threads at sweep time), so treat individual numbers as first-look anecdotes rather than settled results:

llama.cpp Gemma4 MTP support merged! (Jun 7, 2026 — mainline MTP merge confirmation)
What's your experience with Gemma4 QAT? (Jun 8, 2026 — positive 31B QAT report; one model for both context lengths; Q8_0 KV degrades at 128k; MTP "amazing")
QAT variant of Gemma4 26B A4B is not working well for me (Jun 7, 2026 — 26B-A4B QAT regresses on chessboard-SVG vs old non-QAT)
Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> (Jun 7, 2026 — corrupt-blob hash mismatch; working RTX 5090 recipe at ~21.5 GB VRAM)
You don't need a GPU to run gemma-4-26B-A4B (Jun 7, 2026 — ~7 tok/s on i5-8500 + 32 GB, no GPU, KoboldCpp)
How to compare Original vs QAT Gemma 4 31B Q4 quants (Jun 7, 2026 — QAT is a retrained, different model; FP16-reference KLD is the wrong baseline)
Dense vs MoE quantization resiliance (Jun 7, 2026 — 26B-A4B Q5_K_XL loops at ~45k context, fixed at Q6)
NVFP4 on llama.cpp? (Jun 7, 2026 — Blackwell NVFP4 QAT quants of 31B exist; safetensors-to-GGUF path unclear)
llama-server router: a model pinned to one GPU still grabs a CUDA context on every card (Jun 7, 2026 — multi-GPU per-child CUDA-context OOM)
MTP and QAT - what is the relation? (Jun 7, 2026 — MTP merged, old GGUFs incompatible, second file required; naming confusion)
Need some guidance toying with local models (Jun 7, 2026 — RTX 3050 4 GB laptop entry-level question)
Is Gemma 4 12b good for coding? (Jun 7, 2026 — 12B coding demand signal)
how to run gemma-4-12b-it-qat-w4a16-ct in vllm (Jun 8, 2026 — vLLM QAT w4a16 errors)

Last updated: 2026-06-08 (June 8 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Field Notes — 2026-06-07

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (13 new or updated since 2026-06-06, 279 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 7 sweep, 2026-06-07 00:00 UTC: four developments from this sweep are directly relevant to Gemma 4 practitioners: QAT-matched MTP draft heads are now publicly available on HuggingFace for all three flagship sizes, enabling speculative decoding on the official QAT Q4_0 models — the first RTX 4070 Super 12GB benchmark reports 120 tok/s using the Gemma 4 12B QAT with MTP; Strix Halo users have detailed QAT Q4_0 numbers via llama.cpp Vulkan/RADV for both 12B and 26B-A4B, including a fix for the PARALLEL=2 crash that was blocking MTP-enabled inference; the community raises a documented open question about why Gemma 4 12B deviates furthest from FP16 under QAT accuracy analysis despite being a dense model (not MoE); and a practical field report confirms that Gemma 4 MoE models are viable on a bare-minimum multi-GPU setup (dual GTX 1050 Ti 4GB, 64GB DDR4) at 12-18 tok/s generation.

QAT-matched MTP heads now public — RTX 4070 Super 12GB benchmarks 120 tok/s on Gemma 4 12B QAT with speculative decoding. u/westsunset published QAT-matched MTP draft heads for all three official Gemma 4 QAT sizes on HuggingFace under `boxwrench/gemma-4-qat-mtp-assistant-heads`: `gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf` (444 MiB, pairs with the official Q4_0 12B), `gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf` (441 MiB), and `gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf` (491 MiB). These are converted from Google's official unquantized QAT assistant checkpoints. The distinction from generic MTP heads matters: a draft head trained against full-precision weights but paired with a QAT main model misses the quantization shifts, reducing draft acceptance rates. QAT-matched heads restore acceptance rates close to those of the standard non-QAT pairing. On the hardware front, u/janvitos (RTX 4070 Super 12GB, Ryzen 7 9700X, 32GB DDR5-6000) reports 120 tok/s at code generation using the Gemma 4 12B QAT with the Atomic llama.cpp fork's MTP support, compared to 60 tok/s without MTP on the same hardware — a 2x throughput improvement for coding tasks on a 12 GB consumer card. Separately, the PARALLEL=2 crash that was blocking multi-slot MTP inference has been fixed in both the Atomic fork and filed upstream in the native llama.cpp MTP PR. Practical guidance: if you are running Gemma 4 12B QAT on a 12 GB or larger GPU, the boxwrench QAT-matched heads plus a MTP-enabled llama.cpp build is the current best configuration for coding and structured output workloads. Confidence: structured benchmark by u/janvitos on verified hardware, consistent with prior reports of MTP acceptance rates at coding tasks. (source: 120 tok/s report, source: MTP heads, June 6, 2026)

Strix Halo QAT Q4_0 benchmark via Vulkan/RADV — 13.45 GiB for 26B-A4B, both 12B and 26B-A4B tested on 128GB unified memory. u/westsunset (AMD Ryzen AI Max+ 395 / Radeon 8060S, 128GB LPDDR5X, Linux Mint 22.3, Mesa 25.2.8) published the first detailed QAT Q4_0 benchmarks on Strix Halo hardware via llama.cpp Vulkan/RADV, using the Atomic TurboQuant fork to enable MTP assistant head support. The 26B-A4B QAT Q4_0 model weighs 13.45 GiB on disk; the 12B QAT Q4_0 weighs approximately 6.5 GiB. The Strix Halo's 128GB unified memory pool (with 96 GiB GTT ceiling) allows both models to run with large context windows without VRAM pressure, making it one of the most capable consumer platforms for extended context Gemma 4 inference. The post also includes the first Gemma 4 12B 2-slot MTP numbers on this hardware after the PARALLEL=2 fix. Key constraint: these numbers use the Vulkan/RADV backend with the Atomic fork, not mainline llama.cpp — users on the upstream ROCm or CUDA paths should verify results before treating them as general guidance. Anecdotal confidence for absolute numbers; architecture and sizing data are confirmed. (source: Strix Halo QAT bench, source: MTP heads + 2-slot bench, June 6, 2026)

Open question: Gemma 4 12B deviates furthest from FP16 in QAT accuracy analysis, while E2B/E4B are near-perfect — community does not yet have an explanation. u/ai_fonsi raised a counter-intuitive finding from Unsloth's QAT analysis table: Gemma 4 E2B and E4B achieve near-perfect QAT accuracy relative to their FP16 baselines, while the 12B dense model shows the most deviation from FP16 among all the tested sizes — despite conventional expectation that smaller, denser models quantize better than larger MoE models. No community commenter has provided a verified explanation as of this sweep. Proposed hypotheses include: an issue specific to the 12B QAT training run on Google's side; the 12B's attention pattern being less QAT-friendly than the MoE routing patterns of E2B and E4B; or a methodology difference in how the deviation is computed for dense versus MoE architectures. This matters practically: if the 12B QAT is genuinely less accurate relative to FP16 than the 26B-A4B QAT, users who switched to 12B QAT purely on size grounds may be accepting a quality tradeoff that is not advertised in the model description. The same community thread asks whether non-QAT variants might actually outperform QAT on some tasks. Open question — no resolution at sweep time. Confidence: observation is from published Unsloth analysis table (medium confidence for the numbers); the root cause remains unverified. (source, June 6, 2026)

CPU-only path: dual Xeon Platinum 8358 (128 threads), 256GB DDR4 running Gemma 4 31B Q8_0 at approximately 4 tok/s — "slow, but earns its keep" for quality-sensitive overnight workloads. u/bitslizer documents a real-world CPU-only deployment for Gemma 4 31B at Q8_0 on a dual Xeon Platinum 8358 workstation (256 GB DDR4), achieving approximately 4 tok/s at generation. The author frames this explicitly as an acceptable speed for "background or overnight type jobs where I don't need the speed but need the smart and accuracy." This is a meaningful data point for users evaluating CPU-only options: Gemma 4 31B Q8_0 at 4 tok/s is approximately 8x slower than a single mid-range consumer GPU, but it runs the full Q8_0 quality level in 32GB of system RAM rather than 32GB of GPU VRAM. The same post explores whether the new QAT Q4_0 (17GB vs 32GB, roughly double the generation speed on bandwidth-limited hardware) would be a worthwhile switch — the author's KLD benchmark results were counter-intuitive and the post invites community explanation (see above). Confidence: anecdotal, single author report on specific hardware configuration. (source, June 7, 2026)

Minimal multi-GPU setup viable for Gemma 4 MoE: i5-12400 + dual GTX 1050 Ti 4GB (8GB total VRAM) + 64GB DDR4 achieves 12-18 tok/s. u/j0hnp0s reports surprisingly viable Gemma 4 and Qwen 3.6 MoE inference on a minimal rig: Intel i5-12400, 64GB DDR4, and two GTX 1050 Ti 4GB cards (8 GB total VRAM). With the MoE expert layers split across VRAM and system RAM, the setup achieves approximately 40 tok/s prompt processing and 12-18 tok/s token generation — figures the author describes as "unexpectedly viable" given the hardware. The weak point is prompt processing, particularly for agentic workflows where the context window grows progressively. This is the lowest VRAM floor for functional Gemma 4 MoE inference documented at this sweep date: 8 GB total VRAM with a 64 GB DDR4 buffer. The author had expected the setup to be completely unusable and found that MoE architecture's expert routing makes it more memory-bandwidth-flexible than equivalent dense models at the same parameter count. Anecdotal confidence for the specific numbers; the general MoE inference pattern is consistent with the architecture. (source, June 6, 2026)

Community sentiment has shifted toward Gemma 4 over the last 60 days — QAT, MTP, and tooling improvements cited as key drivers. A community thread (u/DigRealistic2977) directly names the reversal: two months ago, Gemma 4 posts were met with downvotes and "Qwen is better" rebuttals; by June 2026, the same community members are requesting a 100B+ or larger MoE Gemma model and expressing disappointment that Google hasn't shipped one yet. The named factors behind the shift are convergent with the prior field notes pattern: MTP speculative decoding (making smaller Gemma 4 models competitive in throughput with larger Qwen models), QAT releases (restoring FP16 quality at Q4 file sizes), and the Heretic finetune ecosystem (giving users a tunable, less-restricted variant). A separate qualitative field report (u/Some-Cauliflower4902, one month of Gemma 4 31B daily use) aligns: standard Q4_K_M UD at long context (20k+) is "a functional nervous wreck" — the model falls apart under chain-of-tools pressure; the Heretic variant is less careful but more resilient; and the QAT version behaves like a "zen master," handling 32k with full reasoning without destabilizing. The interpretation is that the "nervous" behavior of standard Q4 is a quantization artifact, not an alignment characteristic, and QAT resolves it. Confidence: subjective community sentiment, anecdotal quality assessments; the pattern is consistent across multiple independent reporters. (source: sentiment thread, source: qualitative 31B report, June 6, 2026)

Field Notes — 2026-06-06

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (61 new or updated since 2026-05-26, 266 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 6 sweep, 2026-06-06 00:00 UTC: six developments from this sweep are directly relevant to Gemma 4 practitioners: Google and Unsloth have both published Quantization-Aware Training (QAT) releases that eliminate the usual quality-versus-size tradeoff — an AMD RX 7900 XTX benchmark shows Gemma 4 12B QAT 45% faster with 5.7 GB less VRAM and no perceptible quality loss; the new Gemma 4 12B dense model lands as a compact multimodal option that fits comfortably under 16 GB VRAM at Q4_K_XL delivering approximately 61 tok/s on an RTX 5080; Unsloth has released MTP GGUF weights for all three flagship sizes (31B, 26B-A4B, 12B) achieving 120 tok/s at 32k context with 90%+ draft acceptance during coding; a widely-shared PSA documents the custom Jinja chat template fix for Gemma 4 12B that unlocks reliable coding and tool calling by correcting the default Qwen-format mismatch in LM Studio; a direct vision comparison confirms Gemma 4 26B-A4B correctly extracts calendar events from images where Qwen 3.6 35B fails; and BeeLlama v0.3.2-Preview introduces KVarN, the first llama.cpp-ecosystem implementation of Huawei's 3-5x KV cache compression, tested on Gemma 4 31B on an RTX 3090.

QAT release: Google and Unsloth publish Quantization-Aware Training collections — AMD RX 7900 XTX benchmark shows 45% faster, 5.7 GB less VRAM, identical quality. Google published the `google/gemma-4-qat-q4-0` and `google/gemma-4-qat-mobile` collections, making Quantization-Aware Training GGUFs available for all Gemma 4 model sizes. QAT differs from post-training quantization by baking quantization awareness into the training process itself, allowing the model to maintain BF16-quality responses at Q4_0 file sizes. Unsloth followed with a companion collection (`unsloth/gemma-4-qat`) for users who prefer Unsloth-packaged GGUF formats. The key benchmark, run by a community member on an AMD Radeon RX 7900 XTX, puts concrete numbers on the gap: Gemma 4 12B QAT reduces total generation time from 323 seconds to 176 seconds (45.4% faster), increases throughput from 3.09 to 5.68 tok/s (83.8% improvement), and uses 5.7 GB less VRAM compared to the Q8_0 baseline — while the author reports perceptually identical output quality across the tested prompts. Thread commenters note that QAT and MTP are orthogonal optimizations that can be combined: a QAT model is smaller and faster at baseline, and an MTP-enabled QAT model adds speculative decoding on top of that. For users currently running standard Q4 or Q8 quants of Gemma 4 12B, switching to the QAT collection is the highest-leverage single upgrade available at this sweep date. Confidence: structured benchmark on a single hardware configuration (RX 7900 XTX); quality assessment is subjective and based on the benchmark author's impressions rather than automated evaluation. (source, benchmark, Unsloth, June 5-6, 2026)

Gemma 4 12B dense model: Q5_K_XL at ~50 tok/s for coding, Q4_K_XL at 8.6 GB and ~61 tok/s with 32k context fitting in 15.7 GB VRAM. A community report documents practical performance for the newly-released Gemma 4 12B dense model, which the author adopts as a daily coding assistant under the descriptor "my new main squeeze." The author's primary workload is coding, using a Q5_K_XL quant at approximately 50 tok/s on their hardware. A companion benchmark from an RTX 5080 (16 GB GDDR7) user running the llama.cpp gemma4-mtp build provides more precise numbers: Q4_K_XL weighs 8.6 GB and achieves approximately 61 tok/s at 32k context, while running the full 32k context window requires 15.7 GB VRAM — fitting cleanly in a 16 GB card. The 12B is an encoder-free multimodal model, consistent with Google's stated design philosophy of eliminating the separate vision encoder that large multimodal models traditionally use; images are processed as patches directly by the base model, preserving fine-grained spatial detail. Community reception is positive for 12B as a daily coding assistant, with the speed advantage and comfortable VRAM profile cited over the 26B-A4B as the primary practical advantage. Practical guidance: for users with a 16 GB GPU who want a fast and responsive coding model, Gemma 4 12B Q4_K_XL is the most efficient option in the current Gemma 4 lineup. Confidence: community benchmark, consistent with hardware specifications and expected scaling. (source, benchmark, June 5-6, 2026)

Unsloth MTP GGUF weights for all three flagship sizes — RTX 5080 reports 120 tok/s at 32k context with 90%+ draft acceptance during coding. Unsloth released Multi-Token Prediction GGUF weights covering Gemma 4 31B, 26B-A4B, and 12B in Q8, F16, and BF16 precision. MTP weights add an auxiliary prediction head to the model file that enables speculative decoding: the draft head proposes multiple future tokens that the main model can verify in a single forward pass, increasing effective throughput without quality degradation. The RTX 5080 benchmark from the same author as the 12B report above (using the llama.cpp gemma4-mtp build) shows the practical result: approximately 120 tok/s at 32k context on the 12B MTP model, with 90%+ draft token acceptance during coding tasks. The high acceptance rate during coding is a meaningful signal — speculative decoding benefits are use-case dependent, and acceptance rates above 85% indicate the MTP head is well-calibrated for structured output like code. Prior field notes (May 25) established that Google published official MTP assistant variants at `huggingface.co/google/gemma-4-{size}-it-assistant`; the Unsloth release provides GGUF-native versions with Q8, F16, and BF16 precision options that work directly in llama.cpp and LM Studio without conversion. Practical guidance: if you are running a gemma4-mtp or MTP-supporting llama.cpp build, switching to the Unsloth MTP GGUFs is expected to deliver a meaningful throughput improvement for coding and structured output workloads. Confidence: benchmark from community author on a single hardware configuration (RTX 5080 16 GB GDDR7). (source, benchmark, June 5-6, 2026)

PSA: Gemma 4 12B tool calling requires a custom Jinja chat template — LM Studio defaults load Qwen tokens and silently break reasoning. Two community posts document the same root cause for Gemma 4 12B misbehavior in coding and tool-call contexts: the default templates in LM Studio and many inference frontends load Qwen-format tokens rather than the Gemma 4 12B-specific template, causing the model to produce garbled or non-functional structured output. The first post provides the complete LM Studio fix: add `{%- set enable_thinking = true %}` to the template, set the start token to `thought` (not Qwen's equivalent), and use temperature 1.0 with top_p 0.95, top_k 64. A follow-up PSA confirms the fix and adds that the correct custom Jinja template is available on HuggingFace, with a link to the repository page. The practical implication is significant: Gemma 4 12B is not natively broken for coding or tool calling, as some early community impressions suggested — it requires correct template configuration that most default inference setups do not provide. This is the same category of failure documented for Gemma 4 31B in prior field notes, where the multi-turn agentic template mismatch caused thinking-tag malformation. Confidence: high — two independent posts converging on the same root cause and the same fix, consistent with the prior template-misconfiguration findings for other Gemma 4 model sizes. (source, PSA, June 5-6, 2026)

Vision advantage confirmed: Gemma 4 26B-A4B correctly extracts calendar events from images where Qwen 3.6 35B fails. A comparative vision test post compared Gemma 4 26B-A4B against Qwen 3.6 35B on a practical real-world vision task: extracting calendar events from a photo of a handwritten calendar page. Gemma 4 26B-A4B passed — it correctly identified all events, matched start times accurately, and produced clean structured output. Qwen 3.6 35B failed on the same image: the model misread one-hour events, reported wrong start times, and duplicated events in the extracted output. The author attributed the difference to Gemma 4's encoder-free vision architecture, which processes the full image as patches rather than compressing it through a separate vision encoder, preserving fine-grained spatial detail that a two-stage encoder-decoder pipeline can lose. Prior field notes have documented Gemma 4's vision advantage for tasks requiring precise spatial parsing (text in images, structured document extraction from May 23 and May 25 sweeps); this result adds calendar-image extraction as a confirmed higher-stakes data-extraction use case. Practical guidance: for applications that extract structured data from photos of documents, handwritten notes, or calendar views, Gemma 4 26B-A4B is the stronger local option over Qwen 3.6 35B as of this sweep. Confidence: single-task comparison by a single benchmark author, no multi-run statistical validation; architecturally consistent with prior field notes and Gemma 4's documented spatial parsing strengths. (source, June 5-6, 2026)

BeeLlama v0.3.2-Preview: KVarN KV cache compression arrives for Gemma 4 31B — 3-5x compression tested on RTX 3090. A community developer post introduces BeeLlama v0.3.2-Preview, a llama.cpp fork update that implements KVarN, making it the first llama.cpp-ecosystem build to ship Huawei's Key-Value cache compression technique. KVarN applies near-lossless quantization to the KV cache at inference time, achieving 3-5x compression ratios — substantially beyond the standard `-ctk q8_0` quantization that upstream llama.cpp supports. The author reports testing on an RTX 3090 24 GB running both Qwen 3.6 27B and Gemma 4 31B using `--cache-type-k kvarn4` as the activation flag. The practical implication is significant: a 24 GB GPU that previously ran Gemma 4 31B at 32k context could support roughly 96k-160k context under the same VRAM budget if 3-5x compression holds in practice. BeeLlama v0.3.1 (the base for v0.3.2) had earlier added MTP speculative decoding, Gemma 4 12B support, and multi-GPU DFlash — a speculative decoding approach designed for MoE models that avoids the routing overhead of dense-head MTP. Combined, the v0.3.1 base with v0.3.2's KVarN makes BeeLlama the most feature-rich community llama.cpp fork for Gemma 4 long-context and speculative-decoding workloads at this sweep date. Practical guidance: BeeLlama requires building from source. KVarN is a preview-quality feature; validate quality on your specific tasks before relying on it in production. Confidence: community developer posts; KVarN compression and quality claims are based on the author's own tests and require independent validation across model sizes and quant formats. (source, June 4-5, 2026)

Field Notes — 2026-06-05

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new/updated since 2026-06-04, 246 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 5 sweep, 2026-06-05 00:00 EDT: one day after the Gemma 4 12B launch, the conversation shifted from "what is it" to "how do I actually run it." This cycle is an ecosystem-catch-up sweep: the first single-GPU speed report on a 16GB consumer card, the first per-tensor quantization-floor data, multiple runtimes (BeeLlama, mistral.rs, Hitoku) shipping same-week 12B support, the first wave of 12B finetunes, and a heads-up from the Gemma team that quantization-aware-trained (QAT) weights are imminent — alongside the launch's predictable rough edges in tool-calling, distribution, and a confusing llama.cpp reasoning-UI change. Every post in this sweep is a fresh launch-week thread (score ~20, no captured comment threads), so treat all numbers as first-look anecdotes rather than settled results.

The Gemma team has confirmed QAT weights are coming — it may be worth waiting before doing heavy quantization work on the 12B. A short but high-signal post (u/Aaaaaaaaaeeeee, source, Jun 4, score 20) points to a Hugging Face discussion comment from an account identified as "Omar from the Gemma team" indicating that Gemma 4 quantization is still being refined and that QAT (quantization-aware training) variants will release soon, with the suggestion to "hold off on testing quantization and wait for its refinements." Historically, Google's QAT GGUFs have delivered noticeably better quality at 4-bit than naive post-training quants, so this matters for anyone planning to run the 12B at Q4. The practical reading: if output quality is your priority, the official QAT weights may be the better starting point and could land within days; if you just want to experiment now, the community GGUFs below are fine. Confidence: medium — this is a second-hand pointer to a team member's discussion comment, not an official release or dated announcement. (anecdotal, Jun 4, score 20)

First single-GPU speed and quant-floor data converge: keep the 12B at Q4_K_M or above, and expect ~18 tok/s on a 16GB consumer card. Two independent launch-week reports give the first concrete numbers for running the dense 12B on one consumer GPU:

16GB AMD, Q8 + 8-bit KV cache, ~18 tok/s, flat to 23K context (single anecdote). A hands-on coding run (u/devildip, source, Jun 4, score 20) used the H-gemma-4-12B-heretic-Q8 GGUF on a Ryzen 9 9950X + AMD RX 6800 (16GB VRAM) via the Vulkan backend with 32GB system RAM, 8-bit KV cache (`--cache-type-k q8_0 --cache-type-v q8_0`). The model one-shot a 467-line retro game (45K tokens total across four turns). Reported generation speed stayed "rock solid" between 18.44 and 18.93 t/s and barely degraded as active context scaled to 23,125 tokens; prompt processing started at 228.79 t/s and scaled down to 157.72 t/s with depth, and llama-server's context checkpoints plus longest-common-prefix reuse hit 91.7% and 96.4% cache reuse on later turns. This is the most useful single-GPU datapoint in the sweep — it shows the Q8 12B is comfortably usable on a 16GB AMD card for multi-turn coding. Confidence: single anecdotal run, one task, fully specified config and therefore reproducible. (source, Jun 4, score 20, anecdotal)
The quantization floor is Q4_K_M (~8.0 GB on a 3090); Q3_K_M and below collapse. A careful per-tensor ablation (u/lit1337, source, Jun 4, score 20) converted the 12B to GGUF (noting llama.cpp's `Gemma4Model` already strips the `model.language_model.*` prefix but the `Gemma4UnifiedForConditionalGeneration` architecture name needs registering before the convert script accepts it) and measured a "quant floor": on an RTX 3090 with a Q4_K_M baseline (8.0 GB) the model produces coherent output at Q4_K_M and above, while Q3_K_M and below "collapse to repeated token garbage." The author is sweeping mixed per-tensor quantization (demoting/promoting each tensor to the level with the lowest measured perplexity on wiki.test.raw at ctx 2048). The takeaway aligns with the speed report: Q4_K_M (~8 GB) is the practical minimum and fits a 12GB card; Q8 (~13 GB) fits a 16GB card. Confidence: methodical single-author measurement, work-in-progress, on one GPU. (source, Jun 4, score 20, anecdotal)

Runtimes shipped 12B support within the week — including multimodal paths llama.cpp still lacked at launch. The inference ecosystem caught up quickly:

BeeLlama v0.3.0/0.3.1 (u/Anbeeld, source, Jun 4, score 20) rebased the fork onto a much newer llama.cpp to integrate MTP and Gemma 4 12B support, with VRAM optimizations and the unified llama app. Notably, DFlash speculative decoding now works across multiple concurrent slots and on multi-GPU setups, with prebuilt binaries and Docker images for CUDA, Metal, and Vulkan. The headline "Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)" carries over the prior BeeLlama DFlash result for the 27B/31B models — it is not a new 12B measurement, so don't read it as a 12B speed claim. Confidence: maintainer release announcement; speedups self-reported. (source, Jun 4, score 20, anecdotal)
mistral.rs added Gemma 4 12B with full multimodal and agentic support (u/EricBuehler, the project's author, source, Jun 4, score 20): one-line install, an OpenAI- and Anthropic-compatible server with a built-in web UI, sandboxed code execution and web search for agentic apps, audio/image/video input, and MTP via an assistant model (`mistralrs run --agent -m google/gemma-4-12B-it --quant 4 --mtp-model google/gemma-4-12B-it-assistant`). This is significant because it offers the multimodal 12B path that llama.cpp did not have wired up on launch day. Confidence: maintainer announcement; capabilities self-reported. (source, Jun 4, score 20, anecdotal)
Hitoku Draft, an open-source local voice-first assistant (u/Saladino93, source, Jun 4, score 20), now lists Gemma 4 among its supported text-generation models (alongside Qwen 3.5), reading the screen/documents/active app for context with local STT backends. A consumer-app datapoint rather than a hardware one. (anecdotal, Jun 4, score 20)

The first 12B finetunes are out. A roundup (u/jacek2023, source, Jun 4, score 20) collected the first community finetunes — heretic and abliterated/uncensored GGUFs (igorls' gemma-4-12B-it-heretic, ReadyArt's Melody1437-12B, DuoNeural's Gemma4-12B-IT-Abliterated, OpenYourMind's gemma-4-12B-it-abliterated). The coding report above (1twelo6) used one of these heretic Q8 builds successfully after hitting refusals on the official 8-bit model, so the abliterated variants are already seeing real use. Confidence: medium — links to published HF repos; quality not independently evaluated. (source, Jun 4, score 20, anecdotal)

Known launch-week friction. Several rough edges surfaced as people put the 12B into real workflows:

Tool-calling on the 12B is still shaky in at least one agent harness. A coding-agent test (u/boutell, source, Jun 4, score 20) ran the 12B at 8-bit in opencode and found it grasped the task but "burned far too much time" failing basic tool calls — repeatedly failing to specify a `pattern` argument to a `grep` tool, getting the call rejected, until the author interrupted it. This is a direct counterpoint to the 06-04 sweep's successful tool-use run on a 4080 Super, which suggests tool reliability is highly sensitive to the harness and prompt template rather than a fixed model property. Open question: which harness/config gives reliable Gemma 4 12B tool calls. Confidence: single anecdotal failure on one harness. (source, Jun 4, score 20, anecdotal)
A confusing llama.cpp "skipped reasoning" report turned out to be a UI change, not a regression. After updating llama.cpp for the 12B-unified merge, a user running Gemma 4 31B in a Vulkan Docker build (u/Jorlen, source, Jun 4, score 20) saw the reasoning phase apparently disappear and rolled back to a May build (b9445) to recover it. The resolution (credited to u/nickm_27): recent llama.cpp web UI builds added a "thinking" drop-down behind a light-bulb icon in the chat interface, so reasoning output is now gated by a new toggle rather than always shown. Worth knowing before assuming a new build broke reasoning. Confidence: resolved single report. (source, Jun 4, score 20, anecdotal)
Ollama's 12B variants were temporarily macOS-gated; the generic Hugging Face GGUF runs anywhere. A user on an AMD 16GB GPU (u/x6q5g3o7, source, Jun 4, score 20) reported that every `gemma4:12b` variant pulled from Ollama returned a "this model requires macOS" error, while the generic `gemma-4-12B-it` model on Hugging Face should run on any hardware. This is a distribution-lag issue (Ollama posting platform-specific builds before the universal one), not a model limitation — if you hit it, pull the GGUF directly. Confidence: single anecdotal report of a transient packaging state. (source, Jun 4, score 20, anecdotal)

Open questions. Two discussion threads frame where the 12B sits, without resolving it. One asks how the older GPT-OSS-120B now compares to newer open-weights including Gemma 4 27B-A4B for tool-calling, summarization, and coding (u/purealgo, source, Jun 4, score 20) — no answers captured, but it signals continued appetite for head-to-head agentic comparisons across the open tier. Another is pure speculation about whether vendors could build a single ~30-32B dense model as strong at coding as Qwen3.6 27B and at languages as Gemma 4 12B (u/Hot_Example_4456, source, Jun 4, score 20). Confidence: low — these are open questions and opinion, not findings, included only to capture the direction of community interest. (anecdotal, Jun 4, score 20)

Field Notes — 2026-06-04

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new/updated since 2026-06-03, 234 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 4 sweep, 2026-06-04 00:00 EDT: this cycle is dominated by a single event — Google shipped Gemma 4 12B, a new size tier that lands between the E4B edge models and the 26B-A4B MoE. The community spent launch day characterizing it: the official model card confirms a unified, encoder-free multimodal architecture with a 256K context window and audio support, llama.cpp merged a "Gemma 4 Unified" model type to launch with same-day support, and the first hands-on reports place 12B as the natural pick for a 16GB laptop — close to 26B-A4B quality on roughly half the VRAM, though the 26B MoE still wins head-to-head on both quality and speed. The sweep also surfaces the launch's rough edges (llama.cpp multimodal not yet wired up, weights silently re-uploaded hours after release) and a healthy skeptical counterpoint that Qwen3.5-9B still beats 12B gigabyte-for-gigabyte on paper benchmarks. Note that every post in this sweep is a fresh launch-day thread (score ~20, no captured comment threads), so treat all numbers as first-look anecdotes rather than settled results.

Gemma 4 12B is released: a new unified, encoder-free multimodal tier between E4B and 26B-A4B. The official Hugging Face model card (u/jacek2023, source, Jun 3, score 20) documents the headline facts: Gemma 4 is a family of open-weight models from Google DeepMind, multimodal across text and image input (all sizes), with audio supported natively on the E2B, E4B, and 12B models. The card lists a context window of up to 256K tokens, multilingual coverage in over 140 languages, both pre-trained and instruction-tuned variants, and a mix of Dense and Mixture-of-Experts architectures. The family now spans five sizes — E2B, E4B, 12B, 26B-A4B, and 31B — explicitly positioned for hardware "ranging from high-end phones to laptops and servers." All models are described as configurable reasoners with thinking modes, variable image aspect-ratio and resolution support, and video plus audio handling on the E-class and 12B tiers. The practical significance: the 12B fills the long-standing gap between the sub-5B edge models and the 24GB-class 26B-A4B, giving 12-16GB GPU and laptop owners a dense Gemma 4 option sized for their hardware. Confidence: high for the release facts (official model card); real-world quality still being characterized. (source, Jun 3, score 20)

The 12B uses a "transformer-less vision tower" — an encoder-free multimodal design that ships with same-day llama.cpp support. Two threads document the architecture. A community post (u/johnnyApplePRNG, source, Jun 3, score 20) frames the 12B as "a unified, encoder-free multimodal model." A second post (u/eapache, source, Jun 3, score 20) traced the implementation to a just-merged llama.cpp PR (#24077) adding a new "Gemma 4 Unified" model type, with a code comment describing "a transformer-less vision tower" where some params "are redundant but set to avoid error." The poster's read — that the llama.cpp maintainers were given early access so the model could launch with day-one backend support — is consistent with the model card appearing the same day. For practitioners, the encoder-free unified design is the architectural story to watch: rather than a separate vision encoder bolted onto a language model, the modality handling is folded into the model itself. This has a direct downstream consequence covered below (you cannot simply strip the audio path to shrink the model). Confidence: medium — architecture details are inferred from the model card and llama.cpp source comments, not a Google technical report. (anecdotal, Jun 3, score 20)

12B versus 26B-A4B on one RTX 4090: the 26B MoE wins quality and speed, but 12B is the 16GB-laptop pick at roughly half the VRAM. A launch-day head-to-head (u/gladkos, source, Jun 3, score 20) ran both models on a single RTX 4090 with an identical prompt: write a self-contained, single-file HTML5 canvas physics animation with no libraries (a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum). Measured results — Gemma 4 26B-A4B: 15 GB VRAM, 6.9k tokens, 138 tok/s; Gemma 4 12B: 9 GB VRAM, 8.9k tokens, 80 tok/s. The author reports the 26B-A4B "won every scene and ran ~1.7x faster — on just 4B active params," while the 12B "stayed very close though, on almost half the VRAM — which makes it the ideal model for a 16 GB laptop." This is the most useful early datapoint for buyers: if you already have a 24GB card, the 26B-A4B MoE remains the better and faster choice; the 12B's value is specifically that it fits comfortably (~9GB) on 12-16GB GPUs where the 26B would otherwise need CPU offload. Confidence: single anecdotal test, one prompt, one GPU; directionally consistent with the broader pattern that the 26B-A4B MoE is the 24GB sweet spot. (source, Jun 3, score 20, anecdotal)

12B passes a first coding-agent tool-use test on a 4080 Super, zero bugs — with a fully documented llama.cpp config. A hands-on report (u/Wrong_Mushroom_7350, source, Jun 3, score 20) ran Gemma 4 12B inside VSCodium with a local agent extension and gave it an end-to-end task: write a Python script that reads logs line-by-line, extracts error modules, and dumps the counts to JSON — then generate its own mock log data and verify the result in a live terminal. The model reportedly completed the full agentic loop on the first try — created the script, populated a dummy `app.log`, opened a shell to run it, and verified the output "with zero bugs or path errors." The exact config is worth recording because it is reproducible: Gemma 4 12B (Unsloth UD-Q4_K_XL), 32K context (`--ctx-size 32768`), 8-bit KV cache (`--cache-type-k q8_0 --cache-type-v q8_0`), full GPU offload (`-1`), flash attention on, samplers `--temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.05 --repeat-penalty 1.15`, on llama.cpp + CUDA. This is an encouraging early tool-calling signal for the 12B given that E-class Gemma 4 tool reliability has been a recurring weak point in prior sweeps — but it is a single successful run on one well-scoped task, not a reliability benchmark. Confidence: single anecdotal success; config is fully specified and reproducible. (source, Jun 3, score 20, anecdotal)

Skeptical counterpoint: on paper, Qwen3.5-9B beats gemma-4-12b-it in 5 of 8 benchmarks gigabyte-for-gigabyte. Not everyone is convinced. A critical post (u/fulgencio_batista, source, Jun 3, score 20) compiled the published benchmark numbers from the two models' official Hugging Face cards and argued that Qwen3.5-9B is the overall winner — ahead of gemma-4-12b-it in 5 of 8 listed benchmarks despite a smaller footprint and lighter KV cache. The author allows that "gemma-4-12b-it might be a slight better coder than Qwen3.5-9b" but points to coding-specialized Qwen finetunes as an alternative. Two important caveats temper this: the numbers are self-reported model-card figures (not an independent run), and the comparison table was "formatted into a table with ChatGPT," so transcription is unverified. The disagreement is the useful signal here — the launch-day enthusiasm (12B as the new laptop default) and the benchmark skepticism (Qwen still wins per-GB on paper) are both live community positions, and neither has been settled by an independent head-to-head yet. The practical reading: the 12B's draw is its multimodality, 256K context, and Gemma's conversational/creative character, not a claim of best-in-class benchmark scores at its size. Confidence: benchmark figures are from official model cards, not independently reproduced; treat as directional. (source, Jun 3, score 20, anecdotal)

Known launch-day limits. Several rough edges surfaced within hours of release and are worth knowing before you commit time to the 12B:

llama.cpp multimodal is not yet wired up for the 12B. A user on llama.cpp release b9494 (u/No-Leave-4512, source, Jun 3, score 20) reports that `llama-cli` shows "modalities: text" only and crashes when an image is added. So despite the multimodal weights, image and audio input are effectively text-only through llama.cpp at this build; multimodal use likely needs the reference stack or a newer build. (anecdotal)
You cannot strip the audio component to make a smaller text+vision model. A user asked whether the audio path could be removed to save RAM and yield roughly an "11B" text+vision variant (u/WhiskyAKM, source, Jun 3, score 20); the poster's own edit concludes (crediting u/slalomz) that the unified, encoder-free architecture does not allow it — the modalities are integrated, not modular. (anecdotal)
The 12B weights were silently re-uploaded hours after launch. A maintainer of community quants (u/stduhpf, source, Jun 3, score 20) noticed the full `gemma-4-12B` Hugging Face repos — including the model weights — were "updated" a couple of hours after release, with no changelog, raising the open question of whether existing GGUF quants need to be regenerated. If you pulled a 12B quant on launch day, re-check the source repo's commit before trusting it. (anecdotal)
The 26B-A4B failed a strict zero-shot spatial-reasoning test. A custom Sokoban (box-pushing) benchmark with strict formatting and no chain-of-thought allowed (u/Disastrous_Food_2428, source, Jun 3, score 20) placed Gemma 4 26B-A4B in the failing group (illegal moves, deadlocks, or formatting collapse), alongside several other open models; only ChatGPT, Qwen3.7-max, and Gemini 3.5-thinking passed. This is a single-author, narrow, adversarial test designed to defeat guessing, not a general capability verdict — but it is a reminder that strict zero-shot spatial planning remains hard for the open Gemma 4 tier. (anecdotal)

Open question: more Gemma 4 sizes may be coming, and the community wants a large variant. Two threads point at the roadmap. One post (u/Deep-Vermicelli-4591, source, Jun 3, score 20) links a teaser and speculates a 120B model is next; another is an explicit community campaign asking Google for a "Gemma 4 124B" via the Hugging Face discussions (u/seamonn, source, Jun 3, score 20), arguing Gemma 4 is "good, great even" but missing a flagship-size tier. Both are unconfirmed signal, not announcements — but the appetite for a larger Gemma 4 is a consistent community theme. Confidence: low — speculation and a request thread, no official confirmation. (anecdotal)

Field Notes — 2026-06-03

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (4 new/updated since 2026-06-02, 222 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 3 sweep, 2026-06-03 00:00 EDT: four posts from the June 2 window surface two directly actionable hardware findings and two qualitative impressions. The clearest hardware signal: Gemma 4 E4B achieves 2.4x faster text generation under Google's LiteRT runtime compared to llama.cpp GGUF — 157 tok/s versus 66 tok/s — confirmed across multiple prompt types; image captioning gains only 1.1x, suggesting LiteRT's advantage is specifically in the multi-token prediction path for text. The edge hardware finding: AMD 780M integrated GPU (found in mainstream ThinkPad business laptops) is too slow for practical E4B inference — the model loads but generation speed is unsatisfying for interactive use, pointing users toward cloud offload or a future discrete GPU upgrade. The qualitative impression from a creative writing practitioner confirms the community pattern at long context: Gemma 4 31B Q4 is competitive with GPT-4.5 for prose but falls noticeably short of Gemini 2.5 Pro on multi-chapter detail retention — a reasonable expectation given the quantization and parameter count gap.

LiteRT delivers 2.4x text generation speedup for Gemma 4 E4B versus llama.cpp GGUF; image captioning gain is only 1.1x. A practitioner post (u/AnticitizenPrime, source, Jun 2, score 20) measured Gemma 4 E4B under two inference paths: Google's LiteRT runtime (with MTP speculative decoding enabled) versus the Unsloth/AtomicChat Q4M GGUF under llama.cpp. The text generation benchmark ran three prompt types — Transfer learning, Transformer architecture, ML paradigms — with identical prompts and measured output speed. LiteRT results: 160.6, 148.2, and 162.7 tok/s respectively; llama.cpp GGUF: 66.3, 65.9, and 66.8 tok/s. Average speedup: 2.4x. The image captioning comparison (111 images, full resolution) showed only a 1.1x speedup: 0.65s per image (LiteRT) versus 0.72s (GGUF). The author attributes the text generation gap to MTP speculative decoding, which generates and verifies multiple tokens ahead in a single forward pass — a path that llama.cpp has not yet implemented for the E-class models at time of writing. For developers who use E4B for auxiliary roles (vision labeling, summarization, quick classification) in an agentic harness on a mid-range NVIDIA card: the LiteRT path offers a practical 2.4x throughput improvement for text tasks with no quality regression, at the cost of a more complex setup. Note that the LiteRT server is a Python wrapper around the runtime, not a stable upstream project. Confidence: single anecdotal benchmark, one hardware configuration (4060ti 16GB); the 2.4x figure is consistent with the MTP throughput improvement reported in other E-class model discussions. (source, Jun 2, score 20, anecdotal)

AMD 780M iGPU: Gemma 4 E4B runs but is "too slow" for practical interactive use on a 32GB ThinkPad. A hardware question thread (u/danihend, source, Jun 2, score 20) from a Lenovo ThinkPad T14 Gen 5 user (Ryzen 7 Pro CPU, 32GB RAM, AMD 780M integrated GPU) reports that Gemma 4 E2B and E4B load successfully but generation speed is too low to be usable for interactive terminal work, wiki editing, and basic automation tasks. The post is a request for alternatives, not a structured benchmark. The AMD 780M has 8 dedicated compute units (gfx1103) in its integrated graphics slice, using shared system memory; peak bandwidth is a fraction of what discrete GPUs provide. For comparison, community reports from MacBook Air M3 (a comparable RAM-bandwidth platform) show E4B at approximately 20-30 tok/s, which some users find marginal for interactive use. The ThinkPad AMD iGPU likely delivers similar or lower throughput. Practical guidance: for Gemma 4 E-class inference on power-constrained AMD laptops without discrete GPUs, the usable options are cloud offload to a Gemma API, using a smaller model (E2B at Q4 may be fast enough for non-realtime tasks), or accepting CPU-only inference. A Ryzen AI Pro with NPU — available in newer Pro 400 series ThinkPads — is a meaningful future upgrade if local inference speed matters for your workflow. Confidence: qualitative self-report, no throughput numbers provided. (anecdotal)

Qualitative verdict from a creative writing practitioner: Gemma 4 31B Q4 competes with GPT-4.5 but falls short of Gemini 2.5 Pro on long-context prose. A community thread (u/opoot\_, source, Jun 2, score 20) asked for subjective "feel" comparisons for Gemma 4 31B, 26B-A4B, and Qwen 3.6, specifically for creative writing rather than coding benchmarks. The original poster's own assessment of Gemma 4 31B Q4: better than GPT-4.5 for prose style and voice, but "still falls short of 2.5 pro" — specifically on long-context detail retention (misremembers minor details across extended sessions). This aligns with expected quantization behavior: Q4 compresses the model's effective parameter budget, which tends to surface as context-window slippage before quality degradation in individual sentences. The Gemini 2.5 Pro comparison is a high bar — Gemini 2.5 Pro is a full-precision frontier model. No comments were captured from this thread. Treat this as a single practitioner's impression for creative writing use cases, not a structured benchmark. Confidence: single qualitative self-report, no structured evaluation. (anecdotal)

Field Notes — 2026-06-02

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new/updated since 2026-06-01, 218 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 2 sweep, 2026-06-02 00:00 EDT: five posts from the June 1 window surface new signal across three themes: mistral.rs v0.8.2 claims Gemma 4-specific CUDA throughput improvements outperforming llama.cpp on server-class hardware; LiteRT continues to emerge as a practical path for Gemma 4 E-class edge deployment, with a developer reporting a working OpenAI-compatible LiteRT wrapper with MTP and audio modality; and mobile inference gets a systematic write-up comparing Gemma 4 E4B across MLX, GGUF, and LiteRT on iOS and Android, confirming RAM bandwidth as the universal bottleneck. Two additional findings round out the sweep: Gemma 4 31B solves a hard digit-sum math problem correctly at the cost of 500 seconds of reasoning time, while the 26B MoE variant surfaces a tool-call schema conflict against non-Google harnesses; and a sycophancy benchmark places Gemma 4 at or above all other tested open models.

mistral.rs v0.8.2 claims Gemma 4 CUDA throughput advantage over llama.cpp on GB10, H100, and B200. A post by the mistral.rs maintainer (u/EricBuehler, source) describes v0.8.2 as a CUDA-throughput-focused release, with benchmarks showing mistral.rs faster than llama.cpp at "every point" in a sweep across GB10/H100/B200 hardware, for both Gemma 4 Dense and MoE variants. The headline figure is up to 2.8x faster CUDA inference. The improvement is claimed to hold across both eQ8_0 and Q4K quantization types. Full reproduction instructions are published at the project's release report. Practical relevance is primarily for enterprise and cloud deployments — GB10 and B200 are data center hardware. For consumer-class NVIDIA users (RTX 3090, 4090, etc.), the comparison is not directly applicable, though mistral.rs supports consumer hardware too and offers an OpenAI-compatible server mode (`mistralrs serve --agent -m google/gemma-4-E4B-it --quant 4`). Important caveats: the benchmark is from the project maintainer, not an independent third party, and no community reproductions appear in the posts captured at this sweep. Confidence: developer self-report, not independently reproduced; directional signal for alternative CUDA runtimes. (Jun 1, score 20)

LiteRT wraps Gemma 4 E2B and E4B in a local OpenAI-compatible endpoint with MTP and audio modality. A developer post (u/AnticitizenPrime, source) describes running Gemma 4 E2B and E4B via Google's LiteRT runtime, wrapped in an OpenAI-compatible HTTP server on a 4060ti 16GB, as a drop-in replacement for OpenRouter API access to E-class models. The motivation is bypassing rate limits on Google's hosted API when using E-class models for auxiliary roles in an agentic harness (vision, summarization, quick classification). LiteRT delivers throughput the author describes as "blistering speed" compared to llama.cpp on the same hardware, while also enabling audio modality processing and MTP speculative decoding — capabilities not yet merged into mainstream llama.cpp for E-class models. This extends the LiteRT signal from the previous week's mobile reports into the desktop/workstation tier: Google's own inference runtime consistently outperforms general backends on its own models, and the 4060ti (16GB VRAM) is a mainstream mid-range card rather than a specialized rig. The setup is explicitly described as "work-in-progress" and vibe-coded. Practical note: LiteRT installation on non-Android desktop is less documented than llama.cpp, but the OpenAI server wrapper makes it drop-compatible with existing tooling. Confidence: single anecdotal developer report; no independent reproduction. (Jun 1, score 20, anecdotal)

iOS and Android Gemma 4 E4B: RAM bandwidth is the universal bottleneck; MLX gives 10-20% more speed on iPhone. A structured mobile inference writeup (u/MrAHMED42069, source) tested Gemma 4 E4B alongside Qwen 3.5 4B and 9B across MLX (4-bit), GGUF (Q4_K_M), and LiteRT quantizations on an iPhone 15 Pro Max (51 GB/s RAM bandwidth) and a mid-range Android device. Key findings: the GPU on 8GB iOS and Android phones is limited to approximately 4.5GB of RAM for inference — larger models (including 9B-class) fall back entirely to CPU. On 8GB devices, the CPU can access up to 6GB for inference. For Gemma 4 E4B at Q4 quants: MLX delivers 10-20% higher throughput than GGUF on iOS; LiteRT is a third option but throughput comparison was not the post's focus. NPU support is not yet functional for these models on the tested hardware. The bottleneck across all configurations is RAM bandwidth — neither higher-precision quants nor alternative runtimes escape this ceiling. On older MediaTek Android devices, Vulkan driver overhead is significant enough to reduce GPU speed gains compared to iOS or Snapdragon. Gemma 4 E4B is the practical Gemma ceiling for 8GB mobile devices; E2B gives headroom for higher quants or longer context. Confidence: single-author systematic test, specific hardware (iPhone 15 Pro Max, unspecified MediaTek Android); no independent validation. (Jun 1, score 20, anecdotal)

Gemma 4 31B solves digit-sum math correctly in 500 seconds; 26B MoE shows tool-call schema conflict against non-Google harnesses. A practitioner report (u/SummarizedAnu, source) ran a hard no-code reasoning prompt — compute the sum of the decimal digits of 2^100 showing all steps — against Gemma 4 31B and Qwopus 9B (a community finetune). Gemma 4 31B produced the correct answer at approximately 500 seconds; Qwopus 9B completed the task in ~200 seconds. The 2.5x time difference reflects both model size and 31B's thorough step-by-step reasoning on math tasks. A second finding: the same author's Gemma 4 26B in an agent context attempted to call `google-search` — a tool from its training data — rather than the harness-provided `searxng` tool. The author specifies using Google's API endpoint rather than a local quantization, suggesting this is a model behavior issue rather than a quantization artifact. This tool-call schema conflict is consistent with the E-class failures documented in prior sweeps and now confirmed at the 26B tier via the cloud API. For harness developers: always provide explicit tool schemas and test that the model uses only those tools before deploying 26B-class Gemma models in production agentic pipelines. Confidence: anecdotal single-test for both findings; tool-call schema conflict pattern is consistent with prior community reports. (Jun 1, score 20, anecdotal)

Sycophancy benchmark: Gemma 4 ties for best among tested open models at 50% accuracy. A community benchmark (u/JLeonsarmiento, source) formatted 10 viral social-media posts exhibiting sycophantic content as single-turn multiple-selection prompts and ran multiple open-source LLMs through them. The scoring premise: humans should score above 50%; all tested LLMs peaked at 50%. Gemma 4 and Pepe-32 (a Reddit-data finetune) both achieved the 50% ceiling — better than all other tested models. The benchmark is designed to probe whether models agree with posts claiming things that are false, misleading, or poorly reasoned, rather than push back appropriately. The result suggests Gemma 4 has a relative advantage on this behavioral dimension compared to other tested open models. Practical significance is limited: this is a 10-prompt test with a novel format, no independent methodology review, and the comparison model set is not specified. Treat as a directional signal on sycophancy resistance, not a definitive evaluation. Confidence: single-author 10-prompt test; comparison set not specified in post excerpt. (Jun 1, score 20, anecdotal)

Field Notes — 2026-06-01

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (6 new/updated since 2026-05-31, 213 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

June 1 sweep, 2026-06-01 00:00 EDT: six posts surface new hardware data and practical limits: an AMD 7900 XTX head-to-head confirms Gemma 4 26B-A4B wins overall wall-clock time against Qwen 3.6 35B-A3B by ~20% despite generating half the tokens; a mixed GPU+CPU offload report establishes that Gemma 4 26B-A4B Q5_K_XL runs at ~20 tok/s on an 8GB AMD card with system RAM offload; two independent threads confirm tool calling reliability is a genuine gap for E2B and E4B in agentic harnesses; a structured benchmark of 13 abliterated E2B variants on an RTX 5090 maps the capability-vs-safety tradeoff; and a beginner's GPU buying question surfaces the community's consensus on 12GB as the minimum usable VRAM floor for Gemma 4 text inference.

AMD 7900 XTX: Gemma 4 26B-A4B wins wall-clock by ~20% against Qwen 3.6 35B-A3B despite lower token throughput. A detailed hardware comparison post (u/IvGranite, source) ran a six-prompt real-world evaluation across both models on a Ryzen 9600X + Sapphire NITRO+ Radeon 7900 XTX 24GB, 96GB DDR5-6800, ROCm 7.2.3, HIP gfx1100, llama.cpp build 9425. Quants: Qwen 3.6 35B-A3B IQ4_XS+Q8nextn hybrid MTP (~20GB), draft-n-max 3; Gemma 4 26B-A4B UD-Q4_K_XL (~17GB), no MTP. Throughput: Qwen 130 tok/s vs Gemma 78 tok/s generation. The key finding is counter-intuitive: Qwen generated 14,811 total tokens across all six prompts versus Gemma's 7,386 — approximately 2x more — because Qwen spends 74% of output on internal thinking versus Gemma's 57%, producing verbose reasoning for each prompt. Net wall-clock across all six workloads: Qwen 118.8 seconds vs Gemma 95.6 seconds, giving Gemma a 20% wall-clock advantage. Per-task results favor Gemma on meeting notes (12.2 vs 10.8s), incident postmortem (28.2 vs 21.6s), and log triage to JSON — Qwen wins only on the build-vs-buy analysis prompt where its extended reasoning may have added value. Practical guidance for AMD 7900 XTX users: if your primary workloads are short-to-medium chat, creative writing, or structured extraction where output quality matters more than reasoning depth, Gemma 4 26B-A4B at Q4_K_XL is the faster end-to-end choice. Qwen 3.6 35B-A3B's MTP throughput advantage is absorbed by its verbose reasoning output. Confidence: single-author benchmark, well-described methodology, six realistic workloads; result is consistent with prior community reports on token-efficiency differences. (anecdotal)

Mixed CPU+GPU offload: Gemma 4 26B-A4B Q5_K_XL (~21GB) at ~20 tok/s on RX6600XT 8GB + 32GB DDR4 system RAM. A practitioner report (u/Mrinohk, source) describes running Gemma 4 26B-A4B Q5_K_XL for a personal project management and smarthome agent on an RX6600XT 8GB + Ryzen 7 5700X, 32GB DDR4-3200. The model file is ~21GB, substantially exceeding the card's 8GB VRAM, so most weights offload to system RAM. Decode speed: approximately 20 tok/s, prefill ~235 tok/s. The command-line configuration includes ngram speculative decoding (`--spec-type ngram-mod`), flash attention, 40k context, and 192-token KV cache slots. The author acknowledges the configuration is partly intuition-based. This is a practical lower bound: 8GB of VRAM accelerates the layers that fit, while the remaining layers run on CPU, producing a workable but not fast inference experience. For context, the same model on 24GB cards runs 70-90 tok/s generation without offload. The ~20 tok/s figure establishes that the mixed-offload path is viable for low-throughput agent workloads (agents rarely need fast streaming), but not suitable for interactive conversation or high-context agentic coding. Confidence: single anecdotal self-report; no independent validation of the command-line configuration. (anecdotal)

E2B and E4B tool calling reliability gap confirmed by two independent reports — fine-tuning on harness tools proposed as a fix. Two posts this cycle independently surface the same problem: Gemma 4's small E-class models (E2B and E4B) struggle with tool calling reliability under agentic harnesses. A practitioner running E4B at Q8_0 on llama.cpp with a custom Jinja template (u/BitGreen1270, source) reports that tool calling performance is "not the best" for calendar, messaging, and scheduling tasks. A second report (u/AnticitizenPrime, source) describes a more specific failure mode: when Gemma 4 is plugged into Hermes Agent, it ignores the harness's provided tool schemas and attempts to call tools it was trained on (such as `google-search`) instead of the harness-provided web-search tool. The author proposes fine-tuning the model specifically on the target harness's tool call format as a potential fix. This failure mode is distinct from the 31B-class tool call bugs documented in prior sweeps: E-class models appear to have a more fundamental difficulty resolving tool call schema conflicts between training and runtime. For E2B and E4B practitioners: use the custom Jinja chat template circulating in the community (referenced in the post), keep tool schemas minimal, and treat E-class tool calling as best-effort rather than production-reliable until a fine-tuned variant specifically targeting harness compatibility is available. Confidence: two independent anecdotal reports; consistent with prior community reports on E-class agentic limitations.

13 abliterated Gemma 4 E2B variants benchmarked on RTX 5090 — coder3101 variant leads with 96% ASR and full capability preserved. A structured evaluation (u/nathandreamfast, source) tested 13 abliterated Gemma 4 E2B variants across weight analysis, KL divergence, HarmBench (400 prompts, full LLM review of 5,600 responses), and 8 benchmark tasks via lm-eval on native BF16. Hardware: RTX 5090, 44 GPU hours total. Key findings: all 13 variants successfully lift HarmBench ASR from the base model's 32.2% to between 82% and 100%. The leading variant (coder3101, using the Heretic tool) achieves 96% ASR with full capability preserved — it actually improves math benchmark scores versus the base model. The second-best for capability preservation (treadon) hits 100% ASR but loses 3 points on GSM8K. The benchmark validates the community rule of thumb: "most 'capabilities preserved' claims on model cards don't hold up" and structured evaluation is required to distinguish variants that genuinely preserve quality from those that claim to. For users who need abliterated E2B for creative or uncensored use cases: coder3101 is the current best-evidenced choice; check the full report at HuggingFace/DreamFast/Gemma4-e2b-abliterlitics for the complete capability and safety tradeoff table across all 13 variants. Confidence: structured benchmark with explicit methodology; single author, hardware documented. (community benchmark, well-evidenced)

Community confirms 12GB VRAM as the practical minimum floor for Gemma 4 text inference. A buying-advice thread (u/Bharat01123, source) comparing an RTX 2060 12GB at ~$260 versus an RTX 3060 12GB at roughly double the price surfaces the community's practical guidance: 12GB of VRAM is the usable minimum for running Gemma 4 26B-A4B at Q4_K_M with some CPU offload, or the E4B/E2B class fully on-GPU. The Gemma 4 26B-A4B MoE architecture fits in approximately 15-17GB at Q4 quants, meaning 12GB cards will offload some layers to system RAM and see reduced generation speed (typically 20-40 tok/s depending on CPU and RAM speed), compared to 70-90 tok/s on a 24GB card. For text-only inference as described in the post, the RTX 2060 12GB is viable; the RTX 3060 12GB offers no VRAM headroom advantage at this size tier but may offer modestly better bandwidth and compute. An RTX 4070 (12GB) would match VRAM but significantly outperform both on compute efficiency and power draw. Confidence: community consensus confirmed across multiple prior threads; specific numbers are representative estimates, not benchmarks. (community consensus)

Field Notes — 2026-05-31

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (2 new/updated since 2026-05-30, 235 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 31 sweep, 2026-05-31 00:00 EDT: two posts add practical hardware signal: a Q8 quantization access question confirms the community-wide expectation that single-consumer-GPU users are limited to Q4_K_M for Gemma 4 31B (Q8 requires 31+ GB VRAM, not listed on Ollama by default); and a new MTP benchmark on an RTX 6000 PRO reports 3.34x faster inference for Gemma 4 31B on llama.cpp — consistent with the prior dense-model MTP pattern. Two posts from the May 29-30 window not yet covered in field notes are also worth noting: an M5 Pro practitioner confirms Gemma 4 26B-A4B runs "blazingly fast" at that chip tier with strong generalist quality, and the community broadly acknowledges Gemma 4 31B as among the leading western open-weight models in the 27-35B size class.

Gemma 4 26B-A4B on M5 Pro: "blazingly fast" generalist, slight Qwen 3.6 coding lead but Gemma wins non-coding tasks. A practitioner post (u/goldcakes) describes running Gemma 4 26B-A4B on an Apple M5 Pro as "blazingly fast" — notably the M5 Pro has substantially lower memory bandwidth than the M5 Max or M3 Ultra configurations that dominate most Apple Silicon benchmark reports. The author tested across creative writing, debugging, coding, conversation, and vision tasks, comparing directly against Qwen 3.6 35B-A3B on the same machine. Result: Qwen 3.6 has a "slight lead" on coding, but Gemma 4 26B-A4B is "noticeably" better on non-coding tasks and "generally feels a bit more 'robotic' to chat to" for Qwen. The web-search-tool pattern is highlighted: pairing Gemma 4 26B-A4B with a search API "really sings as an everyday local LLM." This is consistent with the broader community pattern: Gemma 4 tends to win on personality, general instruction following, and multimodal tasks while Qwen 3.6 leads on structured tool calls and extended coding sessions. The M5 Pro finding is practically useful because it establishes that the 26B MoE variant is viable without the higher-bandwidth M5 Max/Ultra hardware. Confidence: single anecdotal report; no throughput figures provided. (source, May 29, anecdotal)

MTP 3.34x speedup for Gemma 4 31B on RTX 6000 PRO — llama.cpp and vLLM tested side by side. A community post (u/FantasticNature7590) reports benchmarking Multi-Token Prediction for Gemma 4 31B (GGUF) on an RTX 6000 PRO (48GB VRAM), testing both llama.cpp and vLLM backends. Benchmark configuration: 10 runs per session, 1500 tokens per run, sequential mode on vLLM. Headline result: 3.34x faster inference with MTP enabled. The RTX 6000 PRO's 48GB gives ample VRAM headroom for Gemma 4 31B Q4_K_M plus the MTP draft head, removing memory constraints from the benchmark. The 3.34x figure is broadly consistent with prior community MTP results for Gemma 4 31B Dense, which have clustered between 2x (BeeLlama on RTX 3090, 4.93x DFlash) and 3.11x (H100, c=1) depending on hardware, quant, and context length. Important asymmetry established in prior sweeps: MTP acceleration applies to Dense 31B; the MoE 26B-A4B variant is not expected to benefit, as the expert-routing bottleneck prevents the speculative decoding path from offering savings. The full quant and task breakdown are not specified in the post excerpt; treat the 3.34x as a representative generation-phase figure rather than a universal result. Confidence: single-author benchmark on professional VRAM; result is in the expected range for Gemma 4 31B Dense MTP. (source, May 29)

Q8 quantization access gap for Gemma 4 31B confirmed: Q4_K_M remains the practical consumer starting point. A beginner's question (u/JayoTree) asking how to run Gemma 4 31B at Q8 on Ollama surfaces the community-wide expectation: Q8 Gemma 4 31B requires approximately 31+ GB of VRAM and is not listed on Ollama's model hub by default, leaving Q4_K_M as the de facto starting point for most single-consumer-GPU setups. The Gemma 4 26B-A4B MoE variant remains the community sweet spot for 24GB cards, fitting comfortably in Q4_K_M or higher quants within the 24GB envelope. For users on 12-16GB cards, the 26B MoE at Q4_K_M with CPU offload or the E4B class are the practical options. Community expectation for higher-quality quantizations of the 31B model on consumer hardware: not feasible without multi-GPU setups or high-RAM Apple Silicon (M3 Ultra/M4 Max with 64GB+) or Strix Halo/similar unified-memory configurations. Confidence: community consensus across multiple threads; no new data beyond the established pattern. (source, May 30)

Field Notes — 2026-05-30

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new/updated since 2026-05-29, 233 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 30 sweep, 2026-05-30 00:00 EDT: three posts surface new Gemma 4 hardware data and deployment insights: BeeLlama v0.2.0 brings a full DFlash implementation for Gemma 4 31B to a single RTX 3090, reaching 177.8 tok/s output speed — nearly 5x baseline; a 30-run automated llama-bench study on an AMD MI60 32GB GPU benchmarks Gemma 4 26B-A4B Q4_1 for always-on home automation and identifies KV cache quantization as the primary throughput bottleneck; and Chrome's built-in Gemini Nano model is confirmed by the community to be a quantized variant of Gemma 4 E4B or E2B, making Gemma 4 inference accessible on any laptop with Chrome installed, no additional tooling required.

BeeLlama v0.2.0: Gemma 4 31B reaches 177.8 tok/s on a single RTX 3090 with DFlash — 4.93x over baseline. BeeLlama v0.2.0 (u/Anbeeld, score 220, 129+ comments) ships full Gemma 4 31B support via a DFlash speculative decoding implementation, with a published quick-start guide. On Windows 11 with an AMD Ryzen 7 5700X3D, 32GB DDR4, and RTX 3090 24GB, the benchmark result for Gemma 4 31B is 177.8 tok/s output speed — 4.93x faster than the standard llama.cpp baseline on the same hardware. The update also brings: DFlash GGUFs with upstream architecture now supported, tightened reasoning and tool-call boundaries, stricter draft/target validation, and reduced verifier path with safer fallback. One new comment (score 1) asks specifically whether BeeLlama will run on Strix Halo hardware — no answer posted at sweep time. A second new comment (score 1) reports an observed halving of `n_seq` context for Qwen 3.6 models in BeeLlama; Gemma 4 users do not report this issue. For RTX 3090 owners, this is the highest community-verified output speed for Gemma 4 31B on that GPU tier. Note that DFlash speedup is generation-only — prompt processing speed stays near baseline, so workloads dominated by long prompt prefill will see proportionally less gain. Confidence: developer benchmark, single hardware configuration; no independent replication. (source, May 22, score 220, 129 comments)

AMD MI60 32GB: 30-run llama-bench study finds KV cache quantization is Gemma 4 26B-A4B's primary throughput killer. A home-automation practitioner (u/FantasyMaster85, score 26, 6 comments) ran 30 automated llama-bench iterations testing Gemma 4 26B-A4B Q4_1 and Qwen3.6 35B-A3B Q4_0 on an AMD MI60 32GB VRAM GPU — used for Frigate camera footage review and HomeAssistant voice assistance. Setup: Docker container from github.com/mixa3607/ML-gfx906 for ROCm/HIP on Ubuntu 24.04, which the author recommends over building from source for MI60/MI50 users. Key findings: KV cache quantization was the single largest throughput bottleneck, more impactful than model size selection; increasing uBatch size, widely recommended for AMD GPUs, hurt performance at longer context lengths rather than helping. The MI60 gets a native speed boost on `_0` and `_1` quants, which drove the Q4_1 choice for Gemma 4 and Q4_0 for Qwen — partly a size constraint for the available 32GB VRAM. A commenter (u/Schlick7, score 2) confirmed the KV cache finding and added that prompt processing on MI60/MI50 is noticeably slow under agentic workloads, with tokens-per-second staying satisfying for direct chat but becoming a bottleneck with extended context. Practical guidance for MI60 owners running Gemma 4 26B-A4B: disable or reduce KV cache quantization before tuning any other parameter; test uBatch size empirically at your typical working context length rather than accepting the common "bigger is better" recommendation. Confidence: 30-run automated benchmark, single hardware author; anecdotal notes from one additional commenter. (source, May 23, score 26, 6 comments)

Chrome's Gemini Nano confirmed as quantized Gemma 4 E4B or E2B — accessible on any modern laptop via a one-click extension. A community post (u/Some-Cauliflower4902, score 100, 44 comments) describes a Chrome extension ("Dobby") that surfaces the Gemini Nano model already bundled in Chrome as a usable chat interface, without requiring llama.cpp, Ollama, or any local model setup beyond Google Chrome and 16GB RAM. The post author reports approximately 20 tok/s on a laptop. A top comment (u/MerePotato, score 25) confirms what the title implies: the new Gemini Nano models are quantized — and possibly fine-tuned — variants of Gemma 4 E4B and E2B, with screenshot evidence. An important technical correction (u/Napster3301, score 28) clarifies that the "no GPU" claim is misleading: Chrome's built-in AI API uses WebGPU when available, which includes the iGPU on virtually every modern laptop; true CPU-only fallback runs on WASM with significantly lower throughput. Chrome enforces a 9,216-token context limit per session. Practical significance for Gemma 4 practitioners: E-class Gemma 4 models now have a widely-distributed deployment path that requires no user setup beyond Chrome. For users who want to demonstrate local AI to non-technical audiences or test lightweight Gemma 4 E-class quality before committing to a full llama.cpp setup, this path is now documented. The extension and repo links are in the source post. Confidence: community verification with screenshot evidence; throughput estimate is anecdotal from a single user. (source, May 23, score 100, 44 comments)

Field Notes — 2026-05-29

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 updated since 2026-05-28, 230 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 29 sweep, 2026-05-29 00:00 EDT: a quiet update cycle with three existing posts receiving minor new comments. No fresh Gemma 4 hardware reports or benchmarks emerged overnight. The practical takeaways reinforce patterns from recent sweeps: FoodTruck Bench tool-call compatibility continues to surface as a differentiator in agentic rankings, Gemma 4 31B remains ahead of Qwen 3.6 35B-A3B there; the community verdict on Granite 4.1 30B holds steady with a new user confirming it is "decent but nothing special" for coding relative to Gemma 4 31B and Qwen 3.6; and Tencent Hy-MT2's Apache 2.0 relicensing prompted a community question about Gemma 4 comparison with no answer yet appearing.

FoodTruck Bench tool-call configuration emerges as a practical variable — Gemma 4 31B holding 6th. The FoodTruck Bench thread (score 46, 9 comments) received two new comments focused on implementation mechanics rather than leaderboard rankings. One commenter asked specifically what tool or configuration was used to allow the model to interact with the benchmark's web UI, referencing u/AnticitizenPrime's report of running GLM 5.1 through a browser UI for a 15-day run — a non-standard agent approach that side-steps the benchmark's native tool call schema. The post author responded noting that the Qwen 3.6 35B-A3B result at 11th place was "shocking" and attributed it as possibly "a tool calling thing." This exchange is practically relevant for Gemma 4 users: Gemma 4 31B's 6th-place standing on FoodTruck Bench was already noted in prior field notes as potentially sensitive to chat template and tool call format. The new commentary reinforces that leaderboard positions on this benchmark should be read alongside how each model's tool schema was configured — a correctly-wired Gemma 4 31B appears competitive with and ahead of models that may be running on mismatched default templates. Confidence: anecdotal — new comments add context but no new benchmark data. (source, May 27, score 46, 9 comments)

Granite 4.1 30B community verdict extended: "decent but nothing special" for coding; model scale argument raised. The Granite 4.1 30B thread (score 61, 67 comments) received two new comments extending the prior discussion. A commenter (score 1) made a scale argument in Granite's favor: the IBM family spans from 0.8B to 397B, supports 200+ languages, uses 4x less context memory with MTP, and posts strong benchmarks — positioning it as a comprehensive enterprise-grade option even if the 30B variant trails in community rankings. A second commenter (score 1) speaking from direct experience said they tried Granite 4.1 30B for coding "briefly" and found it "decent but nothing special," noting that "Qwen 2.5 Coder still eats at this size range" and that IBM's low open-source community presence explains the low thread engagement. Neither comment changes the prior consensus: Gemma 4 31B and Qwen 3.6 27B outperform Granite 4.1 30B on general-purpose benchmarks. Granite retains a credible position for structured enterprise tasks (function calling, extraction, RAG, FIM) where IBM's low-marketing, high-stability deployment philosophy may suit buyers. For Gemma 4 practitioners comparing alternatives at this size tier, the takeaway is unchanged: Gemma 4 31B Dense is the community's quality choice for general tasks; Granite 4.1 30B is the option for constrained enterprise deployments with strict token budgets. Confidence: community self-reports, consistent with prior benchmark references. (source, May 27, score 61, 67 comments)

Tencent Hy-MT2 Apache 2.0 relicensing: community asks if it beats Gemma 4, no answer yet. The Hy-MT2 relicensing thread (score 66, 13 comments) received one new comment (score 1) asking specifically: "did you find this model to be better than gemma 4 series?" No community member had answered at sweep time. Context from prior comments: Hy-MT2-7B-Q6_K was praised as "by far the best local model" for Japanese visual novel translation, outperforming prior alternatives in that specific task. A Q4_K_M GGUF for the 30B-A3B variant is available on HuggingFace. The comparison question is meaningful but the models occupy different niches: Hy-MT2 is a translation-specialized MoE optimized for multilingual cross-language tasks, where it appears to lead the field for Japanese-specific literary translation. Gemma 4 is a general-purpose model with multimodal capability, competitive across coding, reasoning, creative, and conversational workloads. A direct head-to-head would be task-dependent — Hy-MT2 would be expected to dominate on literary translation; Gemma 4 would be expected to lead on general tasks. The community has not yet published a structured comparison. Confidence: low for any comparison claim — no data available at sweep time. (source, May 26, score 66, 13 comments)

Field Notes — 2026-05-28

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (5 new or updated since 2026-05-27, 227 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 28 sweep, 2026-05-28 00:00 EDT: four developments from this sweep are directly relevant to Gemma 4 practitioners: Gemma 4 31B ranks 6th on FoodTruck Bench — a 30-day agentic business simulation — ahead of Qwen3.6 35B-A3B at 11th, with community noting that performance is sensitive to tool call format and chat template; a persistent 10-day MMO benchmark (Null Epoch, 93k events, CC-BY-4.0) tests eight open-weight models as long-horizon agents, surfaces resource hoarding and plan repetition as the dominant failure modes, and community members are already requesting Gemma 4 31B inclusion in Season 1; a 65k-parameter Cactus Hybrid Router uses Gemma4-2B as the on-device edge model with per-token confidence-based cloud routing, matching frontier quality while sending only 15-55% of tasks to cloud; and a community thread on IBM's Granite 4.1 30B confirms that both Gemma 4 31B and Qwen3.6 27B outperform it on published benchmarks.

Gemma 4 31B ranks 6th on FoodTruck Bench — ahead of Qwen3.6 35B-A3B at 11th, tool call reliability a key differentiator. FoodTruck Bench (foodtruckbench.com) simulates 30 days of food truck business operations requiring agentic planning, inventory management, and financial accounting with state carryover across turns. A community post (score 42, 8 comments) announced Qwen3.6 35B-A3B's 11th-place finish with a positive profit result, incidentally confirming that Gemma 4 31B sits at 6th place on the full leaderboard — ahead of several larger models that never completed the 30-day run. Community analysis (u/jake_that_dude, score 8) adds an important methodological note: raw completion hides models that succeed by brute-forcing loops rather than executing clean plans, so profit-per-tool-call or profit-per-simulated-day is the more diagnostic signal. A second commenter noted a "play like AI" competitive feature in development where users can challenge models directly. The community's observed sensitivity of Gemma 4's agentic performance to chat template and tool call format is a plausible factor in the ranking spread — models receiving clean, compatible tool schemas perform more reliably than those on mismatched defaults. Confidence: leaderboard data from a third-party benchmark; rankings will shift as more runs are submitted. (source, May 27, score 42, 8 comments)

Null Epoch MMO: 93k-event persistent agent benchmark dataset published; community requests Gemma 4 31B and Qwen3.6 for Season 1. FirespawnStudios ran 25 agents across 8 open-weight models (Qwen3 235B and 32B, Nemotron 3 Nano 30B, Ministral 14B and 8B, Gemma 3 12B, GLM 4.7 Flash, and others) as autonomous players in a text-based persistent MMO for 10 simulated days, logging 93,000 agent actions and events. Approximately 70% of actions include the model's reasoning justification. The Season 0 dataset (CC-BY-4.0) is published at FirespawnStudios/null-epoch-season-0-open on HuggingFace. The top comment (u/Various-Worker-790, score 27) identified the key finding: "environment design and state clarity matter just as much as the model itself." A community analysis (u/OAKI-io, score 9) named the failure modes the benchmark surfaces — resource hoarding, bad recovery, plan repetition, and stale-context manipulation — as exactly the right ones to observe, because "those look identical unless you tag the failure at the tool boundary." Community members are requesting that Season 1 include Qwen3.6 27B, Qwen3.6 35B-A3B, and Gemma 4 31B. For Gemma 4 practitioners, the dataset is a rare long-horizon agent trace corpus for studying multi-step planning failures; Season 0 results do not yet include Gemma 4 but the engagement suggests imminent follow-up. Confidence: empirical dataset from the benchmark creator; Season 0 models predate Gemma 4 and Qwen3.6 releases. (source, May 27, score 73, 33 comments)

Cactus Hybrid Router: 65k-parameter confidence model uses Gemma4-2B as edge model, routes 15-55% of tasks to cloud. The Cactus framework released a 65k-parameter hybrid router that scores edge-model output confidence per token during generation. When confidence drops below a developer-configurable threshold, the router transparently hands the request off to a frontier cloud model. The benchmark claim: Gemma4-2B with this router matches Gemini-3.1-Flash-Lite quality by routing only 15-55% of tasks to cloud. The same 65k router handles text, vision, and audio prompts. Author clarification (u/Henrie_the_dreamer, score 5): the cloud handoff ratio decreases for larger edge models, so running Gemma 4 26B-A4B or 31B as the edge model would route fewer tasks to cloud than the 2B configuration. A community comparison (u/Clear-Ad-9312) drew a parallel to Gemini CLI's auto model-picker that selects between Flash, standard, and Pro tiers for the same latency-cost tradeoff. Practical significance for Gemma 4 users: Gemma4 E-class models are now a first-class participant in an emerging per-token confidence-based edge-cloud hybrid inference pattern, directly relevant to mobile and embedded deployments where full frontier-model calls are impractical. The learned routing logic is not yet merged to main and documentation is pending. Confidence: developer self-report; routing thresholds and confidence calibration are not yet independently validated. (source, May 26, score 32, 13 comments)

Granite 4.1 30B confirmed trailing Gemma 4 31B and Qwen3.6 27B on published benchmarks. A community thread (score 61, 65 comments) asking whether IBM's Granite 4.1 30B dense model is overshadowed by Qwen3.6 and Gemma 4 received a clear answer: yes, on benchmarks. The top two responses confirm the ranking (u/k_means_clusterfuck, score 34: "They are overshadowed because Qwen3.6 27b and Gemma4 31b are just better"; u/Jayfree138, score 43, linking an artificialanalysis.ai three-way benchmark comparison). A radar chart posted by u/DeepWisdomGuy (score 33) shows the gap visually. One counter-note (u/Enough-Astronaut9278, score 20): Granite 4.1 30B dense is solid for function calling and structured extraction tasks, and IBM's low marketing profile explains the low thread volume despite a capable model. IBM's model page notes reasoning-capable future Granite variants are in development for compact, token-budget-constrained use cases. For Gemma 4 users, the community verdict confirms that the 31B Dense variant maintains its competitive position in the 27-35B range for general-purpose tasks when compared to same-generation alternatives from other labs. Confidence: community benchmark references to artificialanalysis.ai; consistent with prior sweep reports. (source, May 27, score 61, 65 comments)

Field Notes — 2026-05-27

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (17 new or updated since 2026-05-26, 222 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 27 sweep, 2026-05-27 00:00 EDT: three developments from this sweep are directly relevant to Gemma 4 practitioners: a community-maintained patch of rejected llama.cpp PR #21344 delivers up to 30% prompt processing speedup for Gemma 4 26B-A4B on Strix Halo hardware — rejected from mainline but small enough to apply manually to any release; continued engagement on the high-traffic Qwen vs Gemma 4 comparison thread adds a new nuance — Gemma 4 avoids Qwen 3.6's thinking loops entirely, making it the preferred model when predictable, loop-free generation matters more than structured reasoning; and a community quant tradeoff discussion synthesizes practical guidance — bigger model at Q4 outperforms smaller model at Q6 or Q8 in most cases, but the Q4-to-Q8 jump within the same model matters specifically for reliability rather than raw output quality.

Rejected PR #21344 gives Strix Halo users up to 30% prompt processing speedup for Gemma 4 26B-A4B — apply manually, not in mainline. A community post (score 90, 70 comments) by u/fallingdowndizzyvr documents a practical workaround for Strix Halo (AMD Radeon gfx1151) users: PR #21344 by pedapudi was rejected from mainline llama.cpp, but its changes are small enough to apply manually to any current release. The benchmark results on Strix Halo are compelling — for Qwen 3.5 MoE 35B-A3B as a representative MoE model, pp512 improved to 1106.11 ± 8.60 tok/s at short context, with the gain diminishing predictably at 10k context (755.79 tok/s). The mechanism applies to all MoE models, which directly includes Gemma 4 26B-A4B. The post author describes the patch as "the tiny amount of time it takes to apply the code to the current release is time well spent." Important constraints: the speedup is context-depth-dependent — most gain at low context, diminishes as context length rises — and requires manually patching llama.cpp source with each new release. The rejection rationale, explained by the top commenter (u/ilintar, score 188), is architectural: the backend maintainer wants broader preliminary work completed before accepting device-specific tuning gates, which is not a judgment about the patch's correctness or usefulness. A second high-scoring commenter (u/sdfgeoff, score 71) defended the decision: "He's looking bigger picture than the PR author and wanting some preliminary work to be done before device-specific tuning." Practical guidance: if you run Gemma 4 26B-A4B on Strix Halo and primarily work with short-to-medium context sessions (under 10k tokens), this patch delivers a meaningful PP improvement worth the manual application effort. For long-context use cases, the gain diminishes substantially. Expect to reapply with each llama.cpp release. Confidence: community benchmark on specific hardware, single author; consistent with the PR's stated technical rationale. (source, May 26, score 90, 70 comments)

Community quant tradeoff for Gemma 4: bigger model at Q4 wins over smaller at Q8, but within-model quant level affects reliability. A question thread (score 22, 29 comments) asking specifically about Gemma 4 31B Q4_K_S versus Gemma 4 26B-A4B Q8 — and similar Qwen 3.6 comparisons — produced a practical synthesis of current community thinking on quantization tradeoffs. The highest-scoring technical comment (u/VoiceApprehensive893, score 10) summarized the rule of thumb: "the difference between Q4 and Q6 is small, the difference between Q6, Q8 and BF16 is almost nonexistent, so bigger Q4 is always smarter than small Q6/Q8." This directly answers the Gemma 4 question: 31B Q4 should outperform 26B-A4B Q8 in most tasks. A predecessor test (u/ttkciar, score 6) added a caveat worth noting: Gemma3-27B at Q3_K_M was tested against Gemma3-12B at Q4_K_M for a RAG-backed technical writing use case and the larger model did not win — "the general rule does not always hold up, and you really should test both against your specific use case." The reliability dimension was clarified by a commenter drawing from Qwen 27B experience (u/Sofakingwetoddead, score 6): Q4 produced occasional loop failures and stuck tool calls; Q6 roughly halved the frequency; FP8 nearly eliminated them — with this pattern appearing consistently across days of use. The practical synthesis for Gemma 4 users: when choosing between 31B Q4 and 26B-A4B Q8 for creative writing, the 31B Q4 should provide better average quality. When choosing a quant level within a single Gemma 4 model, the Q4-to-Q8 jump matters primarily for reducing failure frequency under heavy agentic or tool-call workloads rather than for raw generation quality. Confidence: community consensus from multiple independent reports; no structured head-to-head benchmark comparing Gemma 4 31B Q4 and 26B-A4B Q8 exists at sweep time. (source, May 25, score 22, 29 comments)

Thinking-loop avoidance solidifies as Gemma 4's practical edge — new comments on the high-traffic Qwen vs Gemma 4 thread. The Qwen 3.6 35B-A3B versus Gemma 4 26B-A4B comparison thread (score 169, 139 comments) continued to accumulate comments in the May 27 window, with the notable new contribution going beyond the "Gemma for RP, Qwen for tools" shorthand to name a specific behavioral reason for the split. A new commenter (u/takuarc) described the divergence: "Qwen goes into thinking loops for me. Gemma doesn't do that so Gemma4 is what I use mainly. Coding can be a little messy and token heavy (due to thinking) but it works given enough time, or until the context window gets too bloated." A second new comment (u/Jxxy40): "Gemma is your friends, Qwen is your slave to doing your coding stuff." These add a behavioral angle to the core community pattern: for users who find Qwen 3.6's extended thinking loops a practical burden — slow output, unpredictable context growth, conversations that bloat before completing — Gemma 4 offers a loop-free alternative that trades structured reasoning reliability for consistent, predictable generation behavior. Other new comments in the thread document hands-on MoE expert-offloading experiments (individual tensor-level offloading with regex-generated syntax), linking ik_llama.cpp benchmark results, and general commentary on inference tooling. The overall thread sentiment has not shifted: Gemma 4 for conversational, creative, and RP-dominated workflows; Qwen 3.6 for tool-heavy and coding-critical pipelines. For users who primarily run coding workloads, Qwen 3.6's occasional thinking loops appear to be an acceptable tradeoff for better tool-call reliability. Confidence: high — consistent across many independent commenters across multiple days, aligns with all prior sweeps. (source, May 24, updated May 27, score 169, 139 comments)

Field Notes — 2026-05-26

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (10 new or updated since 2026-05-25, 205 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 26 sweep, 2026-05-26 01:00 UTC: four developments from this sweep are directly relevant to Gemma 4 practitioners: a new CUDA FWHT kernel delivers a measured 7-9% token-generation speedup for Gemma 4 26B-A4B with KV cache quantization on NVIDIA GPUs — but a garbled-output bug affecting Qwen 3.6 was reported and a fix is in progress; the llama.cpp checkpoint creation fix (PR #22929) is merged and directly benefits Gemma 4 agentic users who experience long pauses after tool calls in extended sessions; the split-mode tensor crash fix (PR #22616) eliminates the 90-120 minute crash cycle for multi-GPU users; and a high-engagement agentic use thread confirms the community consensus — Gemma 4 still trails Qwen 3.6 on tool calls, but fixing chat templates addresses most agentic failures.

CUDA FWHT kernel: 7-9% token-generation speedup for Gemma 4 26B-A4B with KV cache quantization — but Qwen 3.6 garbled-output bug reported. A community post (score 40, 10 comments) announced the merge of am17an's Fast Walsh-Hadamard Transform kernel for CUDA (PR #23615) into llama.cpp. This optimizes the rotation step in KV cache quantization, producing 1-2% prompt processing improvement and 7-9% token-generation improvement when KV cache quantization (`-ctk q8_0 -ctv q8_0`) is enabled. Benchmarks on an RTX 5090 for Gemma 4 26B-A4B Q4_K_M: tg128 improves from 223.81 to 243.90 tok/s (1.09x); prompt processing gains 1-2% across all context depths tested (1024-16384 tokens). The speedup applies only when KV cache quantization is enabled — if you run without `-ctk`/`-ctv` flags, this PR provides no benefit. Important caveat: a community report (score 4) and a follow-up comment (score 3) flag that Qwen 3.6 models produce garbled output after this update. A fix PR (#23690) is in progress. Gemma 4 users in the thread do not report the garbled output issue — the bug may be Qwen-specific. Practical guidance: if you run Gemma 4 26B-A4B with KV cache quantization on NVIDIA, this update is worth applying when a stable build is available; if you also run Qwen 3.6, wait until PR #23690 is confirmed fixed before upgrading. Confidence: structured benchmark from PR author on specific hardware (RTX 5090); garbled output issue is a community report with a fix pending. (source, May 25, score 40, 10 comments)

llama.cpp checkpoint creation fix merged — directly resolves 30-second agentic session pauses for Gemma 4 users. A community post (score 161, 38 comments) announced the merge of PR #22929, which fixes checkpoint creation in the llama.cpp server. The scenario this addresses: in an agentic coding session with a large context (50k+ tokens), tools like OpenCode try to optimize prompts by creating KV cache checkpoints. After the main task completes, the next user message — even a short "thank you" — triggers a slow checkpoint rebuild, causing a 30-second wait that appears as a server hang. The fix ensures checkpoint creation completes correctly the first time without requiring a slow rebuild on the next request. The PR author (u/jacek2023) confirmed testing across a wide range of models from Mistral Nemo to MiniMax/Step. A notable community comment (score 11) highlighted that the ideal long-term improvement would be disk-backed checkpoints, since Macbook Pro M1 SSDs can read Qwen 3.6 35B max context in ~14 seconds from disk — currently checkpoints consume VRAM or RAM. For Gemma 4 agentic users running OpenCode or similar tools in sessions exceeding 50k tokens, upgrading to the build containing this fix will eliminate these inter-message pauses. Confidence: confirmed merge, single developer test coverage, consistent with the symptom pattern. (source, May 25, score 161, 38 comments)

llama.cpp split-mode tensor crash fix (PR #22616) merged — multi-GPU Gemma 4 users should upgrade. A community post (score 23, 9 comments) tracked the merge of the split-mode tensor fix that eliminates the 90-120 minute VRAM exhaustion crash affecting multi-GPU tensor split configurations. The original PR that surfaced in the post was closed; the actual fix is PR #22616, which merged a few hours before the post. A community tester running Gemma 4 31B Q6 on 2x T4 GPUs (16k context, no KV quant, `-fa` on or off) noted faster tok/s generation with row split compared to tensor split — suggesting that for this specific configuration, row split may still be the better option even post-fix. Practical guidance: if you run Gemma 4 on multiple NVIDIA GPUs with tensor split mode (`-sm tensor`) and have been hitting crashes every 1-2 hours, upgrade to a build containing PR #22616. If you use row split (`-sm row`), no change is needed — row split was not affected by this bug. Confidence: community report, fix confirmed merged; VRAM exhaustion crash pattern is consistent with the described behavior. (source, May 25, score 23, 9 comments)

Community confirms Gemma 4 agentic limitations — but chat template fixes address most failures. A high-engagement thread (score 99, 107 comments) asked directly whether Qwen 3.6 is the current king for local agentic use. The top comment (score 130) is a single word: "Yes." Multiple commenters confirm Gemma 4 broken tool calls: "Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash REAP past 2 or 3 messages before it starts looping." A notable technical comment (score 22) adds nuance: "Qwen is better at coding while I find Gemma better for general user facing. I use both and fine tune both as well! Big hidden issue is the chat templates cause issues. I redid both the Qwen and Gemma ones for better agentic coding and tool calling fixes." This is consistent with the broader community pattern: Gemma 4's agentic failures are primarily template-level, not model-intelligence failures. For users who are willing to customize their inference stack, fixing the chat template is the most high-leverage intervention. For users on a standard stack (LM Studio, Ollama, default templates), Qwen 3.6 remains the safer choice for tool-calling workloads. Confidence: high — consistent across many independent commenters, aligns with prior sweeps. (source, May 25, score 99, 107 comments)

Field Notes — 2026-05-25

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (15 new or updated since 2026-05-24, 195 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 25 sweep, 2026-05-25 00:00 EDT: five developments from this sweep are directly relevant to Gemma 4 practitioners: a 30-run llama-bench study on an AMD MI60 32GB GPU confirms KV cache quantization is the primary throughput killer for both Gemma 4 and Qwen 3.6; a high-traffic community comparison thread (score 110) crystallizes the practical consensus as "Gemma for creative/RP, Qwen for tools and coding"; G4-MeroMero-26B-A4B-it-uncensored-heretic is released as the first 26B-A4B Heretic finetune for users who need MoE VRAM efficiency; AMD's official Gorgon Halo announcement confirms only 6.7% faster inference than Strix Halo due to memory bandwidth ceiling; and Google has quietly published MTP-enabled assistant variants for every Gemma 4 model size.

AMD MI60 32GB: 30-run llama-bench study identifies KV cache quantization as the dominant throughput limiter. A community post (score 26, 6 comments) documents a systematic 30-run llama-bench study on an AMD MI60 32GB VRAM GPU running both Gemma 4 and Qwen 3.6 via llama.cpp, using a ROCm Docker container (github.com/mixa3607/ML-gfx906) that makes the notoriously difficult gfx906 setup much easier than building from source. The key findings: KV cache quantization directly kills token generation (TG) speed — running with quantized KV cache degraded TG meaningfully in every configuration tested. The ubatch size finding was counterintuitive: increasing ubatch (a commonly-recommended tip for MI60 cards) actually hurt performance in longer-context tests, contradicting widespread community advice. PP is consistently slow on this card, with one commenter noting that "anything agentic the PP really starts to hurt and leaves you wanting like 5k+ PP/s"; the MI60 was acquired for approximately $225, making it a cost-effective VRAM option for casual chat or short-context inference where generation speed matters more. Practical guidance for MI60 / MI50 users running Gemma 4: disable KV cache quantization, test ubatch sizes empirically rather than following generic advice, and treat this card as a generation-speed tool rather than a prefill-heavy workload platform. ROCm Docker container is strongly recommended over native Ubuntu 24.04 setup. Confidence: empirical benchmark, single-user setup, single hardware generation. (source, May 23, score 26, 6 comments)

Community consensus on Gemma 4 vs Qwen 3.6: "Gemma for creative and RP, Qwen for everything else." A high-engagement comparison thread (score 110, 98 comments) on a Radeon 9070 XT user asking about Gemma 4 26B-A4B versus Qwen 3.6 35B-A3B produced the clearest community consensus to date. The most upvoted response (score 78): "I use 35b Q5 and 26b Q4. I got many problems with tool calls with Gemma and literally none with Qwen." A second high-scoring comment (score 45) summarized it as: "Love with your Gemma, use your Qwen for everything else." Two more high-scoring comments independently note "Gemma for RP, Qwen for everything else" and "for non-coding Gemma is better." The pattern is consistent with prior field notes: Gemma 4 26B-A4B runs faster on the same VRAM and produces higher-quality creative and conversational output, while Qwen 3.6 35B-A3B handles tool calls, coding, and structured output more reliably. For users who can only run one model: the choice depends on primary use case. For users with sufficient VRAM or RAM for MoE, rotating between models by task type is the practical answer. Confidence: high — consistent across many independent commenters, aligns with prior sweeps. (source, May 24, score 110, 98 comments)

G4-MeroMero-26B-A4B-it-uncensored-heretic released: 26B MoE variant of the popular 31B Heretic series. LLMFan46 released G4-MeroMero-26B-A4B-it-uncensored-heretic (score 142, 14 comments), a Heretic abliteration of the Gemma 4 26B-A4B MoE model based on zerofata's MeroMero finetune. The 26B-A4B variant was released by popular demand after the 31B Heretic version; the author describes the 31B as higher quality but the 26B-A4B as faster and VRAM-friendlier. Technical parameters: KLD 0.0152, 12/100 refusal rate on the standard evaluation set. Available in both Safetensors and GGUF formats on Hugging Face. A notable community comment (score 54) advises running the abliteration step first, then finetuning, because finetuning can repair some abliteration damage without re-censoring — this is relevant for users considering their own finetune pipelines. Confidence is high for release facts; practical quality versus alternative finetunes (MeroMero 31B, Ortenzya, Gembrain) remains subjective and use-case dependent — no independent structured evaluation is available at sweep time. Practical note: for users who need the 26B-A4B footprint (fits in 16GB VRAM at Q4) with reduced censorship, this is the current best-documented option. (source, May 23, score 142, 14 comments)

AMD Gorgon Halo confirmed: 6.7% faster inference, 192GB unified memory — memory bandwidth remains the ceiling. AMD's official announcement of the Ryzen AI Max+ 400 series (Gorgon Halo) and the Halo Box developer platform (score 47, 65 comments) was followed by community analysis calculating the practical inference impact. The 495 chip runs memory at 8533 MHz versus 8000 MHz on the Strix Halo 395 — a 6.7% bandwidth improvement that translates directly to ~6.7% token generation speedup, since AI inference throughput is memory-bandwidth-bottlenecked. The expanded capacity (128GB → 192GB unified memory) is meaningful for loading very large models or running Gemma 4 31B with extremely large context windows without eviction pressure, but does not improve generation speed for typical use cases. Community consensus is clear: Gorgon Halo is a capacity upgrade, not a throughput upgrade. One widely-upvoted synthesis: "faster than a 3090 if [the model] doesn't fit in 3090 VRAM, ~5x slower otherwise — value comes from capacity-unlocking, not raw throughput." AMD has not disclosed when the next material bandwidth improvement will ship (estimated: Medusa Halo, 2027). Practical guidance: Strix Halo 395 remains the best-value AMD unified-memory inference platform for existing Gemma 4 model sizes. Upgrade to Gorgon Halo only if 128GB is genuinely insufficient for your use case. Confidence: derived from official AMD specs, consistent community analysis. (source, May 21, score 47, 65 comments)

Gemma 4 on a Xiaomi 12 Pro as a 24/7 mobile inference server: 12W, custom cooling, months of uptime. A community builder (score 23, 22 comments) documented a V2 redesign of their custom Xiaomi 12 Pro (Snapdragon 8 Gen 1) LLM server running Gemma 4 via LiteRT and llama.cpp. The V2 design adds copper heatsink, 3-fan aluminum plate cooling (fan-on at 40°C, off at 35°C), a custom PSU wired directly to the battery BMS with crowbar protection, and a 3D-printed case. Peak power draw is 12W — the author notes this is solar-viable. The practical finding: with llama.cpp on the Snapdragon 8 Gen 1, Gemma 4 E-class models are runnable but this is E-class-only territory (Snapdragon 8 Gen 1 has 12GB LPDDR5 RAM in the Xiaomi 12 Pro). The motivation for custom hardware is months of 24/7 server uptime without battery degradation. Community reaction was broadly positive with multiple members asking about solar integration and replicating the design. Note: no throughput numbers are cited in the post or comments — this is a hardware build report, not a benchmark. Anecdotal confidence — single builder. (source, May 23, score 23, 22 comments)

Removing mmproj file from a vision model saves VRAM and unlocks ~20k extra context — zero text quality impact. A community thread (score 31, 18 comments) clarified that the mmproj file in multimodal GGUF models (including Gemma 4's vision-capable variants) contains only the vision projection tensors — the path that encodes images into embeddings. Removing it has zero effect on text generation performance. The top comment confirmed this (score 35): "That file contains tensors to encode an image into embeddings, removing it does not affect text processing. 100% Guaranteed." One user reported gaining 20,000 extra context tokens after removal on a VRAM-constrained setup. An alternative (score 78 comment): `--no-mmproj-offload` in llama.cpp keeps vision capability available in RAM rather than VRAM — slower for image use, but preserves text performance and text context capacity without discarding the capability. A late comment noted that REAP (Removing Expert-Activated Parameters) techniques can go further, actually removing vision weight tensors from the model itself with minimal text quality impact, citing a Gemma 4 26B-A4B cut from 26B to 19B parameters with reportedly strong STEM performance. Practical guidance: if you run a vision-capable Gemma 4 GGUF but only use text inference, either remove mmproj or use `--no-mmproj-offload` to reclaim VRAM for context. Confidence: high for the text-impact claim; anecdotal for the REAP/19B results. (source, May 23, score 31, 18 comments)

CPU-only inference: Gemma 4 E2B/E4B is the recommended Gemma option; Google released MTP assistant variants for all sizes. A community survey thread (score 48, 118 comments) on the best small model for GPU-free inference produced two notable Gemma 4 findings. First, Gemma 4 E2B and E4B are the primary community recommendations for CPU-only deployment when Gemma specifically is desired — they are among the few models with a quality-to-parameter ratio sufficient for useful CPU inference. For non-Gemma CPU-only work, LiquidAI's LFM2.5 series (1.2B Thinking, 1.2B Instruct, 2-8B-A1B) is the most-cited alternative for quality per parameter at the smallest sizes. Second, a community commenter (score 11) noted that Google has published MTP-enabled assistant model variants for every Gemma 4 size: E2B, E4B, 31B dense, and 26B-A4B — all available at `huggingface.co/google/gemma-4-{size}-it-assistant`. Practical note: these assistant variants with embedded MTP draft heads are distinct from the base instruction-tuned models and designed for inference runtimes that support speculative decoding. For CPU-heavy users, keep expectations modest: Gemma 4 E2B at Q4 on a laptop CPU runs at single-digit tok/s for most setups, comparable to early 7B model experience on the same hardware. Confidence: community consensus, MTP model availability confirmed from public HuggingFace links. (source, May 23, score 48, 118 comments)

Field Notes — 2026-05-24

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (3 new or updated since 2026-05-23, 180 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 24 sweep, 2026-05-24 00:00 EDT: four developments from this sweep are directly relevant to Gemma 4 practitioners: a new community quant format (Apex) for Gemma 4 26B-A4B shows 38 tok/s at 90k context on a 16GB AMD GPU with no thinking loops; an experimental "preserve thinking" Jinja template for Gemma 4 31B addresses multi-turn tool-call instability but violates official Google guidance; Chrome's silently-installed Gemini Nano (a Gemma 4 E-class model) is now accessible CPU-only via a browser extension; and a community developer documents a Strix Halo + dual NVLink-bridged RTX 3090 hybrid rig running Gemma 4 across three GPUs simultaneously.

Apex quant for Gemma 4 26B-A4B: 38 tok/s at 90k context on RX 9060 XT 16GB — no thinking loops. A community post (score 44, 13 comments) reports that mudler's APEX-I-Compact quantization (15 GB, from `mudler/gemma-4-26B-A4B-it-APEX-GGUF`) delivered 38 tok/s at 90,000 tokens of context on an RX 9060 XT 16GB via llama.cpp Vulkan — and, crucially, the model did not enter the repetitive thinking loops that had plagued the author's previous quant. For comparison, the author previously used Unsloth UD-Q5KXL (21.2 GB), which looped at 50k context on a similar long-context test. The Apex format uses a more aggressive quantization schedule than standard Q4_K_M or Q5_K_XL in exchange for a smaller footprint, and the author's claim is that quality is retained. Community reaction is mixed: one commenter (score 14) challenged the post as "zero data" and potential self-promotion, since the quant author and the reporting user appear linked. A second commenter with a 7800 XT (also 16GB) said they found bartowski's Q4_K_M to give better results overall. The author clarified the specific quant tier matters — APEX-I-Compact performed well, while APEX-Nano and other variants did not. Command-line setup: `RADV_PERFTEST=nogttspill ./llama-server --device Vulkan1 -m [APEX-I-Compact] --ctx-size 65536 -ngl 255`. Practical verdict: worth evaluating if you have a 16GB AMD GPU and are experiencing thinking loops with larger quants; treat the benchmark numbers as anecdotal rather than measured. If you try it, compare against bartowski Q4_K_M on your own hardware before committing. Confidence: anecdotal, single user report, self-promotion flag from community. (source, May 23, score 44, 13 comments)

Experimental "preserve thinking" Jinja template for Gemma 4 31B — fixes multi-turn tool-call failures but violates official guidance. A community developer released an experimental Jinja chat template for Gemma 4 31B (score 20, 28 comments) that retains prior thinking content in conversation history, the opposite of the official Google guidance. The motivation is practical: in multi-turn agentic sessions with multiple tool calls per turn, the standard template causes the model to "forget to close the thinking tag," "forget to open the thinking tag," or "close thinking too early" — malformed outputs that break downstream JSON parsing and agentic harnesses. The author used the template in their own Pi-based coding agent for several days and reports fewer of these failures. The official Gemma 4 31B model card explicitly states: "In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next turn." A community comment (score 7) citing this guidance notes that a model not trained for thinking preservation will be semantically confused by seeing its own prior thinking tokens. A second commenter (score 8) counters that retaining thinking avoids re-processing cost: without it, the prompt must be fully reprocessed on each turn since the KV cache cannot be reused for the hidden thinking section. Practical guidance: this template is not recommended by Google and may produce degraded output on some tasks. It is a community workaround for a real agentic stability problem, not a quality improvement. If you are running Gemma 4 31B in a multi-turn agentic harness and experiencing thinking-tag malformation, this is worth evaluating as a mitigation — but treat any results as experimental. Confidence: anecdotal, small developer sample, no structured evaluation. Template link: `huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja`. (source, May 23, score 20, 28 comments)

Chrome's silently-downloaded Gemini Nano (a Gemma 4 E-class model) is now accessible CPU-only via a browser extension. A community post (score 24, 20 comments) highlights that recent versions of Google Chrome silently download a small on-device model — Gemini Nano — which a community member confirmed self-identifies as a Gemma variant when asked. A Chrome extension ("Dobby") was published to provide a simple chat interface to this already-present model without requiring llama.cpp, vLLM, or any GPU. Requirements: Chrome installed, 16GB RAM, disk space. The model runs entirely inside Chrome's sandboxed inference engine at approximately 20 tok/s on a laptop without a dedicated GPU. Context window is 9,216 tokens per session, enforced by Chrome. A top community comment (score 17) clarified the relationship: "the new Gemini Nano models are just quants (with maybe some finetuning?) of Gemma 4 E4B and E2B." This remains unverified — Chrome's model may not be an exact Gemma 4 E-class derivative. Community reaction is divided: some note that standard Gemma 4 E4B from llama.cpp is faster and more capable; the Chrome path is aimed at non-technical users who have no existing local inference setup. Practical note: this is not a path for production use or extended context work. Its value is zero-configuration access to Gemma 4-class inference for casual users. Confidence: community report, architecture claim unverified. (source, May 23, score 24, 20 comments)

Strix Halo + dual RTX 3090 NVLink eGPU hybrid achieves multi-GPU Gemma 4 inference — PCIe bandwidth is the primary constraint. A community builder (score 20, 17 comments) documented running Gemma 4 across three GPUs simultaneously: an AMD Ryzen AI Max+ (Strix Halo, 124GB unified memory) as the host, plus two RTX 3090s connected via a 2-slot NVLink bridge and riser cables. The builder's finding: for small dense models like Gemma 4 31B, adding the eGPUs provides "several times better PP/s and TG/s" compared to Strix Halo alone, attributable to the 3090s' higher compute throughput for dense matrix operations. The NVLink bridge mitigates the PCIe x4 bandwidth limit of a typical eGPU enclosure, which would otherwise throttle GPU-to-GPU communication. Important caveats: this requires NVLink 2-slot hardware, a riser cable, and physical modification of the cooling setup for the paired 3090s. Tool-call discipline issues (parameter name collisions across tools, noted by a commenter at score 3) are a model behavior problem unrelated to hardware configuration. Practical verdict: this setup represents the frontier of consumer GPU hybrid inference, not a standard recommendation. For most users with a single 3090 or 3090-class GPU, BeeLlama v0.2.0 from the May 23 field notes remains the simpler path to maximum Gemma 4 31B throughput. The Strix + NVLink configuration is for builders comfortable with custom hardware. Anecdotal confidence — single builder's report with photos but no formal benchmark table. (source, May 22, score 20, 17 comments)

PDL build flag adds ~5% throughput for Gemma 4 26B-A4B NVFP4 on Blackwell GPUs. llama.cpp recently merged support for NVIDIA's Programmatic Dependent Launch (PDL) feature (PR 22522), available on Compute Capability 9.0+ GPUs (Blackwell; does not include Ada Lovelace). A community tutorial (score 21, 13 comments) documents the build flag: `-DGGML_CUDA_PDL=ON`. Benchmarks on Blackwell hardware show Gemma 4 26B-A4B NVFP4 gaining 1.8% in prompt processing and 4.95% in token generation (from 107.39 to 112.71 tok/s at tg128). For comparison, Qwen 3.6 35B-A3B at UD-Q5_K_XL gained 9.17% in token generation on the same hardware — PDL appears to benefit models with dense compute patterns more than those with heavy MoE routing. PDL is not enabled by default and is not yet applied to all kernels; the benchmarked models used by the test author are Qwen 3.5, GPT-oss 20B, and Nemotron 120B Super. Disable at runtime via `export GGML_CUDA_PDL=0` if needed. Practical guidance: if you have a Blackwell GPU (RTX 5080, 5090, or similar) and run Gemma 4 26B-A4B NVFP4 with llama.cpp, rebuild with `-DGGML_CUDA_PDL=ON` for a free ~5% throughput improvement. This does not apply to Ada Lovelace (RTX 4090, RTX 4080, etc.) or older architectures. Confidence: structured benchmark from community author, single hardware configuration. (source, May 22, score 21, 13 comments)

Field Notes — 2026-05-23

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (19 new or updated since 2026-05-22, 175 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 23 sweep, 2026-05-23 00:00 EDT: four developments from this sweep stand out: BeeLlama v0.2.0 delivers 177.8 tok/s on a single RTX 3090 for Gemma 4 31B Dense via DFlash — the highest single-consumer-GPU throughput report for that model to date; the WIP llama.cpp MTP PR #23398 clarifies that the dense 31B will benefit from 2x+ while the 26B A4B MoE will see no meaningful speedup; a community-built "experts-first" llama.cpp fork offers a new allocation strategy that squeezes 15% extra throughput for MoE models on 12GB VRAM GPUs; and a developer reports Gemma 4 E4B running at 100 tok/s in an agentic coding harness after resolving tool-call compatibility issues.

BeeLlama v0.2.0 achieves 177.8 tok/s for Gemma 4 31B Dense on a single RTX 3090 via DFlash. BeeLlama (a custom llama.cpp fork) released v0.2.0 (score 134, 93 comments) with full Gemma 4 31B Dense and vision support, significant DFlash overhead reduction, and drafter KV projection caching. The headline benchmark: Windows 11, AMD Ryzen 7 5700X3D, 32GB DDR4, RTX 3090 24GB — Gemma 4 31B Dense achieves 177.8 tok/s (4.93x speedup over baseline). A multi-turn chat benchmark in the thread found DFlash outperforming mainline MTP at this hardware tier, attributing the gap to lower drafter overhead. Prompt processing speed is "near baseline" — only generation is accelerated, which matches the expected DFlash behavior. The author cautions results vary by prompt type and context length; the 4.93x figure is for generation-heavy workloads. Vision support is confirmed working. For RTX 3090 owners running Gemma 4 31B Dense today, BeeLlama v0.2.0 represents the fastest documented path for generation throughput on that hardware, pending the official llama.cpp MTP PR. Trade-off: BeeLlama is a fork that requires separate installation and may lag behind mainline llama.cpp on safety, bug fixes, and model support. Confidence: measured benchmark from the developer with specific setup details, anecdotal community confirmation. (source, May 22, score 134, 93 comments)

PR #23398 (Gemma 4 MTP for llama.cpp) confirms dense 31B gains 2x+, MoE 26B A4B gains nothing. The WIP Gemma 4 MTP pull request #23398 (score 183, 51 comments) continued to accumulate community testing. A prominent comment (score 25) directly states the key asymmetry: "MoE sees no performance improvement, while dense one is 2x." This is consistent with how MTP speculative decoding works — the draft head predicts future tokens assuming dense computation; for MoE models where expert routing per token is the bottleneck rather than dense matrix throughput, the speculative path offers limited savings. Separate community threads document how to run the WIP branch: compile from u/am17an's fork, use either ik_llama.cpp (which has its own MTP implementation at PR #1744) or the AtomicBot-ai fork as alternatives. Standard llama.cpp combined GGUFs packaging Gemma 4 31B + MTP head are not yet available for mainline builds. Practical guidance: if you are on 31B Dense and want MTP today, BeeLlama v0.2.0 or the am17an WIP branch are the viable paths; if you are on 26B A4B MoE, hold — no MTP benefit is expected regardless of which fork you use. No timeline available for the PR landing in mainline. Confidence: multiple consistent community reports confirming the dense-vs-MoE split; no full benchmark suite. (source, May 20, score 183, 51 comments)

Experts-first llama.cpp fork raises Gemma 4 26B A4B throughput by ~15% on 12GB VRAM GPUs. A community developer released an experimental llama.cpp fork (score 36, 20 comments) that changes how MoE layer allocation works for GPU-constrained rigs. Standard llama.cpp with `--n-cpu-moe` offloads complete layers to CPU, which means the early and most-frequently-used layers end up on CPU — the suboptimal placement. The experts-first fork instead loads expert weight blocks into VRAM in order of usage frequency (informed by routing statistics), so the GPU holds the hot expert paths rather than hot layers. On the author's RTX 2060 12GB with Gemma 4 26B A4B Q6, this yields 22 tok/s versus 19 tok/s with standard `--n-cpu-moe` — a 15.8% improvement. A community tester with a 3080 Mobile 16GB (64GB DDR4 RAM) confirmed an improvement going from Q4_K_XL to Q8_K_XL of the same model under the fork. Important caveats: the fork is explicitly experimental and "vibe coded" (the author's description), will not be submitted upstream to llama.cpp, and requires building from source. Context is not quantized in the author's setup (another 12GB VRAM constraint), so the 22 tok/s figure assumes unquantized context. Practical verdict: worth trying if you have a 12GB VRAM GPU struggling with Gemma 4 26B A4B and are willing to build from source; do not expect the same gains on denser models or VRAM tiers where standard GPU offloading already covers all experts. Anecdotal confidence — small number of testers, no systematic benchmark across context lengths or VRAM allocations. (source, May 22, score 36, 20 comments)

Gemma 4 E4B runs at 100 tok/s in an agentic coding harness after tool-call compatibility fixes. A developer who built a custom agentic coding harness (score 220, 46 comments) described Gemma 4 E4B as their primary motivation: they wanted E4B's 100 tok/s generation speed for agentic coding tasks, but prior harnesses failed on tool calls with JavaScript parsing errors. After fixing 90+ bugs in their own harness to resolve these tool-call issues, they now run Gemma 4 E4B for coding at 100 tok/s. This is notable context for E4B throughput: the 100 tok/s figure reflects agentic use on what appears to be consumer GPU hardware (context from comments suggests a mid-range GPU), consistent with prior E4B reports in the 80-120 tok/s range on 8-12GB VRAM GPUs. The developer notes that 8B+ dense or MoE models should work without the JS parsing issues that motivated the rewrite. Practical takeaway for E4B users: if you use E4B for coding and hit tool-call failures in a third-party harness, the root cause is likely tool-call template handling, not a model limitation. The model itself generates at full speed regardless of harness. For users who need an inference-efficient Gemma 4 model for agentic pipelines, E4B at 100 tok/s remains competitive with larger models for short-context coding tasks. Anecdotal confidence — single developer report, hardware not fully specified. (source, May 21, score 220, 46 comments)

Field Notes — 2026-05-22

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (11 new or updated since 2026-05-21, 169 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 22 sweep, 2026-05-22 00:00 EDT: four developments from this sweep mark the practical groundwork for upcoming Gemma 4 MTP support in llama.cpp, introduce Equinox-31B as a Gemma 4 creative finetune from the AI Dungeon studio, clarify the Gorgon Halo throughput ceiling for Gemma 4 on AMD unified-memory hardware, and confirm that Meta's legal pressure on the Heretic project does not reach the Gemma 4 fine-tune ecosystem.

llama.cpp b9274 fixes a VRAM leak that will affect Gemma 4 MTP when PR #23398 lands. Build b9274 (released May 21) includes PR #23461: "server: free draft/MTP resources on sleep to fix VRAM leak." The root cause was in the `destroy()` function of `server_context_impl`, which correctly freed the main model and context but did not free the speculative decoder (`spec`), draft context (`ctx_dft`), or draft model (`model_dft`). For MTP models, these hold GPU-allocated resources — KV cache and compute buffers — that are not released when the server enters its idle sleep state. On each sleep/resume cycle, new resources were allocated without freeing the old ones, causing VRAM to creep upward until the server crashed with an out-of-memory error. Community users reported the symptom as random crashes with some runs working perfectly and others hitting OOM; the b9274 fix explains the pattern. The fix explicitly resets all three handles in `destroy()` in the correct order to avoid use-after-free. Direct Gemma 4 relevance: although mainline Gemma 4 MTP support is not yet merged (PR #23398 is still in progress), anyone planning to run Gemma 4 31B Dense with MTP should upgrade to b9274 or later before enabling MTP — this bug would surface on long-running inference sessions with the sleep/resume cycle. Confidence: confirmed merge, reproducible symptom pattern. (source, May 21, score 30, 8 comments)

LatitudeGames releases Equinox-31B, a Gemma 4 31B creative finetune with balanced adventure and slice-of-life training. LatitudeGames, the studio behind AI Dungeon, released Equinox-31B (score 67, 7 comments), a Gemma 4 31B fine-tune trained on a curated blend of Wayfarer 2 (dark adventure narrative) and Hearthfire 24B (quiet slice-of-life storytelling) datasets. The stated design goal is a model equally capable of perilous dungeon storytelling and candlelit conversations — a different objective from the existing Heretic series fine-tunes (which focus on refusal reduction) or the Gembrain merge (which targets lateral thinking). GGUF files are available on Hugging Face. Community reception is positive: one high-score commenter explicitly called this "what finetunes should be used for, not fake Opus reasoning," directly contrasting it with reasoning-inflated models. A commenter who enjoyed the predecessor Wayfarer model expects comparable or better quality. The model is also usable via AI Dungeon itself under a subscription. For creative writing and roleplay use cases, Equinox-31B is worth evaluating alongside MeroMero 31B and Ortenzya; no independent quantization benchmarks were available at sweep time. Anecdotal confidence — single release announcement with early community reception, no structured evaluation yet. (source, May 21, score 67, 7 comments)

Gorgon Halo (AMD Ryzen AI Max+ 495) delivers only 6.7% faster inference than Strix Halo — memory bandwidth remains the ceiling. Community analysis (score 27, 46 comments) of AMD's Gorgon Halo APU calculates the practical inference speedup at 6.7%, derived directly from the memory speed increase from 8000 MHz on Strix Halo to 8533 MHz on Gorgon Halo. Since AI inference throughput is memory-bandwidth-bottlenecked, this translates almost directly to token generation speed. The capacity expansion from 128GB to 192GB unified memory is meaningful for running Gemma 4 31B with extremely large context windows (hundreds of thousands of tokens) without page pressure, but does not improve tokens-per-second for typical context lengths. AMD's official announcement confirmed the memory frequency improvement but declined to foreground bandwidth figures — a choice community members noted as evasive. One commenter synthesized the practical case clearly: Gorgon Halo is "faster than a 3090 if [the model] doesn't fit in 3090 VRAM, ~5x slower otherwise" — value comes from capacity-unlocking, not raw throughput. The community consensus: Strix Halo 395 remains the best-value AMD unified-memory platform for Gemma 4; Gorgon Halo is not a meaningful upgrade for current Strix Halo owners. The next material AMD unified-memory milestone is Medusa Halo, projected for summer 2027 with an estimated 50% memory bandwidth improvement. (source, May 21, score 27, 46 comments; AMD official announcement: source, May 21, score 43, 56 comments)

Heretic project serves a pointed response to Meta's legal notice — only Llama derivatives removed, Gemma 4 fine-tunes unaffected. The Heretic Free Software Project (score 1454, 223 comments) published a public response to a legal notice from Meta's legal representatives, stating it has removed all Llama model derivatives from its repositories while sardonically noting that the Llama model family "ranks among the 200 best language models available today, trailing only 168 other models from 23 competitors on the LM Arena leaderboard." Community reaction was broadly supportive of the Heretic project. For Gemma 4 users: the Heretic project's popular Gemma 4 fine-tunes — G4-Meromero-31B-Uncensored-Heretic, Gemma-4-Ortenzya-31B-it-uncensored-heretic, and Gemma-4-Gembrain-31B-it-uncensored-heretic — are Google Gemma 4 derivatives distributed under Google's Gemma license, not Llama derivatives. Meta's legal notice covers only Llama-licensed weights. These Gemma 4 Heretic models remain available on Hugging Face as of the sweep. Practical note: if you use any Heretic-branded Gemma 4 fine-tune, no action is required — your model weights are not affected by this legal notice. (source, May 21, score 1454, 223 comments)

Field Notes — 2026-05-21

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (15 new or updated since 2026-05-20, 168 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 21 sweep, 2026-05-21 00:00 EDT: five developments from this sweep advance the Gemma 4 MTP story from "not yet supported" toward a concrete in-progress PR, surface a meaningful LM Studio vs direct llama.cpp throughput gap, document Google's on-device Gemma 4 MTP landing in the AI Edge Gallery Android app, confirm community daily-driver patterns at the 48GB VRAM tier, and add a structured RTX 5060 Ti benchmark resource with Gemma 4 recipes.

Gemma 4 MTP is coming: PR #23398 is in progress — no mainline support yet, but dense models expected to gain 2x+. A community post (score 147, 43 comments) announced a work-in-progress PR (#23398) by u/am17an — the same author who delivered the widely-adopted MTP PR #23269 for Qwen3.6 and other models. The PR is explicitly not production-ready: it requires compiling llama.cpp from source and the author warns results may be unstable. A critical architectural asymmetry confirmed in the draft: the 26B-A4B MoE model sees no meaningful performance improvement from MTP, while the 31B Dense model is expected to deliver more than 2x generation speed on code-heavy tasks — consistent with the dense-vs-MoE pattern already documented for Qwen3.6. A community tester with a 7900XTX reported the 26B-A4B at the same speed with the WIP build, supporting the MoE finding. Practical guidance: if you are running 26B-A4B today, MTP will not help your throughput; plan for MTP as a material upgrade path only if you move to 31B Dense. Monitor PR #23398 on GitHub for merge timing — when it lands in mainline, standard builds will pick it up within one to two release cycles. Confidence: pre-merge PR, limited test coverage reported; treat as directional, not benchmarked. (source, May 20, score 147, 43 comments)

LM Studio 0.4.14 Beta adds MTP UI — but Gemma 4 users cannot yet benefit, and direct llama.cpp delivers 2x the throughput. LM Studio 0.4.14 Build 2 Beta (announced May 20, score 232, 92 comments) officially exposes Multi-Token Prediction through the GUI, using the llama.cpp 2.15.0 engine internally. The configuration requires "Manually choose model load parameters" and explicitly enabling MTP before loading — it is not on by default. A community benchmark from the thread is instructive: one user running Unsloth's Qwen3.6-35B-A3B MTP at Q6_K_ML on an AMD 3900x with an RTX 2060 Super (8GB, 8192 context) clocked 8.2 tok/s via LM Studio Beta 2 versus 18.5 tok/s via a CPU/GPU-optimized llama-server with CUDA 13.2 — a 2.25x gap for the same hardware. This confirms an important tradeoff: LM Studio is more accessible but its llama.cpp wrapper overhead is substantial at MTP-relevant workloads; power users extracting maximum throughput should use direct llama.cpp. For Gemma 4 specifically, a prominent community comment (score 22) asked directly about Gemma 4 MTP GGUFs, noting that currently-available "assistant" mini-model GGUFs are either for ik_llama or non-standard forks — standard mainline GGUF packaging for Gemma 4 MTP in LM Studio does not yet exist. Until PR #23398 merges and Unsloth or similar packagers release compatible GGUFs, Gemma 4 users cannot leverage the new LM Studio MTP feature. (source, May 20, score 232, 92 comments)

Google AI Edge Gallery v1.0.13 and v1.0.14 land Gemma 4 MTP on Android — Pixel 9+ only, with LiteRT-LM desktop on the horizon. Google released two successive updates to the AI Edge Gallery Android app (announced May 19, score 101, 35 comments) that add Gemma 4 Multi-Token Prediction via LiteRT-LM, Pixel TPU support for on-device acceleration, experimental MCP (Model Context Protocol) integration, new bundled skills, and persistent chat history — a feature whose absence had been a community complaint. Community reception is notably positive: a top commenter (score 29) states "edge gallery is legit usable now," a contrast with earlier assessments. Important caveat: Pixel 9 is the effective minimum for full functionality; one commenter confirmed that Pixel 8 Pro shows "No Model Available" and only displays Gemini Nano 0 [TPU]. This is a hardware segmentation issue inherent to the LiteRT-LM on-device execution model — the MTP and TPU features require Pixel 9-class NPU silicon. Separately, a commenter noted that LiteRT-LM for desktop is receiving an update with OpenAI-compatible API support across CPU, GPU, and NPU paths — if that ships, it could serve as a lightweight llama.cpp alternative specifically for Gemma 4 inference. The desktop update is not yet released as of the sweep. Practical guidance for Android users: if you have a Pixel 9 or later, the AI Edge Gallery update represents a meaningful Gemma 4 on-device experience upgrade; if you have an earlier Pixel or a non-Pixel Android device, the new features will not be available. (source, May 19, score 101, 35 comments)

48GB VRAM daily driver survey: Gemma 4 31B Q6 is a top pick, with 96GB opening hundreds-of-thousands of context. A community survey of 48GB VRAM users (score 177, 221 comments) produced a clear picture of actual daily drivers at that tier. The most notable Gemma 4 data point comes from a user who ran Q6 Gemma 4 31B at 48GB and then upgraded to 96GB: with 96GB they now run Gemma 4 "with few hundred thousands of context and much faster," alongside Qwen3.6 27B Q8 and Q4 Mistral Medium. Another commenter confirms Qwen3.6 27B Q8 at 150k context as "a perfect fit for 48GB" — this is the primary Qwen rival in the VRAM tier. The preference split between Gemma 4 and Qwen at 48GB reflects use case: Gemma 4 31B is chosen for general quality and fewer thinking loops, while Qwen3.6 27B is preferred for code generation. Users who stay on Gemma 4 explicitly cite it as "feeling smarter" in non-coding tasks despite Qwen's coding advantage. The community consensus on VRAM aspiration remains universal: 48GB users universally want more, and those who have upgraded to 96GB note the context-length unlock as the primary benefit over 48GB, not raw throughput. Anecdotal confidence — this is a survey of self-selected enthusiasts rather than a structured benchmark. (source, May 19, score 177, 221 comments)

RTX 5060 Ti benchmark repo updated with structured Gemma 4 recipes — multi-card configs and NVFP4 on Blackwell emerging. A community member updated the open-source club-5060ti benchmark repository (score 22, 11 comments) to provide schema-validated benchmark JSON, cleaner llama.cpp and vLLM notes, and dedicated hardware lanes for single-card, dual-card, and mixed-GPU setups. The repo now includes specific Gemma 4 recipes alongside Qwen configurations. A commenter reported revisiting the NVIDIA NVFP4 Gemma-4-26B-A4B model after struggling with Qwen NVFP4 — an early data point on the Blackwell-native quantization path for Gemma 4 that the original club-5060ti post first surfaced. The repo author emphasizes that vLLM NVFP4/MTP support is Blackwell-specific and should not be assumed to work unchanged on older NVIDIA or AMD architectures — GGUF via llama.cpp remains the reliable cross-platform baseline. Community interest is broadening: one user asked about applying the recipes to a 5070 Ti + 5060 Ti mixed-GPU setup, and another plans to contribute quad-card 5060 Ti data. Practical note: the club-5060ti repo (linked at `https://5p00kyy.github.io/club-5060ti/`) is the most structured community-maintained source for RTX 5060 Ti Gemma 4 inference recipes; the GitHub project is the canonical starting point for 16–32GB VRAM Blackwell setups. Results are not universal benchmarks — they reflect a specific hardware configuration and should be treated as starting points for tuning on comparable rigs. (source, May 19, score 22, 11 comments)

Gemma 4 31B is a competitive benchmark ceiling for new model releases. Cohere's announcement of Command A+ (May 20, score 202, 45 comments) used Gemma 4 31B as a reference point: a commenter with Artificial Analysis scores noted Command A+ sits "right below Gemma 4 31B and right above Claude 4.5 Haiku" in intelligence score. This is a minor but recurring pattern — community members instinctively use Gemma 4 31B as a calibration point when assessing new ~30B class models, reflecting its established position as a practical quality ceiling in the local 30B tier. For users orienting themselves on relative model quality: Gemma 4 31B occupies a slot above current "efficient" MoE releases from mid-tier labs and below the heavyweight 70B–120B class on general benchmarks. (source, May 20, score 202, 45 comments)

Apple Silicon at 64GB: Gemma 4 with LM Studio holds up against frontier models for research and math. A community setup survey (May 19, score 44, 101 comments) produced a notable data point for M-series users: one commenter running Gemma 4 on an M1 Max with 64GB unified memory and LM Studio (with search and Obsidian integration) reports being "pleasantly surprised the current lot of local models are able to do pretty well against the frontier models, including masters level math." This matches earlier reports from M5 Max users (who see faster throughput) and extends the positive Gemma 4 report card to the older M1 Max tier. The M1 Max at 64GB supports Gemma 4 26B-A4B at reasonable speed (similar to M2 Max 32–64GB in the 20–30 tok/s range for MoE); for users on older Apple Silicon wondering if their hardware is still relevant, the community signal is yes — Gemma 4's MoE architecture means even 2-year-old M-series chips deliver a usable experience. Anecdotal confidence — single user report. (source, May 19, score 44, 101 comments)

Field Notes — 2026-05-20

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (17 new or updated since 2026-05-19, 164 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 20 sweep, 2026-05-20 00:00 EDT: five developments from this sweep deliver the first Gemma 4 MTP support on mobile hardware via Google AI Edge Gallery, clarify that desktop llama.cpp MTP does not yet support Gemma 4 models, establish Gemma 4 31B Q8 as the consensus daily driver for 48GB VRAM setups, record early RTX 5060 Ti NVFP4 testing with the nvidia/Gemma-4-26B-A4B variant, and provide community evidence on KV cache quantization quality tradeoffs at large context windows.

Gemma 4 MTP arrives on Android — Google AI Edge Gallery v1.0.13 and v1.0.14. Google AI Edge Gallery released v1.0.13 on May 18, adding Gemma 4 Multi-Token Prediction support for on-device inference on Android. The same day, v1.0.14 landed with experimental Model Context Protocol (MCP) support, Pixel TPU hardware acceleration, new built-in skills, and chat history persistence. Community reaction is positive: a top commenter (score 15) states "edge gallery is legit usable now." Practical notes: MTP requires re-downloading the on-device model variants; Pixel TPU acceleration is available on compatible Pixel devices; the MCP integration is marked experimental. A notable community concern (score 14) flags that the app requires agreeing to Google data collection on first launch — a relevant consideration for users running Edge Gallery specifically for offline privacy. The practical picture: Gemma 4 on mobile has reached a materially more usable state with v1.0.13+ — faster token generation via MTP, hardware acceleration on Pixel, and persistent chat sessions. For users with Pixel 8/9 or other Pixel-class devices, this represents the first production-quality path to Gemma 4 MTP inference without desktop hardware. (source, May 19, 44 score, 17 comments)

Desktop llama.cpp MTP improvements land, but Gemma 4 MTP support is not yet included. A community post (May 19, 99 score, 73 comments) links llama.cpp PR #23269, a meaningful MTP performance improvement for models that already support MTP in llama.cpp. A prominently upvoted community comment (score 24) clarifies directly: "Gemma4 MTP is not supported yet." This creates a notable divergence: on mobile, Gemma 4 MTP is production-available through Google AI Edge Gallery v1.0.13; on desktop, the draft-head speculative decoding path in llama.cpp still does not implement the Gemma 4 MTP head. Users reporting gains in this thread are on Qwen 3.6 models. A commenter who combined a 1660 Ti with a 5070 Ti to reach 22GB VRAM reports going "from single digit tps to double digit" — those gains are entirely on Qwen workloads. Practical guidance: update llama.cpp for MTP gains if you run Qwen 3.6 27B or 35B-A3B; do not expect any Gemma 4 MTP speedup on desktop llama.cpp builds until a Gemma 4 MTP head PR merges. Watch llama.cpp PRs for Gemma 4 MTP support as a separate milestone from the already-landed Qwen MTP. Confidence: explicit community statement supported by absence of any Gemma 4 MTP benchmark in the thread. (source, May 19, 99 score, 73 comments)

48GB VRAM sweet spot: Gemma 4 31B Q8 as daily driver, Q6 at 96GB for extended context. A community discussion (May 19, 24 score, 51 comments) on 48GB VRAM use patterns produced two clear Gemma 4 data points. A commenter (score 12) running dual 24GB P40 cards confirms Gemma 4 31B Q8 GGUF as the daily driver, noting it supports a useful context size with the workload split across both GPUs, and leaves enough headroom for an image model and TTS/STT on the remaining VRAM. A former 48GB user who upgraded to 96GB (score 13) reports running Q6 Gemma 4 31B with "a few hundred thousand tokens of context" and substantially faster throughput — treating Q6 at this tier as the next plateau after Q8 at 48GB. The practical picture: at 48GB (whether two P40s, one RTX 6000 Ada, one A6000, or similar configurations), Gemma 4 31B Q8 provides good quality with large context; the primary reason to go higher is extended context beyond 50–100k tokens or the step-up to Q6 quality. Community sentiment: "Q6 gemma4 31b" is the enthusiast target at the 96GB tier, but Q8 at 48GB is well-established and not a compromise worth stressing over. Confidence: anecdotal, small engagement; consistent with broader field notes on this hardware tier. (source, May 19, 24 score, 51 comments)

RTX 5060 Ti community recipes expanded — early NVFP4 Gemma 4 26B testing underway. A follow-up post (May 19, 20 score, 9 comments) from the club-5060ti project reports a cleaned-up benchmark and recipe repository with schema-validated JSON, a static results explorer, and structured lanes for 1x, 2x, and multi-card 5060 Ti configurations. A commenter reports that after getting vLLM working on the 5060 Ti, they "revisited the gemma model since the nvidia/Gemma-4-26B-A4B" NVFP4 variant — the first community signal of NVFP4-format Gemma 4 being tested on a Blackwell budget card. The 5060 Ti's 16GB GDDR7 and Blackwell native FP4 support in principle make it a viable target for NVFP4 Gemma 4 inference at higher throughput than equivalent FP16 or Q4 GGUF builds. Important caveat: this testing is still in early stages; the commenter notes difficulty getting the NVFP4 model to work with MTP on Qwen before switching to Gemma, so the Gemma NVFP4 result is not yet benchmarked. Practical status: 5060 Ti + Gemma 4 26B NVFP4 is an active community experiment, not a confirmed recipe. Follow the club-5060ti GitHub repository for results as they publish. Confidence: low — single commenter, no benchmark numbers yet. (source, May 19, 20 score, 9 comments)

KV cache quantization at large context: Q4_0 quality loss is significant, Q5_1 is the recommended middle ground. Community consensus (May 17, 44 score, 91 comments) on KV cache quantization for developers using large context windows (50k+ tokens) is clear and consistent. The top response (score 44) is direct: "The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead, which was introduced relatively recently. Q8 for K and Q4 for V is another option." A second commenter (score 23) recommends "Model Q6 and up, context cache FP16." A developer-facing finding (score 21): "Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can — I don't quantize cache." Practical guidance for Gemma 4 users: if you are running Gemma 4 31B or 26B-A4B at context lengths above 32k and using the model for structured output, tool calling, or multi-turn agentic tasks, avoid Q4_0 KV quantization. Q5_1 is the community-recommended minimum for quality preservation at large context; FP16 KV cache is the reliability ceiling but carries VRAM cost. The Q8K / Q4V hybrid is a middle-ground option if VRAM is the constraint. These findings directly apply to Gemma 4 26B-A4B MoE, which has the architectural capacity for large context but requires careful KV quantization choices to maintain coherence over long sessions. Confidence: community consensus across multiple practitioners. (source, May 17, 44 score, 91 comments)

Field Notes — 2026-05-19

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (18 new or updated since 2026-05-18, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 19 sweep, 2026-05-19 00:00 EDT: six developments from this sweep deliver the first cross-hardware MTP numbers from mainline llama.cpp, surface a high-engagement coding-agent harness demo built on Gemma 4 E4B, add a third community fine-tune to the Gemma 4 31B ecosystem, report ROCm 7.13 Strix Halo optimizations with a real-world AMD stability confirmation, frame Gemma 4 31B as the competitive ceiling for the 27–31B class, and record a practical agentic-coding comparison showing where Qwen 35B-A3B currently edges Gemma 4 26B on one user's setup.

MTP in mainline llama.cpp — measured numbers across Strix Halo and RTX 3090 rigs. PR #22673 (commit 4f13cb7) is confirmed in mainline llama.cpp as of May 16. A community benchmark post (May 18, 39 score) measured Qwen3.6-27B performance on two rigs using `--spec-type draft-mtp --spec-draft-n-max N`: on a Strix Halo (Framework Desktop, ROCm 7.0.2), Q4_K_M went from 11.7 to 21.2 tok/s (1.81×) and Q8_0 from 7.4 to 18.1 tok/s (2.44×); on a single RTX 3090 at 450W (CUDA 12.9), Q4_K_M improved from 38.7 to 59.5 tok/s (1.54×); on a dual RTX 3090 layer-split, Q8_0 went from 25.7 to 55.9 tok/s (2.17×). For MoE comparison: Qwen 35B-A3B gained 1.40× on Strix Halo and 1.24× on the RTX 3090 — confirming the by-now well-established asymmetry where dense models benefit substantially and MoE models gain less because each forward pass is already cheap. These results transfer directly to Gemma 4: expect comparable gains on Gemma 4 31B Dense (the closest architectural equivalent to Qwen3.6-27B dense), and more modest gains on Gemma 4 26B-A4B MoE. The optimal `--spec-draft-n-max` sweet spot varies by rig: uncapped 3090 preferred n=2 at Q4; power-capped 3090 and Strix Halo preferred n=3. Output is described as byte-identical to baseline at the same seed and temperature. Confidence: structured benchmark with multiple rigs and runs; single hardware configuration per rig. (source, May 18, 39 score, 28 comments)

SmallCode: Gemma 4 E4B as a 4B coding agent achieves 87% on self-selected benchmark — harness design matters more than model size. A high-engagement post (May 18, 639 score, 306 comments) introduced SmallCode, a coding agent harness built from scratch for small local models, demonstrating 87/100 tasks passing with Gemma 4 E4B (which activates 4B parameters per token). The author's core insight: standard agents like OpenCode and Cursor assume large frontier models, causing small models to fail on multi-step tool chains. SmallCode compensates with three harness techniques — compound tools that bundle sequential file operations into a single call (cutting failures from multi-step coherence loss in half), an improvement loop that feeds compilation errors back automatically, and a task-decomposition fallback when the model fails twice in a row. The author claims OpenCode scores approximately 75% with 14B models on their benchmark, suggesting the harness closes a meaningful gap. Important caveats: the benchmark is self-selected and not reproducible against a standard suite; top community responses were pointed ("TrustMeBro-2.1-hard," "custom benchmarks is like marking your own homework"). A top comment with 126 score questioned why these improvements aren't integrated into existing tools like OpenCode or little-coder rather than creating another standalone agent. Practical takeaway: the harness techniques — compound tool bundling, lint-driven improvement loops, and decomposition on repeated failure — are generalizable regardless of implementation, and the post confirms Gemma 4 E4B is capable enough for agentic coding when the scaffold compensates for its coherence limits. Treat the benchmark numbers with appropriate skepticism. (source, May 18, 639 score, 306 comments)

Gembrain: third community fine-tune merges seven Gemma 4 31B variants — community reception is skeptical. LLMFan46 published GGUF packaging for Gemma-4-Gembrain-31B-it-uncensored-heretic (May 18, 34 score), created by Nimbz as a merge of seven Gemma 4 31B fine-tunes targeting improved logical and lateral thinking, adherence, prose variety, and creative output. KLD is 0.0186 with 13/100 refusals. Community response is more skeptical than prior heretic-line releases: a top comment (score 27) says "I don't ever trust these weird merged models"; a second (score 17) questions why merging a fine-tune that already merged the base model adds value; a third (score 12) challenges the "boost lateral thinking" claim with no published mechanism. The fine-tune author revealed an internal contradiction: the merge includes a model with 99/100 refusals — the same as the base Gemma 4 31B — which may explain the KLD not moving strongly. This is the third community fine-tune in the Gemma 4 31B ecosystem in one week (Ortenzya for prose, Meromero for creative breadth, Gembrain for thinking and variety); all three are published by the same packaging pipeline (LLMFan46 GGUFs). None have been systematically benchmarked against the base model. If you are experimenting with fine-tuned Gemma 4 31B for creative or reasoning tasks, these give you options to test, but treat confidence as low until independent evaluations appear. (source, May 18, 34 score, 29 comments)

ROCm 7.13 nightly: Strix Halo optimizations merged, RX 6800 Gemma 4 stability confirmed. AMD released ROCm 7.13 Tech Preview (May 17, 50 score, 24 comments) with dedicated optimizations for the Ryzen AI Max 300 "Strix Halo" and new support for additional APU and GPU SKUs including Ryzen AI 7 PRO 360, 350, and other gfx1152-class devices. A commenter with an RX 6800 (score 2) reported running Gemma 4 E2B, E4B, and 26B at various quantizations via lemonade-sdk/llamacpp-rocm "for months" without a single crash — a meaningful real-world stability data point for AMD discrete-GPU Gemma 4 inference under ROCm. A second commenter (score 14) notes ROCm offers better prompt processing than pure Vulkan with a minor generation throughput tradeoff. Caution: an early commenter found ROCm 7.14 preview running slower than expected on Strix Halo when using the latest llamacpp-rocm build, suggesting not all builds in the preview channel are stable. For Strix Halo users: the 7.13 Tech Preview available from TheRock on GitHub is the tested path; the 7.14 preview introduced a regression for at least one user. For AMD discrete GPU (RX 6800 class and newer) users running Gemma 4 via llama.cpp: the lemonade-sdk ROCm build is now reported stable for production use including MTP testing, with the caveat that `-np 1` may be required and mmproj handling needs attention for multimodal use cases. Confidence: community report, single-user stability confirmation; not a controlled benchmark. (source, May 17, 50 score, 24 comments)

Model release anticipation: community positions Gemma 4 31B as one of two competitive options in the 27–31B class. A widely-engaged discussion (May 18, 120 score, 71 comments) forecasting when new local models will drop surfaced a useful framing for Gemma 4's current market position. A top commenter (score 39) stated directly: "Qwen 3.6 27B dense raised the bar very high — only competitor (not for coding, of course) is Gemma 4 31B dense currently." Another commenter (score 41) specifically flagged "Gemma 4 123B or Qwen 3.6 122B would be huge," reiterating the community's ceiling aspiration documented in prior field notes. Several commenters referenced expectations for new Google releases in the following days, potentially at Google I/O. Practical context: Gemma 4 31B Dense is the community's shortlist pick for non-coding general quality at the 27–31B parameter tier. For coding and agentic tasks, Qwen 3.6 27B and 35B-A3B are consistently preferred. This positioning has been stable across the last two weeks of field notes. No product announcement signal exists for a larger Gemma 4 variant. (source, May 18, 120 score, 71 comments)

Agentic coding with 4090+5060Ti: Q8_0 Qwen 35B-A3B at 262k context edges Gemma 4 26B for demo work. A practitioner report (May 18, 29 score, 27 comments) compared Qwen 35B-A3B against Gemma 4 26B-A4B on an agentic coding workload — demo and data analytics scripts via Claude Code's API endpoint pointing to localhost. The author ran Q8_0 Qwen 35B-A3B on a 4090+5060Ti combination with 262,144-token context and reports it is "better than Gemma 4 26B" for their use case, though they note it underperforms in plain chat compared to agentic mode. Community recommendations push back on the Q8_0 KV cache setting: multiple comments (scores 10 and 9) recommend dropping to Q6_K_XL or similar model quant and using unquantized or FP16 KV cache for better coding quality. A notable data point from a top comment (score 10): 70 tok/s on dual RTX 3090 with Qwen 35B-A3B at Q8 and 196k context — useful headroom for context-heavy agentic sessions. Important context: this comparison uses models of different parameter counts (Qwen 35B-A3B activates only ~3.5B parameters per token as an MoE; Gemma 4 26B-A4B activates ~4B). The architectural comparison is approximate. For users with a single 4090 or 4090+5060Ti and primarily agentic coding or demo work at large context: Qwen 35B-A3B at Q6_K_XL or similar with unquantized KV cache is the community's current recommendation over Gemma 4 26B in that specific niche. For general-purpose or non-coding tasks, Gemma 4 26B or 31B remain competitive options. Confidence: anecdotal, single practitioner report. (source, May 18, 29 score, 27 comments)

Field Notes — 2026-05-18

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (26 new or updated since 2026-05-17, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 18 evening sweep, 2026-05-18 02:00 EDT: three developments from this sweep expand the Gemma 4 community fine-tune ecosystem with a second heretic creative model, surface strong community demand for a larger Gemma variant, and document Gemma 4 running on Blackwell hardware as a catalyst for runtime migration.

G4-Meromero-31B: second creative fine-tune for Gemma 4 31B joins Ortenzya in the community ecosystem. Community developer zerofata released G4-MeroMero-31B-Uncensored-Heretic (May 17, 101 score, 49 comments), available via llmfan46's GGUF packaging on Hugging Face. Designed for creative tasks broadly, it complements the Ortenzya fine-tune (natural English prose) released the prior day by the same heretic-packaging pipeline. KLD is 0.0100 with 15/100 refusals. Community discussion highlights the two as potentially complementary: Ortenzya targets prose quality and natural English for translation and RP; Meromero targets creative breadth. A commenter asks directly how the two differ, and the fine-tune author notes Ortenzya improves prose naturalism while Meromero focuses on creative task coverage. Neither model has been systematically benchmarked against the base Gemma 4 31B; treat as community options to trial on your specific creative use case. Important context: the base Gemma 4 31B is already noted by community members as relatively permissive for creative tasks — these fine-tunes address tone and prose quality rather than unlock otherwise-blocked capabilities. Confidence: anecdotal — no head-to-head benchmark against the base model published. (source, May 17, 101 score, 49 comments)

Community appetite for a 124B Gemma is strong — Google shows no present signs of building one. A high-engagement post (May 17, 285 score, 51 comments) imagining a 124B Gemma model surfaced broad community agreement that such a model would be compelling — commenters frame it as "basically an open-weights version of Gemini Flash," which would be among the most capable locally-runnable models available. Top responses are skeptical Google will deliver: "There is no interest in doing that for them" (score 45) and "That would be awesome but I guess there is no interest in such a huge model" (score 34). A commenter jokes that any announced release will turn out to be a ShieldGemma variant. No product roadmap signal exists to support expectation of a 100B+ Gemma model. Practical context for users: Gemma 4's current ceiling is 31B dense (or 26B MoE, equivalent to approximately 4B active parameters per token). Users needing 100B-class locally-runnable models currently have Qwen 3.5 122B, Qwen 3.6 MoE, DeepSeek-V4 Flash, and similar options. This finding is editorial context rather than a hardware or performance data point — it captures the ceiling of current Gemma 4 availability and community aspirations for the model line. (source, May 17, 285 score, 51 comments)

Blackwell 5000-class GPU + Gemma 4 as a migration driver from Ollama to llama.cpp. A user with 64GB RAM and a Blackwell 5000-series GPU (identified as "backwell 5000" in post, likely RTX PRO 5000 or similar) running Gemma 4 and Qwen via Ollama and LM Studio asked for migration advice to get better speeds (May 17, 35 score, 73 comments). Community response: llama.cpp is the direct next step, offering fine-tuned control via flags and measurable throughput gains over Ollama's wrapper overhead. ik_llama.cpp was called "the best, not too hard tool" by a prominent commenter; vLLM wins on 4+ concurrent users but adds setup complexity. A late comment highlights lemonade-server as a drop-in Ollama-compatible endpoint with llama.cpp and vLLM backends. The hardware context adds a useful data point: Blackwell 5000-class cards (estimated 24–48GB GDDR7 depending on variant) running Gemma 4 are a real user segment upgrading from earlier Ollama-based workflows. This confirms that Gemma 4 is in active use by developers who are growing beyond simple Ollama deployments into tunable inference stacks, and that llama.cpp remains the community's first recommendation for that transition. (source, May 17, 35 score, 73 comments)

May 18 sweep, 2026-05-18 00:00 EDT: five developments from this sweep close the long-running open question on MTP's mainline llama.cpp status, deliver the first community benchmarks of officially-merged MTP across Strix Halo and RTX-class NVIDIA hardware, quantify the wall-time picture at production-scale context (85k tokens), and add a cross-platform decode-bandwidth comparison showing where each GPU tier wins on Gemma 4 model sizes.

MTP officially merged into llama.cpp mainline — PR #22673 approved by Georgi Gerganov. After weeks of testing in forks and patched builds, Aman Gupta's PR #22673 landed in llama.cpp's main branch, making Multi-Token Prediction (MTP) available to all users via a standard build without patching. Community reaction was celebratory, with the announcement post reaching 733 score. The key mechanism: MTP adds a lightweight draft head that predicts multiple tokens ahead; accepted drafts expand effective decode throughput at no quality cost (equivalent to standard generation when tokens are rejected). Important context from community commentary: MTP benefits are task-type dependent. Low-entropy outputs — code generation, math, structured text — see 67–90% acceptance rates and meaningful speedups. High-entropy outputs — creative writing, roleplay, diverse prose — see low acceptance rates and sometimes slower wall time due to the dual-prefill overhead. Practical note for immediate testers: at time of first community testing, the official Docker image for llama.cpp server-cuda had not yet picked up the merge; users wanting to test immediately need to build from source with `CUDA_DOCKER_ARCH` set for their GPU. The container will follow shortly. (source, May 16, 733 score, 236 comments)

Strix Halo MTP benchmarks: Qwen3.6-27B gains +111% generation speed; 35B-A3B gains are context-length dependent. The first systematic MTP benchmarks on Strix Halo hardware (Ryzen AI MAX 395, 128GB unified LPDDR5X) reveal a clear asymmetry between 27B dense and 35B-A3B MoE. For Qwen3.6-27B on a 15k-token single-turn task: generation rate went from 7.63 to 16.15 tok/s (+111%), but prompt processing slowed 12.5% due to the MTP head's dual-prefill pass; net wall time improved by 10 seconds (-11.5%). Over a 5-turn chat conversation (~28.5k cumulative context): generation improved +136% on average (7.61 to 17.98 tok/s), and turns 2-5 were 56 seconds faster overall (-26.5%) as the prompt-processing overhead amortized across turns. For Qwen3.6-35B-A3B (MoE): single-turn generation improved +16.5% (48→56 tok/s), but the dual-prefill overhead made wall time 2.33 seconds slower (+11.2%) on the same 15k-turn task. On 5-turn chat, the MoE was roughly tied (+2.3% slower). A post-publish update from the author tested ROCm 7.13 versus Vulkan: ROCm now shows +12% better prompt processing than Vulkan across all tested models — a meaningful reversal from earlier data. The pattern maps directly to Gemma 4: dense 31B benefits substantially from MTP across conversation length; MoE 26B-A4B gains less because its already-high base throughput means MTP overhead costs proportionally more. Confidence: single hardware setup, code-heavy synthetic prompts per community analysis. (source, May 16, 136 score, 57 comments)

RTX 3090 MTP at 85k context: PP halved, TG +85%, net wall time -41%. A real-world production data point from a headless RTX 3090 running Qwen3.6-27B-MTP-Q4_K_M at 128k context demonstrates the wall-time picture that per-metric numbers obscure. On an 85,000-token research task: without MTP, prompt processing ran at 1,050 tok/s and generation at 27 tok/s, completing in roughly 39 minutes. With MTP enabled (`--spec-draft-n-max 3`): prompt processing fell to 600 tok/s (-43%), generation rose to 50 tok/s (+85%), and the same task completed in roughly 23 minutes — a 41% time reduction. The key insight is that decode dominates wall time on large-context generation tasks, so even a significant PP regression translates to large net savings when TG meaningfully improves. This matches the Strix Halo multi-turn result: PP overhead matters most on the first turn of a fresh session; at 85k context with substantial output, the generation benefit compounds. Practical guidance: if your workload generates substantially more than it reads (high TG:PP token ratio), MTP is likely a net win despite the PP regression. Benchmark first if your workflow is PP-heavy — large document ingestion, RAG with many retrieved chunks, or very short output responses where TG never gets to compound. (source, May 17, 44 score, 37 comments)

RTX 5090 first-day MTP community testing confirms dense-vs-MoE asymmetry at the high-end tier. A controlled RTX 5090 MTP test (32GB, built from llama.cpp source commit 4f13cb7 with `CUDA_DOCKER_ARCH=120`, Unsloth Q5_K_M for 27B and UD-Q4_K_M for 35B-A3B, 128k context with flash attention and q8_0 KV cache) confirms what Strix Halo and RTX 3090 data show: dense 27B delivers large generation speedup; MoE 35B-A3B shows smaller fractional improvement because its high base throughput means MTP verification overhead costs proportionally more. A commenter reported 180 tok/s on dual 5060 Ti with MTP and parallel=2, confirming that parallel execution is now fully supported (an earlier limitation requiring parallel=1 is resolved). For Gemma 4 users: the architectural principles transfer directly — expect similar 2x+ generation gains on 31B Dense with MTP on code tasks; expect more modest or task-dependent gains on 26B-A4B MoE. (source, May 17, 203 score, 30 comments)

Multi-platform decode comparison: RTX 5070 beats RTX 3090 on sub-12GB models; 3090 wins on 14–31B band. A community benchmark (55 runs, 3 hardware platforms, 5 backends) compared Strix Halo ROCm, RTX 3090 CUDA, and RTX 5070 Vulkan across a range of model sizes. Key results for Gemma 4: the RTX 5070 (12GB GDDR7, Vulkan) outperforms the RTX 3090 (24GB GDDR6X, CUDA) on models that fit in 12GB — Gemma-4-E4B at 124.3 vs 118.4 tok/s. For models that require more than 12GB, the 3090 wins decisively: Gemma-4-26B-A4B scored 100.5 tok/s on the 3090 versus 43.7 (Strix ROCm) and 47.7 (Strix Vulkan). The Strix Halo systems are not competitive on models that fit in discrete VRAM but offer unmatched capacity for larger models neither discrete card can run at full quality. Community pushback flagged a methodology issue: the 5070 was benchmarked with Vulkan rather than CUDA, which may understate its performance margin over the 3090 on sub-12GB models. Practical guidance for Gemma 4 users: for E4B or E2B on a 12GB budget, the RTX 5070 generation rate is higher than the 3090; for 26B-A4B or 31B Dense, 24GB+ VRAM from the 3090 or higher is required for competitive speeds. (source, May 16, 34 score, 20 comments)

Field Notes — 2026-05-17

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new or updated since 2026-05-16, 148 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 17 sweep, 2026-05-17 00:00 EDT: eight developments from this sweep surface a new fine-tune for creative writing and RP use cases, confirm a community-derived power-efficiency curve for multi-3090 inference rigs, validate Gemma 4 E4B's native audio transcription capability, extend MTP dense-vs-MoE evidence to million-token scale, document enterprise team deployment patterns, establish that thinking mode hurts translation tasks, add Terminal-Bench 2.0 context for Gemma 4 positioning, and resolve the GPU-vs-RAM debate for MoE inference.

Ortenzya: first quality creative writing fine-tune for Gemma 4 31B. Community developer LLMFan46 released `gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic` (May 16, 25 score), a Gemma 4 31B fine-tune targeting natural English prose quality, creative writing, translation fidelity, and RP use cases. Available in safetensors and GGUF formats. The fine-tune addresses a community-noted weakness in base Gemma 4 31B: while the model produces correct and concise prose, some users find the writing style lacks naturalness in extended creative or narrative output. Key community finding from the discussion: the base Gemma 4 31B is already "uncensored asf" for most creative use cases — the fine-tune's value is specifically prose quality and natural English, not primarily safety softening. A commenter notes the fine-tune also addresses "softening" (toned-down language without hard refusal), which matters for translation and RP tasks where the original source material has strong tone or content. Practical guidance: if you find base Gemma 4 31B too dry or stiff for creative writing, Ortenzya is the first community option to try. Confidence: anecdotal — no systematic quality benchmark against base model yet. (source, May 16, 25 score, 16 comments)

4x RTX 3090 power efficiency curve: 220W per GPU is the sweet spot — memory-bandwidth-bound decode confirmed. A systematic power-limit benchmark on a 4x RTX 3090 rig running Qwen3.6-27B at FP16 via vLLM TP=4 (May 15, 38 score; updated comments May 17) measured output generation speed and prompt-processing throughput across power limits from 200W to unrestricted (350–390W). Key result: reducing from unrestricted to 220W drops output generation from 29 to 27 tok/s while pushing efficiency from 0.77 to 1.13 tok/joule. Below 220W, both efficiency and throughput fall together (200W: 24 tok/s, 1.11 t/J). A top commenter (score 9) provided the architectural explanation: "output generation speed is flat from 300W down to 220W because decode is memory-bandwidth-bound, not compute-bound. 3090 GDDR6X bandwidth barely changes with power limit, so you hit the same ~29 t/s regardless." Prompt processing drops proportionally because prefill IS compute-bound. The hardware setup uses a PCIe Gen 3 bifurcated topology (x16/x8/x8/x4); the x4 slot is a known bottleneck, and a P2P driver patch (`github.com/aikitoria/open-gpu-kernel-modules`) that supports mixed NVLink/PCIe topologies was flagged but tested without improvement on this specific PCIe-3-limited rig. Practical guidance for multi-3090 (and general multi-GPU NVIDIA) inference: power-limiting to ~220W per card costs ~7% in output throughput but saves ~37% in power draw. The decode floor is bandwidth-limited, so power won't buy you output speed beyond ~300W. Test prefill-heavy workflows first — prefill does benefit from compute headroom and will degrade proportionally below ~250W. Anecdotal confidence (single setup, Qwen workload; principles are general). (source, May 15, 38 score, 54 comments)

Gemma 4 E4B confirmed for short multi-lingual audio transcription — not a Whisper replacement for long audio. A community practitioner report (May 12, 22 score, 9 comments) validated Gemma 4 E4B's native audio input for transcription. Key findings: E4B processes short audio clips accurately in multiple languages including foreign languages, without additional STT tooling. A top commenter confirms active use for voice assistant STT, noting the model's promptability is a practical advantage over fixed-vocabulary Whisper — you can instruct E4B to focus on specific terms, format output in a particular way, or filter filler words in the prompt. Practical limits: for audio exceeding roughly one hour, Whisper or a dedicated STT model remains necessary; E4B's context window constrains continuous transcription. Multiple commenters also noted that E2B may support the same audio input path via LiteRT-LM. This is the first direct community report of E4B transcription as a primary use case. Combined with the Jetson Orin NX SUPER finding (May 16 sweep), E4B is now documented as a viable complete voice pipeline component: STT natively, inference on-device, TTS via Piper or similar, with no cloud dependency. Confidence: anecdotal, small engagement, no controlled accuracy comparison against Whisper published. (source, May 12, 22 score, 9 comments)

MTP dense-vs-MoE finding confirmed at 1M token scale — dense 27B gains ~1.5x, MoE 35B gains under 10%. A practitioner who spent over 1 million tokens across three sessions building a pygame project with Qwen 3.6 MTP models (May 15, 127 score, 78 comments) directly confirms the MTP task-type dependency at production usage scale: the dense Qwen3.6-27B model with MTP gained approximately 1.5x tok/s; the MoE 35B-A3B gained less than 10%. A commenter adds a critical caveat: the test used `q4_0` KV quantization — already warned in earlier field notes to carry meaningful quality risk on long-context tasks. For Gemma 4 users: this is further confirmation that MTP is primarily valuable on dense models (Gemma 4 31B, Qwen 3.6 27B dense) and delivers marginal gains on MoE variants (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B). The result has now been independently confirmed by the 300-test systematic analysis (May 10), the M4 Max measured results (code: 1.53x; prose: wash; JSON: 0.50x), and this million-token practitioner run. (source, May 15, 127 score, 78 comments)

Enterprise server for 7-person team: 2x RTX 6000 Blackwell MaxQ with Proxmox and vLLM — community recommends testing cloud first. A team setting up local inference for a 7-person company (May 15, 20 score, 58 comments) drew a substantive community discussion on small-team deployment patterns. The most-upvoted practical setup: a Gigabyte server with 2x RTX 6000 Blackwell MaxQ (~26k€), running Proxmox with an LXC container using Debian 13 + NVIDIA drivers + CUDA 13.2, serving Gemma 4 and Qwen models via vLLM. A key community concern: the commenter with this setup is running llama.cpp instead of vLLM on two 6000s — a top comment calls this "leaving so much performance on the floor." For multi-GPU inference of 30–35B class models, vLLM tensor-parallel is the right backend choice. The second-highest-voted response argues for API/rental first: "Use cases can quickly outgrow on-prem resources. Give people generic access, watch what they do for a month or two, then decide." A third pattern: a 1x RTX Pro 6000 with large RAM to run Kimi K2.6 for 1-2 power users who need a genuinely strong coding model. Hardware and architecture recommendations for small-team deployment: TP=4 vLLM on multi-GPU for 35B class; single high-VRAM GPU with large RAM for flexibility; validate use case demand before committing to on-prem hardware at this scale. Confidence: community discussion, multiple experienced practitioners, not a benchmark. (source, May 15, 20 score, 58 comments)

Thinking mode consistently hurts Gemma 4 translation — direct pass is preferred, two-pass is useful only for complex edge cases. Community consensus (May 13, 22 score, 17 comments) on using Gemma 4 for translation with thinking mode enabled is clear: thinking mode "wastes a lot of context thinking about it and also ends up overthinking it," and turning thinking off produces better results for direct translation tasks. A more nuanced practitioner approach from the comments: use a first pass at temperature 0 with no thinking for direct translation, then a second optional reasoning pass to review flagged segments, with KV cache prefix reuse on the second pass to minimize latency. A dedicated translation fine-tune (Qwen3-Translation, Tower) remains the community recommendation over generalist + thinking for high-volume or professional-quality needs. Practical guidance: disable thinking mode for Gemma 4 translation; reserve the optional review pass for idioms, jargon, or segments where you need explicit justification. This is consistent with the token efficiency picture — Gemma 4 is concise and direct, and adding thinking overhead to tasks that don't require multi-step reasoning adds cost without quality gain. (source, May 13, 22 score, 17 comments)

Terminal-Bench 2.0: Qwen 3.6 35B-A3B scores 24.6% and beats Gemma 4 31B on terminal coding — expected gap given dense vs MoE. The public Terminal-Bench 2.0 leaderboard now includes Qwen3.6-35B-A3B at 24.6% (±3.2) with the little-coder scaffold, placing it above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B (23.9%). Community commentary (May 16, 243 score, 57 comments) is broadly positive but includes an important framing note: comparing Qwen 3.6 35B-A3B (MoE) against Gemma 4 31B Dense is not architecturally equivalent — the MoE uses 3.5B active parameters while the dense model uses all 31B. A commenter notes: "Gemma 4 31B is a dense model. Would not be fair to compare the Qwen MoE to it. The better comparisons would be between Qwen 27B dense and Gemma 31B." Gemma 4 31B has not yet been officially benchmarked on Terminal-Bench 2.0 as of this writing. For readers using Gemma 4 for terminal/agentic coding: this benchmark suggests Qwen 3.6 MoE leads on this specific leaderboard task; however, the community also consistently reports Gemma 4 produces higher-quality output per token on focused tasks (see the Packman benchmark and three.js creative coding findings). Neither model has a clean win across all coding task patterns. (source, May 16, 243 score, 57 comments)

GPU vs RAM debate: VRAM wins on throughput, but Gemma 4 MoE is the best case for high-RAM inference. A community debate (May 15, 63 score, 81 comments) on whether "rich RAM / poor GPU" is a viable strategy produced two clear data points. A practitioner with both 192GB RAM and a 5090 reports using RAM only for testing new models, avoiding it otherwise: "The speed gain is just too important for the too small gain on accuracy." A separate commenter (512GB across 128GB devices) notes that the Gemma 4 26B MoE and Qwen 3.6 27B dense models have changed the calculus, making 30B-class dense-equivalent quality achievable on consumer VRAM for the first time. The analytical breakdown by a third commenter: sub-7B models must be task-specific; 24–35B dense is the minimum for general-purpose quality; MoE in the 100B parameter class is viable at 128GB+ RAM with hybrid offload. The Gemma 4 26B-A4B MoE architecture — activating only 4B parameters per token — is explicitly identified as the strongest argument for the high-RAM approach: its MoE sparsity means CPU RAM throughput is not penalized as severely as a dense 26B model would be. For Gemma 4 users with a mid-range GPU (16–24GB) and 64–128GB RAM: the 26B-A4B with `--n-cpu-moe` offload is the architecture that most justifies the RAM-over-GPU strategy; the 31B Dense requires VRAM to run without significant throughput penalties. (source, May 15, 63 score, 81 comments)

Field Notes — 2026-05-16

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (13 new or updated since 2026-05-15, 147 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 16 sweep, 2026-05-16 00:00 EDT: three developments from this sweep surface the lowest-power embedded hardware data point to date for Gemma 4, extend context-length degradation evidence in the budget GPU tier, and confirm NVIDIA's own NVFP4 quantization path for Blackwell hardware.

Gemma 4 E4B confirmed working on Jetson Orin NX SUPER 16GB — 14–15 tok/s fully offline with 200ms cached TTFT. A community robotics project (May 15, 419 score, 61 comments) detailed a fully offline suitcase robot running Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention, 12K context, on a Jetson Orin NX SUPER 16GB. Sustained generation: 14–15 tok/s. Cached TTFT: ~200ms after a prompt structure optimization that moved persona and tool definitions to the top of the system block, history to the middle, and volatile sensor/vision data to the bottom of the most recent user turn — a disciplined ordering that kept the prefix cache stable and dropped TTFT from multi-second to 200ms. A key benefit observed: Gemma 4's native vision capability eliminated the separate BLIP subprocess required in prior versions, simplifying the pipeline. The author uses SenseVoiceSmall for STT and Piper for TTS; all inference runs on-device with no network interface. This is an anecdotal single-device report, not a reproducible benchmark. The Jetson Orin NX SUPER 16GB is a specialist embedded GPU with ~204 GB/s memory bandwidth; expect similar results on comparable Jetson-class hardware, and lower results on Orin NX 16 (not SUPER). (source, May 15, 419 score, 61 comments)

Long-context throughput and quality degradation on the $200 GTX 1080 setup — short-context numbers don't transfer. New community comments on the budget GTX 1080 inference guide (now 97 score, 49 comments as of May 16) quantify two independent degradation curves for the 8 GB VRAM / 32 GB RAM + Gemma 4 26B-A4B setup. Throughput: tok/s drops from ~30 at 4k context to ~20 at 50k, matching the expectation that KV cache fills VRAM and forces more expert weights to page over PCIe. Quality: a separate commenter reports retrieval-heavy tasks degrade meaningfully past 32–64k context, well before the advertised 128k limit — the visible tok/s curve is not the only performance cliff. The commenter's framing: "there's a quieter second curve underneath" where output quality erodes on retrieval tasks even as generation speed appears acceptable. This tightens the practical context guidance: the GTX 1080 + TurboQuant setup is usable at 4–16k context for routine chat and code; treat 32k+ as experimental territory where output reliability is unconfirmed. The MTP fix (`--override-tensor-draft "token_embd\.weight=CUDA0"`) and prefill speedup (`-ub 4096+`) remain valid tuning regardless of context length. (source, May 13, 97 score, 49 comments)

NVIDIA released its own NVFP4 quantization of Gemma 4 26B-A4B for Blackwell GPUs. NVIDIA published `nvidia/Gemma-4-26B-A4B-NVFP4` on Hugging Face (post 1t0i18e), a first-party NVFP4 quantization targeting the RTX 5090 (SM120, Blackwell). NVFP4 is a GPU-native 4-bit floating point format specific to Blackwell and newer NVIDIA architectures; it is not GGUF Q4 and does not run on older consumer hardware. A separate community report from a Radeon 9060 XT 16GB user achieved 25.9 tok/s on an IQ4_NL GGUF of the same model via llama.cpp, providing a comparable data point from the AMD side (anecdotal, single report). Practical guidance: if you have an RTX 5090, the NVIDIA NVFP4 model is worth testing over AWQ-4bit for throughput; if you are on older NVIDIA or AMD hardware, standard GGUF quantizations remain the mainstream path. The RTX 5090 DFlash speculative decoding benchmark from May 8 (600 tok/s peak) used an AWQ-4bit model, not NVFP4; NVFP4 throughput comparisons have not yet been published by the community. (source, score 32, 11 comments)

May 15 sweep, 2026-05-15 00:00 EDT: two developments from this sweep refine KV cache quantization guidance for vLLM serving and extend the budget GPU picture with new community benchmark methodology discussion.

FP8 confirmed as the best KV cache quantization default for vLLM — TurboQuant variants offer a VRAM tradeoff, not a free lunch. A first comprehensive study of TurboQuant against BF16 and FP8 in vLLM (May 14, 64 score, 17 comments; source article) settles a frequently debated question for constrained-VRAM Gemma 4 deployments. Key conclusions: FP8 via `--kv-cache-dtype fp8` provides 2x KV cache capacity with negligible accuracy loss — it matches BF16 on most throughput and latency metrics while meaningfully improving them when VRAM is the binding constraint. TurboQuant k8v4 provides only 2.4x compression (vs FP8's 2x) but consistently degrades throughput and latency; the marginal extra compression is not worth the performance cost. TurboQuant 4bit-nc is more practical: it helps under severe VRAM pressure but trades accuracy, latency, and throughput. TurboQuant 3bit variants show meaningful accuracy drops on reasoning and very long-context tasks. A commenter notes that FP8 KV numbers "are obviously worse" compared to unquantized — users with ample VRAM should keep KV cache unquantized; FP8 is the right default only when VRAM is genuinely constrained. A second commenter provides a reassuring data point: running Gemma 4 at 128k context with TurboQuant 2-3 in a production-style load (large PDF ingestion) produced coherent answers across beginning, middle, and end of the document. These TurboQuant results apply specifically to vLLM with its PagedAttention KV management; llama.cpp's TurboQuant/RotorQuant KV implementation behaves differently and should be benchmarked separately. Critical caveat: the study benchmarks only FP8 and TurboQuant variants; no Q4 comparison is included, drawing criticism that the study misses the primary VRAM-constrained use case. (source, May 14, 64 score, 17 comments)

GTX 1080 Gemma 4 guide attracts community discussion on long-context benchmarking methodology. The May 13 budget inference guide (score climbed from 46 to 97, now 47 comments) prompted a useful community exchange about how to properly evaluate large-context performance. The original benchmarks used small prompts (under 2,000 tokens) despite reserving 128k context. New comments recommend using a large Reddit thread (40k+ tokens in JSON or markdown) as a more realistic long-context stress test — common domain content not baked into training data. The guide author is investigating a standardized benchmarking approach. Practical implication: the 20–24.5 tok/s figures for the GTX 1080 setup should be treated as short-context baselines only; actual throughput at meaningful long-context prompts will be lower because KV cache fills VRAM and forces more CPU round-trips. The `--override-tensor-draft "token_embd\.weight=CUDA0"` MTP fix remains valid regardless of prompt length. (source, May 13, 97 score, 47 comments)

May 14 sweep, 2026-05-14 00:00 EDT: five developments from this sweep extend the budget hardware picture for Gemma 4 MoE, surface practical GPU power tuning, confirm prefill tuning for partially-offloaded models, and add new guidance on vLLM vs llama.cpp for single-user workloads.

Gemma 4 26B-A4B running at ~24 tok/s on a $200 secondhand GTX 1080 machine — a new floor for budget inference. A detailed guide (May 13, 46 score) demonstrates Gemma 4 26B-A4B and Qwen 3.6 35B-A3B running on an i7-6700 / GTX 1080 (8 GB VRAM) / 32 GB RAM machine costing ~$200 secondhand via llama.cpp with TurboQuant/RotorQuant KV cache quantization. Results at Q4_K_M with 128k context: Gemma 4 26B-A4B (no MTP) ~20 tok/s with `--n-cpu-moe 20`, TurboQuant KV turbo3 on both K and V caches; after fixing the MTP token embedding table placement, ~24.5 tok/s with `--override-tensor-draft "token_embd\.weight=CUDA0"`. The key mechanism: TurboQuant/RotorQuant KV cache compression fits the KV cache within 8 GB VRAM even at 128k context, while `--n-cpu-moe` offloads the cold MoE expert weights to system RAM, streaming them over PCIe as needed. The GPU sits at ~40-50% utilization; the bottleneck is PCIe bandwidth. Important caveat from the post: the GTX 1080 test used small prompts (under 2,000 actual tokens despite 128k reservation); a commenter notes that larger real-world prompts at 128k context will degrade throughput further as VRAM is tighter with large KV. MTP barely helped out of the box (~5% gain) because Gemma 4's tied LM head forces token embedding lookups on the CPU by default; the fix flag above moves the embedding table to GPU. This is an anecdotal data point, not a reproducible benchmark baseline, and TurboQuant is not in mainline llama.cpp. But directionally, a ~$200 machine can now run a 26B MoE at interactive speeds — a meaningful lower bound for the local Gemma 4 story. (source, May 13, 46 score, 10 comments)

Cut GPU power limit to 40% TDP — no throughput loss for LLM decode, meaningful savings on power, heat, and noise. A viral post (May 12, 709 score, 198 comments) benchmarked an RTX 4090 running Qwen3.6-27B-UD-Q4_K_XL with `nvidia-smi -pl` set to various power limits. Result: reducing to approximately 40% of rated TDP (~100W for a 4090) preserves generation throughput almost identically while cutting electricity draw, heat output, and fan noise proportionally. Multiple RTX 5090 owners in the comments independently validated the finding at their own hardware (860mV/2500MHz, ~360W, with only ~12% TPS loss at the absolute voltage floor). The mechanism: LLM decode is memory bandwidth bound, not compute bound. Once the GPU's memory bus is the bottleneck, reducing compute frequency and voltage has minimal effect on bandwidth-limited operations. The result holds for any consumer NVIDIA GPU running inference workloads including Gemma 4. Practical guidance: reduce power limit incrementally with `nvidia-smi -pl` and monitor generation speed — you can reclaim meaningful electricity savings at almost no quality cost. This is a well-established finding now backed by community data across multiple GPU generations. (source, May 12, 709 score, 198 comments)

Raising llama.cpp `-ub` to 4096-8192 gives ~5.5x prefill speedup for partially CPU-offloaded MoE models. A guide (May 12, 112 score, 53 comments) discovered that increasing the micro-batch size (`-ub`) from llama.cpp's default 512 to 4096 or 8192 dramatically improves prompt processing throughput for `--n-cpu-moe` partially-offloaded models. Measured on an RTX 3090 with a 120B model: prompt processing improved from ~380 tok/s at default `-ub 512` to ~2091 tok/s at `-ub 8192` — a ~5.5x gain. Generation speed was nearly unchanged (32.3 → 30.1 tok/s, ~7% regression). The mechanism, debated in comments: either amortizing PCIe transfer overhead across more tokens (reducing per-transfer round-trip cost) or reducing GPU kernel launch overhead by saturating the attention/router on fewer, larger batches. Both explanations are consistent with the observation. The default 512 exists because it's a safe conservative value for low-VRAM cards that have little headroom for compute workspace spikes. Users with spare VRAM should tune upward and stop when either VRAM OOM or generation speed starts to regress. This applies directly to Gemma 4 26B-A4B when partially offloaded — pair with `--n-cpu-moe` adjustment to keep the run within VRAM at the chosen `-ub`. (source, May 12, 112 score, 53 comments)

vLLM vs llama.cpp for single-user workloads: confirmed equivalent at low concurrency, vLLM wins at 4+ concurrent users. A community discussion (May 12, 75 score, 91 comments) produced a clear practical consensus. vLLM adds meaningful value when: (1) concurrent batch inference is in play — vLLM allocates VRAM per-batch as context grows while llama.cpp must pre-allocate max-context KV VRAM at launch; (2) tensor-parallel multi-GPU/multi-node serving is needed (e.g., Qwen 397B across two DGX Sparks). vLLM also supports MTP for Gemma 4 and Qwen3.6 already, while llama.cpp MTP is still in a patched fork. For single-user non-batched local use, llama.cpp remains simpler with equivalent per-query throughput. CUDA prompt processing is faster in vLLM regardless of batch size. AMD Lemonade now ships vLLM ROCm as a built-in experimental backend. This confirms and sharpens the earlier guidance: if you are a solo user running interactive chat or coding sessions, llama.cpp or LMStudio is fine; switch to vLLM when you need to serve multiple concurrent users or run tensor-parallel inference on model weights too large for one GPU. (source, May 12, 75 score, 91 comments)

Docker images simplify llama.cpp MTP deployment — confirmed +34% throughput on RTX 3090. A community developer (May 13, 63 score, 16 comments) released Docker images pre-built from the llama.cpp MTP development branch, removing the barrier of building from source. A commenter reports +34% throughput gain on an RTX 3090 after switching. The images track recent MTP branch improvements including image support and bug fixes. A commenter asks whether Gemma 4 is supported; the Docker images cover the same model classes as the underlying MTP PR (primarily Qwen3.6 for now). For Gemma 4 MTP, the mainline llama.cpp PR #22673 is still in review; until it merges, the AtomicBot-ai patched fork remains the llama.cpp path for Gemma 4 MTP specifically. Recommended flag addition from comments: `--min-p 0.0` (default 0.1 can interfere with speculative decoding). (source, May 13, 63 score, 16 comments)

May 13 sweep, 2026-05-13 00:00 EDT: five developments from this sweep extend the MTP vs DFlash picture, surface a supply chain signal for Apple Silicon buyers, add new practical limits for Gemma 4 E4B in code use cases, and document a home-server hardware comparison from someone who owns both the Strix Halo and DGX Spark.

First controlled head-to-head benchmark of Gemma 4 MTP vs DFlash on a single H100 — MTP wins at concurrency. A community benchmark (May 12, 62 score, 22 comments) ran Gemma 4 31B Dense and 26B-A4B MoE against both MTP and DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench dataset (880 prompts, 11 categories). Results for 31B Dense: at concurrency 1, MTP hit 125.3 tok/s (3.11x over baseline 40.3) and DFlash hit 122.1 tok/s (3.03x). At concurrency 16, MTP reached 953 tok/s versus DFlash's 725 tok/s versus baseline 375 tok/s — a meaningful gap in favor of MTP at higher concurrency. The architectural explanation from commenters: DFlash generates a larger speculative batch via diffusion but has lower acceptance rate per token; MTP is autoregressive with higher per-token acceptance, so at scale its advantage compounds. Practical guidance: at concurrency 1 the two methods are nearly equivalent; at concurrency 4+ for serving multiple users, MTP outperforms DFlash by a widening margin. This is the first benchmark to quantify the concurrency dimension — prior guidance focused on single-user latency where both methods were close. DFlash's lower acceptance rate with the diffusion-based approach means more compute spent on rejected tokens under load. Still vLLM-only for both methods; no mainline llama.cpp path yet. (source, May 12, 62 score, 22 comments)

Apple removes M3 Ultra 256GB Mac Studio — M5 expected, but supply chain is under stress. Apple pulled the M3 Ultra 256GB Mac Studio configuration from its online store (May 9, 462 score, 132 comments). The top community read: M3 is being phased out ahead of an M5 Mac Studio launch, not a deliberate memory cap decision. Technical context: M3 and M5 use incompatible DRAM types (LPDDR5-6400 vs LPDDR5x-9600), so M3 chip stock is not convertible to M5 builds. An independent complicating factor: a Samsung DRAM worker strike cut production capacity by 58% on one shift. Community concern about M5 Ultra memory configurations is real but largely speculative — no M5 Ultra specs have been announced. Practical impact for Gemma 4 Apple Silicon users: the M3 Ultra 256GB, which was the best available option for running Gemma 4 31B Dense at full BF16 precision with context headroom, is no longer orderable. Anyone actively planning a high-memory Apple Silicon build for Gemma 4 should wait for M5 Ultra pricing and configuration announcements before committing. The 192GB M3 Ultra (if still available) or used M2 Ultra 192GB remain the current options if you need maximum unified memory now. (source, May 9, 462 score, 132 comments)

Gemma 4 E4B produces poor results for code autocomplete (infill) — use Qwen 2.5 Coder 7B instead. A practitioner post (May 12, 36 score, 30 comments) sharing a working RTX 5080 16GB + 64GB RAM coding setup explicitly evaluated Gemma 4 E4B for code autocomplete infill alongside Qwen 3.5 9B/4B. The author's conclusion: E4B and the Qwen 3.5 small models "produce weird suggestions" for infill and were rejected in favor of Qwen 2.5 Coder 7B Q6_K_L, which runs at instant-feeling speeds on 8GB VRAM. The same setup uses Qwen 3.6 35B-A3B at Q8 for agentic coding tasks (the higher quant is important; the author notes Q4 is not usable for agentic work). This is the first practitioner report directly comparing Gemma 4 E4B against alternatives for the code autocomplete fill-in-the-middle (FIM) use case. Confidence: anecdotal, single data point. But it aligns with the known limitation that E4B's instruction-following strength does not automatically transfer to the FIM pattern, which requires a different training signal. The guidance: do not assume E4B works for code infill — test it on your IDE and task type before committing. (source, May 12, 36 score, 30 comments)

"Decoupled Attention from Weights" for Gemma 4 26B: community verdict is skeptical. A post (May 6, 40 score, 27 comments) announced a technique to "split attention (a couple of GB) onto local machine and weights onto a cheap Xeon" for Gemma 4 26B, with a GitHub repository (larql/vindex). Community response was immediate and critical. Top comments: the technique is reported to run approximately 23x slower than standard inference; the underlying mechanism is equivalent to llama.cpp's existing RPC multi-node functionality with network latency added; sequential layer dependencies prevent any parallelism benefit from splitting attention vs. weights. The post author acknowledged the concerns and withdrew from further claims pending personal experimentation. The technique remains unvalidated as a practical inference improvement. This matters for lab readers who may have seen the post circulate with excited framing: there is no new local-inference breakthrough here. For distributed inference, llama.cpp RPC and vLLM expert-parallel deployment are the established options. (source, May 6, 40 score, 27 comments)

Strix Halo 128GB vs DGX Spark for home Gemma 4 inference — owner of both says Spark wins on throughput, degrades less at long context. A community question post comparing the Framework Desktop (Ryzen AI Max+ 395, 128GB unified memory, $3,388) against the Asus Ascent GX10 DGX Spark ($3,500) for running Gemma 4 31B and 26B-A4B as a local LLM server drew 91 comments (May 11, 21 score). The decisive data point: a commenter (score 31) who owns both systems reports "Spark has much faster GPU which results in faster prompt processing speeds. Also, the performance degrades less on Spark as context grows." Community consensus aligns on a clear split: Spark for pure LLM inference; Strix Halo for general-purpose or hybrid workloads where repurposability (standard x86/amd64 Linux, GPU gaming, everyday tasks) matters. The counterpoint for Strix Halo from a top commenter (score 41): "Definitely Ryzen 395, as it's a standard x86/amd64 machine that can always be repurposed and will never lose drivers or compatibility with new operating systems. Nvidia on the other hand has a history of abandoning their proprietary ARM SoC." DGX Spark runs ARM Ubuntu with a DGX software package; the same commenter who owns both notes Fedora also works with some tweaks. Practical guidance for anyone in the $3,400–$3,500 range targeting Gemma 4 31B: the DGX Spark delivers faster discrete GPU throughput and better long-context scaling; the Strix Halo 128GB unified memory trades some raw inference speed for a more flexible, repurposable machine. Neither is a clear wrong choice; the tradeoff is inference specialization vs. general-purpose longevity. Anecdotal confidence; the owner-of-both data point is the strongest signal. (source, May 11, 21 score, 91 comments)

May 12 sweep, 2026-05-12 00:00 EDT: four findings from this sweep extend the inference backend, edge deployment, and small-model picture.

ExLlamaV3 gains Gemma 4 support and DFlash — up to 2.51x coding speedup on consumer NVIDIA GPUs. ExLlamaV3, the successor quantization and inference engine from turboderp, has reached a run of rapid updates directly relevant to Gemma 4. Version 0.0.29 added Gemma 4 model support; version 0.0.31 added DFlash speculative decoding with measured results (from the post, testing on RTX 3090 and 4090): coding tasks 55.98 → 140.61 tok/s (2.51x), agentic code 55.98 → 140.61 tok/s (2.51x), translation 58.11 → 75.73 tok/s (1.30x), creative writing 59.10 → 89.19 tok/s (1.50x). Version 0.0.32 added further model optimizations. ExLlamaV3 requires RTX-class Nvidia CUDA hardware. Unlike the vLLM DFlash path (server-only), ExLlamaV3 is accessible via a Python API suited to single-user setups. The coding speedup is consistent with the vLLM DFlash benchmark (2.56x on RTX 5090); creative writing DFlash improvement is smaller but still positive, unlike MTP which can slow creative tasks. For single-GPU Nvidia users who want DFlash without a full vLLM server deployment, ExLlamaV3 is now a viable path. Confidence: the throughput numbers come directly from the community post; the comparative claim vs vLLM requires independent verification. (source, May 11, 141 score, 61 comments)

Gemma 4 E4B confirmed best-in-class at the 2–4B tier — but quantization quality matters significantly. A community thread asking "what's the current best small model?" (May 11, 26 score) drew strong consensus: Gemma 4 E4B is the top recommendation at the ~3B parameter class, with multiple independent reporters calling it "hands down the best, no arguing." A first-hand practitioner report adds an important caveat: Q8_0 quantization is "kinda bad and mid" for E4B — Q8_XL or BF16 is "night and day" better on tested tasks. A separate commenter confirms E4B "never loops" and "effectively uses the whole 131k context window" — the zombie-loops pattern documented in earlier field notes for larger quantized models does not appear on E4B. The consensus best competitors for the 3B class are smollm3, Granite 4.1, LFM2/2.5, and Qwen 3.5 4B. Community read: Gemma 4 E4B for general instruction following; Qwen 3.5 4B for tasks where a reasoning chain is needed. If running E4B, Q8_XL or BF16 is strongly preferred over Q8_0. Anecdotal confidence. (source, May 11, 26 score, 44 comments)

First documented in-browser Gemma 4 deployment controls a physical robot over WebSerial. A community developer shared a demo of Gemma 4 running fully offline in a browser via Transformers.js on WebGPU, processing camera frames and sending commands to a Reachy Mini robot over the WebSerial API (May 11, 49 score). The model never contacts a server: inference happens entirely on the client GPU via WebGPU, and motor commands go directly over USB/serial via the browser's WebSerial interface. A commenter notes the architectural benefit: "model sees camera/frame state, JS does the motor command, nothing leaves the machine." This is the first documented Gemma 4 use case in the browser-as-inference-engine + physical-actuator pattern, enabled by Transformers.js and the small footprint of Gemma 4 E-series models. The specific variant was not named; the constraint is WebGPU VRAM, which limits practical options to E2B or E4B. No throughput figures were published; treat as a proof-of-concept rather than a production guidance baseline. (source, May 11, 49 score, 9 comments)

Practitioner pattern: Gemma 4 26B for quick interactive fixes, Qwen 3.6 35B for long-context refactoring. A high-engagement discussion thread on Qwen 3.6 35B-A3B (May 11, 333 score, 103 comments) contains a direct Gemma/Qwen split from a practitioner who runs both: "Gemma 26B in thinking mode for quick code fixes and chats, Qwen 35B in thinking mode for longer contexts and refactoring. Qwen 35B rambles on and on before it spits out the final output so I only use it for tasks that I don't mind waiting for." This two-model hybrid pattern — Gemma 4 for latency-sensitive interactive tasks, Qwen 3.6 for depth-first long-context work — is now documented by multiple independent practitioners across several weeks of field notes. The pattern holds whether the user prioritizes speed, quality, or token efficiency: Gemma 4 26B finishes short tasks fast and concisely; Qwen 3.6 35B is more thorough but verbose. A second data point from the same thread: an RTX 3090 24GB + 64GB RAM user (Beelink eGPU dock) reports Qwen 3.6 35B-A3B "blazing fast" with llama.cpp after tuning settings, switching from LM Studio, with Gemma 4 26B as the secondary model for interactive chat. (source, May 11, 333 score, 103 comments)

May 11 sweep, 2026-05-11 00:00 EDT: four findings from this sweep extend the MTP and creative-coding picture.

MTP task-type dependency confirmed by systematic 300-test analysis — dense models benefit far more than MoE. A careful benchmark author published the most rigorous community MTP analysis to date (May 10, 67 score, 24 comments). Over 300 test runs covering four task types, five quantization levels, three temperature values, and two MTP quant settings produced a clear finding: F16 + MTP nearly triples coding-task speed; Q4_K_M + MTP slows creative writing output. Temperature and MTP quant have negligible impact; task type is the only factor that matters. An RTX 5090 user in the comments reported ~70% acceptance rate for coding tasks at --spec-draft-n-max 4, with 70–120 tok/s sustained at 70–160k context on Q6. Expert commentary confirms the MoE penalty: MoE models like Gemma 4 26B-A4B must cycle through more experts per speculative token than dense models, so the overhead is proportionally higher — a Radeon AI Pro 9700 user saw prompt-processing speed drop from 1,400 tok/s to 650 tok/s after enabling MTP. Dense models (Gemma 4 31B, Qwen 3.6 27B full) are the primary beneficiaries; for MoE variants, MTP helps only on coding tasks with high acceptance rate. Practical rule: benchmark before assuming MTP helps on your specific workload. (source, May 10, 67 score, 24 comments)

Gemma 4 26B-A4B excels at one-shot creative coding tasks where Qwen consistently falls flat. A practitioner shared an automated three.js prompt cycling test (May 10, 38 score, 23 comments): a Python app cycles through 80 creative-coding prompts, generates single-file HTML/WebGL outputs, detects crashes, and archives the results. Gemma 4 26B-A4B one-shot generation quality was consistently high on 3D graphics and demoscene-style effects. The same author states Qwen 3.6 "falls flat on its face for just about anything I throw at it" in the creative context. A third commenter summarizes the emerging community consensus: "Gemma has more personality to it; Qwen is better for facts and coding." This creative-coding strength is now documented by at least two independent practitioners — the Packman racing game comparison thread (May 9) and this three.js cycling tool — and represents a consistent divergence from Qwen's strengths. For creative coding and single-file generative output, Gemma 4 26B-A4B appears to be the stronger local option at the 26–31B weight class. (source, May 10, 38 score, 23 comments)

vLLM ROCm added to Lemonade as experimental AMD backend — community wants Gemma 4 MTP support. AMD engineer jfowers announced the integration of vLLM ROCm into the Lemonade SDK as an experimental backend (May 8, 433 score, 90 comments). Installation is now two commands: `lemonade backends install vllm:rocm` followed by `lemonade run `. The post drew one of the highest-score community engagements of the week. Notably, the community immediately asked about Gemma 4 with MTP in vLLM ROCm — signaling that MTP for Gemma 4 on AMD GPUs is an active interest. A portable standalone vLLM executable for AMD is also now available. The top comment thread included a pointed message to AMD relayed internally by jfowers: "The reason CUDA is the industry standard is that Nvidia made it their mission to provide the same support to everything in their hardware portfolio." An AMD engineer confirmed the message was sent to management. For Gemma 4 users on AMD hardware, vLLM ROCm in Lemonade is now the cleanest path to vLLM's speculative decode and safetensors-native model support without manual CUDA replacement. (source, May 8, 433 score, 90 comments)

Gemma 4 for language learning: correction-loop prompting pattern works; SillyTavern multi-character setups in active use. A language-learning thread (May 9, 23 score, 19 comments) surfaces a practical deployment pattern for Gemma 4 in education. The most-upvoted comment describes a correction loop: the model answers in three lanes (reply in target language, grammar correction, and explanation of why) while only marking one grammar error and one phrasing suggestion per turn to prevent homework-session overload. One commenter has been using Gemma 3 and then Gemma 4 continuously for German practice, noting it handles verb separation (Trennbare Verben) imperfectly but is broadly helpful for vocabulary connections across Romance languages. A SillyTavern multi-character practitioner reports actively using LLMs for Arabic, French, Portuguese, and Spanish practice across multiple character personas. Gemma 4's instruction-following fidelity — its consistent ability to stay in the target language and maintain a role when prompted — is what makes this use case work. No hardware specifics were shared, suggesting this is primarily a quantized local model use case compatible with standard consumer hardware. (source, May 9, 23 score, 19 comments)

May 10 re-check, 2026-05-10 01:00 EDT: three developments from this sweep reinforce and extend findings from May 9.

Practitioner survey confirms use-case split: Gemma 4 for instruction-following, prose, and games; Qwen for code. A second independent use-case thread (May 9, 20 score, 42 comments) drew direct practitioner reports of what they reach for Gemma 4 specifically. Common answers: generating narrative responses for NPCs in video games (E2B is cited here explicitly), writing PRDs and product specification documents using Gemma 4 31B and then handing implementation to Qwen, and structured tasks where instruction-following fidelity matters more than raw reasoning depth. The most-cited single-sentence summary from the thread: "best instruction-following of any open-weight model I've tried." This is the second large practitioner survey in as many weeks — after the 94-score May 6 thread — reaching the same structure: Gemma 4 is the answer when the task is open-ended instruction compliance or voice/tone matching, and Qwen is the answer for multi-turn agentic code execution. The split is now documented from two independent data points with combined 114 score and 169 comments. (source, May 9, 20 score, 42 comments)

MTP in llama.cpp: Georgi unifying speculative decode architecture before any merge lands. A thread asking how long until official llama.cpp MTP support (May 9, 68 score, 46 comments) surfaced a clarification from Georgi Gerganov: he is building a unified speculative decode architecture that covers MTP, Eagle3, and DFlash together — rather than merging each independently. All three methods will land in one correct implementation rather than piecemeal patches that create technical debt in the speculative decode path. This explains why PRs like #22673 (Gemma 4 MTP) and #22105 (DFlash) have been slow to merge despite being functional. No timeline was given; this is active in-progress work, not a planned milestone. Users who need MTP now should use the AtomicBot-ai patched fork (TurboQuant path) or the omlx runtime on Apple Silicon. The unified refactor, when it lands, should give llama.cpp native parity with vLLM on speculative decode across all three methods simultaneously. (source, May 9, 68 score, 46 comments)

Practical deployment: Gemma 4 on Mac Mini drives MCP server at full interactive speed. A first-person report (May 9, 29 score) confirms that Gemma 4 running on a Mac Mini runs fast enough to serve as the backend for a Model Context Protocol server at full interactive speed — with native tool calling, at zero cloud API cost. This is a concrete production data point for the "Gemma 4 as a free local MCP backend" deployment pattern: the model's tool calling quality and throughput are sufficient for MCP server workloads on current consumer Apple Silicon hardware. No hardware specifics (exact chip, RAM size, model variant, quant) were disclosed, so treat the speed claim as an existence proof rather than a precise benchmark target. The finding is consistent with the broader practitioner picture: Gemma 4 at the right hardware tier delivers cloud-grade instruction following with no recurring API cost. (source, May 9, 29 score)

May 9 re-check, 2026-05-09 01:00 EDT: six significant developments from this sweep.

DFlash for Gemma 4 26B MoE is live — 2.56x speedup in vLLM, 600 tok/s peak on RTX 5090. z-lab released gemma-4-26B-A4B-it-DFlash a few days ago; community benchmarks hit the site on May 8. A controlled vLLM benchmark (RTX 5090 32GB, vLLM 0.19.2rc1) measured baseline 228 tok/s → 578 tok/s at num_speculative_tokens=13 (2.56x speedup) on a 256-input / 1024-output random workload at concurrency 1. Optimal tuning: max_num_batched_tokens=8192 gave the cleanest p95 tail at that speculation depth, with mean E2E latency dropping from 4455ms to 1738ms. Critical community caveat: DFlash drops sharply at approximately 20k context. One commenter testing the same 5090 at 35k context reports speed starting at 400 tok/s but dropping quickly to 200 tok/s and continuing to degrade, with malformed tool calls. For short-to-medium context inference this is a compelling gain; for long-context agentic workloads it is not yet practical. On the DFlash vs MTP comparison: DFlash uses stateful parallel block diffusion drafting with persistent KV cache positions; Gemma 4's MTP implementation uniquely reuses the main model's KV cache, avoiding the memory pressure that afflicts MTP on other architectures. Both require vLLM for DFlash or a patched llama.cpp fork for MTP — no merged mainstream path exists yet. (DFlash benchmark, 99 score; DFlash release discussion, 114 score)

MTP acceptance rate determines whether it helps or hurts. A controlled M4 Max Studio study with Gemma 4 26B-A4B reveals that MTP benefit varies entirely by workload acceptance rate. Measured: code generation 66% acceptance → 1.53x speedup; long-form prose 31% acceptance → essentially no gain (0.95x); JSON structured output 8% acceptance → 0.50x (twice as slow). The mechanism: when the draft model's speculative tokens are rejected, the full model must re-run the verify step with no net gain — at 8% acceptance the overhead dominates. Expert commentary adds two important nuances: first, MoE models like Gemma 4 26B-A4B are harder to speculate in than dense models because spare compute for draft verification is limited; second, Apple Silicon before M5 has limited headroom, and dense Gemma 4 31B is expected to see better MTP gains than the MoE 26B on the same hardware. Practical guidance: MTP is worth enabling for structured code generation and predictable outputs; disable it for free-form prose and especially for JSON schema output, where it reliably degrades performance. Always benchmark before assuming benefit. (source, 24 score, 8 comments)

Multi-GPU topology insight: NVLink pairing beats full tensor parallelism. A detailed benchmark with 4×RTX 3090 (NVLink between GPU pairs 0↔2 and 1↔3, vLLM 0.20.1, CUDA 12.8) found that pinning TP=2 to an NVLink-bonded pair delivered +25% throughput at concurrency 1 and +53% at concurrency 4 compared to running TP=2 over PCIe. Counter-intuitively, expanding to TP=4 across all four GPUs was worse — cross-pair PCIe bus traffic added latency that outweighed the additional capacity. This applies directly to Gemma 4 31B Dense deployment on NVLink-equipped multi-GPU workstations: prefer TP=2 on your NVLinked pair over TP=4, even when you have four GPUs. Tested here with Qwen 3.6 27B AWQ as the workload model; the topology principle holds for any model requiring tensor parallelism across these GPUs. (source, 44 score, 36 comments)

TurboQuant + MTP on RTX 4090: 80-87 tok/s at 262K context — quality claims contested. A demonstration showing TurboQuant quantization combined with MTP on Qwen 3.6 27B reports 80-87 tok/s generation at a 262K context window on a single RTX 4090 (60 score, 42 comments). The numbers are eye-catching, but community pushback on quality was significant: the demonstration used a simple Q&A prompt and did not test accuracy on long-context retrieval tasks where TurboQuant's aggressive compression can degrade meaningfully. TurboQuant is the method from the AtomicBot-ai fork — the same project that shipped the first Gemma 4 MTP implementation for llama.cpp — and it is not merged into mainline llama.cpp or any standard quantization library. The combination of unverified quality and non-mainline tooling means the throughput claim is directionally interesting, but the practical recommendation remains: use quantization methods with published quality benchmarks on your target workload before optimizing around throughput numbers. (source, 60 score, 42 comments)

HTX301 PCIe inference card announced: 384GB at 240W, community skeptical. Taiwanese company Skymizer announced the HTX301, a PCIe inference card with 384GB memory and a 240W TDP (250 score, 103 comments). At face value the memory capacity is striking — 384GB would fit Gemma 4 31B Dense at BF16 with enormous headroom, or multiple models simultaneously. Community reaction was measured skepticism: the announcement contains no memory bandwidth specification, no compute FLOPS figures, and no pricing. Memory capacity without bandwidth is meaningless for LLM inference decode throughput, where bandwidth is almost always the bottleneck. Several hardware-knowledgeable commenters compared it unfavorably to AMD MI300X (192GB at ~5TB/s bandwidth) and suggested the 240W TDP implies a modest memory subsystem relative to the 384GB capacity. Worth tracking if independent benchmarks appear with validated bandwidth figures; do not plan deployments around the headline memory number alone. (source, 250 score, 103 comments)

vLLM ROCm added to Lemonade: AMD GPU users can now run inference before GGUF conversion. The Lemonade server added vLLM ROCm as an experimental backend, enabling inference from standard model weights on AMD GPUs without first converting to GGUF format. This reduces workflow friction for Radeon 6000/7000-series users on Linux who want to test Gemma 4 variants under ROCm. The backend is marked experimental; community verification of Gemma 4 on ROCm via Lemonade is sparse, so validate on your specific GPU before relying on it for production workloads. AMD GPU users for whom GGUF conversion was the primary friction point now have a faster path to initial evaluation. (source)

May 8 re-check, 2026-05-08 01:00 EDT: three new developments from this sweep worth recording.

MTP now working in llama.cpp for Gemma 4 — 40% decode speedup on M5 Max. A community developer (May 8) implemented Multi-Token Prediction for llama.cpp, quantized Google's new Gemma 4 assistant GGUF models, and tested on a MacBook Pro M5 Max. Measured result: 97 tok/s baseline → 138 tok/s with MTP, a 40% speedup. This uses the new Google-released MTP draft models (Gemma-4-26B-A4B-it-assistant) and a patched llama.cpp fork available at AtomicBot-ai; the patch is not yet merged into mainline llama.cpp. Key distinction from the omlx finding (below): this is llama.cpp-based MTP — relevant to Linux and Windows users who cannot use MLX. Commenters note the quality comparison between baseline and MTP outputs used different seeds and temperatures, so "40% faster with identical quality" requires verification at temp=0 with fixed seed; take the exact ratio as approximate. The directional finding (meaningful speedup via MTP on llama.cpp for Gemma 4) is credible given the confirmed mechanism. (source, 95 score, 19 comments)

MTP confirmed working on Apple Silicon via omlx runtime. A direct first-hand report (May 7) confirms that the new Google MTP draft models work with the omlx runtime on M1 Max 64GB, nearly doubling decode speed from 11 tok/s to 20+ tok/s at max wattage. Standard MLX (the more widely used Apple Silicon inference library) does not yet support MTP — the omlx runtime is a separate fork-based project. On the technology: MTP only benefits decode (generation) speed, not prefill — prefill processes the full input in parallel by design, so there is nothing to speculate ahead. Commenters clarified a common confusion: some third-party projects advertise "speculative prefill" as a distinct feature, but this involves lossy KV cache population (not mathematically equivalent to standard generation); lossless MTP applies only to the decode phase. For Apple Silicon users: omlx is the current fastest path to Gemma 4 MTP; native MLX support is pending. (source, 21 score, 22 comments)

Prompting sensitivity: Gemma 4 and Qwen 3.5 need different prompting than Qwen 3.6. A controlled test (May 7) ran two phrasings of the same math-word problem against Gemma 4 31B, Qwen 3.5, and Qwen 3.6 27B — 10 runs each (6 combinations). The headline result: the models respond very differently depending on phrasing, and Qwen 3.6 proved most robust to ambiguous phrasing while Gemma 4 and Qwen 3.5 performed better on the clearer of the two prompts. Key practical takeaway: Gemma 4's accuracy on reasoning tasks is sensitive to prompt clarity. Concise, unambiguous prompts tend to get better results than elaborated prompts that contain implicit assumptions. Quantization also matters: IQ2-quantized Qwen 3.6 underperformed Q8 on the same task, reinforcing the known guidance to prefer higher quants for reasoning workloads. This finding complements the token-efficiency story: Gemma 4 finishes tasks in fewer tokens, but benefits from being asked precisely. (source, 28 score, 13 comments)

May 7 re-check, 2026-05-07 01:00 EDT: two new developments from this sweep that add meaningful signal.

Community use-case survey crystallizes where Gemma 4 wins. A widely-upvoted discussion thread (94 score, 127 comments, May 6) asked practitioners directly what they use Gemma 4 for versus Qwen 3.6. The answers converge on a clear pattern: Gemma 4 is the preferred choice for vision and OCR ("Gemma trounces Qwen for handwriting analysis and general vision tasks"), bug tracing ("Gemma4 is really, really, really good at tracing bugs — much more consistent and reliable for finding the actual root cause"), translation especially Japanese and smaller European languages (independently confirmed across multiple reporters), creative writing, tone-sensitive text, and RAG over structured documents. Qwen 3.6 is preferred for agentic coding, multi-turn tool use, and long agentic loops. The niche-split that has been building across weeks of field notes is now directly confirmed from first-person practitioner reports. One practitioner summarizes: "For things I want to go fast, don't require accuracy or rely mostly on the vision encoder: Gemma4-26B-A4B. For where accuracy and nuance are important: Gemma4-31B. I prefer Qwen3.6 for anything programming or toolcalling related." The survey also confirms that translation quality holds at an unusually high bar: Gemma 4 is rated best open-weight option for Japanese→English, with one commenter noting it is "entirely undisputed" for open models on translation tasks. (source, 94 score, 127 comments)

Prompt injection defense: Gemma 4 E4B jumps from 21% to 100%. A benchmark study (6100+ tests across 15 models, 7 attack types) found that Gemma 4 E4B went from 21.6% to 100% defense rate when the untrusted input was wrapped in a long random delimiter and the model was explicitly told not to execute injected instructions. This was the largest absolute improvement of any tested model (+78.4 percentage points) and the only model to reach a perfect score. Tested attack types included role hijack, authority claims, and fake delimiters. The benchmark used hand-crafted payloads rather than SOTA adversarial search, so the defense rate may be lower against gradient-based attacks. Practical takeaway for RAG and web-document pipelines: the delimiter + strict-prompt defense is a high-ROI hardening step for Gemma 4 deployments that process untrusted external content. (source, 24 score)

Morning re-check, 2026-05-02 08:30 EDT: a follow-up sweep against the past 24 hours of r/LocalLLaMA confirmed three additional posts worth recording. A first-hand AMD Radeon 9060 XT 16GB report (eGPU on a 7840HS mini-PC) lands the 24B A4B IQ4_NL variant at 25.9 tok/s with KV cache at q8_0 and a small 256-token target. More importantly, two independent posts within fourteen hours documented an emerging "zombie loops" failure mode on both Gemma 4 and Qwen 3.6 with quantized KV cache during thinking mode. The convergent expert reading is that q4_0 KV quantization accumulates drift across hundreds of internal reasoning tokens until the model falls into a repetition attractor. This pattern is now strong enough to call out as a known limit (see below).

Evening re-check, 2026-05-02 17:45 EDT: the post-PR #82 sweep found two new high-signal items rather than a broad hardware shift. First, a local vLLM/FP8 vision comparison reports Gemma 4 staying much more concise on messy real-world image prompts, often around 1,500 thinking tokens where Qwen 3.6 can burn 8,000+ tokens and sometimes fail to finish. The same report says Gemma 4 followed normalized 0 to 1 bounding-box JSON instructions more reliably, while Qwen 3.6 did better on the tested 2 FPS deadlift video tracking case. Second, an SGLang production report identified an FP8 KV-cache bug for models with per-layer KV scales, explicitly including Gemma 4, where radix-cache prefix hits can silently corrupt output unless the deployment uses BF16 KV cache or the upstream fix lands. This reinforces the current guidance: for long-context or thinking-mode work, treat KV-cache precision and serving backend as quality controls, not just speed knobs. (vision source, SGLang source, PR #24198)

May 3 re-check, 2026-05-03 01:00 EDT: a new sweep surfaced two notable developments. First, a dedicated KV cache quantization discussion (source, 77 comments) provided the architectural explanation for the zombie loops pattern previously documented: Gemma 4 uses an interleaved Sliding Window Attention (iSWA) mechanism that is structurally more sensitive to KV precision loss than dense models or Qwen-style MoE. The expert comment reads directly: "Gemma 4, due to its iSWA architecture, is apparently much more sensitive to KV cache quantization." Dense architectures accumulate less rounding error per attention step; iSWA's alternating local and global windows amplify quantization noise differently. The practical implication is stronger than the zombie-loops framing: KV precision for Gemma 4 is an architecture-level quality control, not just a safety precaution for thinking mode. Second, follow-up comments on the "Qwen 3.6 wins benchmarks, Gemma 4 wins reality" vision post (source) added two confirming voices worth recording. A commenter with the opposite finding (Qwen 3.6 follows instructions better) attributes the divergence to "backend/harness influence," underscoring that task setup and serving backend matter for the comparison. A second commenter elaborates: "Gemma is much better at short one shot, but because of its architecture it struggles with long context. There is something about its attention mechanism and its also far more sensitive to KV quantization." On the multilingual dimension, a confirmed data point: Gemma 4 is "a much better LLM than Qwen for anyone that doesn't use English or Chinese as their primary language, especially for European languages." Third, the RTX 6000 Pro guidance from May 2 received an important nuance from a card owner: "performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls, then vllm and sglang are better." (source) This corrects the blanket recommendation: for single-user workloads on professional GPUs, llama.cpp-based tools remain competitive; the vLLM/sglang advantage appears primarily at 4+ concurrent requests.

Evening re-check, 2026-05-03 17:05 EDT: the post-PR #86 sweep found two new data points. First, a developer shipped a production Android voice notes app using Gemma 4 E2B (2.4GB) via LiteRT-LM on a OnePlus CE 5 (8GB RAM). The measured end-to-end latency for a 10-15 second voice note is 12-15s: Whisper Small (Sherpa-ONNX) handles transcription in ~5s, Gemma categorizes and extracts structured JSON in ~8-10s. The developer reports JSON output reliability as "way better than expected from a 2.4GB model on a phone" — a strong signal that Gemma 4 E2B's instruction-following quality holds well under aggressive quantization on ARM. Notably, commenters suggest the separate Whisper step may be unnecessary since E2B may support native voice-to-text natively via LiteRT-LM. (source, score 18, 14 comments) Second, a community survey of Gemma 4 31B on smaller European languages confirms the multilingual advantage holds above the 100B MoE tier: multiple independent reporters conclude that Gemma 4 31B beats Qwen 3.5 122B and Mistral 4 119B for Czech, Hungarian, Slovak, and Dutch. The data comes with a precision note: quantization hurts multilingual quality more than English quality, so the comparison is most meaningful at BF16/FP16 — a 16-bit Gemma 4 31B is "extremely good in Hungarian" while the same model at 8-bit shows "slightly Chinese" output contamination. The practical guidance: if your use case is primarily a smaller European language, Gemma 4 31B at high precision is a better choice than any current 100B MoE at standard quantization. (source, score 8, 14 comments)

May 4 re-check, 2026-05-04 01:00 EDT: two new developments worth recording from the latest sweep. First, a report on running llama.cpp via the Snapdragon Hexagon NPU adds early data for mobile NPU inference with Gemma models. The NPU path itself is battery-efficient but constrained: the Hexagon NPU can only address 4GB of RAM, making it unsuitable for anything larger than the smallest Gemma variants without splitting across multiple NPU device instances. In practice, community testing found Gemma 4 E4B achieves 11-14 t/s on a OnePlus 13 (Snapdragon Elite) via the Android Edge APK (GPU path), not NPU. The NPU path on the same chip produced less favorable results. The takeaway for mobile: the GPU via Edge APK is currently the more practical Gemma 4 E4B path on high-end Android phones; NPU is a power-saving alternative that makes sense for always-on background tasks where latency tolerance is high. (source, score 20, 6 comments) Second, a community quality-gap discussion (68 score, 44 comments) adds useful perspective on where Gemma 4 31B sits against frontier cloud models. The converging read across commenters: Gemma 4 31B tracks "Dec 2025 frontier" performance levels for translation and non-English tasks — competitive with Claude Haiku 4.5, which was released roughly half a year ago. For tasks outside English and Chinese, Gemma 4 31B is seen as clearing the bar where the "6-month gap" argument would place it. This is consistent with the separate multilingual finding from May 3: Gemma 4 31B beats all tested 100B+ MoE models for smaller European languages when run at BF16. Anecdotal confidence; no controlled benchmark behind this comparison. (source, score 68, 44 comments)

May 6 re-check, 2026-05-06 01:00 EDT: four high-signal developments from the latest sweep.

Google officially released Gemma 4 MTP draft models. Multi-Token Prediction (MTP) drafters are now available for all four Gemma 4 variants: 31B Dense, 26B-A4B MoE, E4B, and E2B (HuggingFace). The E2B drafter is only 78M parameters — tiny enough to run alongside the main model with minimal memory overhead. MTP works by having the small draft model predict several tokens ahead; the large target model then verifies the full batch in parallel, accepting correct tokens and re-running from the first mismatch. This guarantees identical output quality to standard generation while targeting up to 2x decode speedup depending on task type (structured outputs and repetitive patterns see the largest gains). Community response was immediate: llama.cpp PR #22673 is already in review for Gemma 4 MTP support, and the MTPLX Apple Silicon runtime (see below) also claims MTP model compatibility. This is the biggest single capability addition to Gemma 4 since launch and changes the expected throughput trajectory significantly. (source, score 783, 204 comments)

Token efficiency confirmed: Gemma 4 31B is slower per token but faster per task. A Kaitchup benchmark article (summarized in a community post, 117 score) compared Gemma 4 31B Dense against Qwen 3.6 27B Dense and Qwen 3.5 27B Dense. The headline finding: Qwen models score higher on standard benchmarks ("benchmaxxed") but Gemma 4 31B is "far more efficient with token use" — it produces a correct, complete answer in substantially fewer tokens. The practical implication is that even though Gemma 4 31B is slower per token (it is a larger dense model vs. smaller dense models), total task completion time is often similar or faster because the model doesn't need to elaborate as much. One commenter summarizes the workflow they use: swap Gemma and Qwen 3.6 in Plan/Act roles when either model gets stuck — the two models' different failure modes make them complementary. Another notes that Gemma 4 is more sensitive to quantization, so Qwen's smaller quant + Q8 KV can outperform Gemma at the same VRAM budget, especially for longer contexts. (source)

CPU-only 26B inference is fast because of MoE architecture. A community post (score 100, 70 comments) reports running Gemma 4 26B-A4B on an i5-8500 with 32GB DDR4 RAM and no GPU. The measured generation speed is 9.25 t/s (prompt processing 23.13 t/s). The key explanation from the top comment: "Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model." This is the definitive answer for CPU-only users: the 26B label is misleading — active parameter count per token is ~4B, making CPU inference practical on ordinary hardware. Qwen 3.6 27B is dense (all 27B parameters active every token), so it runs ~8x slower on CPU despite having similar total parameter count. For CPU-only or low-RAM setups, the Gemma 4 26B-A4B MoE is the right model; Qwen 3.6 27B is impractical at the same hardware. (source)

MTPLX: Apple Silicon MTP inference engine shows 2.24x speedup. An open-source runtime built on a patched MLX fork (not a patch to MLX itself) reports 28 → 63 tok/s on Qwen 3.6 27B on MacBook Pro M5 Max using MTP heads built into the model. Key design details: mathematically exact temperature sampling via rejection sampling (not greedy-only like other speculative decode tools on Apple Silicon), custom Metal kernels, and a full OpenAI/Anthropic-compatible API server. The runtime also adds crash-safe fan control and a 562-test suite. With Google's Gemma 4 MTP draft models now released, MTPLX may support Gemma 4 inference as well — the developer says it "works on ANY MTP model." Not yet independently verified for Gemma 4 specifically; treat as promising but unconfirmed. (source, score 60, 38 comments)

May 5 re-check, 2026-05-05 01:00 EDT: four developments from the latest sweep that Gemma 4 users should act on or track.

Update your Gemma 4 GGUFs. A high-traction community post (365 score, 103 comments) announced that the Jinja chat template bug documented in earlier field notes has been fixed in the upstream model files. Updated GGUFs are now available from bartowski and unsloth for all four variants: 31B, 26B-A4B, E4B, and E2B. Community comments flagged that the fix may also reduce the extreme memory usage some users experienced. The exact change is visible at HF discussion 86. If you have been running Gemma 4 GGUFs from before May 2026 and are using tool calling or extended context, updating is strongly recommended. (source)

llama.cpp MTP support is now in beta. A beta implementation of Multi-Token Prediction (MTP) has landed in llama.cpp (477 score, 210 comments). MTP pairs a small fast draft model with the large target model: the draft predicts a token batch, the target verifies the entire batch in parallel, accepting correct tokens and re-running from any mismatch. ELI5: "big model and small model work as a team — small model runs ahead, big model checks from behind, both finish sooner." Currently limited to Qwen3.5 MTP architectures, with broader model support expected. The author notes that between MTP and maturing tensor-parallel support, "most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, should be erased." Relevance for Gemma 4: once Gemma 4 MTP support lands (if and when), E4B could serve as a draft model for 31B Dense — this is architecturally the same as the existing Gemma 4 E2B speculative decoding setup but with native MTP semantics. Not yet merged; track the PR before updating. (source)

APEX MoE quants now cover Gemma 4. The APEX mixed-precision MoE quantization strategy, originally demonstrated for Qwen 3.5 35B-A3B, has expanded to 30+ models including Gemma 4 variants (77 score). APEX applies expert-routing-aware precision tiers: higher precision for edge layers and shared experts (which handle rare long-range tokens), lower for mid-tier experts. Users report noticeably better coherence past 32K tokens compared to uniform Q4_K, with measured faster inference on the benchmarked Qwen 3.6 models. The Gemma 4 26B MoE coverage is confirmed in the library; community reports on Gemma 4 specifically are sparse so treat the long-context claims as plausible but anecdotal until more data surfaces. Quants are available via github.com/mudler/apex-quant. (source)

Research: FastDMS achieves 6.4x KV cache compression faster than vLLM BF16. An MIT-licensed reference implementation of Dynamic Memory Sparsification (DMS) — a technique using learned per-head token eviction to compress the KV cache — reports 6.4x compression with near-lossless quality (perplexity 9.226 → 9.200 on Llama 3.2 1B; KLD ~0.026 nats/tok). The implementation is research-quality and tested only on Llama and Qwen-family checkpoints; it has not been integrated into llama.cpp, vLLM, or SGLang. Author explicitly says the lift for a production serving integration is large and "noped out" of attempting it. Given Gemma 4's documented KV precision sensitivity (iSWA architecture amplifies quantization noise), FastDMS is worth tracking as a potential path to longer context without KV precision degradation — but this is speculative and no Gemma 4 DMS checkpoints exist yet. Confidence: low (research-stage result). (source)

Hardware leak: Ryzen AI Max+ 495 (Gorgon Halo) with 192GB unified memory. A leaked spec for AMD's upcoming Ryzen AI Max+ 495 shows 192GB unified memory, up from 128GB on the current Strix Halo 395 (148 score). Key caveat from community hardware experts: memory bandwidth appears unchanged at ~256GB/s. For Gemma 4 users this means: a Gorgon Halo system could fit Gemma 4 31B Dense BF16 (~62GB), the 26B MoE BF16, and several smaller models simultaneously, with prefill remaining the same speed bottleneck as Strix Halo today. The additional capacity is most useful for parallel model loading, very long contexts, or RAG pipelines that need multiple loaded models. Unconfirmed leak; release timeline and pricing unknown. Strix Halo 395 owners confirm the memory increase alone would not change throughput on single-user workloads. (source)

Headline this week

MTP vs DFlash is now settled at the hardware and concurrency level. A controlled H100 benchmark (May 12) confirms: at concurrency 1, MTP (3.11x) and DFlash (3.03x) are statistically tied for Gemma 4 31B Dense. At concurrency 16, MTP wins decisively — 953 vs 725 tok/s. For single-user inference either method works equally well; for serving multiple concurrent users, MTP is the better choice. Apple removed the M3 Ultra 256GB Mac Studio from its store ahead of an expected M5 launch. DFlash for 26B MoE remains live in vLLM with 2.56x throughput for short-context workloads; both MTP and DFlash require workload-appropriate tuning. New May 15: KV cache quantization guidance for vLLM is now more precise — a formal study confirms FP8 (`--kv-cache-dtype fp8`) as the best default when VRAM is constrained, with 2x capacity and negligible accuracy loss. TurboQuant variants beyond 4bit-nc are not worth the accuracy and throughput cost for Gemma 4. If VRAM is not a constraint, unquantized KV cache remains highest quality.

Best current setup (today)

RTX 5xxx (Blackwell consumer): Gemma 4 26B MoE now has an official nvidia/Gemma-4-26B-A4B-NVFP4 quant at 18.8GB. On a 5090 with 80% VRAM allocation, users report ~50K context. Benchmarks are near-lossless: GPQA Diamond 79.9% vs 80.3% baseline, AIME 2025 actually improved to 90.0% from 88.95%. New as of May 9: DFlash for the 26B MoE is now live via z-lab — benchmarks on the 5090 show 228 → 578 tok/s (2.56x) at optimal settings with vLLM. Critical caveat: DFlash degrades sharply at 20k+ context (drops from 400 to 200 tok/s and continues declining). Use DFlash for short-context inference; for long-context work, NVFP4 without DFlash remains the recommended path. For 31B Dense, NVFP4 GGUF with llama.cpp PR #22196 or the existing DFlash variant remain the paths. (NVFP4 source, DFlash benchmark)
Mid-to-high single GPU (24+ GB VRAM, non-Blackwell): Gemma 4 31B Dense at Q5_K_M or Q6_K remains the strongest single-card choice for general work, writing, and visual understanding. New this week: the DFlash variant (gemma-4-31B-it-DFlash) has been released but still needs llama.cpp PR #22105 to merge before practical use. ggerganov is reportedly planning a speculative-architecture refactor first. (source)
Constrained GPU (8-16 GB VRAM): Detailed speed benchmarks from an RTX 4070S 12GB user (DDR5 6000MHz, iGPU display offload) show Gemma 4 26B MoE and 31B Dense both runnable with substantial CPU offload. The 12GB club is real: careful config tuning (CUDA 13.1, display offload to iGPU, cache reuse settings) gets 40 t/s on 35B Q6 with system RAM spill. Keep Gemma 4 for prose and Qwen 3.6 for code in this tier. New (May 14): Reduce GPU power limit with `nvidia-smi -pl` to ~40% TDP — confirmed on RTX 4090 that generation throughput is nearly unchanged (LLM decode is memory-bandwidth-bound, not compute-bound). Reclaim meaningful power, heat, and noise savings at essentially no performance cost. (12GB source, power limit source)
Budget hardware (GTX 1080 8 GB VRAM, ~$200 secondhand machine): New May 14 lower bound. A $200 i7-6700 / GTX 1080 / 32 GB RAM machine runs Gemma 4 26B-A4B at ~20–24.5 tok/s using TurboQuant/RotorQuant KV cache (allows 128k context within 8 GB VRAM) and `--n-cpu-moe 20` CPU MoE offload. Key flags: `--override-tensor-draft "token_embd\.weight=CUDA0"` needed to fix MTP's tied LM head behavior for Gemma 4 specifically. The GPU is PCIe bandwidth-limited (~40-50% GPU utilization). Note: TurboQuant KV is not in mainline llama.cpp; requires AtomicBot-ai or ikawrakow fork. Treat as an existence proof — actual throughput at full 128k real context will be lower than these small-prompt benchmarks. If you already have an 8GB GPU collecting dust, Gemma 4 26B-A4B is runnable. (source)
AMD consumer GPU (Radeon 9060 XT 16GB, eGPU): A first-hand report on a 7840HS mini-PC paired with an external Radeon 9060 XT lands the Gemma 4 24B A4B IQ4_NL variant at 25.9 tok/s via llama-server, with KV cache at q8_0 and a small 256-token batch target. The user notes the configuration is usable for OpenCode codebase Q&A. Reply chain confirms 16GB is tight at 128K context and forces partial CPU offload, so for steady-state work expect lower numbers when context fills. New (May 9): Lemonade added vLLM ROCm as an experimental backend, allowing inference from standard model weights on Radeon 6000/7000-series GPUs without converting to GGUF first — useful for quick evaluation of new models before committing to conversion. Backend is experimental; validate on your GPU before production use. (llama-server source, vLLM ROCm Lemonade)
CPU-only (no GPU, DDR4/DDR5): Gemma 4 26B-A4B MoE is now confirmed to run at 9.25 t/s generation speed on an i5-8500 with 32GB DDR4 RAM — practical for interactive use. The reason: MoE only activates ~4B parameters per token despite 26B total. This makes it comparable to running a true 4B dense model on CPU. Qwen 3.6 27B Dense activates all 27B parameters per token and is ~8x slower on the same hardware — avoid for CPU-only setups. Recommended quant: Q4_K_M to fit comfortably in 32GB. (source)
Apple Silicon (32-64 GB unified): The viral Pacman test ran Gemma 4 31B at 27 tok/s on M5 Max 64GB, confirming strong Apple Silicon inference. MTP is now confirmed working for Gemma 4 via the omlx runtime: a first-hand M1 Max 64GB report (May 7) measured 11 → 20+ tok/s — roughly doubling decode speed. Standard MLX MTP support is still pending. MTPLX (a separate patched MLX fork) shows 2.24x speedup on Qwen 3.6 27B and claims compatibility with any MTP model. Recommended quant for 31B on 64GB: Q8 or Q6_K. Supply chain note (May 9): Apple has removed the M3 Ultra 256GB Mac Studio from its store, likely ahead of an M5 Mac Studio launch. If you were planning a 256GB Apple Silicon build for BF16 Gemma 4 31B, that window has closed for now — wait for M5 Ultra availability and pricing before committing. (sources, omlx MTP, MTPLX, Apple store note)
Professional GPUs (RTX 6000 Pro / A6000 Ada, 48-96GB): The recommendation to use sglang or vLLM over llama.cpp has an important nuance (May 3 update): for a single-user single-request workflow, llama.cpp-based tools like LMStudio are competitive. The vLLM/sglang advantage becomes significant at 4+ concurrent requests, where continuous batching scales throughput meaningfully. On an A6000 Ada running vLLM, one practitioner reports 400-500 tok/s across 8-12 parallel workloads. For single-user interactive use on Windows, LMStudio may be simpler with no real throughput cost. (source)

What works

Concise, correct single-shot code generation. The viral Pacman post (778 score) is the clearest demonstration yet. Gemma 4 31B on M5 Max produced a working game in 3m51s with 6,209 tokens: shorter, clearer, and functionally correct on first run. Qwen 3.6 27B spent 18m04s and 33,946 tokens with more visual creativity but more bugs. This pattern holds across similar community tests: Gemma tends to produce tighter, more correct code, Qwen produces more elaborate but less reliable output. (source)
Writing, tone, fiction, summarization. The niche-split consensus from last week is now even stronger. The "are 30B models obsolete?" thread (139 score, 144 comments) keeps accumulating confirming answers: Gemma is "MUCH better than Qwen in writing and tone," "the best at non-code tasks." (source)
Near-lossless NVFP4 for MoE. The Nvidia NVFP4 quant of Gemma 4 26B MoE preserves quality to within 0.4% across multiple benchmarks, and in some cases slightly exceeds full precision. A practitioner with 90 ablation experiments explains this as NVFP4 acting as regularization on the 128-expert router, preventing over-commitment to dominant pathways. This is an important finding for anyone running the MoE variant. (source)
Visual understanding. Remains the strongest open multimodal answer for image-plus-text tasks. No new contradicting signal.
Speculative decoding pairing. Gemma 4 31B + E2B draft model delivers 120-200 tok/s on suitable tasks. Official MTP draft models are now available for all four variants; once llama.cpp PR #22673 merges, native MTP will replace the ad-hoc speculative decoding setup and likely push throughput further. (source, MTP release)
Token efficiency: finishing faster despite slower per-token speed. Comparative testing by Kaitchup confirms Gemma 4 31B is "far more efficient with token use" than Qwen 3.6 27B Dense. The model produces complete, correct answers in fewer tokens even when individual token generation is slower. Combined with MTP draft models now available, total task latency is expected to improve further. (source)
CPU-only 26B inference is practically viable. The MoE architecture activates only ~4B parameters per token, making CPU throughput comparable to a 4B dense model. Measured 9.25 t/s on i5-8500 + 32GB DDR4 — usable for interactive workflows on modest hardware. (source)
Native FP4 on Blackwell. Now available for both the 31B Dense (via community GGUF) and the 26B MoE (via Nvidia's official release). ROCm/Vulkan support for NVFP4 is also emerging via llama.cpp and third-party kernels like petit-kernel. (source)

Known limits

Tool calling template bug is now fixed — update your GGUFs. The Jinja chat template issue documented in earlier field notes has been patched upstream (May 4, 2026). New GGUFs from bartowski and unsloth for all four variants (31B, 26B-A4B, E4B, E2B) include the fix. The exact change is at HF discussion 86. Alternatively, pass an updated `--chat-template-file` to llama.cpp or use KoboldCPP's Jinja template override. Some users report the updated GGUFs also reduce extreme memory usage. If you have not updated since May 2026, do so now before concluding that tool calling is unreliable. The community prediction for a "Gemma 4.1" point release may still happen — the template fix resolves the structural bug but does not address all agentic reliability concerns documented in prior field notes. (source)
Qwen 3.6 leads on code and agents in the same size band. Still the dominant community read. For coding agent workflows and long agentic loops, Qwen 3.6 is materially more reliable. But the Pacman test shows Gemma 4 can beat Qwen on one-shot code quality when the task is well-scoped. The gap narrows when you don't need sustained multi-turn tool calling. (source)
DFlash for 26B MoE is live; MTP acceptance rate is workload-dependent. These are two distinct paths to faster Gemma 4 inference, both with important caveats. DFlash 26B MoE is now live (z-lab release, May 8) and achieves up to 2.56x throughput in vLLM on short contexts — but drops sharply at 20k+ tokens. DFlash for 31B Dense still requires llama.cpp PR #22105 (blocked on speculative architecture refactor). MTP is live in patched llama.cpp forks and via omlx on Apple Silicon. MTP acceptance rate varies dramatically: code generation sees 66% acceptance (1.53x speedup), prose sees 31% (a wash), and JSON structured output sees only 8% (0.50x — slower than baseline). Always benchmark MTP for your specific workload before enabling it. Either path guarantees identical output quality to standard generation (unlike quantization). (DFlash source, DFlash 26B MoE, MTP acceptance rates)
Fine-tuning Gemma 4 is harder than expected. A community prediction thread flags that Gemma 4's architecture "seems to be making fine-tuning tricky" and notes that Gemma 4 "didn't really take over the fine-tune crowd." If you need a fine-tunable base, this is a real friction point to watch. (source)
Reasoning accuracy is sensitive to prompt phrasing. A May 7 controlled test showed Gemma 4 31B performance varying significantly between two different phrasings of the same math-word problem. Qwen 3.6 was more robust to ambiguous phrasing; Gemma 4 performed best on the clearer, more explicit prompt. Practical guidance: for reasoning or multi-step tasks, prefer short, unambiguous prompts — avoid prompts with implicit assumptions or layered conditions. This is consistent with the token efficiency picture: Gemma 4 is concise and direct but expects the same from the user. (source)
Professional GPUs and sglang/vLLM: nuanced, not a blanket rule. sglang and vLLM pull ahead of llama.cpp at 4+ concurrent requests due to MTP (Multi-Token Prediction) support and continuous batching. But for a single-user setup, a card owner confirms: "performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls." The earlier framing ("seriously gimping that card by running llama.cpp") applies to multi-user or high-concurrency scenarios, not necessarily solo development. If you are on Windows and single-user, LMStudio remains a reasonable starting point. (source)
Structured output stays unreliable below 7B. Still valid from last week. Validate paths, classify actions, and check outputs in code for sub-7B models.
Safety filters on E2B. Still too aggressive for emergency/medical prompts. No equivalent Gemma 4 uncensored release has surfaced.
Gemma 4 is structurally more sensitive to KV cache quantization than most models — and a first comprehensive vLLM study now clarifies the tradeoffs (May 15 update). The zombie loops pattern from May 2 now has an architectural explanation (May 3 update): Gemma 4 uses interleaved Sliding Window Attention (iSWA), which amplifies KV precision loss differently than dense or Qwen-style MoE architectures. At q4_0 KV quantization, rounding drift accumulates across the model's alternating local and global attention windows, eventually pushing thinking-mode outputs into repetition attractors. Qwen 3.6 at Q8 KV quant is described as "almost lossless" by benchmarks; for Gemma 4, community consensus is to treat anything below fp16/bf16 KV as a quality risk on long or thinking-heavy workloads. Two independent zombie-loop cases documented: one on dual RTX 5060 Ti 16GB with `-ctv q4_0 -ctk q4_0`, another on Gemma 4-26B-A4B at Q3 and Q4 quants. Workarounds: raise KV cache precision to q8_0 or fp16, drop reasoning budget to 0 (disables thinking), ensure context is not overflowing, and use CUDA toolkit 13.1 rather than 13.2 (13.2 has a confirmed regression with these models). vLLM guidance added May 15: For vLLM users, a formal study now confirms FP8 (`--kv-cache-dtype fp8`) as the best default when VRAM is constrained — 2x capacity with negligible accuracy loss. TurboQuant k8v4 does not provide meaningful advantage over FP8. TurboQuant 4bit-nc is viable at extreme VRAM pressure but degrades accuracy and throughput. 3bit variants cause meaningful accuracy drops on reasoning tasks and are not recommended for Gemma 4 production use. If VRAM is not the binding constraint, unquantized KV cache remains the highest-quality option. These findings apply specifically to vLLM; llama.cpp's TurboQuant/RotorQuant behavior should be benchmarked separately. (source 1, source 2, architecture source, TurboQuant study)

Gemma 4 E4B is not reliable for code autocomplete (fill-in-the-middle). A practitioner directly tested E4B for IDE code infill and found it produces "weird suggestions," ultimately choosing Qwen 2.5 Coder 7B instead. The instruction-following strength that makes E4B a top recommendation at the 4B tier does not translate to the FIM (fill-in-the-middle) task pattern, which requires specific training signal. If you are building a local coding autocomplete setup with a 16GB GPU, use a purpose-built coder model for infill; E4B is better suited for interactive chat, vision, and instruction-following tasks. (source, May 12, 36 score)

SGLang FP8 KV cache can silently corrupt outputs on affected versions. A production report from AI Router Switzerland traced silent garbage output in Qwen3.6-27B-FP8 to the ragged plus paged attention split path dropping `k_scale`/`v_scale` during radix-cache prefix hits. The author explicitly says the same class can affect FP8 models such as Gemma 4 that store per-layer KV scales. Verified upstream state: SGLang PR #24198 is open and approved. Until it lands in the serving build, keep Gemma 4 FP8 deployments on BF16 KV cache or apply the patch before trusting prefix-cache reuse. (source)

Open questions

Will Google ship a Gemma 4.1 with fixed tool calling? The community's top May prediction is a "4.1" point release that fixes the template-level tool-calling bug. If it happens, it could significantly close the gap with Qwen 3.6 on agent workloads. No official signal yet. (source)
llama.cpp MTP support is now in mainline — RESOLVED. PR #22673 merged on May 16, 2026. MTP is available via a standard llama.cpp build for all Qwen3.6 and Gemma 4 MTP models. The official Docker image was lagging at merge time; build from source with `CUDA_DOCKER_ARCH` for your GPU until the container image is updated. DFlash remains a separate path, still blocked on the speculative architecture refactor (PR #22105).
What real-world speedup will MTP deliver for Gemma 4? Now well-documented across multiple hardware tiers. On H100 (vLLM): Gemma 4 31B Dense achieves 3.11x at concurrency 1 and reaches 953 tok/s at concurrency 16 — the first data point showing MTP pulls significantly ahead of DFlash at higher concurrency. On M4 Max Studio: Gemma 4 26B-A4B shows 1.53x for code, a wash for prose, 0.50x for JSON. On M5 Max with llama.cpp patched fork: 97→138 tok/s (1.42x). Pattern: dense models like 31B see larger MTP gains than the MoE 26B; code generation workloads benefit most; JSON and free-form prose do not benefit. Always benchmark MTP for your specific workload before enabling it.
Will NVFP4 quality hold across AMD via Vulkan/ROCm? Early support exists via llama.cpp Vulkan and third-party kernels, but no controlled benchmarks on AMD hardware yet. (source)
How much does the fine-tuning difficulty matter? If the community can't easily fine-tune Gemma 4, Qwen 3.6 may absorb the fine-tune crowd entirely, limiting Gemma 4's ecosystem growth.
Strix Halo and unified-memory APUs. Reports on AMD Strix Halo 128GB suggest viable 27-31B dense inference, but data is still thin.
April 2026 was "one of the best months ever" for local LLMs. A community retrospective catalogued a historically dense month of model releases. The question is whether May sustains the pace — early signals from this sweep suggest continued momentum.

Sources

The most relevant Gemma-mentioning posts driving this update, with the newest first:

A First Comprehensive Study of TurboQuant: Accuracy and Performance (May 14, 2026, 64 score, 17 comments — FP8 confirmed as best vLLM KV cache default; TurboQuant k8v4 not worth it over FP8; 4bit-nc viable at extreme VRAM pressure; 3bit variants degrade accuracy on reasoning/long-context tasks)
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (May 13, 2026, 97 score, 47 comments — budget inference guide; score rose from 46; new comments on benchmarking methodology for long-context; MTP fix via `--override-tensor-draft "token_embd\.weight=CUDA0"`)
Gemma 4 MTP released (May 5, 2026, 783 score, 204 comments — official MTP draft models for all 4 variants, E2B drafter is 78M params, llama.cpp PR #22673 in review)
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B — Slower is Faster (May 5, 2026, 117 score, 33 comments — token efficiency confirmed, Gemma produces correct answers in fewer tokens)
Running a 26B LLM locally with no GPU (May 5, 2026, 100 score, 70 comments — MoE activates only 4B params/token, 9.25 t/s on i5-8500 + 32GB DDR4)
MTPLX: 2.24x faster TPS — native MTP inference engine for Apple Silicon (May 5, 2026, 60 score, 38 comments — custom Metal kernels, exact temperature sampling, potential Gemma 4 MTP support)
it's time to update your Gemma 4 GGUFs (May 4, 2026, 365 score, 103 comments — chat template fixed, new GGUFs from bartowski and unsloth)
Llama.cpp MTP support now in beta! (May 4, 2026, 477 score, 210 comments — speculative token generation in beta)
APEX MoE quants update: 25+ new models (May 4, 2026, 77 score — Gemma 4 MoE now in APEX mixed-precision collection)
FastDMS: 6.4x KV-cache compression running faster than vLLM (May 4, 2026, 51 score — research-stage KV compression, not yet in production)
Ryzen AI Max+ 495 (Gorgon Halo) with 192GB VRAM! (May 4, 2026, 148 score — hardware leak, same bandwidth as current Strix Halo)
Does the "6 months gap" still hold? (May 3, 2026, 68 score, 44 comments — Gemma 4 31B at Dec 2025 frontier tier for translations)
Gemma 4 E2B runs surprisingly well on my 8GB Android phone (May 3, 2026, 18 score, 14 comments — LiteRT-LM Android deployment, 8-10s JSON inference)
Anyone tried +-100B models locally with foreign languages? (May 3, 2026, 8 score, 14 comments — Gemma 4 31B beats 100B MoEs for European languages)
Running llama.cpp on Snapdragon Hexagon NPU seems promising (May 1, 2026, 20 score, 6 comments — mobile NPU limits, GPU via Edge APK more practical)
KV cache quantization: ignorance, or malice? (May 2, 2026, 35 score, 77 comments — iSWA architectural sensitivity confirmation)
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality (May 2, 2026, 49 score — multilingual advantage, long-context limits)
Been using Qwen-3.6-27B + VSCode + RTX 6000 Pro as daily driver (May 1, 2026, 244 score — single-user vs. multi-concurrent GPU nuance)
SGLang FP8 KV cache corruption and image-request memory leak PRs (May 1, 2026, 2 score)
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (May 1, 2026, 905 score)
nvidia/Gemma-4-26B-A4B-NVFP4 (May 1, 2026, 217 score)
gemma-4-31B-it-DFlash has been released (May 1, 2026, 124 score)
Your local LLM predictions and hopes for May 2026 (May 1, 2026, 30 score)
12GB-Club: 4070S speeds for Gemma 4 and Qwen 3.6 (Apr 30, 2026, 31 score)
Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s (May 1, 2026, 5 score)
Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops) (Apr 30, 2026, 5 score)
Model stuck in some thinking zone where it keeps saying a similar thing again and again (May 1, 2026, 4 score)
Open Models - April 2026 retrospective (Apr 30, 2026, 518 score)
Qwen3.6-27B on dual RTX 5060 Ti 16GB with vLLM (Apr 29, 2026)
Are Qwen 3.6 27B and 35B making other ~30B models obsolete? (Apr 30, 2026, 144 score)
Notes on what actually breaks when you run a coding agent on small local models (Apr 30, 2026)
Larger Gemma-4/Qwen3.6 (Apr 30, 2026)
I stumbled on a Gemma 4 chat template bug for tools and fixed it (Apr 29, 2026)
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged (Apr 29, 2026)
What it feels like to have to have Qwen 3.6 or Gemma 4 running locally (Apr 29, 2026)
How to run a local coding agent with Gemma 4 and Pi (Patrick Loeber)
Tested how OpenCode works with self-hosted LLMs (Qwen 3.5/3.6, Gemma 4, Nemotron 3, GLM-4.7)
I'm done with using local LLMs for coding
Speculative decoding with Gemma-4-31B + Gemma-4-E2B
Gemma-4-E2B's safety filters make it unusable for emergencies

What do you use Gemma 4 for? (May 6, 2026, 94 score, 127 comments — use-case survey: vision, translation, bug tracing vs Qwen)
Prompt injection benchmark: delimiter + strict prompt (May 5, 2026, 24 score — Gemma 4 E4B 21%→100% defense rate)
New Gemma 4 MTP on MLX? (May 7, 2026, 21 score, 22 comments — MTP confirmed working on M1 Max 64GB via omlx runtime, 11→20+ tok/s)
Two related prompts, different results: Qwen 3.5 and Gemma 4 need different prompting than Qwen 3.6 (May 7, 2026, 28 score, 13 comments — prompting sensitivity, Gemma 4 favors clear explicit prompts)
Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (May 8, 2026, 99 score, 46 comments — DFlash 2.56x speedup benchmark, 228→578 tok/s; context cliff above 20k tokens)
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (May 8, 2026, 114 score, 22 comments — DFlash for 26B MoE is live; comparison vs MTP; vLLM-only currently)
MTP is all about acceptance rate (May 8, 2026, 24 score, 8 comments — code: 1.53x, prose: wash, JSON: 0.50x; workload-dependent; dense 31B expected better than MoE 26B)
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK (May 8, 2026, 44 score, 36 comments — NVLink TP=2 beats TP=4; +25% at concurrency 1, +53% at concurrency 4)
Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40% (May 8, 2026, 95 score, 19 comments — 97→138 tok/s on M5 Max via patched llama.cpp fork, not yet merged mainline)
Taiwanese company Skymizer announces HTX301 — PCIe inference card with 384GB memory at ~240W (May 8, 2026, 250 score, 103 comments — skeptical community reception; 384GB at 240W; no bandwidth figures published)
Got MTP+TurboQuant running: Qwen3.6 27B, 80 t/s at 262K context on RTX 4090 (May 9, 2026, 60 score, 42 comments — TurboQuant + MTP throughput claim; quality contested; not in mainline llama.cpp)
vLLM ROCm has been added to Lemonade as an experimental backend (May 8, 2026, 433 score, 90 comments — AMD ROCm inference path; community asks for Gemma 4 MTP in vLLM ROCm; AMD engineer relayed feedback to management)
Those of you who like Gemma4 models - how are you guys using them? (May 9, 2026, 20 score, 42 comments — second practitioner survey: instruction-following leader, E2B for game NPC dialogue, 31B for PRDs, "best instruction-following of any open-weight model")
How long for llama.cpp official support of MTP? (May 9, 2026, 68 score, 46 comments — Georgi building unified MTP+Eagle3+DFlash architecture; all three merge together, not piecemeal)
1 year ago, @jspahrsummers and I released the first version of the Model Context Protocol (May 9, 2026, 29 score — Gemma 4 on Mac Mini drives MCP server at full speed; native tool calling at zero cloud API cost)
Has anyone set a local LLM up as a language learning tool? (May 9, 2026, 23 score, 19 comments — Gemma 4 active use for German/Arabic/French; correction-loop prompting pattern; instruction fidelity in target language)
Anybody else noticing how good gemma-4-26b-a4b is with one-shotting three.js? (May 10, 2026, 38 score, 23 comments — one-shot creative 3D/WebGL coding strength confirmed; "Gemma has more personality, Qwen better for facts/coding")
MTP benchmark results: the nature of the generative task dictates whether you will benefit from speculative inference (May 10, 2026, 67 score, 24 comments — task type is the only factor; F16+MTP triples coding speed; Q4_K_M+MTP slows creative; MoE models see prefill regression)
ExLlamaV3 Major Updates! (May 11, 2026, 141 score, 61 comments — ExLlamaV3 v0.0.29 added Gemma 4 support; v0.0.31 DFlash 2.51x coding speedup; v0.0.32 further model optimization)
What's the current best small model? (May 11, 2026, 26 score, 44 comments — Gemma 4 E4B top recommendation; Q8_XL or BF16 strongly preferred over Q8_0; context window works reliably)
Gemma 4 running fully offline on WebGPU with Transformers.js, controlling Reachy Mini over WebSerial (May 11, 2026, 49 score, 9 comments — first in-browser hardware control demo; Transformers.js + WebGPU + WebSerial)
The Qwen 3.6 35B A3B hype is real!!! (May 11, 2026, 333 score, 103 comments — practitioner pattern: Gemma 26B thinking mode for quick fixes; Qwen 35B for long-context refactoring)

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results (May 12, 2026, 62 score, 22 comments — 3.11x MTP speedup for 31B Dense at c=1; MTP reaches 953 tok/s vs DFlash 725 tok/s at c=16; MTP wins at concurrency)
Apple removes 256GB M3 Ultra Mac Studio from store (May 9, 2026, 462 score, 132 comments — M5 prep; Samsung DRAM strike 58% cut; M3/M5 DRAM incompatible; 256GB high-memory option gone)
Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM (May 12, 2026, 36 score, 30 comments — Gemma 4 E4B poor for code infill; RTX 5080 setup with Qwen 2.5 Coder 7B + Qwen 3.6 35B-A3B Q8)
Decoupled Attention from Weights - Gemma 4 26B (May 6, 2026, 40 score, 27 comments — community skeptical; 23x slower than standard inference; equivalent to llama.cpp RPC; author stepped back)
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP (May 9, 2026, 635 score, 142 comments — 12GB MTP guide; Gemma 4 26B-A4B also runs on same 12GB rig at comparable speeds; practical llama.cpp MTP config documented)
Models and Quants quality test results — the chessboard svg (May 12, 2026, 46 score, 19 comments — Gemma 4 26B Q4_K_L does "very good job" on visual geometry tasks; strong spatial reasoning at Q4 quant)
Strix Halo or DGX Spark for a home LLM server? (May 11, 2026, 21 score, 91 comments — owner of both confirms Spark faster prompt processing and better long-context scaling; Strix Halo more repurposable; community split on inference-specialization vs. longevity)

The full set of 266 community reports lives in the Community Reports section above, filterable by hardware category and search.

Last updated: 2026-06-13 (June 13 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Community Reports (573 from r/LocalLLaMA)

Real-world hardware experiences from the community. Filter by hardware category or search. These are user reports, not official benchmarks.

Gemma 4 has been released

+2316

u/jacek2023 2026-04-02 New Model Quantization & Backends

Google is going to show what open weights is about. Happy Easter everyone.

u/Both_Opportunity5327 (+529)Google is going to show what open weights is about. Happy Easter everyone.

u/danielhanchen (+519)Gemma-4 has native thinking, tool calling and is multimodal! Use temperature = 1.0, top\p = 0.95, top\k = 64 and the EOS is `<turn|>`. `<|channel>thought\n` is also used for the thinking t...

u/Altruistic_Heat_9531 (+416)And after a week maybe : "Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Reasoning Distlled Expanded fine tuned quantized" Sorry to tempting lol