Community & Hardware Reports

Real-world experiences running Gemma models, curated from the community. Browse hardware reports, read the weekly field notes, or search for your setup.

Field Notes

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use.

Field Notes — 2026-05-20

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (17 new or updated since 2026-05-19, 164 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 20 sweep, 2026-05-20 00:00 EDT: five developments from this sweep deliver the first Gemma 4 MTP support on mobile hardware via Google AI Edge Gallery, clarify that desktop llama.cpp MTP does not yet support Gemma 4 models, establish Gemma 4 31B Q8 as the consensus daily driver for 48GB VRAM setups, record early RTX 5060 Ti NVFP4 testing with the nvidia/Gemma-4-26B-A4B variant, and provide community evidence on KV cache quantization quality tradeoffs at large context windows.

Gemma 4 MTP arrives on Android — Google AI Edge Gallery v1.0.13 and v1.0.14. Google AI Edge Gallery released v1.0.13 on May 18, adding Gemma 4 Multi-Token Prediction support for on-device inference on Android. The same day, v1.0.14 landed with experimental Model Context Protocol (MCP) support, Pixel TPU hardware acceleration, new built-in skills, and chat history persistence. Community reaction is positive: a top commenter (score 15) states "edge gallery is legit usable now." Practical notes: MTP requires re-downloading the on-device model variants; Pixel TPU acceleration is available on compatible Pixel devices; the MCP integration is marked experimental. A notable community concern (score 14) flags that the app requires agreeing to Google data collection on first launch — a relevant consideration for users running Edge Gallery specifically for offline privacy. The practical picture: Gemma 4 on mobile has reached a materially more usable state with v1.0.13+ — faster token generation via MTP, hardware acceleration on Pixel, and persistent chat sessions. For users with Pixel 8/9 or other Pixel-class devices, this represents the first production-quality path to Gemma 4 MTP inference without desktop hardware. (source, May 19, 44 score, 17 comments)

Desktop llama.cpp MTP improvements land, but Gemma 4 MTP support is not yet included. A community post (May 19, 99 score, 73 comments) links llama.cpp PR #23269, a meaningful MTP performance improvement for models that already support MTP in llama.cpp. A prominently upvoted community comment (score 24) clarifies directly: "Gemma4 MTP is not supported yet." This creates a notable divergence: on mobile, Gemma 4 MTP is production-available through Google AI Edge Gallery v1.0.13; on desktop, the draft-head speculative decoding path in llama.cpp still does not implement the Gemma 4 MTP head. Users reporting gains in this thread are on Qwen 3.6 models. A commenter who combined a 1660 Ti with a 5070 Ti to reach 22GB VRAM reports going "from single digit tps to double digit" — those gains are entirely on Qwen workloads. Practical guidance: update llama.cpp for MTP gains if you run Qwen 3.6 27B or 35B-A3B; do not expect any Gemma 4 MTP speedup on desktop llama.cpp builds until a Gemma 4 MTP head PR merges. Watch llama.cpp PRs for Gemma 4 MTP support as a separate milestone from the already-landed Qwen MTP. Confidence: explicit community statement supported by absence of any Gemma 4 MTP benchmark in the thread. (source, May 19, 99 score, 73 comments)

48GB VRAM sweet spot: Gemma 4 31B Q8 as daily driver, Q6 at 96GB for extended context. A community discussion (May 19, 24 score, 51 comments) on 48GB VRAM use patterns produced two clear Gemma 4 data points. A commenter (score 12) running dual 24GB P40 cards confirms Gemma 4 31B Q8 GGUF as the daily driver, noting it supports a useful context size with the workload split across both GPUs, and leaves enough headroom for an image model and TTS/STT on the remaining VRAM. A former 48GB user who upgraded to 96GB (score 13) reports running Q6 Gemma 4 31B with "a few hundred thousand tokens of context" and substantially faster throughput — treating Q6 at this tier as the next plateau after Q8 at 48GB. The practical picture: at 48GB (whether two P40s, one RTX 6000 Ada, one A6000, or similar configurations), Gemma 4 31B Q8 provides good quality with large context; the primary reason to go higher is extended context beyond 50–100k tokens or the step-up to Q6 quality. Community sentiment: "Q6 gemma4 31b" is the enthusiast target at the 96GB tier, but Q8 at 48GB is well-established and not a compromise worth stressing over. Confidence: anecdotal, small engagement; consistent with broader field notes on this hardware tier. (source, May 19, 24 score, 51 comments)

RTX 5060 Ti community recipes expanded — early NVFP4 Gemma 4 26B testing underway. A follow-up post (May 19, 20 score, 9 comments) from the club-5060ti project reports a cleaned-up benchmark and recipe repository with schema-validated JSON, a static results explorer, and structured lanes for 1x, 2x, and multi-card 5060 Ti configurations. A commenter reports that after getting vLLM working on the 5060 Ti, they "revisited the gemma model since the nvidia/Gemma-4-26B-A4B" NVFP4 variant — the first community signal of NVFP4-format Gemma 4 being tested on a Blackwell budget card. The 5060 Ti's 16GB GDDR7 and Blackwell native FP4 support in principle make it a viable target for NVFP4 Gemma 4 inference at higher throughput than equivalent FP16 or Q4 GGUF builds. Important caveat: this testing is still in early stages; the commenter notes difficulty getting the NVFP4 model to work with MTP on Qwen before switching to Gemma, so the Gemma NVFP4 result is not yet benchmarked. Practical status: 5060 Ti + Gemma 4 26B NVFP4 is an active community experiment, not a confirmed recipe. Follow the club-5060ti GitHub repository for results as they publish. Confidence: low — single commenter, no benchmark numbers yet. (source, May 19, 20 score, 9 comments)

KV cache quantization at large context: Q4_0 quality loss is significant, Q5_1 is the recommended middle ground. Community consensus (May 17, 44 score, 91 comments) on KV cache quantization for developers using large context windows (50k+ tokens) is clear and consistent. The top response (score 44) is direct: "The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead, which was introduced relatively recently. Q8 for K and Q4 for V is another option." A second commenter (score 23) recommends "Model Q6 and up, context cache FP16." A developer-facing finding (score 21): "Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can — I don't quantize cache." Practical guidance for Gemma 4 users: if you are running Gemma 4 31B or 26B-A4B at context lengths above 32k and using the model for structured output, tool calling, or multi-turn agentic tasks, avoid Q4_0 KV quantization. Q5_1 is the community-recommended minimum for quality preservation at large context; FP16 KV cache is the reliability ceiling but carries VRAM cost. The Q8K / Q4V hybrid is a middle-ground option if VRAM is the constraint. These findings directly apply to Gemma 4 26B-A4B MoE, which has the architectural capacity for large context but requires careful KV quantization choices to maintain coherence over long sessions. Confidence: community consensus across multiple practitioners. (source, May 17, 44 score, 91 comments)

Field Notes — 2026-05-19

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (18 new or updated since 2026-05-18, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 19 sweep, 2026-05-19 00:00 EDT: six developments from this sweep deliver the first cross-hardware MTP numbers from mainline llama.cpp, surface a high-engagement coding-agent harness demo built on Gemma 4 E4B, add a third community fine-tune to the Gemma 4 31B ecosystem, report ROCm 7.13 Strix Halo optimizations with a real-world AMD stability confirmation, frame Gemma 4 31B as the competitive ceiling for the 27–31B class, and record a practical agentic-coding comparison showing where Qwen 35B-A3B currently edges Gemma 4 26B on one user's setup.

MTP in mainline llama.cpp — measured numbers across Strix Halo and RTX 3090 rigs. PR #22673 (commit 4f13cb7) is confirmed in mainline llama.cpp as of May 16. A community benchmark post (May 18, 39 score) measured Qwen3.6-27B performance on two rigs using `--spec-type draft-mtp --spec-draft-n-max N`: on a Strix Halo (Framework Desktop, ROCm 7.0.2), Q4_K_M went from 11.7 to 21.2 tok/s (1.81×) and Q8_0 from 7.4 to 18.1 tok/s (2.44×); on a single RTX 3090 at 450W (CUDA 12.9), Q4_K_M improved from 38.7 to 59.5 tok/s (1.54×); on a dual RTX 3090 layer-split, Q8_0 went from 25.7 to 55.9 tok/s (2.17×). For MoE comparison: Qwen 35B-A3B gained 1.40× on Strix Halo and 1.24× on the RTX 3090 — confirming the by-now well-established asymmetry where dense models benefit substantially and MoE models gain less because each forward pass is already cheap. These results transfer directly to Gemma 4: expect comparable gains on Gemma 4 31B Dense (the closest architectural equivalent to Qwen3.6-27B dense), and more modest gains on Gemma 4 26B-A4B MoE. The optimal `--spec-draft-n-max` sweet spot varies by rig: uncapped 3090 preferred n=2 at Q4; power-capped 3090 and Strix Halo preferred n=3. Output is described as byte-identical to baseline at the same seed and temperature. Confidence: structured benchmark with multiple rigs and runs; single hardware configuration per rig. (source, May 18, 39 score, 28 comments)

SmallCode: Gemma 4 E4B as a 4B coding agent achieves 87% on self-selected benchmark — harness design matters more than model size. A high-engagement post (May 18, 639 score, 306 comments) introduced SmallCode, a coding agent harness built from scratch for small local models, demonstrating 87/100 tasks passing with Gemma 4 E4B (which activates 4B parameters per token). The author's core insight: standard agents like OpenCode and Cursor assume large frontier models, causing small models to fail on multi-step tool chains. SmallCode compensates with three harness techniques — compound tools that bundle sequential file operations into a single call (cutting failures from multi-step coherence loss in half), an improvement loop that feeds compilation errors back automatically, and a task-decomposition fallback when the model fails twice in a row. The author claims OpenCode scores approximately 75% with 14B models on their benchmark, suggesting the harness closes a meaningful gap. Important caveats: the benchmark is self-selected and not reproducible against a standard suite; top community responses were pointed ("TrustMeBro-2.1-hard," "custom benchmarks is like marking your own homework"). A top comment with 126 score questioned why these improvements aren't integrated into existing tools like OpenCode or little-coder rather than creating another standalone agent. Practical takeaway: the harness techniques — compound tool bundling, lint-driven improvement loops, and decomposition on repeated failure — are generalizable regardless of implementation, and the post confirms Gemma 4 E4B is capable enough for agentic coding when the scaffold compensates for its coherence limits. Treat the benchmark numbers with appropriate skepticism. (source, May 18, 639 score, 306 comments)

Gembrain: third community fine-tune merges seven Gemma 4 31B variants — community reception is skeptical. LLMFan46 published GGUF packaging for Gemma-4-Gembrain-31B-it-uncensored-heretic (May 18, 34 score), created by Nimbz as a merge of seven Gemma 4 31B fine-tunes targeting improved logical and lateral thinking, adherence, prose variety, and creative output. KLD is 0.0186 with 13/100 refusals. Community response is more skeptical than prior heretic-line releases: a top comment (score 27) says "I don't ever trust these weird merged models"; a second (score 17) questions why merging a fine-tune that already merged the base model adds value; a third (score 12) challenges the "boost lateral thinking" claim with no published mechanism. The fine-tune author revealed an internal contradiction: the merge includes a model with 99/100 refusals — the same as the base Gemma 4 31B — which may explain the KLD not moving strongly. This is the third community fine-tune in the Gemma 4 31B ecosystem in one week (Ortenzya for prose, Meromero for creative breadth, Gembrain for thinking and variety); all three are published by the same packaging pipeline (LLMFan46 GGUFs). None have been systematically benchmarked against the base model. If you are experimenting with fine-tuned Gemma 4 31B for creative or reasoning tasks, these give you options to test, but treat confidence as low until independent evaluations appear. (source, May 18, 34 score, 29 comments)

ROCm 7.13 nightly: Strix Halo optimizations merged, RX 6800 Gemma 4 stability confirmed. AMD released ROCm 7.13 Tech Preview (May 17, 50 score, 24 comments) with dedicated optimizations for the Ryzen AI Max 300 "Strix Halo" and new support for additional APU and GPU SKUs including Ryzen AI 7 PRO 360, 350, and other gfx1152-class devices. A commenter with an RX 6800 (score 2) reported running Gemma 4 E2B, E4B, and 26B at various quantizations via lemonade-sdk/llamacpp-rocm "for months" without a single crash — a meaningful real-world stability data point for AMD discrete-GPU Gemma 4 inference under ROCm. A second commenter (score 14) notes ROCm offers better prompt processing than pure Vulkan with a minor generation throughput tradeoff. Caution: an early commenter found ROCm 7.14 preview running slower than expected on Strix Halo when using the latest llamacpp-rocm build, suggesting not all builds in the preview channel are stable. For Strix Halo users: the 7.13 Tech Preview available from TheRock on GitHub is the tested path; the 7.14 preview introduced a regression for at least one user. For AMD discrete GPU (RX 6800 class and newer) users running Gemma 4 via llama.cpp: the lemonade-sdk ROCm build is now reported stable for production use including MTP testing, with the caveat that `-np 1` may be required and mmproj handling needs attention for multimodal use cases. Confidence: community report, single-user stability confirmation; not a controlled benchmark. (source, May 17, 50 score, 24 comments)

Model release anticipation: community positions Gemma 4 31B as one of two competitive options in the 27–31B class. A widely-engaged discussion (May 18, 120 score, 71 comments) forecasting when new local models will drop surfaced a useful framing for Gemma 4's current market position. A top commenter (score 39) stated directly: "Qwen 3.6 27B dense raised the bar very high — only competitor (not for coding, of course) is Gemma 4 31B dense currently." Another commenter (score 41) specifically flagged "Gemma 4 123B or Qwen 3.6 122B would be huge," reiterating the community's ceiling aspiration documented in prior field notes. Several commenters referenced expectations for new Google releases in the following days, potentially at Google I/O. Practical context: Gemma 4 31B Dense is the community's shortlist pick for non-coding general quality at the 27–31B parameter tier. For coding and agentic tasks, Qwen 3.6 27B and 35B-A3B are consistently preferred. This positioning has been stable across the last two weeks of field notes. No product announcement signal exists for a larger Gemma 4 variant. (source, May 18, 120 score, 71 comments)

Agentic coding with 4090+5060Ti: Q8_0 Qwen 35B-A3B at 262k context edges Gemma 4 26B for demo work. A practitioner report (May 18, 29 score, 27 comments) compared Qwen 35B-A3B against Gemma 4 26B-A4B on an agentic coding workload — demo and data analytics scripts via Claude Code's API endpoint pointing to localhost. The author ran Q8_0 Qwen 35B-A3B on a 4090+5060Ti combination with 262,144-token context and reports it is "better than Gemma 4 26B" for their use case, though they note it underperforms in plain chat compared to agentic mode. Community recommendations push back on the Q8_0 KV cache setting: multiple comments (scores 10 and 9) recommend dropping to Q6_K_XL or similar model quant and using unquantized or FP16 KV cache for better coding quality. A notable data point from a top comment (score 10): 70 tok/s on dual RTX 3090 with Qwen 35B-A3B at Q8 and 196k context — useful headroom for context-heavy agentic sessions. Important context: this comparison uses models of different parameter counts (Qwen 35B-A3B activates only ~3.5B parameters per token as an MoE; Gemma 4 26B-A4B activates ~4B). The architectural comparison is approximate. For users with a single 4090 or 4090+5060Ti and primarily agentic coding or demo work at large context: Qwen 35B-A3B at Q6_K_XL or similar with unquantized KV cache is the community's current recommendation over Gemma 4 26B in that specific niche. For general-purpose or non-coding tasks, Gemma 4 26B or 31B remain competitive options. Confidence: anecdotal, single practitioner report. (source, May 18, 29 score, 27 comments)

Field Notes — 2026-05-18

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (26 new or updated since 2026-05-17, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 18 evening sweep, 2026-05-18 02:00 EDT: three developments from this sweep expand the Gemma 4 community fine-tune ecosystem with a second heretic creative model, surface strong community demand for a larger Gemma variant, and document Gemma 4 running on Blackwell hardware as a catalyst for runtime migration.

G4-Meromero-31B: second creative fine-tune for Gemma 4 31B joins Ortenzya in the community ecosystem. Community developer zerofata released G4-MeroMero-31B-Uncensored-Heretic (May 17, 101 score, 49 comments), available via llmfan46's GGUF packaging on Hugging Face. Designed for creative tasks broadly, it complements the Ortenzya fine-tune (natural English prose) released the prior day by the same heretic-packaging pipeline. KLD is 0.0100 with 15/100 refusals. Community discussion highlights the two as potentially complementary: Ortenzya targets prose quality and natural English for translation and RP; Meromero targets creative breadth. A commenter asks directly how the two differ, and the fine-tune author notes Ortenzya improves prose naturalism while Meromero focuses on creative task coverage. Neither model has been systematically benchmarked against the base Gemma 4 31B; treat as community options to trial on your specific creative use case. Important context: the base Gemma 4 31B is already noted by community members as relatively permissive for creative tasks — these fine-tunes address tone and prose quality rather than unlock otherwise-blocked capabilities. Confidence: anecdotal — no head-to-head benchmark against the base model published. (source, May 17, 101 score, 49 comments)

Community appetite for a 124B Gemma is strong — Google shows no present signs of building one. A high-engagement post (May 17, 285 score, 51 comments) imagining a 124B Gemma model surfaced broad community agreement that such a model would be compelling — commenters frame it as "basically an open-weights version of Gemini Flash," which would be among the most capable locally-runnable models available. Top responses are skeptical Google will deliver: "There is no interest in doing that for them" (score 45) and "That would be awesome but I guess there is no interest in such a huge model" (score 34). A commenter jokes that any announced release will turn out to be a ShieldGemma variant. No product roadmap signal exists to support expectation of a 100B+ Gemma model. Practical context for users: Gemma 4's current ceiling is 31B dense (or 26B MoE, equivalent to approximately 4B active parameters per token). Users needing 100B-class locally-runnable models currently have Qwen 3.5 122B, Qwen 3.6 MoE, DeepSeek-V4 Flash, and similar options. This finding is editorial context rather than a hardware or performance data point — it captures the ceiling of current Gemma 4 availability and community aspirations for the model line. (source, May 17, 285 score, 51 comments)

Blackwell 5000-class GPU + Gemma 4 as a migration driver from Ollama to llama.cpp. A user with 64GB RAM and a Blackwell 5000-series GPU (identified as "backwell 5000" in post, likely RTX PRO 5000 or similar) running Gemma 4 and Qwen via Ollama and LM Studio asked for migration advice to get better speeds (May 17, 35 score, 73 comments). Community response: llama.cpp is the direct next step, offering fine-tuned control via flags and measurable throughput gains over Ollama's wrapper overhead. ik_llama.cpp was called "the best, not too hard tool" by a prominent commenter; vLLM wins on 4+ concurrent users but adds setup complexity. A late comment highlights lemonade-server as a drop-in Ollama-compatible endpoint with llama.cpp and vLLM backends. The hardware context adds a useful data point: Blackwell 5000-class cards (estimated 24–48GB GDDR7 depending on variant) running Gemma 4 are a real user segment upgrading from earlier Ollama-based workflows. This confirms that Gemma 4 is in active use by developers who are growing beyond simple Ollama deployments into tunable inference stacks, and that llama.cpp remains the community's first recommendation for that transition. (source, May 17, 35 score, 73 comments)

May 18 sweep, 2026-05-18 00:00 EDT: five developments from this sweep close the long-running open question on MTP's mainline llama.cpp status, deliver the first community benchmarks of officially-merged MTP across Strix Halo and RTX-class NVIDIA hardware, quantify the wall-time picture at production-scale context (85k tokens), and add a cross-platform decode-bandwidth comparison showing where each GPU tier wins on Gemma 4 model sizes.

MTP officially merged into llama.cpp mainline — PR #22673 approved by Georgi Gerganov. After weeks of testing in forks and patched builds, Aman Gupta's PR #22673 landed in llama.cpp's main branch, making Multi-Token Prediction (MTP) available to all users via a standard build without patching. Community reaction was celebratory, with the announcement post reaching 733 score. The key mechanism: MTP adds a lightweight draft head that predicts multiple tokens ahead; accepted drafts expand effective decode throughput at no quality cost (equivalent to standard generation when tokens are rejected). Important context from community commentary: MTP benefits are task-type dependent. Low-entropy outputs — code generation, math, structured text — see 67–90% acceptance rates and meaningful speedups. High-entropy outputs — creative writing, roleplay, diverse prose — see low acceptance rates and sometimes slower wall time due to the dual-prefill overhead. Practical note for immediate testers: at time of first community testing, the official Docker image for llama.cpp server-cuda had not yet picked up the merge; users wanting to test immediately need to build from source with `CUDA_DOCKER_ARCH` set for their GPU. The container will follow shortly. (source, May 16, 733 score, 236 comments)

Strix Halo MTP benchmarks: Qwen3.6-27B gains +111% generation speed; 35B-A3B gains are context-length dependent. The first systematic MTP benchmarks on Strix Halo hardware (Ryzen AI MAX 395, 128GB unified LPDDR5X) reveal a clear asymmetry between 27B dense and 35B-A3B MoE. For Qwen3.6-27B on a 15k-token single-turn task: generation rate went from 7.63 to 16.15 tok/s (+111%), but prompt processing slowed 12.5% due to the MTP head's dual-prefill pass; net wall time improved by 10 seconds (-11.5%). Over a 5-turn chat conversation (~28.5k cumulative context): generation improved +136% on average (7.61 to 17.98 tok/s), and turns 2-5 were 56 seconds faster overall (-26.5%) as the prompt-processing overhead amortized across turns. For Qwen3.6-35B-A3B (MoE): single-turn generation improved +16.5% (48→56 tok/s), but the dual-prefill overhead made wall time 2.33 seconds slower (+11.2%) on the same 15k-turn task. On 5-turn chat, the MoE was roughly tied (+2.3% slower). A post-publish update from the author tested ROCm 7.13 versus Vulkan: ROCm now shows +12% better prompt processing than Vulkan across all tested models — a meaningful reversal from earlier data. The pattern maps directly to Gemma 4: dense 31B benefits substantially from MTP across conversation length; MoE 26B-A4B gains less because its already-high base throughput means MTP overhead costs proportionally more. Confidence: single hardware setup, code-heavy synthetic prompts per community analysis. (source, May 16, 136 score, 57 comments)

RTX 3090 MTP at 85k context: PP halved, TG +85%, net wall time -41%. A real-world production data point from a headless RTX 3090 running Qwen3.6-27B-MTP-Q4_K_M at 128k context demonstrates the wall-time picture that per-metric numbers obscure. On an 85,000-token research task: without MTP, prompt processing ran at 1,050 tok/s and generation at 27 tok/s, completing in roughly 39 minutes. With MTP enabled (`--spec-draft-n-max 3`): prompt processing fell to 600 tok/s (-43%), generation rose to 50 tok/s (+85%), and the same task completed in roughly 23 minutes — a 41% time reduction. The key insight is that decode dominates wall time on large-context generation tasks, so even a significant PP regression translates to large net savings when TG meaningfully improves. This matches the Strix Halo multi-turn result: PP overhead matters most on the first turn of a fresh session; at 85k context with substantial output, the generation benefit compounds. Practical guidance: if your workload generates substantially more than it reads (high TG:PP token ratio), MTP is likely a net win despite the PP regression. Benchmark first if your workflow is PP-heavy — large document ingestion, RAG with many retrieved chunks, or very short output responses where TG never gets to compound. (source, May 17, 44 score, 37 comments)

RTX 5090 first-day MTP community testing confirms dense-vs-MoE asymmetry at the high-end tier. A controlled RTX 5090 MTP test (32GB, built from llama.cpp source commit 4f13cb7 with `CUDA_DOCKER_ARCH=120`, Unsloth Q5_K_M for 27B and UD-Q4_K_M for 35B-A3B, 128k context with flash attention and q8_0 KV cache) confirms what Strix Halo and RTX 3090 data show: dense 27B delivers large generation speedup; MoE 35B-A3B shows smaller fractional improvement because its high base throughput means MTP verification overhead costs proportionally more. A commenter reported 180 tok/s on dual 5060 Ti with MTP and parallel=2, confirming that parallel execution is now fully supported (an earlier limitation requiring parallel=1 is resolved). For Gemma 4 users: the architectural principles transfer directly — expect similar 2x+ generation gains on 31B Dense with MTP on code tasks; expect more modest or task-dependent gains on 26B-A4B MoE. (source, May 17, 203 score, 30 comments)

Multi-platform decode comparison: RTX 5070 beats RTX 3090 on sub-12GB models; 3090 wins on 14–31B band. A community benchmark (55 runs, 3 hardware platforms, 5 backends) compared Strix Halo ROCm, RTX 3090 CUDA, and RTX 5070 Vulkan across a range of model sizes. Key results for Gemma 4: the RTX 5070 (12GB GDDR7, Vulkan) outperforms the RTX 3090 (24GB GDDR6X, CUDA) on models that fit in 12GB — Gemma-4-E4B at 124.3 vs 118.4 tok/s. For models that require more than 12GB, the 3090 wins decisively: Gemma-4-26B-A4B scored 100.5 tok/s on the 3090 versus 43.7 (Strix ROCm) and 47.7 (Strix Vulkan). The Strix Halo systems are not competitive on models that fit in discrete VRAM but offer unmatched capacity for larger models neither discrete card can run at full quality. Community pushback flagged a methodology issue: the 5070 was benchmarked with Vulkan rather than CUDA, which may understate its performance margin over the 3090 on sub-12GB models. Practical guidance for Gemma 4 users: for E4B or E2B on a 12GB budget, the RTX 5070 generation rate is higher than the 3090; for 26B-A4B or 31B Dense, 24GB+ VRAM from the 3090 or higher is required for competitive speeds. (source, May 16, 34 score, 20 comments)

Field Notes — 2026-05-17

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new or updated since 2026-05-16, 148 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 17 sweep, 2026-05-17 00:00 EDT: eight developments from this sweep surface a new fine-tune for creative writing and RP use cases, confirm a community-derived power-efficiency curve for multi-3090 inference rigs, validate Gemma 4 E4B's native audio transcription capability, extend MTP dense-vs-MoE evidence to million-token scale, document enterprise team deployment patterns, establish that thinking mode hurts translation tasks, add Terminal-Bench 2.0 context for Gemma 4 positioning, and resolve the GPU-vs-RAM debate for MoE inference.

Ortenzya: first quality creative writing fine-tune for Gemma 4 31B. Community developer LLMFan46 released `gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic` (May 16, 25 score), a Gemma 4 31B fine-tune targeting natural English prose quality, creative writing, translation fidelity, and RP use cases. Available in safetensors and GGUF formats. The fine-tune addresses a community-noted weakness in base Gemma 4 31B: while the model produces correct and concise prose, some users find the writing style lacks naturalness in extended creative or narrative output. Key community finding from the discussion: the base Gemma 4 31B is already "uncensored asf" for most creative use cases — the fine-tune's value is specifically prose quality and natural English, not primarily safety softening. A commenter notes the fine-tune also addresses "softening" (toned-down language without hard refusal), which matters for translation and RP tasks where the original source material has strong tone or content. Practical guidance: if you find base Gemma 4 31B too dry or stiff for creative writing, Ortenzya is the first community option to try. Confidence: anecdotal — no systematic quality benchmark against base model yet. (source, May 16, 25 score, 16 comments)

4x RTX 3090 power efficiency curve: 220W per GPU is the sweet spot — memory-bandwidth-bound decode confirmed. A systematic power-limit benchmark on a 4x RTX 3090 rig running Qwen3.6-27B at FP16 via vLLM TP=4 (May 15, 38 score; updated comments May 17) measured output generation speed and prompt-processing throughput across power limits from 200W to unrestricted (350–390W). Key result: reducing from unrestricted to 220W drops output generation from 29 to 27 tok/s while pushing efficiency from 0.77 to 1.13 tok/joule. Below 220W, both efficiency and throughput fall together (200W: 24 tok/s, 1.11 t/J). A top commenter (score 9) provided the architectural explanation: "output generation speed is flat from 300W down to 220W because decode is memory-bandwidth-bound, not compute-bound. 3090 GDDR6X bandwidth barely changes with power limit, so you hit the same ~29 t/s regardless." Prompt processing drops proportionally because prefill IS compute-bound. The hardware setup uses a PCIe Gen 3 bifurcated topology (x16/x8/x8/x4); the x4 slot is a known bottleneck, and a P2P driver patch (`github.com/aikitoria/open-gpu-kernel-modules`) that supports mixed NVLink/PCIe topologies was flagged but tested without improvement on this specific PCIe-3-limited rig. Practical guidance for multi-3090 (and general multi-GPU NVIDIA) inference: power-limiting to ~220W per card costs ~7% in output throughput but saves ~37% in power draw. The decode floor is bandwidth-limited, so power won't buy you output speed beyond ~300W. Test prefill-heavy workflows first — prefill does benefit from compute headroom and will degrade proportionally below ~250W. Anecdotal confidence (single setup, Qwen workload; principles are general). (source, May 15, 38 score, 54 comments)

Gemma 4 E4B confirmed for short multi-lingual audio transcription — not a Whisper replacement for long audio. A community practitioner report (May 12, 22 score, 9 comments) validated Gemma 4 E4B's native audio input for transcription. Key findings: E4B processes short audio clips accurately in multiple languages including foreign languages, without additional STT tooling. A top commenter confirms active use for voice assistant STT, noting the model's promptability is a practical advantage over fixed-vocabulary Whisper — you can instruct E4B to focus on specific terms, format output in a particular way, or filter filler words in the prompt. Practical limits: for audio exceeding roughly one hour, Whisper or a dedicated STT model remains necessary; E4B's context window constrains continuous transcription. Multiple commenters also noted that E2B may support the same audio input path via LiteRT-LM. This is the first direct community report of E4B transcription as a primary use case. Combined with the Jetson Orin NX SUPER finding (May 16 sweep), E4B is now documented as a viable complete voice pipeline component: STT natively, inference on-device, TTS via Piper or similar, with no cloud dependency. Confidence: anecdotal, small engagement, no controlled accuracy comparison against Whisper published. (source, May 12, 22 score, 9 comments)

MTP dense-vs-MoE finding confirmed at 1M token scale — dense 27B gains ~1.5x, MoE 35B gains under 10%. A practitioner who spent over 1 million tokens across three sessions building a pygame project with Qwen 3.6 MTP models (May 15, 127 score, 78 comments) directly confirms the MTP task-type dependency at production usage scale: the dense Qwen3.6-27B model with MTP gained approximately 1.5x tok/s; the MoE 35B-A3B gained less than 10%. A commenter adds a critical caveat: the test used `q4_0` KV quantization — already warned in earlier field notes to carry meaningful quality risk on long-context tasks. For Gemma 4 users: this is further confirmation that MTP is primarily valuable on dense models (Gemma 4 31B, Qwen 3.6 27B dense) and delivers marginal gains on MoE variants (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B). The result has now been independently confirmed by the 300-test systematic analysis (May 10), the M4 Max measured results (code: 1.53x; prose: wash; JSON: 0.50x), and this million-token practitioner run. (source, May 15, 127 score, 78 comments)

Enterprise server for 7-person team: 2x RTX 6000 Blackwell MaxQ with Proxmox and vLLM — community recommends testing cloud first. A team setting up local inference for a 7-person company (May 15, 20 score, 58 comments) drew a substantive community discussion on small-team deployment patterns. The most-upvoted practical setup: a Gigabyte server with 2x RTX 6000 Blackwell MaxQ (~26k€), running Proxmox with an LXC container using Debian 13 + NVIDIA drivers + CUDA 13.2, serving Gemma 4 and Qwen models via vLLM. A key community concern: the commenter with this setup is running llama.cpp instead of vLLM on two 6000s — a top comment calls this "leaving so much performance on the floor." For multi-GPU inference of 30–35B class models, vLLM tensor-parallel is the right backend choice. The second-highest-voted response argues for API/rental first: "Use cases can quickly outgrow on-prem resources. Give people generic access, watch what they do for a month or two, then decide." A third pattern: a 1x RTX Pro 6000 with large RAM to run Kimi K2.6 for 1-2 power users who need a genuinely strong coding model. Hardware and architecture recommendations for small-team deployment: TP=4 vLLM on multi-GPU for 35B class; single high-VRAM GPU with large RAM for flexibility; validate use case demand before committing to on-prem hardware at this scale. Confidence: community discussion, multiple experienced practitioners, not a benchmark. (source, May 15, 20 score, 58 comments)

Thinking mode consistently hurts Gemma 4 translation — direct pass is preferred, two-pass is useful only for complex edge cases. Community consensus (May 13, 22 score, 17 comments) on using Gemma 4 for translation with thinking mode enabled is clear: thinking mode "wastes a lot of context thinking about it and also ends up overthinking it," and turning thinking off produces better results for direct translation tasks. A more nuanced practitioner approach from the comments: use a first pass at temperature 0 with no thinking for direct translation, then a second optional reasoning pass to review flagged segments, with KV cache prefix reuse on the second pass to minimize latency. A dedicated translation fine-tune (Qwen3-Translation, Tower) remains the community recommendation over generalist + thinking for high-volume or professional-quality needs. Practical guidance: disable thinking mode for Gemma 4 translation; reserve the optional review pass for idioms, jargon, or segments where you need explicit justification. This is consistent with the token efficiency picture — Gemma 4 is concise and direct, and adding thinking overhead to tasks that don't require multi-step reasoning adds cost without quality gain. (source, May 13, 22 score, 17 comments)

Terminal-Bench 2.0: Qwen 3.6 35B-A3B scores 24.6% and beats Gemma 4 31B on terminal coding — expected gap given dense vs MoE. The public Terminal-Bench 2.0 leaderboard now includes Qwen3.6-35B-A3B at 24.6% (±3.2) with the little-coder scaffold, placing it above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B (23.9%). Community commentary (May 16, 243 score, 57 comments) is broadly positive but includes an important framing note: comparing Qwen 3.6 35B-A3B (MoE) against Gemma 4 31B Dense is not architecturally equivalent — the MoE uses 3.5B active parameters while the dense model uses all 31B. A commenter notes: "Gemma 4 31B is a dense model. Would not be fair to compare the Qwen MoE to it. The better comparisons would be between Qwen 27B dense and Gemma 31B." Gemma 4 31B has not yet been officially benchmarked on Terminal-Bench 2.0 as of this writing. For readers using Gemma 4 for terminal/agentic coding: this benchmark suggests Qwen 3.6 MoE leads on this specific leaderboard task; however, the community also consistently reports Gemma 4 produces higher-quality output per token on focused tasks (see the Packman benchmark and three.js creative coding findings). Neither model has a clean win across all coding task patterns. (source, May 16, 243 score, 57 comments)

GPU vs RAM debate: VRAM wins on throughput, but Gemma 4 MoE is the best case for high-RAM inference. A community debate (May 15, 63 score, 81 comments) on whether "rich RAM / poor GPU" is a viable strategy produced two clear data points. A practitioner with both 192GB RAM and a 5090 reports using RAM only for testing new models, avoiding it otherwise: "The speed gain is just too important for the too small gain on accuracy." A separate commenter (512GB across 128GB devices) notes that the Gemma 4 26B MoE and Qwen 3.6 27B dense models have changed the calculus, making 30B-class dense-equivalent quality achievable on consumer VRAM for the first time. The analytical breakdown by a third commenter: sub-7B models must be task-specific; 24–35B dense is the minimum for general-purpose quality; MoE in the 100B parameter class is viable at 128GB+ RAM with hybrid offload. The Gemma 4 26B-A4B MoE architecture — activating only 4B parameters per token — is explicitly identified as the strongest argument for the high-RAM approach: its MoE sparsity means CPU RAM throughput is not penalized as severely as a dense 26B model would be. For Gemma 4 users with a mid-range GPU (16–24GB) and 64–128GB RAM: the 26B-A4B with `--n-cpu-moe` offload is the architecture that most justifies the RAM-over-GPU strategy; the 31B Dense requires VRAM to run without significant throughput penalties. (source, May 15, 63 score, 81 comments)

Field Notes — 2026-05-16

A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (13 new or updated since 2026-05-15, 147 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.

May 16 sweep, 2026-05-16 00:00 EDT: three developments from this sweep surface the lowest-power embedded hardware data point to date for Gemma 4, extend context-length degradation evidence in the budget GPU tier, and confirm NVIDIA's own NVFP4 quantization path for Blackwell hardware.

Gemma 4 E4B confirmed working on Jetson Orin NX SUPER 16GB — 14–15 tok/s fully offline with 200ms cached TTFT. A community robotics project (May 15, 419 score, 61 comments) detailed a fully offline suitcase robot running Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention, 12K context, on a Jetson Orin NX SUPER 16GB. Sustained generation: 14–15 tok/s. Cached TTFT: ~200ms after a prompt structure optimization that moved persona and tool definitions to the top of the system block, history to the middle, and volatile sensor/vision data to the bottom of the most recent user turn — a disciplined ordering that kept the prefix cache stable and dropped TTFT from multi-second to 200ms. A key benefit observed: Gemma 4's native vision capability eliminated the separate BLIP subprocess required in prior versions, simplifying the pipeline. The author uses SenseVoiceSmall for STT and Piper for TTS; all inference runs on-device with no network interface. This is an anecdotal single-device report, not a reproducible benchmark. The Jetson Orin NX SUPER 16GB is a specialist embedded GPU with ~204 GB/s memory bandwidth; expect similar results on comparable Jetson-class hardware, and lower results on Orin NX 16 (not SUPER). (source, May 15, 419 score, 61 comments)

Long-context throughput and quality degradation on the $200 GTX 1080 setup — short-context numbers don't transfer. New community comments on the budget GTX 1080 inference guide (now 97 score, 49 comments as of May 16) quantify two independent degradation curves for the 8 GB VRAM / 32 GB RAM + Gemma 4 26B-A4B setup. Throughput: tok/s drops from ~30 at 4k context to ~20 at 50k, matching the expectation that KV cache fills VRAM and forces more expert weights to page over PCIe. Quality: a separate commenter reports retrieval-heavy tasks degrade meaningfully past 32–64k context, well before the advertised 128k limit — the visible tok/s curve is not the only performance cliff. The commenter's framing: "there's a quieter second curve underneath" where output quality erodes on retrieval tasks even as generation speed appears acceptable. This tightens the practical context guidance: the GTX 1080 + TurboQuant setup is usable at 4–16k context for routine chat and code; treat 32k+ as experimental territory where output reliability is unconfirmed. The MTP fix (`--override-tensor-draft "token_embd\.weight=CUDA0"`) and prefill speedup (`-ub 4096+`) remain valid tuning regardless of context length. (source, May 13, 97 score, 49 comments)

NVIDIA released its own NVFP4 quantization of Gemma 4 26B-A4B for Blackwell GPUs. NVIDIA published `nvidia/Gemma-4-26B-A4B-NVFP4` on Hugging Face (post 1t0i18e), a first-party NVFP4 quantization targeting the RTX 5090 (SM120, Blackwell). NVFP4 is a GPU-native 4-bit floating point format specific to Blackwell and newer NVIDIA architectures; it is not GGUF Q4 and does not run on older consumer hardware. A separate community report from a Radeon 9060 XT 16GB user achieved 25.9 tok/s on an IQ4_NL GGUF of the same model via llama.cpp, providing a comparable data point from the AMD side (anecdotal, single report). Practical guidance: if you have an RTX 5090, the NVIDIA NVFP4 model is worth testing over AWQ-4bit for throughput; if you are on older NVIDIA or AMD hardware, standard GGUF quantizations remain the mainstream path. The RTX 5090 DFlash speculative decoding benchmark from May 8 (600 tok/s peak) used an AWQ-4bit model, not NVFP4; NVFP4 throughput comparisons have not yet been published by the community. (source, score 32, 11 comments)

May 15 sweep, 2026-05-15 00:00 EDT: two developments from this sweep refine KV cache quantization guidance for vLLM serving and extend the budget GPU picture with new community benchmark methodology discussion.

FP8 confirmed as the best KV cache quantization default for vLLM — TurboQuant variants offer a VRAM tradeoff, not a free lunch. A first comprehensive study of TurboQuant against BF16 and FP8 in vLLM (May 14, 64 score, 17 comments; source article) settles a frequently debated question for constrained-VRAM Gemma 4 deployments. Key conclusions: FP8 via `--kv-cache-dtype fp8` provides 2x KV cache capacity with negligible accuracy loss — it matches BF16 on most throughput and latency metrics while meaningfully improving them when VRAM is the binding constraint. TurboQuant k8v4 provides only 2.4x compression (vs FP8's 2x) but consistently degrades throughput and latency; the marginal extra compression is not worth the performance cost. TurboQuant 4bit-nc is more practical: it helps under severe VRAM pressure but trades accuracy, latency, and throughput. TurboQuant 3bit variants show meaningful accuracy drops on reasoning and very long-context tasks. A commenter notes that FP8 KV numbers "are obviously worse" compared to unquantized — users with ample VRAM should keep KV cache unquantized; FP8 is the right default only when VRAM is genuinely constrained. A second commenter provides a reassuring data point: running Gemma 4 at 128k context with TurboQuant 2-3 in a production-style load (large PDF ingestion) produced coherent answers across beginning, middle, and end of the document. These TurboQuant results apply specifically to vLLM with its PagedAttention KV management; llama.cpp's TurboQuant/RotorQuant KV implementation behaves differently and should be benchmarked separately. Critical caveat: the study benchmarks only FP8 and TurboQuant variants; no Q4 comparison is included, drawing criticism that the study misses the primary VRAM-constrained use case. (source, May 14, 64 score, 17 comments)

GTX 1080 Gemma 4 guide attracts community discussion on long-context benchmarking methodology. The May 13 budget inference guide (score climbed from 46 to 97, now 47 comments) prompted a useful community exchange about how to properly evaluate large-context performance. The original benchmarks used small prompts (under 2,000 tokens) despite reserving 128k context. New comments recommend using a large Reddit thread (40k+ tokens in JSON or markdown) as a more realistic long-context stress test — common domain content not baked into training data. The guide author is investigating a standardized benchmarking approach. Practical implication: the 20–24.5 tok/s figures for the GTX 1080 setup should be treated as short-context baselines only; actual throughput at meaningful long-context prompts will be lower because KV cache fills VRAM and forces more CPU round-trips. The `--override-tensor-draft "token_embd\.weight=CUDA0"` MTP fix remains valid regardless of prompt length. (source, May 13, 97 score, 47 comments)

May 14 sweep, 2026-05-14 00:00 EDT: five developments from this sweep extend the budget hardware picture for Gemma 4 MoE, surface practical GPU power tuning, confirm prefill tuning for partially-offloaded models, and add new guidance on vLLM vs llama.cpp for single-user workloads.

Gemma 4 26B-A4B running at ~24 tok/s on a $200 secondhand GTX 1080 machine — a new floor for budget inference. A detailed guide (May 13, 46 score) demonstrates Gemma 4 26B-A4B and Qwen 3.6 35B-A3B running on an i7-6700 / GTX 1080 (8 GB VRAM) / 32 GB RAM machine costing ~$200 secondhand via llama.cpp with TurboQuant/RotorQuant KV cache quantization. Results at Q4_K_M with 128k context: Gemma 4 26B-A4B (no MTP) ~20 tok/s with `--n-cpu-moe 20`, TurboQuant KV turbo3 on both K and V caches; after fixing the MTP token embedding table placement, ~24.5 tok/s with `--override-tensor-draft "token_embd\.weight=CUDA0"`. The key mechanism: TurboQuant/RotorQuant KV cache compression fits the KV cache within 8 GB VRAM even at 128k context, while `--n-cpu-moe` offloads the cold MoE expert weights to system RAM, streaming them over PCIe as needed. The GPU sits at ~40-50% utilization; the bottleneck is PCIe bandwidth. Important caveat from the post: the GTX 1080 test used small prompts (under 2,000 actual tokens despite 128k reservation); a commenter notes that larger real-world prompts at 128k context will degrade throughput further as VRAM is tighter with large KV. MTP barely helped out of the box (~5% gain) because Gemma 4's tied LM head forces token embedding lookups on the CPU by default; the fix flag above moves the embedding table to GPU. This is an anecdotal data point, not a reproducible benchmark baseline, and TurboQuant is not in mainline llama.cpp. But directionally, a ~$200 machine can now run a 26B MoE at interactive speeds — a meaningful lower bound for the local Gemma 4 story. (source, May 13, 46 score, 10 comments)

Cut GPU power limit to 40% TDP — no throughput loss for LLM decode, meaningful savings on power, heat, and noise. A viral post (May 12, 709 score, 198 comments) benchmarked an RTX 4090 running Qwen3.6-27B-UD-Q4_K_XL with `nvidia-smi -pl` set to various power limits. Result: reducing to approximately 40% of rated TDP (~100W for a 4090) preserves generation throughput almost identically while cutting electricity draw, heat output, and fan noise proportionally. Multiple RTX 5090 owners in the comments independently validated the finding at their own hardware (860mV/2500MHz, ~360W, with only ~12% TPS loss at the absolute voltage floor). The mechanism: LLM decode is memory bandwidth bound, not compute bound. Once the GPU's memory bus is the bottleneck, reducing compute frequency and voltage has minimal effect on bandwidth-limited operations. The result holds for any consumer NVIDIA GPU running inference workloads including Gemma 4. Practical guidance: reduce power limit incrementally with `nvidia-smi -pl` and monitor generation speed — you can reclaim meaningful electricity savings at almost no quality cost. This is a well-established finding now backed by community data across multiple GPU generations. (source, May 12, 709 score, 198 comments)

Raising llama.cpp `-ub` to 4096-8192 gives ~5.5x prefill speedup for partially CPU-offloaded MoE models. A guide (May 12, 112 score, 53 comments) discovered that increasing the micro-batch size (`-ub`) from llama.cpp's default 512 to 4096 or 8192 dramatically improves prompt processing throughput for `--n-cpu-moe` partially-offloaded models. Measured on an RTX 3090 with a 120B model: prompt processing improved from ~380 tok/s at default `-ub 512` to ~2091 tok/s at `-ub 8192` — a ~5.5x gain. Generation speed was nearly unchanged (32.3 → 30.1 tok/s, ~7% regression). The mechanism, debated in comments: either amortizing PCIe transfer overhead across more tokens (reducing per-transfer round-trip cost) or reducing GPU kernel launch overhead by saturating the attention/router on fewer, larger batches. Both explanations are consistent with the observation. The default 512 exists because it's a safe conservative value for low-VRAM cards that have little headroom for compute workspace spikes. Users with spare VRAM should tune upward and stop when either VRAM OOM or generation speed starts to regress. This applies directly to Gemma 4 26B-A4B when partially offloaded — pair with `--n-cpu-moe` adjustment to keep the run within VRAM at the chosen `-ub`. (source, May 12, 112 score, 53 comments)

vLLM vs llama.cpp for single-user workloads: confirmed equivalent at low concurrency, vLLM wins at 4+ concurrent users. A community discussion (May 12, 75 score, 91 comments) produced a clear practical consensus. vLLM adds meaningful value when: (1) concurrent batch inference is in play — vLLM allocates VRAM per-batch as context grows while llama.cpp must pre-allocate max-context KV VRAM at launch; (2) tensor-parallel multi-GPU/multi-node serving is needed (e.g., Qwen 397B across two DGX Sparks). vLLM also supports MTP for Gemma 4 and Qwen3.6 already, while llama.cpp MTP is still in a patched fork. For single-user non-batched local use, llama.cpp remains simpler with equivalent per-query throughput. CUDA prompt processing is faster in vLLM regardless of batch size. AMD Lemonade now ships vLLM ROCm as a built-in experimental backend. This confirms and sharpens the earlier guidance: if you are a solo user running interactive chat or coding sessions, llama.cpp or LMStudio is fine; switch to vLLM when you need to serve multiple concurrent users or run tensor-parallel inference on model weights too large for one GPU. (source, May 12, 75 score, 91 comments)

Docker images simplify llama.cpp MTP deployment — confirmed +34% throughput on RTX 3090. A community developer (May 13, 63 score, 16 comments) released Docker images pre-built from the llama.cpp MTP development branch, removing the barrier of building from source. A commenter reports +34% throughput gain on an RTX 3090 after switching. The images track recent MTP branch improvements including image support and bug fixes. A commenter asks whether Gemma 4 is supported; the Docker images cover the same model classes as the underlying MTP PR (primarily Qwen3.6 for now). For Gemma 4 MTP, the mainline llama.cpp PR #22673 is still in review; until it merges, the AtomicBot-ai patched fork remains the llama.cpp path for Gemma 4 MTP specifically. Recommended flag addition from comments: `--min-p 0.0` (default 0.1 can interfere with speculative decoding). (source, May 13, 63 score, 16 comments)

May 13 sweep, 2026-05-13 00:00 EDT: five developments from this sweep extend the MTP vs DFlash picture, surface a supply chain signal for Apple Silicon buyers, add new practical limits for Gemma 4 E4B in code use cases, and document a home-server hardware comparison from someone who owns both the Strix Halo and DGX Spark.

First controlled head-to-head benchmark of Gemma 4 MTP vs DFlash on a single H100 — MTP wins at concurrency. A community benchmark (May 12, 62 score, 22 comments) ran Gemma 4 31B Dense and 26B-A4B MoE against both MTP and DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench dataset (880 prompts, 11 categories). Results for 31B Dense: at concurrency 1, MTP hit 125.3 tok/s (3.11x over baseline 40.3) and DFlash hit 122.1 tok/s (3.03x). At concurrency 16, MTP reached 953 tok/s versus DFlash's 725 tok/s versus baseline 375 tok/s — a meaningful gap in favor of MTP at higher concurrency. The architectural explanation from commenters: DFlash generates a larger speculative batch via diffusion but has lower acceptance rate per token; MTP is autoregressive with higher per-token acceptance, so at scale its advantage compounds. Practical guidance: at concurrency 1 the two methods are nearly equivalent; at concurrency 4+ for serving multiple users, MTP outperforms DFlash by a widening margin. This is the first benchmark to quantify the concurrency dimension — prior guidance focused on single-user latency where both methods were close. DFlash's lower acceptance rate with the diffusion-based approach means more compute spent on rejected tokens under load. Still vLLM-only for both methods; no mainline llama.cpp path yet. (source, May 12, 62 score, 22 comments)

Apple removes M3 Ultra 256GB Mac Studio — M5 expected, but supply chain is under stress. Apple pulled the M3 Ultra 256GB Mac Studio configuration from its online store (May 9, 462 score, 132 comments). The top community read: M3 is being phased out ahead of an M5 Mac Studio launch, not a deliberate memory cap decision. Technical context: M3 and M5 use incompatible DRAM types (LPDDR5-6400 vs LPDDR5x-9600), so M3 chip stock is not convertible to M5 builds. An independent complicating factor: a Samsung DRAM worker strike cut production capacity by 58% on one shift. Community concern about M5 Ultra memory configurations is real but largely speculative — no M5 Ultra specs have been announced. Practical impact for Gemma 4 Apple Silicon users: the M3 Ultra 256GB, which was the best available option for running Gemma 4 31B Dense at full BF16 precision with context headroom, is no longer orderable. Anyone actively planning a high-memory Apple Silicon build for Gemma 4 should wait for M5 Ultra pricing and configuration announcements before committing. The 192GB M3 Ultra (if still available) or used M2 Ultra 192GB remain the current options if you need maximum unified memory now. (source, May 9, 462 score, 132 comments)

Gemma 4 E4B produces poor results for code autocomplete (infill) — use Qwen 2.5 Coder 7B instead. A practitioner post (May 12, 36 score, 30 comments) sharing a working RTX 5080 16GB + 64GB RAM coding setup explicitly evaluated Gemma 4 E4B for code autocomplete infill alongside Qwen 3.5 9B/4B. The author's conclusion: E4B and the Qwen 3.5 small models "produce weird suggestions" for infill and were rejected in favor of Qwen 2.5 Coder 7B Q6_K_L, which runs at instant-feeling speeds on 8GB VRAM. The same setup uses Qwen 3.6 35B-A3B at Q8 for agentic coding tasks (the higher quant is important; the author notes Q4 is not usable for agentic work). This is the first practitioner report directly comparing Gemma 4 E4B against alternatives for the code autocomplete fill-in-the-middle (FIM) use case. Confidence: anecdotal, single data point. But it aligns with the known limitation that E4B's instruction-following strength does not automatically transfer to the FIM pattern, which requires a different training signal. The guidance: do not assume E4B works for code infill — test it on your IDE and task type before committing. (source, May 12, 36 score, 30 comments)

"Decoupled Attention from Weights" for Gemma 4 26B: community verdict is skeptical. A post (May 6, 40 score, 27 comments) announced a technique to "split attention (a couple of GB) onto local machine and weights onto a cheap Xeon" for Gemma 4 26B, with a GitHub repository (larql/vindex). Community response was immediate and critical. Top comments: the technique is reported to run approximately 23x slower than standard inference; the underlying mechanism is equivalent to llama.cpp's existing RPC multi-node functionality with network latency added; sequential layer dependencies prevent any parallelism benefit from splitting attention vs. weights. The post author acknowledged the concerns and withdrew from further claims pending personal experimentation. The technique remains unvalidated as a practical inference improvement. This matters for lab readers who may have seen the post circulate with excited framing: there is no new local-inference breakthrough here. For distributed inference, llama.cpp RPC and vLLM expert-parallel deployment are the established options. (source, May 6, 40 score, 27 comments)

Strix Halo 128GB vs DGX Spark for home Gemma 4 inference — owner of both says Spark wins on throughput, degrades less at long context. A community question post comparing the Framework Desktop (Ryzen AI Max+ 395, 128GB unified memory, $3,388) against the Asus Ascent GX10 DGX Spark ($3,500) for running Gemma 4 31B and 26B-A4B as a local LLM server drew 91 comments (May 11, 21 score). The decisive data point: a commenter (score 31) who owns both systems reports "Spark has much faster GPU which results in faster prompt processing speeds. Also, the performance degrades less on Spark as context grows." Community consensus aligns on a clear split: Spark for pure LLM inference; Strix Halo for general-purpose or hybrid workloads where repurposability (standard x86/amd64 Linux, GPU gaming, everyday tasks) matters. The counterpoint for Strix Halo from a top commenter (score 41): "Definitely Ryzen 395, as it's a standard x86/amd64 machine that can always be repurposed and will never lose drivers or compatibility with new operating systems. Nvidia on the other hand has a history of abandoning their proprietary ARM SoC." DGX Spark runs ARM Ubuntu with a DGX software package; the same commenter who owns both notes Fedora also works with some tweaks. Practical guidance for anyone in the $3,400–$3,500 range targeting Gemma 4 31B: the DGX Spark delivers faster discrete GPU throughput and better long-context scaling; the Strix Halo 128GB unified memory trades some raw inference speed for a more flexible, repurposable machine. Neither is a clear wrong choice; the tradeoff is inference specialization vs. general-purpose longevity. Anecdotal confidence; the owner-of-both data point is the strongest signal. (source, May 11, 21 score, 91 comments)

May 12 sweep, 2026-05-12 00:00 EDT: four findings from this sweep extend the inference backend, edge deployment, and small-model picture.

ExLlamaV3 gains Gemma 4 support and DFlash — up to 2.51x coding speedup on consumer NVIDIA GPUs. ExLlamaV3, the successor quantization and inference engine from turboderp, has reached a run of rapid updates directly relevant to Gemma 4. Version 0.0.29 added Gemma 4 model support; version 0.0.31 added DFlash speculative decoding with measured results (from the post, testing on RTX 3090 and 4090): coding tasks 55.98 → 140.61 tok/s (2.51x), agentic code 55.98 → 140.61 tok/s (2.51x), translation 58.11 → 75.73 tok/s (1.30x), creative writing 59.10 → 89.19 tok/s (1.50x). Version 0.0.32 added further model optimizations. ExLlamaV3 requires RTX-class Nvidia CUDA hardware. Unlike the vLLM DFlash path (server-only), ExLlamaV3 is accessible via a Python API suited to single-user setups. The coding speedup is consistent with the vLLM DFlash benchmark (2.56x on RTX 5090); creative writing DFlash improvement is smaller but still positive, unlike MTP which can slow creative tasks. For single-GPU Nvidia users who want DFlash without a full vLLM server deployment, ExLlamaV3 is now a viable path. Confidence: the throughput numbers come directly from the community post; the comparative claim vs vLLM requires independent verification. (source, May 11, 141 score, 61 comments)

Gemma 4 E4B confirmed best-in-class at the 2–4B tier — but quantization quality matters significantly. A community thread asking "what's the current best small model?" (May 11, 26 score) drew strong consensus: Gemma 4 E4B is the top recommendation at the ~3B parameter class, with multiple independent reporters calling it "hands down the best, no arguing." A first-hand practitioner report adds an important caveat: Q8_0 quantization is "kinda bad and mid" for E4B — Q8_XL or BF16 is "night and day" better on tested tasks. A separate commenter confirms E4B "never loops" and "effectively uses the whole 131k context window" — the zombie-loops pattern documented in earlier field notes for larger quantized models does not appear on E4B. The consensus best competitors for the 3B class are smollm3, Granite 4.1, LFM2/2.5, and Qwen 3.5 4B. Community read: Gemma 4 E4B for general instruction following; Qwen 3.5 4B for tasks where a reasoning chain is needed. If running E4B, Q8_XL or BF16 is strongly preferred over Q8_0. Anecdotal confidence. (source, May 11, 26 score, 44 comments)

First documented in-browser Gemma 4 deployment controls a physical robot over WebSerial. A community developer shared a demo of Gemma 4 running fully offline in a browser via Transformers.js on WebGPU, processing camera frames and sending commands to a Reachy Mini robot over the WebSerial API (May 11, 49 score). The model never contacts a server: inference happens entirely on the client GPU via WebGPU, and motor commands go directly over USB/serial via the browser's WebSerial interface. A commenter notes the architectural benefit: "model sees camera/frame state, JS does the motor command, nothing leaves the machine." This is the first documented Gemma 4 use case in the browser-as-inference-engine + physical-actuator pattern, enabled by Transformers.js and the small footprint of Gemma 4 E-series models. The specific variant was not named; the constraint is WebGPU VRAM, which limits practical options to E2B or E4B. No throughput figures were published; treat as a proof-of-concept rather than a production guidance baseline. (source, May 11, 49 score, 9 comments)

Practitioner pattern: Gemma 4 26B for quick interactive fixes, Qwen 3.6 35B for long-context refactoring. A high-engagement discussion thread on Qwen 3.6 35B-A3B (May 11, 333 score, 103 comments) contains a direct Gemma/Qwen split from a practitioner who runs both: "Gemma 26B in thinking mode for quick code fixes and chats, Qwen 35B in thinking mode for longer contexts and refactoring. Qwen 35B rambles on and on before it spits out the final output so I only use it for tasks that I don't mind waiting for." This two-model hybrid pattern — Gemma 4 for latency-sensitive interactive tasks, Qwen 3.6 for depth-first long-context work — is now documented by multiple independent practitioners across several weeks of field notes. The pattern holds whether the user prioritizes speed, quality, or token efficiency: Gemma 4 26B finishes short tasks fast and concisely; Qwen 3.6 35B is more thorough but verbose. A second data point from the same thread: an RTX 3090 24GB + 64GB RAM user (Beelink eGPU dock) reports Qwen 3.6 35B-A3B "blazing fast" with llama.cpp after tuning settings, switching from LM Studio, with Gemma 4 26B as the secondary model for interactive chat. (source, May 11, 333 score, 103 comments)

May 11 sweep, 2026-05-11 00:00 EDT: four findings from this sweep extend the MTP and creative-coding picture.

MTP task-type dependency confirmed by systematic 300-test analysis — dense models benefit far more than MoE. A careful benchmark author published the most rigorous community MTP analysis to date (May 10, 67 score, 24 comments). Over 300 test runs covering four task types, five quantization levels, three temperature values, and two MTP quant settings produced a clear finding: F16 + MTP nearly triples coding-task speed; Q4_K_M + MTP slows creative writing output. Temperature and MTP quant have negligible impact; task type is the only factor that matters. An RTX 5090 user in the comments reported ~70% acceptance rate for coding tasks at --spec-draft-n-max 4, with 70–120 tok/s sustained at 70–160k context on Q6. Expert commentary confirms the MoE penalty: MoE models like Gemma 4 26B-A4B must cycle through more experts per speculative token than dense models, so the overhead is proportionally higher — a Radeon AI Pro 9700 user saw prompt-processing speed drop from 1,400 tok/s to 650 tok/s after enabling MTP. Dense models (Gemma 4 31B, Qwen 3.6 27B full) are the primary beneficiaries; for MoE variants, MTP helps only on coding tasks with high acceptance rate. Practical rule: benchmark before assuming MTP helps on your specific workload. (source, May 10, 67 score, 24 comments)

Gemma 4 26B-A4B excels at one-shot creative coding tasks where Qwen consistently falls flat. A practitioner shared an automated three.js prompt cycling test (May 10, 38 score, 23 comments): a Python app cycles through 80 creative-coding prompts, generates single-file HTML/WebGL outputs, detects crashes, and archives the results. Gemma 4 26B-A4B one-shot generation quality was consistently high on 3D graphics and demoscene-style effects. The same author states Qwen 3.6 "falls flat on its face for just about anything I throw at it" in the creative context. A third commenter summarizes the emerging community consensus: "Gemma has more personality to it; Qwen is better for facts and coding." This creative-coding strength is now documented by at least two independent practitioners — the Packman racing game comparison thread (May 9) and this three.js cycling tool — and represents a consistent divergence from Qwen's strengths. For creative coding and single-file generative output, Gemma 4 26B-A4B appears to be the stronger local option at the 26–31B weight class. (source, May 10, 38 score, 23 comments)

vLLM ROCm added to Lemonade as experimental AMD backend — community wants Gemma 4 MTP support. AMD engineer jfowers announced the integration of vLLM ROCm into the Lemonade SDK as an experimental backend (May 8, 433 score, 90 comments). Installation is now two commands: `lemonade backends install vllm:rocm` followed by `lemonade run `. The post drew one of the highest-score community engagements of the week. Notably, the community immediately asked about Gemma 4 with MTP in vLLM ROCm — signaling that MTP for Gemma 4 on AMD GPUs is an active interest. A portable standalone vLLM executable for AMD is also now available. The top comment thread included a pointed message to AMD relayed internally by jfowers: "The reason CUDA is the industry standard is that Nvidia made it their mission to provide the same support to everything in their hardware portfolio." An AMD engineer confirmed the message was sent to management. For Gemma 4 users on AMD hardware, vLLM ROCm in Lemonade is now the cleanest path to vLLM's speculative decode and safetensors-native model support without manual CUDA replacement. (source, May 8, 433 score, 90 comments)

Gemma 4 for language learning: correction-loop prompting pattern works; SillyTavern multi-character setups in active use. A language-learning thread (May 9, 23 score, 19 comments) surfaces a practical deployment pattern for Gemma 4 in education. The most-upvoted comment describes a correction loop: the model answers in three lanes (reply in target language, grammar correction, and explanation of why) while only marking one grammar error and one phrasing suggestion per turn to prevent homework-session overload. One commenter has been using Gemma 3 and then Gemma 4 continuously for German practice, noting it handles verb separation (Trennbare Verben) imperfectly but is broadly helpful for vocabulary connections across Romance languages. A SillyTavern multi-character practitioner reports actively using LLMs for Arabic, French, Portuguese, and Spanish practice across multiple character personas. Gemma 4's instruction-following fidelity — its consistent ability to stay in the target language and maintain a role when prompted — is what makes this use case work. No hardware specifics were shared, suggesting this is primarily a quantized local model use case compatible with standard consumer hardware. (source, May 9, 23 score, 19 comments)

May 10 re-check, 2026-05-10 01:00 EDT: three developments from this sweep reinforce and extend findings from May 9.

Practitioner survey confirms use-case split: Gemma 4 for instruction-following, prose, and games; Qwen for code. A second independent use-case thread (May 9, 20 score, 42 comments) drew direct practitioner reports of what they reach for Gemma 4 specifically. Common answers: generating narrative responses for NPCs in video games (E2B is cited here explicitly), writing PRDs and product specification documents using Gemma 4 31B and then handing implementation to Qwen, and structured tasks where instruction-following fidelity matters more than raw reasoning depth. The most-cited single-sentence summary from the thread: "best instruction-following of any open-weight model I've tried." This is the second large practitioner survey in as many weeks — after the 94-score May 6 thread — reaching the same structure: Gemma 4 is the answer when the task is open-ended instruction compliance or voice/tone matching, and Qwen is the answer for multi-turn agentic code execution. The split is now documented from two independent data points with combined 114 score and 169 comments. (source, May 9, 20 score, 42 comments)

MTP in llama.cpp: Georgi unifying speculative decode architecture before any merge lands. A thread asking how long until official llama.cpp MTP support (May 9, 68 score, 46 comments) surfaced a clarification from Georgi Gerganov: he is building a unified speculative decode architecture that covers MTP, Eagle3, and DFlash together — rather than merging each independently. All three methods will land in one correct implementation rather than piecemeal patches that create technical debt in the speculative decode path. This explains why PRs like #22673 (Gemma 4 MTP) and #22105 (DFlash) have been slow to merge despite being functional. No timeline was given; this is active in-progress work, not a planned milestone. Users who need MTP now should use the AtomicBot-ai patched fork (TurboQuant path) or the omlx runtime on Apple Silicon. The unified refactor, when it lands, should give llama.cpp native parity with vLLM on speculative decode across all three methods simultaneously. (source, May 9, 68 score, 46 comments)

Practical deployment: Gemma 4 on Mac Mini drives MCP server at full interactive speed. A first-person report (May 9, 29 score) confirms that Gemma 4 running on a Mac Mini runs fast enough to serve as the backend for a Model Context Protocol server at full interactive speed — with native tool calling, at zero cloud API cost. This is a concrete production data point for the "Gemma 4 as a free local MCP backend" deployment pattern: the model's tool calling quality and throughput are sufficient for MCP server workloads on current consumer Apple Silicon hardware. No hardware specifics (exact chip, RAM size, model variant, quant) were disclosed, so treat the speed claim as an existence proof rather than a precise benchmark target. The finding is consistent with the broader practitioner picture: Gemma 4 at the right hardware tier delivers cloud-grade instruction following with no recurring API cost. (source, May 9, 29 score)

May 9 re-check, 2026-05-09 01:00 EDT: six significant developments from this sweep.

DFlash for Gemma 4 26B MoE is live — 2.56x speedup in vLLM, 600 tok/s peak on RTX 5090. z-lab released gemma-4-26B-A4B-it-DFlash a few days ago; community benchmarks hit the site on May 8. A controlled vLLM benchmark (RTX 5090 32GB, vLLM 0.19.2rc1) measured baseline 228 tok/s → 578 tok/s at num_speculative_tokens=13 (2.56x speedup) on a 256-input / 1024-output random workload at concurrency 1. Optimal tuning: max_num_batched_tokens=8192 gave the cleanest p95 tail at that speculation depth, with mean E2E latency dropping from 4455ms to 1738ms. Critical community caveat: DFlash drops sharply at approximately 20k context. One commenter testing the same 5090 at 35k context reports speed starting at 400 tok/s but dropping quickly to 200 tok/s and continuing to degrade, with malformed tool calls. For short-to-medium context inference this is a compelling gain; for long-context agentic workloads it is not yet practical. On the DFlash vs MTP comparison: DFlash uses stateful parallel block diffusion drafting with persistent KV cache positions; Gemma 4's MTP implementation uniquely reuses the main model's KV cache, avoiding the memory pressure that afflicts MTP on other architectures. Both require vLLM for DFlash or a patched llama.cpp fork for MTP — no merged mainstream path exists yet. (DFlash benchmark, 99 score; DFlash release discussion, 114 score)

MTP acceptance rate determines whether it helps or hurts. A controlled M4 Max Studio study with Gemma 4 26B-A4B reveals that MTP benefit varies entirely by workload acceptance rate. Measured: code generation 66% acceptance → 1.53x speedup; long-form prose 31% acceptance → essentially no gain (0.95x); JSON structured output 8% acceptance → 0.50x (twice as slow). The mechanism: when the draft model's speculative tokens are rejected, the full model must re-run the verify step with no net gain — at 8% acceptance the overhead dominates. Expert commentary adds two important nuances: first, MoE models like Gemma 4 26B-A4B are harder to speculate in than dense models because spare compute for draft verification is limited; second, Apple Silicon before M5 has limited headroom, and dense Gemma 4 31B is expected to see better MTP gains than the MoE 26B on the same hardware. Practical guidance: MTP is worth enabling for structured code generation and predictable outputs; disable it for free-form prose and especially for JSON schema output, where it reliably degrades performance. Always benchmark before assuming benefit. (source, 24 score, 8 comments)

Multi-GPU topology insight: NVLink pairing beats full tensor parallelism. A detailed benchmark with 4×RTX 3090 (NVLink between GPU pairs 0↔2 and 1↔3, vLLM 0.20.1, CUDA 12.8) found that pinning TP=2 to an NVLink-bonded pair delivered +25% throughput at concurrency 1 and +53% at concurrency 4 compared to running TP=2 over PCIe. Counter-intuitively, expanding to TP=4 across all four GPUs was worse — cross-pair PCIe bus traffic added latency that outweighed the additional capacity. This applies directly to Gemma 4 31B Dense deployment on NVLink-equipped multi-GPU workstations: prefer TP=2 on your NVLinked pair over TP=4, even when you have four GPUs. Tested here with Qwen 3.6 27B AWQ as the workload model; the topology principle holds for any model requiring tensor parallelism across these GPUs. (source, 44 score, 36 comments)

TurboQuant + MTP on RTX 4090: 80-87 tok/s at 262K context — quality claims contested. A demonstration showing TurboQuant quantization combined with MTP on Qwen 3.6 27B reports 80-87 tok/s generation at a 262K context window on a single RTX 4090 (60 score, 42 comments). The numbers are eye-catching, but community pushback on quality was significant: the demonstration used a simple Q&A prompt and did not test accuracy on long-context retrieval tasks where TurboQuant's aggressive compression can degrade meaningfully. TurboQuant is the method from the AtomicBot-ai fork — the same project that shipped the first Gemma 4 MTP implementation for llama.cpp — and it is not merged into mainline llama.cpp or any standard quantization library. The combination of unverified quality and non-mainline tooling means the throughput claim is directionally interesting, but the practical recommendation remains: use quantization methods with published quality benchmarks on your target workload before optimizing around throughput numbers. (source, 60 score, 42 comments)

HTX301 PCIe inference card announced: 384GB at 240W, community skeptical. Taiwanese company Skymizer announced the HTX301, a PCIe inference card with 384GB memory and a 240W TDP (250 score, 103 comments). At face value the memory capacity is striking — 384GB would fit Gemma 4 31B Dense at BF16 with enormous headroom, or multiple models simultaneously. Community reaction was measured skepticism: the announcement contains no memory bandwidth specification, no compute FLOPS figures, and no pricing. Memory capacity without bandwidth is meaningless for LLM inference decode throughput, where bandwidth is almost always the bottleneck. Several hardware-knowledgeable commenters compared it unfavorably to AMD MI300X (192GB at ~5TB/s bandwidth) and suggested the 240W TDP implies a modest memory subsystem relative to the 384GB capacity. Worth tracking if independent benchmarks appear with validated bandwidth figures; do not plan deployments around the headline memory number alone. (source, 250 score, 103 comments)

vLLM ROCm added to Lemonade: AMD GPU users can now run inference before GGUF conversion. The Lemonade server added vLLM ROCm as an experimental backend, enabling inference from standard model weights on AMD GPUs without first converting to GGUF format. This reduces workflow friction for Radeon 6000/7000-series users on Linux who want to test Gemma 4 variants under ROCm. The backend is marked experimental; community verification of Gemma 4 on ROCm via Lemonade is sparse, so validate on your specific GPU before relying on it for production workloads. AMD GPU users for whom GGUF conversion was the primary friction point now have a faster path to initial evaluation. (source)

May 8 re-check, 2026-05-08 01:00 EDT: three new developments from this sweep worth recording.

MTP now working in llama.cpp for Gemma 4 — 40% decode speedup on M5 Max. A community developer (May 8) implemented Multi-Token Prediction for llama.cpp, quantized Google's new Gemma 4 assistant GGUF models, and tested on a MacBook Pro M5 Max. Measured result: 97 tok/s baseline → 138 tok/s with MTP, a 40% speedup. This uses the new Google-released MTP draft models (Gemma-4-26B-A4B-it-assistant) and a patched llama.cpp fork available at AtomicBot-ai; the patch is not yet merged into mainline llama.cpp. Key distinction from the omlx finding (below): this is llama.cpp-based MTP — relevant to Linux and Windows users who cannot use MLX. Commenters note the quality comparison between baseline and MTP outputs used different seeds and temperatures, so "40% faster with identical quality" requires verification at temp=0 with fixed seed; take the exact ratio as approximate. The directional finding (meaningful speedup via MTP on llama.cpp for Gemma 4) is credible given the confirmed mechanism. (source, 95 score, 19 comments)

MTP confirmed working on Apple Silicon via omlx runtime. A direct first-hand report (May 7) confirms that the new Google MTP draft models work with the omlx runtime on M1 Max 64GB, nearly doubling decode speed from 11 tok/s to 20+ tok/s at max wattage. Standard MLX (the more widely used Apple Silicon inference library) does not yet support MTP — the omlx runtime is a separate fork-based project. On the technology: MTP only benefits decode (generation) speed, not prefill — prefill processes the full input in parallel by design, so there is nothing to speculate ahead. Commenters clarified a common confusion: some third-party projects advertise "speculative prefill" as a distinct feature, but this involves lossy KV cache population (not mathematically equivalent to standard generation); lossless MTP applies only to the decode phase. For Apple Silicon users: omlx is the current fastest path to Gemma 4 MTP; native MLX support is pending. (source, 21 score, 22 comments)

Prompting sensitivity: Gemma 4 and Qwen 3.5 need different prompting than Qwen 3.6. A controlled test (May 7) ran two phrasings of the same math-word problem against Gemma 4 31B, Qwen 3.5, and Qwen 3.6 27B — 10 runs each (6 combinations). The headline result: the models respond very differently depending on phrasing, and Qwen 3.6 proved most robust to ambiguous phrasing while Gemma 4 and Qwen 3.5 performed better on the clearer of the two prompts. Key practical takeaway: Gemma 4's accuracy on reasoning tasks is sensitive to prompt clarity. Concise, unambiguous prompts tend to get better results than elaborated prompts that contain implicit assumptions. Quantization also matters: IQ2-quantized Qwen 3.6 underperformed Q8 on the same task, reinforcing the known guidance to prefer higher quants for reasoning workloads. This finding complements the token-efficiency story: Gemma 4 finishes tasks in fewer tokens, but benefits from being asked precisely. (source, 28 score, 13 comments)

May 7 re-check, 2026-05-07 01:00 EDT: two new developments from this sweep that add meaningful signal.

Community use-case survey crystallizes where Gemma 4 wins. A widely-upvoted discussion thread (94 score, 127 comments, May 6) asked practitioners directly what they use Gemma 4 for versus Qwen 3.6. The answers converge on a clear pattern: Gemma 4 is the preferred choice for vision and OCR ("Gemma trounces Qwen for handwriting analysis and general vision tasks"), bug tracing ("Gemma4 is really, really, really good at tracing bugs — much more consistent and reliable for finding the actual root cause"), translation especially Japanese and smaller European languages (independently confirmed across multiple reporters), creative writing, tone-sensitive text, and RAG over structured documents. Qwen 3.6 is preferred for agentic coding, multi-turn tool use, and long agentic loops. The niche-split that has been building across weeks of field notes is now directly confirmed from first-person practitioner reports. One practitioner summarizes: "For things I want to go fast, don't require accuracy or rely mostly on the vision encoder: Gemma4-26B-A4B. For where accuracy and nuance are important: Gemma4-31B. I prefer Qwen3.6 for anything programming or toolcalling related." The survey also confirms that translation quality holds at an unusually high bar: Gemma 4 is rated best open-weight option for Japanese→English, with one commenter noting it is "entirely undisputed" for open models on translation tasks. (source, 94 score, 127 comments)

Prompt injection defense: Gemma 4 E4B jumps from 21% to 100%. A benchmark study (6100+ tests across 15 models, 7 attack types) found that Gemma 4 E4B went from 21.6% to 100% defense rate when the untrusted input was wrapped in a long random delimiter and the model was explicitly told not to execute injected instructions. This was the largest absolute improvement of any tested model (+78.4 percentage points) and the only model to reach a perfect score. Tested attack types included role hijack, authority claims, and fake delimiters. The benchmark used hand-crafted payloads rather than SOTA adversarial search, so the defense rate may be lower against gradient-based attacks. Practical takeaway for RAG and web-document pipelines: the delimiter + strict-prompt defense is a high-ROI hardening step for Gemma 4 deployments that process untrusted external content. (source, 24 score)

Morning re-check, 2026-05-02 08:30 EDT: a follow-up sweep against the past 24 hours of r/LocalLLaMA confirmed three additional posts worth recording. A first-hand AMD Radeon 9060 XT 16GB report (eGPU on a 7840HS mini-PC) lands the 24B A4B IQ4_NL variant at 25.9 tok/s with KV cache at q8_0 and a small 256-token target. More importantly, two independent posts within fourteen hours documented an emerging "zombie loops" failure mode on both Gemma 4 and Qwen 3.6 with quantized KV cache during thinking mode. The convergent expert reading is that q4_0 KV quantization accumulates drift across hundreds of internal reasoning tokens until the model falls into a repetition attractor. This pattern is now strong enough to call out as a known limit (see below).

Evening re-check, 2026-05-02 17:45 EDT: the post-PR #82 sweep found two new high-signal items rather than a broad hardware shift. First, a local vLLM/FP8 vision comparison reports Gemma 4 staying much more concise on messy real-world image prompts, often around 1,500 thinking tokens where Qwen 3.6 can burn 8,000+ tokens and sometimes fail to finish. The same report says Gemma 4 followed normalized 0 to 1 bounding-box JSON instructions more reliably, while Qwen 3.6 did better on the tested 2 FPS deadlift video tracking case. Second, an SGLang production report identified an FP8 KV-cache bug for models with per-layer KV scales, explicitly including Gemma 4, where radix-cache prefix hits can silently corrupt output unless the deployment uses BF16 KV cache or the upstream fix lands. This reinforces the current guidance: for long-context or thinking-mode work, treat KV-cache precision and serving backend as quality controls, not just speed knobs. (vision source, SGLang source, PR #24198)

May 3 re-check, 2026-05-03 01:00 EDT: a new sweep surfaced two notable developments. First, a dedicated KV cache quantization discussion (source, 77 comments) provided the architectural explanation for the zombie loops pattern previously documented: Gemma 4 uses an interleaved Sliding Window Attention (iSWA) mechanism that is structurally more sensitive to KV precision loss than dense models or Qwen-style MoE. The expert comment reads directly: "Gemma 4, due to its iSWA architecture, is apparently much more sensitive to KV cache quantization." Dense architectures accumulate less rounding error per attention step; iSWA's alternating local and global windows amplify quantization noise differently. The practical implication is stronger than the zombie-loops framing: KV precision for Gemma 4 is an architecture-level quality control, not just a safety precaution for thinking mode. Second, follow-up comments on the "Qwen 3.6 wins benchmarks, Gemma 4 wins reality" vision post (source) added two confirming voices worth recording. A commenter with the opposite finding (Qwen 3.6 follows instructions better) attributes the divergence to "backend/harness influence," underscoring that task setup and serving backend matter for the comparison. A second commenter elaborates: "Gemma is much better at short one shot, but because of its architecture it struggles with long context. There is something about its attention mechanism and its also far more sensitive to KV quantization." On the multilingual dimension, a confirmed data point: Gemma 4 is "a much better LLM than Qwen for anyone that doesn't use English or Chinese as their primary language, especially for European languages." Third, the RTX 6000 Pro guidance from May 2 received an important nuance from a card owner: "performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls, then vllm and sglang are better." (source) This corrects the blanket recommendation: for single-user workloads on professional GPUs, llama.cpp-based tools remain competitive; the vLLM/sglang advantage appears primarily at 4+ concurrent requests.

Evening re-check, 2026-05-03 17:05 EDT: the post-PR #86 sweep found two new data points. First, a developer shipped a production Android voice notes app using Gemma 4 E2B (2.4GB) via LiteRT-LM on a OnePlus CE 5 (8GB RAM). The measured end-to-end latency for a 10-15 second voice note is 12-15s: Whisper Small (Sherpa-ONNX) handles transcription in ~5s, Gemma categorizes and extracts structured JSON in ~8-10s. The developer reports JSON output reliability as "way better than expected from a 2.4GB model on a phone" — a strong signal that Gemma 4 E2B's instruction-following quality holds well under aggressive quantization on ARM. Notably, commenters suggest the separate Whisper step may be unnecessary since E2B may support native voice-to-text natively via LiteRT-LM. (source, score 18, 14 comments) Second, a community survey of Gemma 4 31B on smaller European languages confirms the multilingual advantage holds above the 100B MoE tier: multiple independent reporters conclude that Gemma 4 31B beats Qwen 3.5 122B and Mistral 4 119B for Czech, Hungarian, Slovak, and Dutch. The data comes with a precision note: quantization hurts multilingual quality more than English quality, so the comparison is most meaningful at BF16/FP16 — a 16-bit Gemma 4 31B is "extremely good in Hungarian" while the same model at 8-bit shows "slightly Chinese" output contamination. The practical guidance: if your use case is primarily a smaller European language, Gemma 4 31B at high precision is a better choice than any current 100B MoE at standard quantization. (source, score 8, 14 comments)

May 4 re-check, 2026-05-04 01:00 EDT: two new developments worth recording from the latest sweep. First, a report on running llama.cpp via the Snapdragon Hexagon NPU adds early data for mobile NPU inference with Gemma models. The NPU path itself is battery-efficient but constrained: the Hexagon NPU can only address 4GB of RAM, making it unsuitable for anything larger than the smallest Gemma variants without splitting across multiple NPU device instances. In practice, community testing found Gemma 4 E4B achieves 11-14 t/s on a OnePlus 13 (Snapdragon Elite) via the Android Edge APK (GPU path), not NPU. The NPU path on the same chip produced less favorable results. The takeaway for mobile: the GPU via Edge APK is currently the more practical Gemma 4 E4B path on high-end Android phones; NPU is a power-saving alternative that makes sense for always-on background tasks where latency tolerance is high. (source, score 20, 6 comments) Second, a community quality-gap discussion (68 score, 44 comments) adds useful perspective on where Gemma 4 31B sits against frontier cloud models. The converging read across commenters: Gemma 4 31B tracks "Dec 2025 frontier" performance levels for translation and non-English tasks — competitive with Claude Haiku 4.5, which was released roughly half a year ago. For tasks outside English and Chinese, Gemma 4 31B is seen as clearing the bar where the "6-month gap" argument would place it. This is consistent with the separate multilingual finding from May 3: Gemma 4 31B beats all tested 100B+ MoE models for smaller European languages when run at BF16. Anecdotal confidence; no controlled benchmark behind this comparison. (source, score 68, 44 comments)

May 6 re-check, 2026-05-06 01:00 EDT: four high-signal developments from the latest sweep.

Google officially released Gemma 4 MTP draft models. Multi-Token Prediction (MTP) drafters are now available for all four Gemma 4 variants: 31B Dense, 26B-A4B MoE, E4B, and E2B (HuggingFace). The E2B drafter is only 78M parameters — tiny enough to run alongside the main model with minimal memory overhead. MTP works by having the small draft model predict several tokens ahead; the large target model then verifies the full batch in parallel, accepting correct tokens and re-running from the first mismatch. This guarantees identical output quality to standard generation while targeting up to 2x decode speedup depending on task type (structured outputs and repetitive patterns see the largest gains). Community response was immediate: llama.cpp PR #22673 is already in review for Gemma 4 MTP support, and the MTPLX Apple Silicon runtime (see below) also claims MTP model compatibility. This is the biggest single capability addition to Gemma 4 since launch and changes the expected throughput trajectory significantly. (source, score 783, 204 comments)

Token efficiency confirmed: Gemma 4 31B is slower per token but faster per task. A Kaitchup benchmark article (summarized in a community post, 117 score) compared Gemma 4 31B Dense against Qwen 3.6 27B Dense and Qwen 3.5 27B Dense. The headline finding: Qwen models score higher on standard benchmarks ("benchmaxxed") but Gemma 4 31B is "far more efficient with token use" — it produces a correct, complete answer in substantially fewer tokens. The practical implication is that even though Gemma 4 31B is slower per token (it is a larger dense model vs. smaller dense models), total task completion time is often similar or faster because the model doesn't need to elaborate as much. One commenter summarizes the workflow they use: swap Gemma and Qwen 3.6 in Plan/Act roles when either model gets stuck — the two models' different failure modes make them complementary. Another notes that Gemma 4 is more sensitive to quantization, so Qwen's smaller quant + Q8 KV can outperform Gemma at the same VRAM budget, especially for longer contexts. (source)

CPU-only 26B inference is fast because of MoE architecture. A community post (score 100, 70 comments) reports running Gemma 4 26B-A4B on an i5-8500 with 32GB DDR4 RAM and no GPU. The measured generation speed is 9.25 t/s (prompt processing 23.13 t/s). The key explanation from the top comment: "Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model." This is the definitive answer for CPU-only users: the 26B label is misleading — active parameter count per token is ~4B, making CPU inference practical on ordinary hardware. Qwen 3.6 27B is dense (all 27B parameters active every token), so it runs ~8x slower on CPU despite having similar total parameter count. For CPU-only or low-RAM setups, the Gemma 4 26B-A4B MoE is the right model; Qwen 3.6 27B is impractical at the same hardware. (source)

MTPLX: Apple Silicon MTP inference engine shows 2.24x speedup. An open-source runtime built on a patched MLX fork (not a patch to MLX itself) reports 28 → 63 tok/s on Qwen 3.6 27B on MacBook Pro M5 Max using MTP heads built into the model. Key design details: mathematically exact temperature sampling via rejection sampling (not greedy-only like other speculative decode tools on Apple Silicon), custom Metal kernels, and a full OpenAI/Anthropic-compatible API server. The runtime also adds crash-safe fan control and a 562-test suite. With Google's Gemma 4 MTP draft models now released, MTPLX may support Gemma 4 inference as well — the developer says it "works on ANY MTP model." Not yet independently verified for Gemma 4 specifically; treat as promising but unconfirmed. (source, score 60, 38 comments)

May 5 re-check, 2026-05-05 01:00 EDT: four developments from the latest sweep that Gemma 4 users should act on or track.

Update your Gemma 4 GGUFs. A high-traction community post (365 score, 103 comments) announced that the Jinja chat template bug documented in earlier field notes has been fixed in the upstream model files. Updated GGUFs are now available from bartowski and unsloth for all four variants: 31B, 26B-A4B, E4B, and E2B. Community comments flagged that the fix may also reduce the extreme memory usage some users experienced. The exact change is visible at HF discussion 86. If you have been running Gemma 4 GGUFs from before May 2026 and are using tool calling or extended context, updating is strongly recommended. (source)

llama.cpp MTP support is now in beta. A beta implementation of Multi-Token Prediction (MTP) has landed in llama.cpp (477 score, 210 comments). MTP pairs a small fast draft model with the large target model: the draft predicts a token batch, the target verifies the entire batch in parallel, accepting correct tokens and re-running from any mismatch. ELI5: "big model and small model work as a team — small model runs ahead, big model checks from behind, both finish sooner." Currently limited to Qwen3.5 MTP architectures, with broader model support expected. The author notes that between MTP and maturing tensor-parallel support, "most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, should be erased." Relevance for Gemma 4: once Gemma 4 MTP support lands (if and when), E4B could serve as a draft model for 31B Dense — this is architecturally the same as the existing Gemma 4 E2B speculative decoding setup but with native MTP semantics. Not yet merged; track the PR before updating. (source)

APEX MoE quants now cover Gemma 4. The APEX mixed-precision MoE quantization strategy, originally demonstrated for Qwen 3.5 35B-A3B, has expanded to 30+ models including Gemma 4 variants (77 score). APEX applies expert-routing-aware precision tiers: higher precision for edge layers and shared experts (which handle rare long-range tokens), lower for mid-tier experts. Users report noticeably better coherence past 32K tokens compared to uniform Q4_K, with measured faster inference on the benchmarked Qwen 3.6 models. The Gemma 4 26B MoE coverage is confirmed in the library; community reports on Gemma 4 specifically are sparse so treat the long-context claims as plausible but anecdotal until more data surfaces. Quants are available via github.com/mudler/apex-quant. (source)

Research: FastDMS achieves 6.4x KV cache compression faster than vLLM BF16. An MIT-licensed reference implementation of Dynamic Memory Sparsification (DMS) — a technique using learned per-head token eviction to compress the KV cache — reports 6.4x compression with near-lossless quality (perplexity 9.226 → 9.200 on Llama 3.2 1B; KLD ~0.026 nats/tok). The implementation is research-quality and tested only on Llama and Qwen-family checkpoints; it has not been integrated into llama.cpp, vLLM, or SGLang. Author explicitly says the lift for a production serving integration is large and "noped out" of attempting it. Given Gemma 4's documented KV precision sensitivity (iSWA architecture amplifies quantization noise), FastDMS is worth tracking as a potential path to longer context without KV precision degradation — but this is speculative and no Gemma 4 DMS checkpoints exist yet. Confidence: low (research-stage result). (source)

Hardware leak: Ryzen AI Max+ 495 (Gorgon Halo) with 192GB unified memory. A leaked spec for AMD's upcoming Ryzen AI Max+ 495 shows 192GB unified memory, up from 128GB on the current Strix Halo 395 (148 score). Key caveat from community hardware experts: memory bandwidth appears unchanged at ~256GB/s. For Gemma 4 users this means: a Gorgon Halo system could fit Gemma 4 31B Dense BF16 (~62GB), the 26B MoE BF16, and several smaller models simultaneously, with prefill remaining the same speed bottleneck as Strix Halo today. The additional capacity is most useful for parallel model loading, very long contexts, or RAG pipelines that need multiple loaded models. Unconfirmed leak; release timeline and pricing unknown. Strix Halo 395 owners confirm the memory increase alone would not change throughput on single-user workloads. (source)

Headline this week

MTP vs DFlash is now settled at the hardware and concurrency level. A controlled H100 benchmark (May 12) confirms: at concurrency 1, MTP (3.11x) and DFlash (3.03x) are statistically tied for Gemma 4 31B Dense. At concurrency 16, MTP wins decisively — 953 vs 725 tok/s. For single-user inference either method works equally well; for serving multiple concurrent users, MTP is the better choice. Apple removed the M3 Ultra 256GB Mac Studio from its store ahead of an expected M5 launch. DFlash for 26B MoE remains live in vLLM with 2.56x throughput for short-context workloads; both MTP and DFlash require workload-appropriate tuning. New May 15: KV cache quantization guidance for vLLM is now more precise — a formal study confirms FP8 (`--kv-cache-dtype fp8`) as the best default when VRAM is constrained, with 2x capacity and negligible accuracy loss. TurboQuant variants beyond 4bit-nc are not worth the accuracy and throughput cost for Gemma 4. If VRAM is not a constraint, unquantized KV cache remains highest quality.

Best current setup (today)

  • RTX 5xxx (Blackwell consumer): Gemma 4 26B MoE now has an official nvidia/Gemma-4-26B-A4B-NVFP4 quant at 18.8GB. On a 5090 with 80% VRAM allocation, users report ~50K context. Benchmarks are near-lossless: GPQA Diamond 79.9% vs 80.3% baseline, AIME 2025 actually improved to 90.0% from 88.95%. New as of May 9: DFlash for the 26B MoE is now live via z-lab — benchmarks on the 5090 show 228 → 578 tok/s (2.56x) at optimal settings with vLLM. Critical caveat: DFlash degrades sharply at 20k+ context (drops from 400 to 200 tok/s and continues declining). Use DFlash for short-context inference; for long-context work, NVFP4 without DFlash remains the recommended path. For 31B Dense, NVFP4 GGUF with llama.cpp PR #22196 or the existing DFlash variant remain the paths. (NVFP4 source, DFlash benchmark)
  • Mid-to-high single GPU (24+ GB VRAM, non-Blackwell): Gemma 4 31B Dense at Q5_K_M or Q6_K remains the strongest single-card choice for general work, writing, and visual understanding. New this week: the DFlash variant (gemma-4-31B-it-DFlash) has been released but still needs llama.cpp PR #22105 to merge before practical use. ggerganov is reportedly planning a speculative-architecture refactor first. (source)
  • Constrained GPU (8-16 GB VRAM): Detailed speed benchmarks from an RTX 4070S 12GB user (DDR5 6000MHz, iGPU display offload) show Gemma 4 26B MoE and 31B Dense both runnable with substantial CPU offload. The 12GB club is real: careful config tuning (CUDA 13.1, display offload to iGPU, cache reuse settings) gets 40 t/s on 35B Q6 with system RAM spill. Keep Gemma 4 for prose and Qwen 3.6 for code in this tier. New (May 14): Reduce GPU power limit with `nvidia-smi -pl` to ~40% TDP — confirmed on RTX 4090 that generation throughput is nearly unchanged (LLM decode is memory-bandwidth-bound, not compute-bound). Reclaim meaningful power, heat, and noise savings at essentially no performance cost. (12GB source, power limit source)
  • Budget hardware (GTX 1080 8 GB VRAM, ~$200 secondhand machine): New May 14 lower bound. A $200 i7-6700 / GTX 1080 / 32 GB RAM machine runs Gemma 4 26B-A4B at ~20–24.5 tok/s using TurboQuant/RotorQuant KV cache (allows 128k context within 8 GB VRAM) and `--n-cpu-moe 20` CPU MoE offload. Key flags: `--override-tensor-draft "token_embd\.weight=CUDA0"` needed to fix MTP's tied LM head behavior for Gemma 4 specifically. The GPU is PCIe bandwidth-limited (~40-50% GPU utilization). Note: TurboQuant KV is not in mainline llama.cpp; requires AtomicBot-ai or ikawrakow fork. Treat as an existence proof — actual throughput at full 128k real context will be lower than these small-prompt benchmarks. If you already have an 8GB GPU collecting dust, Gemma 4 26B-A4B is runnable. (source)
  • AMD consumer GPU (Radeon 9060 XT 16GB, eGPU): A first-hand report on a 7840HS mini-PC paired with an external Radeon 9060 XT lands the Gemma 4 24B A4B IQ4_NL variant at 25.9 tok/s via llama-server, with KV cache at q8_0 and a small 256-token batch target. The user notes the configuration is usable for OpenCode codebase Q&A. Reply chain confirms 16GB is tight at 128K context and forces partial CPU offload, so for steady-state work expect lower numbers when context fills. New (May 9): Lemonade added vLLM ROCm as an experimental backend, allowing inference from standard model weights on Radeon 6000/7000-series GPUs without converting to GGUF first — useful for quick evaluation of new models before committing to conversion. Backend is experimental; validate on your GPU before production use. (llama-server source, vLLM ROCm Lemonade)
  • CPU-only (no GPU, DDR4/DDR5): Gemma 4 26B-A4B MoE is now confirmed to run at 9.25 t/s generation speed on an i5-8500 with 32GB DDR4 RAM — practical for interactive use. The reason: MoE only activates ~4B parameters per token despite 26B total. This makes it comparable to running a true 4B dense model on CPU. Qwen 3.6 27B Dense activates all 27B parameters per token and is ~8x slower on the same hardware — avoid for CPU-only setups. Recommended quant: Q4_K_M to fit comfortably in 32GB. (source)
  • Apple Silicon (32-64 GB unified): The viral Pacman test ran Gemma 4 31B at 27 tok/s on M5 Max 64GB, confirming strong Apple Silicon inference. MTP is now confirmed working for Gemma 4 via the omlx runtime: a first-hand M1 Max 64GB report (May 7) measured 11 → 20+ tok/s — roughly doubling decode speed. Standard MLX MTP support is still pending. MTPLX (a separate patched MLX fork) shows 2.24x speedup on Qwen 3.6 27B and claims compatibility with any MTP model. Recommended quant for 31B on 64GB: Q8 or Q6_K. Supply chain note (May 9): Apple has removed the M3 Ultra 256GB Mac Studio from its store, likely ahead of an M5 Mac Studio launch. If you were planning a 256GB Apple Silicon build for BF16 Gemma 4 31B, that window has closed for now — wait for M5 Ultra availability and pricing before committing. (sources, omlx MTP, MTPLX, Apple store note)
  • Professional GPUs (RTX 6000 Pro / A6000 Ada, 48-96GB): The recommendation to use sglang or vLLM over llama.cpp has an important nuance (May 3 update): for a single-user single-request workflow, llama.cpp-based tools like LMStudio are competitive. The vLLM/sglang advantage becomes significant at 4+ concurrent requests, where continuous batching scales throughput meaningfully. On an A6000 Ada running vLLM, one practitioner reports 400-500 tok/s across 8-12 parallel workloads. For single-user interactive use on Windows, LMStudio may be simpler with no real throughput cost. (source)

What works

  • Concise, correct single-shot code generation. The viral Pacman post (778 score) is the clearest demonstration yet. Gemma 4 31B on M5 Max produced a working game in 3m51s with 6,209 tokens: shorter, clearer, and functionally correct on first run. Qwen 3.6 27B spent 18m04s and 33,946 tokens with more visual creativity but more bugs. This pattern holds across similar community tests: Gemma tends to produce tighter, more correct code, Qwen produces more elaborate but less reliable output. (source)
  • Writing, tone, fiction, summarization. The niche-split consensus from last week is now even stronger. The "are 30B models obsolete?" thread (139 score, 144 comments) keeps accumulating confirming answers: Gemma is "MUCH better than Qwen in writing and tone," "the best at non-code tasks." (source)
  • Near-lossless NVFP4 for MoE. The Nvidia NVFP4 quant of Gemma 4 26B MoE preserves quality to within 0.4% across multiple benchmarks, and in some cases slightly exceeds full precision. A practitioner with 90 ablation experiments explains this as NVFP4 acting as regularization on the 128-expert router, preventing over-commitment to dominant pathways. This is an important finding for anyone running the MoE variant. (source)
  • Visual understanding. Remains the strongest open multimodal answer for image-plus-text tasks. No new contradicting signal.
  • Speculative decoding pairing. Gemma 4 31B + E2B draft model delivers 120-200 tok/s on suitable tasks. Official MTP draft models are now available for all four variants; once llama.cpp PR #22673 merges, native MTP will replace the ad-hoc speculative decoding setup and likely push throughput further. (source, MTP release)
  • Token efficiency: finishing faster despite slower per-token speed. Comparative testing by Kaitchup confirms Gemma 4 31B is "far more efficient with token use" than Qwen 3.6 27B Dense. The model produces complete, correct answers in fewer tokens even when individual token generation is slower. Combined with MTP draft models now available, total task latency is expected to improve further. (source)
  • CPU-only 26B inference is practically viable. The MoE architecture activates only ~4B parameters per token, making CPU throughput comparable to a 4B dense model. Measured 9.25 t/s on i5-8500 + 32GB DDR4 — usable for interactive workflows on modest hardware. (source)
  • Native FP4 on Blackwell. Now available for both the 31B Dense (via community GGUF) and the 26B MoE (via Nvidia's official release). ROCm/Vulkan support for NVFP4 is also emerging via llama.cpp and third-party kernels like petit-kernel. (source)

Known limits

  • Tool calling template bug is now fixed — update your GGUFs. The Jinja chat template issue documented in earlier field notes has been patched upstream (May 4, 2026). New GGUFs from bartowski and unsloth for all four variants (31B, 26B-A4B, E4B, E2B) include the fix. The exact change is at HF discussion 86. Alternatively, pass an updated `--chat-template-file` to llama.cpp or use KoboldCPP's Jinja template override. Some users report the updated GGUFs also reduce extreme memory usage. If you have not updated since May 2026, do so now before concluding that tool calling is unreliable. The community prediction for a "Gemma 4.1" point release may still happen — the template fix resolves the structural bug but does not address all agentic reliability concerns documented in prior field notes. (source)
  • Qwen 3.6 leads on code and agents in the same size band. Still the dominant community read. For coding agent workflows and long agentic loops, Qwen 3.6 is materially more reliable. But the Pacman test shows Gemma 4 can beat Qwen on one-shot code quality when the task is well-scoped. The gap narrows when you don't need sustained multi-turn tool calling. (source)
  • DFlash for 26B MoE is live; MTP acceptance rate is workload-dependent. These are two distinct paths to faster Gemma 4 inference, both with important caveats. DFlash 26B MoE is now live (z-lab release, May 8) and achieves up to 2.56x throughput in vLLM on short contexts — but drops sharply at 20k+ tokens. DFlash for 31B Dense still requires llama.cpp PR #22105 (blocked on speculative architecture refactor). MTP is live in patched llama.cpp forks and via omlx on Apple Silicon. MTP acceptance rate varies dramatically: code generation sees 66% acceptance (1.53x speedup), prose sees 31% (a wash), and JSON structured output sees only 8% (0.50x — slower than baseline). Always benchmark MTP for your specific workload before enabling it. Either path guarantees identical output quality to standard generation (unlike quantization). (DFlash source, DFlash 26B MoE, MTP acceptance rates)
  • Fine-tuning Gemma 4 is harder than expected. A community prediction thread flags that Gemma 4's architecture "seems to be making fine-tuning tricky" and notes that Gemma 4 "didn't really take over the fine-tune crowd." If you need a fine-tunable base, this is a real friction point to watch. (source)
  • Reasoning accuracy is sensitive to prompt phrasing. A May 7 controlled test showed Gemma 4 31B performance varying significantly between two different phrasings of the same math-word problem. Qwen 3.6 was more robust to ambiguous phrasing; Gemma 4 performed best on the clearer, more explicit prompt. Practical guidance: for reasoning or multi-step tasks, prefer short, unambiguous prompts — avoid prompts with implicit assumptions or layered conditions. This is consistent with the token efficiency picture: Gemma 4 is concise and direct but expects the same from the user. (source)
  • Professional GPUs and sglang/vLLM: nuanced, not a blanket rule. sglang and vLLM pull ahead of llama.cpp at 4+ concurrent requests due to MTP (Multi-Token Prediction) support and continuous batching. But for a single-user setup, a card owner confirms: "performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls." The earlier framing ("seriously gimping that card by running llama.cpp") applies to multi-user or high-concurrency scenarios, not necessarily solo development. If you are on Windows and single-user, LMStudio remains a reasonable starting point. (source)
  • Structured output stays unreliable below 7B. Still valid from last week. Validate paths, classify actions, and check outputs in code for sub-7B models.
  • Safety filters on E2B. Still too aggressive for emergency/medical prompts. No equivalent Gemma 4 uncensored release has surfaced.
  • Gemma 4 is structurally more sensitive to KV cache quantization than most models — and a first comprehensive vLLM study now clarifies the tradeoffs (May 15 update). The zombie loops pattern from May 2 now has an architectural explanation (May 3 update): Gemma 4 uses interleaved Sliding Window Attention (iSWA), which amplifies KV precision loss differently than dense or Qwen-style MoE architectures. At q4_0 KV quantization, rounding drift accumulates across the model's alternating local and global attention windows, eventually pushing thinking-mode outputs into repetition attractors. Qwen 3.6 at Q8 KV quant is described as "almost lossless" by benchmarks; for Gemma 4, community consensus is to treat anything below fp16/bf16 KV as a quality risk on long or thinking-heavy workloads. Two independent zombie-loop cases documented: one on dual RTX 5060 Ti 16GB with `-ctv q4_0 -ctk q4_0`, another on Gemma 4-26B-A4B at Q3 and Q4 quants. Workarounds: raise KV cache precision to q8_0 or fp16, drop reasoning budget to 0 (disables thinking), ensure context is not overflowing, and use CUDA toolkit 13.1 rather than 13.2 (13.2 has a confirmed regression with these models). vLLM guidance added May 15: For vLLM users, a formal study now confirms FP8 (`--kv-cache-dtype fp8`) as the best default when VRAM is constrained — 2x capacity with negligible accuracy loss. TurboQuant k8v4 does not provide meaningful advantage over FP8. TurboQuant 4bit-nc is viable at extreme VRAM pressure but degrades accuracy and throughput. 3bit variants cause meaningful accuracy drops on reasoning tasks and are not recommended for Gemma 4 production use. If VRAM is not the binding constraint, unquantized KV cache remains the highest-quality option. These findings apply specifically to vLLM; llama.cpp's TurboQuant/RotorQuant behavior should be benchmarked separately. (source 1, source 2, architecture source, TurboQuant study)
  • Gemma 4 E4B is not reliable for code autocomplete (fill-in-the-middle). A practitioner directly tested E4B for IDE code infill and found it produces "weird suggestions," ultimately choosing Qwen 2.5 Coder 7B instead. The instruction-following strength that makes E4B a top recommendation at the 4B tier does not translate to the FIM (fill-in-the-middle) task pattern, which requires specific training signal. If you are building a local coding autocomplete setup with a 16GB GPU, use a purpose-built coder model for infill; E4B is better suited for interactive chat, vision, and instruction-following tasks. (source, May 12, 36 score)
  • SGLang FP8 KV cache can silently corrupt outputs on affected versions. A production report from AI Router Switzerland traced silent garbage output in Qwen3.6-27B-FP8 to the ragged plus paged attention split path dropping `k_scale`/`v_scale` during radix-cache prefix hits. The author explicitly says the same class can affect FP8 models such as Gemma 4 that store per-layer KV scales. Verified upstream state: SGLang PR #24198 is open and approved. Until it lands in the serving build, keep Gemma 4 FP8 deployments on BF16 KV cache or apply the patch before trusting prefix-cache reuse. (source)

Open questions

  • Will Google ship a Gemma 4.1 with fixed tool calling? The community's top May prediction is a "4.1" point release that fixes the template-level tool-calling bug. If it happens, it could significantly close the gap with Qwen 3.6 on agent workloads. No official signal yet. (source)
  • llama.cpp MTP support is now in mainline — RESOLVED. PR #22673 merged on May 16, 2026. MTP is available via a standard llama.cpp build for all Qwen3.6 and Gemma 4 MTP models. The official Docker image was lagging at merge time; build from source with `CUDA_DOCKER_ARCH` for your GPU until the container image is updated. DFlash remains a separate path, still blocked on the speculative architecture refactor (PR #22105).
  • What real-world speedup will MTP deliver for Gemma 4? Now well-documented across multiple hardware tiers. On H100 (vLLM): Gemma 4 31B Dense achieves 3.11x at concurrency 1 and reaches 953 tok/s at concurrency 16 — the first data point showing MTP pulls significantly ahead of DFlash at higher concurrency. On M4 Max Studio: Gemma 4 26B-A4B shows 1.53x for code, a wash for prose, 0.50x for JSON. On M5 Max with llama.cpp patched fork: 97→138 tok/s (1.42x). Pattern: dense models like 31B see larger MTP gains than the MoE 26B; code generation workloads benefit most; JSON and free-form prose do not benefit. Always benchmark MTP for your specific workload before enabling it.
  • Will NVFP4 quality hold across AMD via Vulkan/ROCm? Early support exists via llama.cpp Vulkan and third-party kernels, but no controlled benchmarks on AMD hardware yet. (source)
  • How much does the fine-tuning difficulty matter? If the community can't easily fine-tune Gemma 4, Qwen 3.6 may absorb the fine-tune crowd entirely, limiting Gemma 4's ecosystem growth.
  • Strix Halo and unified-memory APUs. Reports on AMD Strix Halo 128GB suggest viable 27-31B dense inference, but data is still thin.
  • April 2026 was "one of the best months ever" for local LLMs. A community retrospective catalogued a historically dense month of model releases. The question is whether May sustains the pace — early signals from this sweep suggest continued momentum.

Sources

The most relevant Gemma-mentioning posts driving this update, with the newest first:

The full set of 153 community reports lives in the Community Reports section above, filterable by hardware category and search.

Last updated: 2026-05-18 (May 18 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.

Community Reports (164 from r/LocalLLaMA)

Real-world hardware experiences from the community. Filter by hardware category or search. These are user reports, not official benchmarks.

u/jacek2023 2026-04-02 New Model Quantization & Backends

Google is going to show what open weights is about. Happy Easter everyone.

u/Both_Opportunity5327 (+529)Google is going to show what open weights is about. Happy Easter everyone.
u/danielhanchen (+519)Gemma-4 has native thinking, tool calling and is multimodal! Use temperature = 1.0, top\p = 0.95, top\k = 64 and the EOS is `<turn|>`. `<|channel>thought\n` is also used for the thinking t...
u/Altruistic_Heat_9531 (+416)And after a week maybe : "Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Reasoning Distlled Expanded fine tuned quantized" Sorry to tempting lol
View full discussion on r/LocalLLaMA
u/ex-arman68 2026-05-06 Resources High-end GPU (24+ GB)Quantization & Backends

> 2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q40 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants, it appears 3 is the optim...

u/ResidentPositive4122 (+249)Legend. Man, these past 6 months have brought us more than the last 2 years combined. On the one hand we've seen really powerful open models (glms, kimis, deepseeks, minimaxs, mimos, etc) and more imp...
u/SmartCustard9944 (+64)1. Better inference and intelligence 2. -> Better and faster contributions 3. Go to 1 This loop gets better and better with time. This feels like a self improving intelligence, just that it’s hybri...
u/jacek2023 (+47)When was turbo3/turbo4 merged? Or is this part of MTP PR?
View full discussion on r/LocalLLaMA
u/FullChampionship7564 2026-04-21 Funny General

Link post. The discussion is mostly in the comments. Target:

u/StupidScaredSquirrel (+277)I still wanna glaze gemma just cause I'm too scared qwen will stop delivering at some point and gemma is very close in terms of performance and I dont want google to stop releasing
u/MexInAbu (+208)Gemma 4 is superior for creative writing and there's no contest.
u/markole (+83)Coding? Sure. Translating? Nah, qwen sucks for translating.
View full discussion on r/LocalLLaMA
u/rerri 2026-05-05 New Model General

Blog post: MTP draft models:

u/MaartenGr (+262)For those interested in how they work, I updated my visual guide with some snippets here and there:
u/Craftkorb (+247)The E2B model has a 78M draft model - Cuuute!
u/hackerllama (+141)Enjoy!
View full discussion on r/LocalLLaMA
u/dtdisapointingresult 2026-04-28 Discussion General

I think gave it a fair shot over the past few weeks, forcing myself to use local models for non-work tech asks. I use Claude Code at my job so that's what I'm comparing to. I used Qwen 27B and Gemma 4 31B, these are considered the best local models u...

u/PeerlessYeeter (+534)op's experience somewhat matches mine, I keep assuming I'm doing something wrong but I think this subreddit gave me some unrealistic expectations
u/onethousandmonkey (+320)Purely from the performance point of view, there are a number of settings to tweak to make Claude Code jive with local models. For example: Before I did that, I was banging my head against the wall at...
u/Hans-Wermhatt (+191)The people here overhype Qwen 3.6 for sure, but I don't know what to tell the people who were expecting to just flip over from Opus 4.7 xhigh 4 Trillion to Qwen 27B and expect the same performance. Yo...
View full discussion on r/LocalLLaMA
u/gladkos 2026-05-01 Generation Apple Silicon

Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens. Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens. So what is more important: tokens per second, or t...

u/OneSlash137 (+290)Keep performance stable and no bugs are pretty hilarious additions to the prompt.
u/The-Pork-Piston (+234)I’ve shared it before but my secret source when prompting is: IMPORTANT: You are an expert coder turned professor, however you have been accused of a sexual crime and the only way to exonerate yoursel...
u/zigzag3600 (+103)Have you tried: You are a senior-level software developer: do good, don't do bad.
View full discussion on r/LocalLLaMA
u/GodComplecs 2026-04-29 Funny Quantization & Backends

Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour. Ofc the key is building a system around their weaknesses,...

u/RetroPeel2025 (+131)Gemma4 is great for translation and creative writing. Qwen3.6 outputs great games. I don't know what black magic they did to make the smaller models that capable in making cool games for the browser. ...
u/slvrsmth (+51)Careful with Gemma4 translating to languages you don't understand, especially smaller ones. I ran a small test with my native latvian. 31B understands the inputs well, but the outputs are 2010-google-...
u/phenotype001 (+40)I left an agent with Qwen 3.6 working overnight. I wake up, it still works. No looping on bullshit, no dumb decisions. It's a dream come true.
View full discussion on r/LocalLLaMA

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4\K\M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall fo...

u/Recoil42 (+97)Really cool hardware design, OP.
u/rog1121 (+46)
u/teachersecret (+37)Definitely not taking that thing on a plane... lol
View full discussion on r/LocalLLaMA
u/Glittering_Focus1538 2026-05-18 Resources General

I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi...

u/rinaldo23 (+222)Interesting. I think there is a trend towards using smaller, more focused models for specific tasks. Extraordinary claims require extraordinary evidence though.
u/OsmanthusBloom (+136)Interesting tricks. Though I wish these could be integrated with existing tools like Pi or OpenCode instead of creating Yet Another Coding Agent. See for example little-coder which is nowadays a set o...
u/Orolol (+67)> OpenCode scores ~75% with 14B models. Which Model ? Which Benchmark ? If you want to be taken seriously,you have to be precise enough so people are able to reproduce your results.
View full discussion on r/LocalLLaMA
u/gvij 2026-04-28 Discussion Quantization & Backends

Evaluated Qwen 3.6 27B across BF16, Q4\K\M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanEval: 164 He...

u/PassengerPigeon343 (+350)Really like seeing this kind of comparison across quants, I feel like we need more of that kind of analysis on here. Thanks for doing this!
u/doc-acula (+72)Gemma4 would be nice.
u/audioen (+71)No error bars in these measurements. We know that Q4\K\M is not likely to better than Q8_0 and the fact benchmark ordered them in this order at least once raises the question of how much this is just ...
View full discussion on r/LocalLLaMA
u/technaturalism 2026-04-20 Funny General

sycophancy: deleted efficiency per token:+1000% friendship: just beginning edit: “sup” got cut off at top

u/RoomyRoots (+414)Sounds like an average talk in TInder but with better vocabulary.
u/Own-Potential-2308 (+223)AutismGPT
u/No_Swimming6548 (+118)You finally built her
View full discussion on r/LocalLLaMA
u/Course_Latter 2026-05-02 Other General

I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different models are str...

u/CheatCodesOfLife (+47)If you have two chrome tabs open with these links: then click back and forth between them, the diagram jumps up and down. I'm not a frontend dev but I'm guessing it's because of the "p" in -pt hanging...
u/Hurricane31337 (+10)Nice idea! Looks really polished!
u/overand (+9)A simple workaround for this (without requiring layout changes) might be to switch to a monospaced font. (It might be a problem if anything requires word wrap, but, it should help.) You could give it ...
View full discussion on r/LocalLLaMA
u/Medical_Lengthiness6 2026-04-19 Discussion Apple SiliconHigh-end GPU (24+ GB)Quantization & Backends

of course this is just a trust me bro post but I've been testing various local models (a couple gemma4s, qwen3 coder next, nemotron) and I noticed the new qwen3.6 show up on LM Studio so I hooked it up. VERY impressed. It's super fast to respond, han...

u/cosmicnag (+154)Its the best local model so far IMO. On a 5090, the friggin speed gives an overall unmatched experience to any cloud model. The speed is insane. Havent even tried a NVFP4 yet lol.
u/H_DANILO (+78)you can easily go 256k, context is VERY CHEAP on Qwen, and this model is REALLY good with context
u/john0201 (+77)The latency and general experience is just better. I bought the perplexity search api credits and I just straight up prefer it for many things. Opus 4.7 is still much better for coding. I sometimes us...
View full discussion on r/LocalLLaMA
u/gladkos 2026-05-08 Tutorial | Guide Apple SiliconQuantization & Backends

Implemented Multi-Token Prediction for LLaMA.cpp. Quantized Gemma 4 assistant models into GGUF format. Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. Prompt: Write a Python program to find the nth Fibonacci number usin...

u/grumd (+128)Would be interesting to see the same comparison but with the same seed and with temp 0.0, supposedly the output would be the exact same, proving MTP isn't degrading quality
u/k4ch0w (+72)Set the seed to the same like 42 and then set temp to 0
u/[deleted] (+50)[removed]
View full discussion on r/LocalLLaMA
u/danielhanchen 2026-04-17 Resources Quantization & Backends

Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier. GGUFs: We also want to clear up a few misunderstandings around our ...

u/danielhanchen (+73)For more a more HQ and cleaner graph, see: The CUDA 13.2 issue (ie all 4bit quants getting gibberish (not just ours - everyone's) will be fixed in CUDA 13.3 as…
u/PiratesOfTheArctic (+45)These graphs are fantastic for idiots like me who can't work out what is what, thankyou so much
u/tavirabon (+31)Interesting how you use % of models affected when where it makes you look better, but leave it out where it makes you look worse, all the while the issue isn't prevalent in larger-sized quants where y...
View full discussion on r/LocalLLaMA
u/guiopen 2026-04-25 Discussion General

other companies are slowly going away from open weight, not releasing base models, delaying open weight distribution, not releasing top models (this one I think is fair, but still), and I also noticed they stopped publishing research (old Gemma and q...

u/Daemontatox (+260)deepseek's contribution isnt just the models , alot of people forget the kernels and repos they open source which are insanely helpful
u/KeikakuAccelerator (+135)They straight up open sourced a new file system to squeeze more training. They are efficiency goats
u/ttkciar (+87)I think it's fine, because we have some excellent smaller models from other labs (most recently Qwen3.6 from Alibaba and Gemma4 from Google), some of which do have base models (Gemma, Olmo, K2-V2). Wh...
View full discussion on r/LocalLLaMA
u/The_Paradoxy 2026-05-11 Discussion Quantization & Backends

My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is substantiv...

u/autisticit (+189)Where can I download that "intelligent human" ?
u/Altruistic-Dust-2565 (+61)It's currently under preview. Only enterprises and institutions can have access. Stay tuned.
u/Real_Ebb_7417 (+36)It’s to dangerous to release publicly, only a limited set of companies has access to it for now to assure security.
View full discussion on r/LocalLLaMA
u/pmttyji 2026-04-28 News General

Model(s) or Tool upgrade/New Tool? Source Tweet :

u/RepulsiveRaisin7 (+125)New devstral? Current model is pretty meh, hope they manage to catch up to the industry
u/szansky (+72)Okay I need something similar to Qwen 3.6 27B !
u/CYTR_ (+37)Mistral understood very well that the race for SOTA is pointless for becoming profitable. They send fleets of engineers to their clients, manufacture tools, and fine-tune the models. That's the main p...
View full discussion on r/LocalLLaMA
u/Unfounded_898 2026-04-20 Discussion General

I’ve been testing Google’s Gemma-4-E2B-it as a local, offline resource for emergency preparedness. The idea was to have a lightweight model that could provide basic technical or medical info if the internet goes down. As the screenshots show, the saf...

u/Klutzy-Snow8016 (+282)Frankly, it should be designed to refuse this. It's a small model that doesn't have a lot of world knowledge, and you're basically asking it to hallucinate you into an early grave. I'm imagining someo...
u/LadyPopsickle (+265)That is why people make uncensored versions of such models.
u/iliark (+257)To be fair, Gemma's answer for the first one is actually correct. You should not remove the shrapnel on your own even with perfect instructions from an LLM, you should leave it in place.
View full discussion on r/LocalLLaMA
u/jacek2023 2026-05-04 News Quantization & Backends

Chat Template was fixed a few days ago choose your fav dealer:

u/interAathma (+96)Can anyone tell, what was broken and what was improved in this new gguf?
u/Silver-Champion-4846 (+87)What did this fix exactly?
u/dampflokfreund (+66)Or just use the current model with the updated chat template. In llama.cpp use --chat-template-file "path to your updated jinja", in koboldcpp there is also a feature that allows this now (under loade...
View full discussion on r/LocalLLaMA
u/Epicguru 2026-04-17 Discussion Quantization & Backends

I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my personal/throw...

u/Better-Struggle9958 (+406)every release same posts
u/EuphoricPenguin22 (+71)Yeah, but if this thing is better than the dense Gemma 4 31B, like the benchmarks I've seen suggest, this is killer. Gemma 4 is the first model for me to pass this threshold, so doing that but way fas...
u/Epicguru (+71)I guess that's what happens when every new release is better than the last...
View full discussion on r/LocalLLaMA
u/cgs019283 2026-05-17 Funny General

Link post. The discussion is mostly in the comments. Target:

u/VoiceApprehensive893 (+98)id get some clown makeup just in case
u/thrownawaymane (+73)> "The moment you've all been waiting for... today we're happy to announce ShieldGemma 4"
u/stoppableDissolution (+71)There is no interest in doing that for them*
View full discussion on r/LocalLLaMA
u/mayocream39 2026-04-22 New Model Quantization & Backends

Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image. It uses a combination of object detection, visual LLM-bas...

u/Mayion (+86)Can't wait for it to have a browser extension to translate in real-time the "manga" I am reading
u/mayocream39 (+42)We have a GitHub issue for this; we wanna integrate with the ComicReadScript project. We are currently waiting for the author's response to integrate. <3
u/mayocream39 (+30)There are so many features I didn't mention in this main thread. If you are interested, please take a look at the GitHub README. I spent almost one year polishing this project, and while there may be ...
View full discussion on r/LocalLLaMA
u/oobabooga4 2026-04-24 Resources Quantization & Backends

Link post. The discussion is mostly in the comments. Target:

u/dinerburgeryum (+64)Great writeup, thank you. I speculate Gemma's degradation is actually related to the decision to continue to quantize the SWA cache. The team had initially made the decision to keep SWA in 16-bit alwa...
u/seamonn (+25)So Gemma starts getting Brain Damage on cache quantization
u/keyboardhack (+19)The attention rotation that llama.cpp has implemented was not inspired by turboquant.the inspiration is from here Long before turbo quant even existed. GG links to it here.
View full discussion on r/LocalLLaMA
u/sdfgeoff 2026-04-23 Discussion High-end GPU (24+ GB)Quantization & Backends

Launched claude code, pointed it at my running Qwen, and, well, it vibe codes perfectly fine. I started a project with Qwen3.6-35B-A3B (Q4) yesterday, and then this morning switched to 27B (Q8), and both worked fine! Running on a dual 3090 rig with 2...

u/Canchito (+118)Qwen 3.6 is not only really usable for coding, but also writing, as well as other applications. I thought I was done being pleasantly surprised for the month after Qwen 3.5 and Gemma 4, but damn... Th...
u/RealestNagaEver (+40)What kind of generation speed do you get with 2x3090 and 27b model?
u/mxmumtuna (+32)Qwen would love to tell you a story about Elara and her detecting the smell of ozone.
View full discussion on r/LocalLLaMA
u/abkibaarnsit 2026-04-29 New Model General

Link post. The discussion is mostly in the comments. Target:

u/sine120 (+80)Interesting in a few use cases. Glad there's still some competition, hopefully they continue improving.
u/abkibaarnsit (+71)Also released : Claiming to beat frontier models on Benchmarks
u/Middle_Bullfrog_6173 (+38)Very weak on benchmarks FWIW. 30B scored 15 on AA index. Equal to the non-reasoning scores of Gemma 4 E4B and Qwen 3.5 2B.
View full discussion on r/LocalLLaMA
u/LLMFan46 2026-05-07 New Model Mid-range GPU (8-16 GB)Quantization & Backends

llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF:

u/inrea1time (+33)Good effort! Would love to try it, can you add a Q4\K\XS to run on 16GB with enough context? Does the MTP work with TurboQuant compressed kv?
u/-p-e-w- (+31)Heretic optimizes for minimal KL divergence on non-refusal prompts, so the acceptance rate for those should remain roughly the same. For prompts that were previously refused and thus had their behavio...
u/hideo_kuze_ (+24)> We choose to run cutting edge AI models in 16GB VRAM GPUs in this year and do the other things, not because they are easy, but because they are hard!
View full discussion on r/LocalLLaMA
u/CountlessFlies 2026-04-17 Discussion High-end GPU (24+ GB)Mid-range GPU (8-16 GB)Quantization & Backends

I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know). But this genuinely feels like a model I could daily drive...

u/ailee43 (+82)every day i regret more the 16GB of VRAM on my 5070ti.... should have gone 3090
u/Uncle___Marty (+61)Saw someone making a reply to another post about qwen 3.6 saying roughly "so many qwen 3.6 posts are getting boring". I TOTALLY disagree. I'm literally swimming in posts with peoples experiences right...
u/grumd (+38)I've got a 5080 (also 16GB) and the only model I can't run is Qwen 27B and Gemma 31B. We're good, mate. Just use llama.cpp and offload MoE experts to RAM. I'm running Qwen 3.6 35B-A3B with FULL 262k c...
View full discussion on r/LocalLLaMA
u/seamonn 2026-04-21 Tutorial | Guide CPU / Raspberry PiQuantization & Backends

A lot of people in the Gemma 4 Model Request Thread were asking for better vision capabilities in the next Gemma Model. This tells me that people are not configuring Gemma 4's vision budget. Gemma 4 ships with [Variable Image Resolution](

u/seamonn (+32)cmd: | llama-server --port ${PORT} --model '/models/Gemma4/gemma-4-31B-it-Q8_0.gguf' --mmproj '/models/Gemma4/mmproj-F32.gguf' --jinja --chat-template-file '/models/Gemma4/google-gemma-4-31B-it-interl...
u/segmond (+30)Thanks for sharing.
u/Temporary-Mix8022 (+23)Thanks for writing it.. and thanks for the typos that I believe that only a human could have made. (Genuinely, zero sarcasm) Literally just so happy to read something that isn't slop. Also, I was doin...
View full discussion on r/LocalLLaMA

Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection. First Chinese model to land in th...

u/Total_Activity_7550 (+68)Good for DeepSeek, but Claude Opus 4.6 doing 1.7x profit over next group of models (and that's not even Mythos) rings a bell that they're leaving competitors behind...
u/Disastrous_Theme5906 (+41)Yeah, agreed — Opus is in a league of its own right now. Worth noting xAI and Google's flagships are also lagging on this, not just the Chinese tier.
u/Disastrous_Theme5906 (+22)Wanna test it but not ready to drop $300+ on a full benchmark run rn. Tried 5.3 and 5.4 before that and they kept going into infinite loops in their replies. Sometimes a single request hit like $1 in ...
View full discussion on r/LocalLLaMA
u/Lowkey_LokiSN 2026-04-17 Discussion Quantization & Backends

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information from reasonab...

u/R_Duncan (+64)Please add your configuration for Qwen, and quantization used.
u/dampflokfreund (+34)There's still a lot of bugs left in Gemma to squash. For example there is one where it will tell you it's going to do X now but then fails to call the tool in its thought process. Or it is going to te...
u/Lowkey_LokiSN (+30)Nothing specialized. Just went with model card recommendations. Qwen config: For agentic coding: temperature=0.6, topp=0.95, topk=20, minp=0.0, presencepenalty=0.0, repetitionpenalty=1.0 For PDF resea...
View full discussion on r/LocalLLaMA
u/techlatest_net 2026-04-25 New Model CPU / Raspberry Pi

DeepSeek V4 Update

u/datbackup (+164)For the pill question, i agree with V4’s assumption. 60 minutes is the more logical answer, even if 90 minutes is perhaps the technically “correct” answer. The problem is that the user failed to adequ...
u/tm604 (+91)"This is a classic puzzle/riddle" implies that both of these are in training data, so it's perhaps not the most exciting result of the year.
u/Materva (+64)The real question here is what medication is requiring you to take it every 30 minutes? This sounds like the pharmaceutical version of a power hour.
View full discussion on r/LocalLLaMA

llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF:

u/MmmmMorphine (+26)Sorry if this is a dumb question Are these MTPs in any way modified to match the Heretic model, so otherwise refused tokens are generated correctly. Or does that not really matter since once past a ch...
u/LLMFan46 (+23)>Please do Gemma 4 heretic llmfan46/gemma-4-31B-it-uncensored-heretic-GGUF: llmfan46/gemma-4-26B-A4B-it-uncensored-heretic-GGUF:
u/craftogrammer (+20)Things are so fast that all I can do is click the three dots and save the post to read later. Thank you so much!
View full discussion on r/LocalLLaMA
u/Creative-Regular6799 2026-05-16 Discussion Mid-range GPU (8-16 GB)

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard! little-coder × Qwen3.6-35B-A3B hit 24.6% (±3.2), and now land above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B on Terminus 2 (23.9%). I didn’t expect t...

u/MichaelDaza (+43)Been running qwen 3.5 9b on 2x3060s and i dont feel like switching anytime soon. Reads images quickly, and does well on long chained conversations. Smaller models are getting pretty impressive.
u/Agitated_Space_672 (+23)are those 12GB vram? Would Qwen3.6-35B-A3B not be better? I have this running on one 3060 at about 15-20tps
u/Jealous_Crow1346 (+16)The scaffold-model gap holding on Terminal-Bench 2.0 is genuinely surprising, great to see 35B punching above its weight against 480B. Rooting for the open source push to the top!
View full discussion on r/LocalLLaMA

A bit of an interesting story of model degradation and censorship. So, one of my use cases for AI has been translating and reading an Chinese novel as it appears, chapter by chapter. Due to the way some characters have secret identities plot points, ...

u/Uncle___Marty (+182)Google did something with the language abilities of Gemma 4 which really puts it in a class of its own. I've seen SO many posts praising gemma 4 for this. I was a little disappointed with Gemma at fir...
u/Potential-Gold5298 (+71)You are not the only one who came to these conclusions: The RP community has been stuck with Mistral Nemo and Mistral Small (both two years old) simply because there was not…
u/Salt-Willingness-513 (+36)Agreed. Never saw such a tiny model (and actually 80% of the SOTA models are not capable in that) being so capable in writing and transcribing swiss german.
View full discussion on r/LocalLLaMA
u/LocalAI_Amateur 2026-04-26 Discussion Mid-range GPU (8-16 GB)Quantization & Backends

A bit of context. I was coding up a little html tower defense game where you can alter the path by placing additional waypoints. My setup: 32gb ram with 16gb vram 5070 ti. Using AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS on LM Studio with OpenCode. I've gr...

u/Pyros-SD-Models (+62)It looks like the one game every LLM on earth somehow wants to implement if you ask it for a small puzzle game: laser-refractor-puzzles :D but yes, dense qwen best qwen
u/ridablellama (+58)Its really impressive and a huge relief to know that if worse comes to worse it will be the baseline. that can never be taken away from anyone who has 16-24 GB vram. no matter how expensive monthly cl...
u/SkyFeistyLlama8 (+28)In just two years of local LLMs, we've gone from stochastic parrots like Llama to Qwen 3.6 and Gemma 4 in roughly the same amount of RAM. You're right, this is the worst it will ever be. I can't remem...
View full discussion on r/LocalLLaMA

I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchm...

u/tempedbyfate (+36)Thanks for doing the leg work on this. Appreciate it!
u/nathandreamfast (+32)Sure going forward I'd be happy to add something like that, Best picks: Heretic and Huihui. Both remove safety completely, both preserve capabilities within 1% of the original, and both do it with cle...
u/nathandreamfast (+16)Thanks! I hope that this makes people have a better choice when choosing what models to try out
View full discussion on r/LocalLLaMA
u/Demonicated 2026-05-01 Discussion Quantization & Backends

So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up the local ...

u/mxmumtuna (+125)You need to be using sglang or vLLM with that 6000. It’s significantly faster due to MTP support and significantly better with large context. Rather than the NVFP4 in the…
u/mxmumtuna (+28)You’re just seriously gimping that card by running llama.cpp and friends. Join the discord mentioned below. We can help you work it out, though most of us run Linux.
u/redditrasberry (+25)I think you touch on one of the reasons there is so much disagreement on how useful local models are. If you really need your hand held then that is where full scale hosted models are very different. ...
View full discussion on r/LocalLLaMA
u/LocalAI_Amateur 2026-04-20 Discussion Quantization & Backends

Gemma 4 26b-a4b-it is basically a solid B student that gets the job done. Qwen3.6-35b-a3b is an A+ student that has plenty of energy after finishing the assignment to add flairs. On a my 16vram video card. Both models runs comparable speed. On Window...

u/ambient_temp_xeno (+51)You can just ask models to make what you want. If you just say "tetris pls" it might give you a basic becky one.
u/Sadman782 (+37)A custom fineutne or a system prompt can make gemma's default frotned style much better than what it is now (It is capable, but lazy like gemini). It doesn't make Qwen better in coding. for example \\...
u/Sadman782 (+28)exactly. See the difference, just changed the system prompt (Gemma 4 26B MoE)
View full discussion on r/LocalLLaMA
u/danielhanchen 2026-04-20 Resources Quantization & Backends

Hey r/LocalLLaMA we conducted KL Divergence benchmarks for Gemma 4 26B-A4B GGUFs across providers to help you pick the best quant. Mean KL Divergence puts nearly all Unsloth GGUFs on the Pareto frontier KLD shows how well a quantized model matches th...

u/Educational_Rent1059 (+29)Awesome work and good insight, thanks for your efforts
u/qfox337 (+20)Would it make sense to include inference speed benchmarks (I realize there's a big question of "on which hardware"), or is there usually little difference / performance impact of kernels for different...
u/Far-Low-4705 (+19)UD-IQ2\XXS is a better quant at 9Gb than Q4\K_M from ggml-org at 16Gb This is crazy stuff, i remember the days where a Q3 or even some Q4 quants would produce completely garbled outputs.
View full discussion on r/LocalLLaMA
u/reto-wyss 2026-05-01 New Model High-end GPU (24+ GB)Quantization & Backends

Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context. - It's 18.8GB | Benchmark | Baseline (Full Precision) | NVFP4 | | --- | --- | --- | | GPQA Diamond | 80.30% | 79.90% | | AIME 2025 | 88.95% | 90.00% | | MMLU Pro ...

u/ubrtnk (+143)
u/annodomini (+40)Anyone tried the petit kernels to run NVFP4 on ROCm? These NVFP4 results looks really good, wondering how well they'll run on AMD without native support. edit: oh, it looks like there's Vulkan support...
u/Its-all-redditive (+36)Evaluation results seem odd. NVFP4 outscoring full precision? These must not be an average score over lots of runs.
View full discussion on r/LocalLLaMA
u/jacek2023 2026-04-21 Discussion General

tell the Gemma team:

u/ResidentPositive4122 (+109)The small models are already good. Let's see what 124B was all about. We'll find hardware to run it :)
u/DelKarasique (+89)Midrange one. Like 70b. I think that's a sweet and empty spot right now.
u/DeepOrangeSky (+88)70b dense 124b MoE
View full discussion on r/LocalLLaMA
u/FederalAnalysis420 2026-04-27 Discussion Apple Silicon

Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at ...

u/Pristine-Woodpecker (+96)I am confused. If you are punishing models that want to think, why not just disable thinking to begin with?
u/Dabber43 (+57)Isn't gemma 4b double the size of qwen 4b
u/Sufficient-Bid3874 (+48)Yeah it's only 4b active not total, not sure why it was included. E2B would be more relevant
View full discussion on r/LocalLLaMA
u/_maverick98 2026-05-04 Discussion General

I just wanted to share my experience. At work we have Cursor with the Enterprise tier. Today I burned 10$ with 2 prompts, one on gpt-5.5 and one on claude-opus-4.6-thinking. Last month I burned 80$ in one week with claude-opus-4.7 even with the 50% o...

u/jacek2023 (+146)Prices will go up at least 10x. People on this sub are delusional, they think they are being "smart" by using cloud models. There will be more and more crying about prices and limits.
u/misanthrophiccunt (+64)agreed Venture capital modus operandi = cheap first to take the whole market, then hike the price.
u/wurst_katastrophe (+49)Unless they stop open sourcing them in the future.
View full discussion on r/LocalLLaMA
u/MiaBchDave 2026-05-05 Resources Quantization & Backends

Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article... and it did not disappoint. Lots of free info in the link. The juicy part is behind a paywall. I'll respect that, but the short of it is: It's showing t...

u/LORD_CMDR_INTERNET (+75)Anecdotally, for coding, I find Qwen3.6 27B and Gemma4 31B trade blows. I will swap Plan/Act roles if either gets stuck and that seems to work quite well.
u/slower-is-faster (+51)I knew it
u/ResidentPositive4122 (+49)> I will swap Funny enough this was one finding ~last year from hf: swapping randomly between opus and gpt5 in the same session led to better results than any of them separately.
View full discussion on r/LocalLLaMA
u/WeGoToMars7 2026-04-17 Discussion Quantization & Backends

I'm using the fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4\K\M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% smaller) I could've gone with a smaller quant of Gemma...

u/KaroYadgar (+89)Indeed, it should be noted that Bonsai was built on Qwen3, not Qwen3.5, so its issues may stem from the fact that it's built on the previous generation rather than purely its quantization impacting it...
u/charlesrwest0 (+66)If we are being fair, Google has way more resources than the bonsai team. It's a cool proof of their concept, but I'm not sure it's really super production ready.
u/WeGoToMars7 (+54)Qwen3-8B-Q4\K\M has no issue with all three questions, so it's PrismMLs quant process that lobotomised it! With almost 3x the RAM used, it's not a fair comparison, but PrismML was the one to claim tha...
View full discussion on r/LocalLLaMA
u/pmttyji 2026-04-22 Discussion General

I created this chart with recent open models from last 6 months. Few might be older than that possibly. Included only latest versions(Ex: Only Kimi-K2.6, no Kimi-K2.5 & Kimi-K2. Also only GLM-5.1 & GLM-4.7, no GLM-4.6 & GLM-4.5). I couldn...

u/nextlevelhollerith (+52)I remember the times when we were amazed by GPT-4. GPT-4o (Nov 24) as a Intelligence Index of 17. Now Qwen 3.6 35 has 43. With some decent hardware you can run that locally. Its remarkable what open m...
u/jacek2023 (+48)"Possibly best 6 months for Local LLMs?!?" and they are constantly complaining that local LLMs are dead :)
u/200206487 (+33)Qwen3.6 27 dense just dropped
View full discussion on r/LocalLLaMA
u/dreamai87 2026-04-17 Discussion Mid-range GPU (8-16 GB)LaptopsQuantization & Backends

Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task. The model is performing very well. It handled all tool calls properly and also managed large context using llama.cpp on a 16GB VRAM on laptop. ...

u/youcloudsofdoom (+38)FYI I'm getting 30 t/s generation on 8GB VRAM/192k context with the Q4 KXL model, if the 2bit quant starts getting you down.. .
u/sToeTer (+29)I asked it to create a book-pdf suite where i can process my books for printing. Extremely detailed prompt. It doesn't care and decides it wants to create a gambling website... :D Prompt: Answer: I do...
u/winless (+25)A "bookmaker" is a person or org who handles bets on sports and stuff. My guess is that it didn't have a big enough context window to keep track of the instructions, and just kept going off the folder...
View full discussion on r/LocalLLaMA
u/Unstable_Llama 2026-05-11 News General

Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes. We started off last month with the release of gemma 4 support, and continued with [improved caching efficiency](

u/Such_Advantage_6949 (+21)Dflash with qwen3.6 27B is so fast
u/Unstable_Llama (+15)Correct, it does not.
u/OXKSA1 (+14)is exllama has no cpu offload?
View full discussion on r/LocalLLaMA
u/Lowkey_LokiSN 2026-04-22 Discussion Quantization & Backends

This is a follow-up update to my previous post comparing Qwen 3.6 35B vs Gemma 4 26B. I wanted to particularly follow-up with the following: 1. Gemma 4 26B could've suffered the quantization tax…

u/Lowkey_LokiSN (+22)1 MI50 32GB Xeon 6148 128GB ECC DDR4 2666hz
u/FusionCow (+17)well sure, but then when the 27b dense comes out and beats the moe what will you say then
u/Reactor-Licker (+15)What hardware are you using?
View full discussion on r/LocalLLaMA
u/PaceZealousideal6091 2026-05-08 Discussion High-end GPU (24+ GB)Quantization & Backends

Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago. As far as my understanding goes, Dflash should be a better alternative than MTP because of faster parallel ...

u/coder543 (+52)yes, someone posted this about 5 minutes before you:
u/coder543 (+40)> MTP should technically degrade faster because the kv cache will start balooning faster. For Gemma 4, none of this is true. MTP in Gemma 4 reuses the model's KV cache. I am very excited for DFlash...
u/coder543 (+9)Gemma 4’s MTP implementation is unique, as far as I’m aware
View full discussion on r/LocalLLaMA
u/Mayion 2026-04-18 Discussion Quantization & Backends

I don't know if it's something I am doing horribly wrong or what, but running Open WebUI w/ Terminal on Docker with the models on LM Studio and I am starting to think the community keeps praising the tool calling feature just to cope lol Qwen3.5 27B,...

u/jacek2023 (+105)It works for sure with opencode
u/SNThrailkill (+95)I find openweb UI to not be a great harness. However like others have said, I'm having much more success with it on opencode which is awesome for coding but not so much for personal tasks. Looking for...
u/HopePupal (+34)you didn't mention which quants you're using. running an aggressive quant can be an issue, especially with small models. and by aggressive i mean under Q6 or maybe Q5 if the model's very quantization ...
View full discussion on r/LocalLLaMA
u/nikhilprasanth 2026-04-30 Discussion General

Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older \~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows. At this point I’m not really finding a rea...

u/StupidScaredSquirrel (+203)Nemotron is fast af for long context. Gpt oss 20b is very small. But generally speaking yes newer models take the place of older ones, this isn't sensational or surprising
u/dionysio211 (+87)I think they are each carving out niches that play to their strengths, speaking to fresh models in this size range. Anything older than 6 months is fighting an unfair fight. Gemma is MUCH better than ...
u/simon_zzz (+52)For writing and summarization, I lean towards the Gemma models.
View full discussion on r/LocalLLaMA
u/LegacyRemaster 2026-05-18 Discussion General

After the recent releases, there's almost a sense of emptiness. When do you think new models will be released? Looking at the chart, it's between the end of May and the beginning of June, but... I don't know why, it seems like something's changing ab...

u/L0ren_B (+67)I'm refreshing LocalLLaMA everyday for a new Qwen 27B model! (wishful thinking!). But, somehow I an sceptic that something as good as GPT5.2 (which would be amazing if local) would ever become availab...
u/Healthy-Nebula-3603 (+44)Yep ..qwen 3.6 27b dense raised the bar very high :) Only competitor ( not for coding of course) is Gemma 4 31b dense currently.
u/durden111111 (+43)Gemma 4 123B or Qwen 3.6 122B would be huge
View full discussion on r/LocalLLaMA
u/JackStrawWitchita 2026-05-05 Discussion CPU / Raspberry Pi

This is crazy. I've been running local LLMs on CPU only for awhile now and have great results with 12B models running on an i5-8500 and only 32GB of RAM with no GPU. But I've got a version of Gemma4 26B running really fast on the same machine which i...

u/GoodTip7897 (+113)That's because Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model. Even though Qwen 3.6 27B has just 1B more total paramete...
u/CooperDK (+46)Then he should get Qwen3.6-35B-A3B. One billion less parameters active. Should be 25% faster. It isn't though.
u/LetsGoBrandon4256 (+28)> [14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s 23.13T/s is the prompt processing speed. The actual generation speed ...
View full discussion on r/LocalLLaMA
u/antirez 2026-05-08 News Apple SiliconQuantization & Backends

Link post. The discussion is mostly in the comments. Target:

u/foldl-li (+31)This section is really great. ollama shall learn something from this.
u/goat_on_boat (+18)This is unreal. Performance is insane and the model seems to be a cut above Qwen3.6/Gemma4 i've been playing with. 2 bit quant, running M5 Max 128gb getting ~35tk/s generation at 300tk/s prefill. Cont...
u/antirez (+13)The only llama.cpp DS4 implementation I'm aware of that works reliably is the one I published on a fork. DS4 is faster. When the official llama.cpp implementation will be released there is to benchmar...
View full discussion on r/LocalLLaMA
u/Jorlen 2026-05-15 Discussion Mid-range GPU (8-16 GB)Quantization & Backends

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests but only with the dense full 27b Qwen 3.6 model. The MoE 35B version gained less than 10% with the MTP version....

u/Southern_Sun_2106 (+31)I just ran qwen 3.6 35B in LM Studio on a Mac to full 265K context. The model itself is just amazing - no sign of slowing down, no mistakes when calling tools, if I didn't know the number, I would thi...
u/Jorlen (+12)I have tested 40+ models and none work better than it, I agree. With VScodium and roo, it's just fucking perfect. It's doing all the tool calls and knows exactly when to call APIs and everything else....
u/redblood252 (+11)I hope this will work for my iq3 dense 27b on my 16gb vram
View full discussion on r/LocalLLaMA

I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when I tried to...

u/Chromix_ (+32)Keep in mind that the impact on a MoE model will be worse, especially if partially offloaded, as it needs to cycle through more experts to speculate, instead of just going through the same tensors lik...
u/Look_0ver_There (+15)I downloaded your models and tested them. One thing I immediately noticed though, and to be fair this seems to be caused by the MTP implementation itself and not your models, is that PP speeds were li...
u/ex-arman68 (+7)Definitely. From what I understand, MTP is not suitable for MoE with small models. Dense models are the ones that benefits the most from it, and the bigger the model, the bigger the benefits.
View full discussion on r/LocalLLaMA
u/cafedude 2026-05-11 Discussion General

I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.

u/NNN_Throwaway2 (+128)As I keep trying to point out, the 3.6 27b blog post implied there will be no more 3.6 model releases. Of course, nobody wants to believe it, so I’m not going to waste time arguing. We’re going on thr...
u/a_beautiful_rhind (+56)They had a major restructure so it doesn't look good.
u/cyber_burr (+52)The fact that they are hesitant to publish small models is nothing short of a tragedy for GPU-poor people. With every major Qwen release, you could get small models ranging from <1b, 4b, 8b, etc. I...
View full discussion on r/LocalLLaMA
u/Opening-Broccoli9190 2026-05-14 News Quantization & Backends

>The NVIDIA Kimi-K2.6-NVFP4 model is the quantized version of the Moonshot AI's Kimi-K2.6 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Kimi-K...

u/TheCTRL (+93)"Model Limitations: The base model was trained on data that contains toxic language and societal biases originally crawled from the internet." So they crawled all linux dev email threads! Good to know...
u/2Norn (+38)that just means the model is not censored enough for western standards lol
u/FullOf_Bad_Ideas (+21)it's copy-paste boilerplate also present here:
View full discussion on r/LocalLLaMA
u/EffectiveMedium2683 2026-05-11 Other High-end GPU (24+ GB)Quantization & Backends

Originally I was a diehard fan of Gemma4 26b-a4b because it really is a remarkably intelligent llm. Ran qwen3.6 via ollama and found it impressive but still favored Gemma. Ollama did it a disservice at least on my pc. Ran it straight through llama.cp...

u/our_sole (+36)I am just stunned how well qwen3.6-35b-A3B MOE is working for me. I have an rtx 3090 24GB VRAM, 64GB RAM on a beelink gti14 Ultra 9185H CPU and the beelink eGPU dock. I switched from LM Studio to llam...
u/bighead96 (+19)There's a common belief that Gemma4 is very smart, its not, its actually very dumb. It's very good at confidentally telling you its fixed things and here are the issues and how it resolved them. If yo...
u/cmndr_spanky (+17)You almost certainly have param settings / context window settings wrong with Qwen. It does think for a while but not like that.
View full discussion on r/LocalLLaMA
u/chain-77 2026-05-08 Discussion High-end GPU (24+ GB)Quantization & Backends

I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: GPU: RTX 5090, 32GB VRAM vLLM: 0.19.2rc1 Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit Draft model: z-lab/gemma-4-26B-A4B-it-DFlash Workload: random datas...

u/JLeonsarmiento (+39)Well, that pretty much kill it for agentic, which is where you want this kind of speed ups to make a difference, innit?
u/coder543 (+39)What performance can you get with Qwen3.6-27B (dense) using DFlash? Or Gemma-4-31B using DFlash, if you can fit that into memory.
u/ATK_DEC_SUS_REL (+38)This is great throughput, but unfortunately DFlash drops off a cliff at high context lengths. By “high” I mean \~20k context or more. Edit: I found better performance by utilizing prefix caching.
View full discussion on r/LocalLLaMA
u/Total-Resort-3120 2026-05-01 News Quantization & Backends

I guess we'll have to wait until this PR is merged before we can test it.

u/jacek2023 (+26)unfortunately PR is still a draft
u/farkinga (+19)I seem to recall a comment by ggerganov on another PR about his intention to refactor the speculative codebase, ultimately to unify the various speculative methods within a more general architecture. ...
u/oxygen_addiction (+16)They were bought be huggingface
View full discussion on r/LocalLLaMA
u/segmond 2026-05-05 Other Apple SiliconQuantization & Backends

DeepSeekv3 OG DeepSeekv3.2/4 Qwen3.5+ GLM4.5+ ~~MiniMax2.5+~~ Step3.5Flash Mimo v2+ Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.

u/GrungeWerX (+54)Doesn't Qwen 3.6 support it as well?
u/Ok_Warning2146 (+48)Well, this beta is only for Qwen3.5/6. Each architecture has their own MTP implementation. So it is not an once for all thing.
u/One-Replacement-37 (+35)It does.
View full discussion on r/LocalLLaMA
u/Nunki08 2026-05-11 News Quantization & Backends

From clem on 𝕏: From Victor M on 𝕏:

u/StupidScaredSquirrel (+68)Eh, most of them are vibe-tuned slop "opus-distill" of qwen and the likes. I'm happy of all the new models but these numbers are kinda meaningless
u/DistanceSolar1449 (+33)It’s all Qwen 3.5/3.6 finetunes lol.
u/ParaboloidalCrest (+14)We'd get by with less than tenth of those. Seriously HF needs a "No finetune" and "Group by Quant type" filters because model discoverability has gotten real shitty. They all look like uuids now.
View full discussion on r/LocalLLaMA
u/zakadit 2026-04-22 Question | Help General

Hello, I’ve been scrolling through a lot of posts, reading personal experiences, setup advice, and replies to beginner questions from people like me. LLMs really seem like a revolution. But at the same time in every post there is issues : they’re exp...

u/jannycideforever (+150)It will virtually ALWAYS be cheaper per token to run Kimi in a giant warehouse running constantly at 90% capacity than it is to run a local version that will be idle 90% of the time. It's just economi...
u/Red_Redditor_Reddit (+53)Dude if you're going to go local, dip your toes in and start small. You don't need some monster machine to get basic llm's to work. You're not going to get the same results as some 5T parameter model....
u/jannycideforever (+45)Cope, unironically. If opus 4.7 went open weight everyone would be losing their minds like it was the second coming because it's better than everything else. I use open weight models all the time, esp...
View full discussion on r/LocalLLaMA

Link post. The discussion is mostly in the comments. Target:

u/Last_Bad_2687 (+12)Super cool, but wish you did more to show it actually processing the images
u/xenovatech (+9)This demo is running Gemma 4 E2B, since it works pretty well in-browser. You can try it out using this demo I posted a couple weeks ago:
u/thetaFAANG (+7)ohh wow I thought this was 3D rendering in the browser whoops would have been impressive as well, but we already know they can do 3D models
View full discussion on r/LocalLLaMA

Provided in both Safetensors and GGUFs. Safetensors: llmfan46/G4-MeroMero-31B-uncensored-heretic: GGUFs: llmfan46/G4-MeroMero-31B-uncensored-heretic-GGUF:

u/Connect_Ad791 (+28)Uhhh, yeah? What else would they be doing with it? Now I have a question though, how are a finetunings refusal percentages calculated? Now wouldn’t you be able to just recursively see what the model r...
u/BawbbySmith (+27)The real reason for local LLM
u/LLMFan46 (+24)G4-Meromero-31B-Uncensored-Heretic is a model made by zerofata, I did not make this model I just Hereticated it and released it, if you have any questions about this model you should ask zerofata. gem...
View full discussion on r/LocalLLaMA
u/Exciting-Camera3226 2026-04-28 Resources Quantization & Backends

edits to call out some information: \- All local model uses \`Q4\K\M\` quantization with \`llama.cpp\` engine \- Main factor contribute to difference with Qwen's official post (59% vs 38%) is probably benchmark task timeout used, then quantization, h...

u/false79 (+29)People who know what they are doing with local models have been doing real work for a while now. I'm talking about the devs who know what aspects of their role is manually repetative and automating it...
u/alrojo (+17)How aggresive are your timeouts? At 1.9 tokens per second that is very slow generation.
u/cygn (+13)so the gap between Qwen's official post (59.3) and what you measured (38.2) for 27b is purely because of the timeout? I still wonder if they have benchmaxxed terminal bench 2.0. Would love to see some...
View full discussion on r/LocalLLaMA
u/PromptInjection_ 2026-05-14 Discussion Quantization & Backends

Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web access can ha...

u/CYTR_ (+74)Honestly, if someone had told me last year that the US would launch Operation "Epic Fury" (EPIC FURY, bruuuh) to invade Iran... I would have had a hard time believing it.
u/CatTwoYes (+47)I've hit this on Qwen, Gemma, and Llama models. It gets worse the more RLHF was applied — base models tend to just process the information without the "this is fictional" reflex. Best band-aid I've fo...
u/DeliberatelySus (+36)It's not gay, the balls didn't touch!
View full discussion on r/LocalLLaMA
u/fakezeta 2026-05-05 Resources Quantization & Backends

Hi, recently froggeric and allanchan339 released enhanced/fixed template for Qwen3.6 each one addressing different topics. I didn't know which one to use so I merged both with the help of Claude Opus to have the best of both. I've uploaded it to this...

u/noclip1 (+41)Would love for someone to explain to me how a chat template can be community modified to possibly out perform (or fix bugs) in the intended chat template the Qwen team released and would've been using...
u/ambient_temp_xeno (+24)If it wasn't for the 'used claude opus' part lowering the odds, there would be a 50/50 chance of this being an improvement, historically speaking. The explanation vaguely summed up: 'life on the front...
u/ex-arman68 (+22)Thanks for your work. I have checked your merged template and allanchan339. Here are my thoughts: 1. Long strict tool rules: allanchan339 uses a much longer version (300 tokens). It can be useful for ...
View full discussion on r/LocalLLaMA
u/HornyGooner4402 2026-05-06 Discussion General

Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks. So is Qwen outright better? In what case would you pick Gemma ...

u/FoxiPanda (+82)Gemma trounces Qwen for my handwriting analysis and general vision tasks at the very least. I also appreciate Gemma in chat significantly more than Qwen (qwen is cold and calculating even with system ...
u/ggonavyy (+77)Gemma4 is really, really, really good at tracing bugs. When you feed a ticket to Qwen3.6 27B it'll fill the context with all available info available and maybe probably find the root cause, and someti...
u/Pro-Row-335 (+36)The only things I use llms for are coding and JP->EN translation, for agentic coding its a nobrainer, Qwen is much better than gemma with tools, for JP->EN translation its also a nobrainer, Gemm...
View full discussion on r/LocalLLaMA
u/TruckUseful4423 2026-05-08 Question | Help Quantization & Backends

Hi everyone, I saw an article saying Chrome silently downloads a \~4GB AI model (likely "Gemini Nano") to your computer for features like text summarization. Two questions: 1. What is the exact name/version of this model? 2. Is there a GGUF file avai...

u/Baldur-Norddahl (+73)Not really an answer to the question, but you can use it locally by writing javascript in the browser, like so: ``` const session = await LanguageModel.create({ outputLanguage: "en" }); const reply = ...
u/Altruistic_Heat_9531 (+31)Nano is E2/4B model, i dont know the quant since it is stored as .bin file instead of safetensors or pth. It is 4gb model Here the result """ \- I am LLM from DeepMind \- I was created by Gem…
u/dryadofelysium (+20)Gemini Nano is based on a custom variant of one of the smaller Gemma 4 models. It was mentioned in some Google/DeepMind blog post when Gemma 4 came out.
View full discussion on r/LocalLLaMA
u/TKGaming_11 2026-05-07 New Model General

Link post. The discussion is mostly in the comments. Target:

u/FoxiPanda (+28)Looks like the pulled it down from HF for some reason? - It was here: - But it's not listed here that I can see: Looks like they also updated their 8B model like an hour ago...so maybe it's just not q...
u/Eyelbee (+12)Okay this is interesting. Now complete that rl pipeline and actually deliver the model, if it then beats the qwen 3.6 27b for real, it would be hugely impressive. That's an extremely high bar though, ...
u/grumd (+9)Aaaaand it's gone. The model card gives me a 404 :(
View full discussion on r/LocalLLaMA
u/NoConcert8847 2026-04-22 Tutorial | Guide Apple SiliconLaptopsQuantization & Backends

Hardware |Component|Details| |:-|:-| |Machine|MacBook Pro (Mac14,6)| |Chip|Apple M2 Max — 12-core CPU (8P + 4E)| |Memory|64 GB unified memory| |Storage|512 GB SSD| |OS|macOS 15.7 (Sequoia)| # AI Agent Setup I'm using the pi coding agent as my primary...

u/nicksterling (+20)I’m happy to see pi getting more love. The extension system is incredible and being able to customize my harness is great. I added Claude Code plugin support via extensions so I’m not losing any compa...
u/hailnobra (+9)Sure thing. Qwen 3.6 is running on a Strix Halo system with 96GB of RAM (75GB allocated to GTT). Host OS is running on CachyOS and llama server currently running on the amd-strix-halo-toolboxes:rocm-7...
u/sine120 (+8)Qwen + Pi has been working really well for me for coding. I just need to get a better search setup and I think I can start phasing out gemini day to day.
View full discussion on r/LocalLLaMA
u/PixelatedCaffeine 2026-05-19 Resources Mid-range GPU (8-16 GB)Quantization & Backends

MTP is amazing. I genuinely thought it would be a nothingburger

u/Borkato (+44)MTP is amazing. I genuinely thought it would be a nothingburger
u/GreenPastures2845 (+24)Gemma4 MTP is not supported yet
u/DR4G0NH3ART (+21)I am going from single digit tps to double digit. Never have been happier. Slapped my old 1660 ti to sit with my 5070 ti today, now I am at 22 gb having fun with qwen. Huge thanks to the community.
View full discussion on r/LocalLLaMA
u/mdda 2026-05-13 Tutorial | Guide Quantization & Backends

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4\K\M models, 128k ...

u/Client_Hello (+27)Your tests are all with small context, usually under 2000 total tokens. While you reserved 128k, you didn't actually use it. Reserving the larger context reserved VRAM, causing more layers to offload ...
u/OldEffective9726 (+12)That's a steal for only $200!!!
u/rm_rf_all_files (+7)32gb ram machine, crazy cheap
View full discussion on r/LocalLLaMA

Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my cu...

u/pedronasser_ (+39)It may be the backend/harness influence, but I have the opposite of your findings. Qwen3.6 follows instructions better than Gemma4 for me. And I don't care at all about the visual capabilities.
u/LetsGoBrandon4256 (+33)> side-by-side > noticeably better on simple prompts—it stops earlier > 0–1000 Quite impressive that you used all three variants of dashes in one post.
u/chimpera (+28)my sense is that gemma is much better at short one shot, but that because of it architecture it struggles with long context. There is something about its attention mechanism and its also far more sens...
View full discussion on r/LocalLLaMA
u/Temporary-Mix8022 2026-04-19 Discussion Apple SiliconQuantization & Backends

Going to flag this up front - I know that there are some properly smart people on this sub, please can you correct my noob user errors or misunderstandings and educate my ass. Model: google/gemma-4-26b-a4b Versions: * MLX:

u/EvolvingSoftware (+52)GGUF has come a long way recently.
u/cm8t (+50)GGUF/llama.cpp has really caught up to MLX over the past few months by leaning into Metal.
u/Temporary-Mix8022 (+41)That my friend, is the downside of writing my own posts.. yeah, user error over here. Edited it. I was running both on the same model. It was just when googling for those 4-ish bit quants to share the...
View full discussion on r/LocalLLaMA
u/Gazorpazorp1 2026-04-23 Discussion High-end GPU (24+ GB)Quantization & Backends

I ran a pretty simple but revealing local-LLM test. At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well... Models tested: Gemma4 `cyank...

u/LocoMod (+149)>Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.” Your harness is irrelevant here as your entire post doesnt read like someone wh...
u/drwebb (+18)I'm pretty sure any serious dev could easily get work done with either qwen 3.5, qwen 3.6, and Gemma-4 31. Like you say, one might be better than the other, but just creating some artificial task made...
u/Southern_Sun_2106 (+16)"I am building a Mystery Tool, and this is how the models did - according to me and to chatgpt" - ok, thanks for sharing, but this tells close to nothing about anything.
View full discussion on r/LocalLLaMA
u/ihatebeinganonymous 2026-05-03 Question | Help High-end GPU (24+ GB)Quantization & Backends

Hi. It is quite a consensus that the "jump" in quality of agentic development happened sometime in December 2025, transforming from "nice to have", to actually performing. It was also long discussed that open source models lag the state of the art by...

u/HumanDrone8721 (+73)Yes, but the hardware requirements for local stuff are getting heavier and the prices still remain high and even increasing. I struggle to run MiniMax2.7 at a reasonable quantization level to give me ...
u/DeepOrangeSky (+40)What I'm more curious about is how the gap between the ~30b models that normal people can actually run on their home setups compare to the SOTA models now compared to the same type of comparison from ...
u/Kahvana (+30)As with everything, it depends. Qwen3.6-35B-A3B is slightly better than Claude Haiku 4.5, released roughly half a year ago. Gemma4-31B can be there with the frontier for translations depending on the ...
View full discussion on r/LocalLLaMA
u/havenoammo 2026-05-13 Resources High-end GPU (24+ GB)Quantization & Backends

This is follow up from previous post: There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to dat...

u/metmelo (+17)The hero we don't deserve.
u/grumd (+16)Thanks havenoammo! You've done a lot to push MTP with Qwen recently! I'd recommend adding --min-p 0.0 to the command, default is 0.1
u/Prudence-0 (+8)Il grand merci, j'ai gagné +34% de perf sur ma RTX 3090
View full discussion on r/LocalLLaMA

You can play them here: This started out as a simple test for Qwen3 Coder Next vs Qwen3.5 4B because they have similar benchmark numbers and then I just kept trying other models and decided I might as well share it even if I'm not that happy with how...

u/AngeloKappos (+14)qwen3 coder next losing to the 4b at actual game logic is the most demoralizing benchmark result i've seen this week, playwright mcp doing the heavy lifting probably explains a lot of the variance her...
u/libregrape (+11)Crazy how 35B and 26B moes with just 4-3B active totally annihilated 122B, and even dense 27B.
u/yami_no_ko (+9)>LLMs are advancing so quickly! Imagine this just a few years ago in 2023. We'd have shat our pants seeing how far local LLMs advanced. Even a 4b Model feels better than what we had just 3 years ag...
View full discussion on r/LocalLLaMA
u/LayerHot 2026-05-12 Tutorial | Guide High-end GPU (24+ GB)Quantization & Backends

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. # Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative …

u/danish334 (+12)Nice. I too noticed the acceptance rate of dflash wasn't as good as mtp but zlab do mention lossless inference. You should benchmark their claim.
u/MrLlamaGnome (+9)Nice writeup! As a GTX 1050 3GB potato enjoyer, I wonder how this comparison would change on more constrained hardware... Is one method more compute or I/O dependent in practice than the other?
u/coder543 (+8)DFlash uses diffusion to generate a larger batch of tokens. MTP is autoregressive, one token at a time, and it usually isn't worth generating more than 2 or 3 tokens with MTP. For the same cost, you c...
View full discussion on r/LocalLLaMA
u/ayylmaonade 2026-05-12 Question | Help Quantization & Backends

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I ha...

u/ttkciar (+69)If you do any batched inference at all, vLLM is going to scale better and more easily, since it allocates VRAM per-batch as needed as context grows and as more concurrent queries arrive, while llama.c...
u/Farmadupe (+26)It's totally ironic that llama.cpp is built for people without VRAM but (practically) needs to pre-allocate all possible KV VRAM at launch time. If llama.cpp ever gets a mature paging/preempting KV ca...
u/Klutzy-Snow8016 (+20)It usually gets new models before llama.cpp. The same goes for features. For example, MTP is already supported for Qwen3.6 and Gemma 4. vLLM can be much faster if you can use tensor parallelism. It do...
View full discussion on r/LocalLLaMA
u/GodComplecs 2026-04-19 Question | Help Quantization & Backends

Im using these settings in llama.cpp: --spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 Whats the real reason for lets say the prompt is for "minor changes in code", whats differing between models: Gemma 4 31b: Doubles in t...

u/Fresh_Finance9065 (+24)Speculative decoding works for simple questions but doesn't really speed up difficult questions where the small and big model would give different answers Edit: Idk how I got upvoted so high with a wr...
u/DinoAmino (+14)Meanwhile, OP isn't using a draft model at all. Using ngrams here.
u/Sadman782 (+13)It's not black magic, basically it is just search based speculative decoding. It means it only actually works for coding or where the model repeatedly answers the same thing with a little change. Let'...
View full discussion on r/LocalLLaMA

UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did ...

u/ambient_temp_xeno (+45)It's not so much that there's an inherent issue with windows (necessarily anyway), it's that the cuda dev guy doesn't care about the windows performance. The difference used to be a lot bigger on my m...
u/mstahh (+23)It's funny that the world rests on the "cuda dev guy"s shoulders
u/simracerman (+13)Who’s he? Can’t we buy him coffee/beer in bulk?
View full discussion on r/LocalLLaMA
u/evoura 2026-04-20 Other Apple SiliconQuantization & Backends

There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so...

u/ttkciar (+46)Regarding the low Gemma 4 scores: This might be hitting the Gemma 4 tool-calling problem, where inference stops prematurely just before a tool-call. Both Google and llama.cpp have issued bug-fixes for...
u/UnifiedFlow (+30)Its pretty annoying how everyone thinks they know how to run testing. I come from a nuclear reactor testing background and I cringe at the test setups people are using and then confidently reporting r...
u/ambient_temp_xeno (+24)I can't help but think you're doing science wrong.
View full discussion on r/LocalLLaMA

Provided in Safetensors, GGUFs and NVFP4 formats. Safetensors: llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic: GGUFs: lmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it…

u/HonestoJago (+18)Let me start by saying I'll definitely try it, but what are you people doing to get refusals from the base model? I know it's part of the tests/scripts you run, but in actual practice, are you getting...
u/LLMFan46 (+16)>Because it seems uncensored asf to me. Vanilla. There's plenty, sexy creative writings, RolePlaying just ask the people over at r/SillyTavernAI definitly need an uncensored model for whatever they...
u/LLMFan46 (+10)Also I explain on the MiniMax-M2.7 thread about softening, let me copy and paste it here: >What could happen is something called "softening", meaning it's not a hard refusal but instead the languag...
View full discussion on r/LocalLLaMA
u/crowtain 2026-05-15 Discussion High-end GPU (24+ GB)Quantization & Backends

Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B param...

u/TokenRingAI (+93)You are making a lot of assumptions about the car I drive
u/Ledeste (+40)Do not want to make you sad, but as someone with both 192GB of ram (is not the new max 256 since DDR5?) and a 5090, I'm only using ram to test the new models, but will avoid getting out of vram as muc...
u/LizardViceroy (+36)I have 512GB worth of 128GB devices and I've been feeling worse about my choices since Qwen3.6 27B and Gemma4 dropped... In the GPT-OSS-120b days we looked like the smart ones. These things come and g...
View full discussion on r/LocalLLaMA
u/HyPyke 2026-04-24 Question | Help Apple Silicon

EDIT: OKOKOK. Blackwell all the way. NEW, at MC or NewEgg or where ever and more tokens than my face can handle. Thanks guys. I was close to pulling that Apple.com trigger. You saved me. EDIT AGAIN: I think it's the max-q for me. Central Computers ha...

u/Look_0ver_There (+133)MicroCenter is selling the RTX 6000 Pros with 96GB for $9300 brand new. Why are you considering buying a second hand one for $10,000?
u/btdeviant (+83)2x blackwell owner here, I get my money from my wifes boyfriend
u/different_tom (+71)Where are you ppl getting the money for this shit
View full discussion on r/LocalLLaMA
u/cant-find-user-name 2026-04-24 Discussion Quantization & Backends

We have a chat system which we use haiku for because it is mostly about tool calling and summarisation of them. But we have many tools with pretty complex input schemas, and stuff like gemma didn't cut it, so we went with haiku. Haiku is pretty good....

u/TheRealMasonMac (+31)GLM-5.1 being cheaper than Haiku is still hilarious to me.
u/SnooPaintings8639 (+26)I would be very disappointed with D4 Flash if it wasnt MUCH better than haiku. I don't know how, but I have quite high expectations for this model
u/Caffdy (+16)>But we have many tools with pretty complex input schemas can you gives an example, because the jump from Haiku (which probably is in the same range as Qwen 27/35B or Gemma4 in terms of size) to D4...
View full discussion on r/LocalLLaMA
u/tovidagaming 2026-04-23 Discussion High-end GPU (24+ GB)Quantization & Backends

Just sharing the results from experimenting with the B70 on my setup.... These results compare three `llama.cpp` execution paths on the same machine: RTX 3090 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026) Arc Pro B70 (Vulk...

u/PassengerPigeon343 (+22)3090 still the top value play, incredible
u/AbsoluteHedonn (+20)NixOS mentioned
u/RemarkableGuidance44 (+8)Intel Drivers are still new and they are updating them weekly. I got 4 x B70's they are great for larger models a bit slower of course but software is still new. Intel are also now going for the AI Da...
View full discussion on r/LocalLLaMA
u/Striking-Swim6702 2026-04-18 Resources Apple SiliconLaptopsQuantization & Backends

I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix Hardware: Apple M3 Ultra, 256GB unified memory Frameworks tested: Hermes Agent (64K stars), PydanticAI, LangChain, ...

u/Evening_Ad6637 (+8)The whole thing is kind of… weird… or a little chaotic. I mean, the results are very interesting, and you can clearly see a lot of effort went into this work, so of course, kudos to @OP It’s just a bi...
u/mr_Owner (+7)Unfair comparisons of mixed quants tbh
u/Only-Fisherman5788 (+6)compatibility matrix is useful for "will this even run" but doesn't answer the production question: which of these framework+model combos actually produces correct outputs on the agentic task you care...
View full discussion on r/LocalLLaMA
u/EntertainmentBroad43 2026-04-29 Resources Quantization & Backends

TLDR: tool parameters using the common JSON Schema pattern \`anyOf: [$ref, null]\` are rendered into the prompt as empty \`type\` fields. This strips the useful schema information before the model sees it. \-- Long, rambling version: Gemma 4 was havi...

u/onil_gova (+8)is this why gemma 4 is struggling to make proper tool calls unlike qwen3.6?
u/EntertainmentBroad43 (+6)yeah i saw this post when I was looking for people with similar incidences; if the tools that this person made includes composition, references, unions, or constraints that are not expressed as a dire...
u/EntertainmentBroad43 (+5)I've given some more thought into this. This is my conclusion thus far. Gemma 4 supports tool calling, but its Jinja/template protocol is not a faithful JSON schema renderer. It projects tools into a ...
View full discussion on r/LocalLLaMA
u/jacobpederson 2026-05-10 Generation General

I wrote up this little python app to cycle through a bunch of prompts like this: |Single HTML file using three.js from CDN. A central rotating MeshNormalMaterial torus knot. Place a bright Sprite (AdditiveBlending, soft circular canvas texture) at a ...

u/ambient_temp_xeno (+13)I wonder if gemma knows these effects so well because of youtube videos of them.
u/DifficultDog8435 (+8)That’s sick honestly. I like the idea of treating the prompts like a little generative demo reel instead of manually cherry-picking one good output.
u/itsappleseason (+7)of course not.
View full discussion on r/LocalLLaMA
u/abkibaarnsit 2026-04-28 New Model Apple SiliconQuantization & Backends

Link post. The discussion is mostly in the comments. Target:

u/pmttyji (+17)Good to see one more 30B range MOE model. That seems honest benchmark(Check the graph)
u/abkibaarnsit (+13)XS.2 is currently out. 33BA3B ( M.1 still not open. 225BA23B Free on OpenRouter M.1 , XS.2
u/coder543 (+11)Always cool to see more open weight models!
View full discussion on r/LocalLLaMA

This isn't an advertisement, and it's very much local and open - I already don't have enough time to keep up with the existing pull requests and issues... just a fond look back on how much this space has grown and matured in the past year. Shit was t...

u/taylorwilsdon (+9)Do share! I find at this point the main thing I need in my chat ui is email, calendar and todo list - I've honestly stopped using context7 and its ilk for code because most of what I'm working on thes...
u/srigi (+6)MCPs are not dead. They just seems like, because they’re at the bottom pit of the hype curve. But they’re essential in some workflows where you cannot use skills or native tool calls. For example the ...
u/PreferenceAsleep8093 (+5)Would you be able to add Google Search Console to this, or you just purely focused on the core productivity suite?
View full discussion on r/LocalLLaMA
u/jacek2023 2026-04-27 Tutorial | Guide Quantization & Backends

Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)

u/valdev (+23)Gemma 4 was the best, for about 10 minutes after its release. Still the best at visual understanding But Qwen 3.6 27b is... literally a few generations beyond it. Frankly it shocks me how good it is.
u/geldonyetich (+11)Maybe if you're looking at Gemma4:26b. 31b is still doing as well as Qwen 3.6 and even better in some applications. Honestly there's so much Qwen hype here I think Alibaba has hired an influencing fir...
u/o0genesis0o (+5)Funny how something that should feel like second nature to people in this sub (downloading lmstudio and model, setup context, attach to cli agent) is still treated like expert tutorial for general aud...
View full discussion on r/LocalLLaMA
u/Tryshea 2026-04-18 Discussion Quantization & Backends

I had the idea of splitting the cross-entropy difference into two sums (positive and negative; or the PPL into two ratios >1 and <1) while doing PPL evals of uncensored GGUFs. The inspiration came from looking at the area under the PPL ratio co...

u/WhoRoger (+8)Lol I love this but I have no idea what most of that means lol. Td;dr? (Too dumb; didn't read)
u/Embarrassed_Soup_279 (+5)hauhaucs models "felt" better compared to other uncensored models despite the degradation shown in the graphs... ive not tested the gemma models but with qwen3.5 it was pretty good. still don't know h...
u/Tryshea (+4)Thanks, An uncensored model should succeed at predicting the text more often than the base model (X axis) but never less often (Y axis). That's because we should just be "unlocking knowledge" without ...
View full discussion on r/LocalLLaMA
u/Perfect-Flounder7856 2026-04-27 Discussion Apple SiliconLaptopsQuantization & Backends

Running my own models. I was having some trouble getting vLLM going so dropped down to LM Studio which I've used on my 24GB MacBook Air. I now have LM Link across both laptops into the AI Workstation RTX Pro 6000 Blackwell. And my phone on LM Mini. I...

u/jacek2023 (+24)Good luck. These old models are not best. Explore Huggingface to have some fun
u/Guilty_Rooster_6708 (+13)Try out the gemma4 models!
u/rm-rf-rm (+12)Please see the Best LLMs megathread linked in the sidebar (and stickied currently)
View full discussion on r/LocalLLaMA
u/ggonavyy 2026-04-29 Resources Apple SiliconMid-range GPU (8-16 GB)Quantization & Backends

And somehow we already got some GGUFs for it!

u/Bulky-Priority6824 (+13)nvfp4 speaks the gpus native language. The blackwell tensor cores have FP4 math built directly into the silicon so the model weights go in as is and the multiplication happens without any translation ...
u/ggonavyy (+13)You have to use it with nvfp4 ggufs, q4 is still using regular mmq and mma
u/ggonavyy (+9)Yes, this PR just makes 50 series users with nvfp4 models prefill a LOT faster
View full discussion on r/LocalLLaMA
u/Shoddy-Tutor9563 2026-05-05 Resources General

I was thinking, that some folks in this community will be interested to see what current options are on local deep research field. So I spent some time to collect everything I could find together. Enjoy. TLDR: the most healthiest and local-friendly p...

u/Shoddy-Tutor9563 (+5)I cannot rate them yet. I was composing a list of projects to try. So these two are my top candidates :) will be able to address your question in a couple of days when I run them both side by side
u/DeltaSqueezer (+3)How would you rate GPT Researcher vs LDR? Do either support a big model for planning and synthesis but a smaller faster model for retrieval and exploration?
u/MustBeSomethingThere (+3)Answer to OP's challenge. I used my own agent harness with Gemma 4 26B. I had to add clarifications for "best" (number of contributors) and for "recent" (last 6 months). Dates and numbers…
View full discussion on r/LocalLLaMA
u/Hydroskeletal 2026-05-08 Discussion Apple Silicon

So I was very excited about the MTP stuff especially since Gemma4 has become my "daily driver" for some stuff. I grabbed the latest mlx-vlm and did some tests and found it disappointing. | Workload | MTP off | MTP on | Result | Draft accept rate | |-...

u/coder543 (+46)It’s not just a matter of acceptance rate, it’s a matter of having computation to burn (which Macs famously don’t have much to spare before the M5 series added Neural Accelerators), and gains are hard...
u/Anbeeld (+13)Which is why adaptive draft is a must.
u/Hydroskeletal (+11)Personally I am not using for coding or AI written slop. I find it much more interesting to use local LLMs in programs.
View full discussion on r/LocalLLaMA
u/PaceZealousideal6091 2026-05-15 Discussion Apple Silicon

Everyone has been taking about Luce DFlash and PFlash. I just came across their megakernal and it seems it was released along with Dflash and PFlash. It seems it's giving them 1.8x greater speed with much more power efficiency on nvidia gpu comparabl...

u/Ok-Measurement-1575 (+74)I was very excited until I read this: Single model, single architecture. The kernel is hand-written for Qwen 3.5-0.8B's specific layer pattern (18 DeltaNet + 6 Attention). It does not generalize to ot...
u/dinerburgeryum (+24)The post goes onto say “Megakernel fusion benefits shrink as model size grows and compute begins to dominate over launch overhead.” Sounds like diminishing returns.
u/JumpyAbies (+22)I think that if, for example, it has support for qwen3.6-27b or gemma-4, it becomes a very attractive option for those who use those models. It would be a solution focused on a smaller scope of models...
View full discussion on r/LocalLLaMA
u/pmttyji 2026-04-29 New Model General

Ling-2.6-1T: A Trillion-Parameter Comprehensive Flagship Model for Complex Tasks Today, we are thrilled to open-source Ling–2.6–1T from the Ling family. Tailored for real–world, complex scenarios, this trillion–parameter model introduces targeted opt...

u/Hodler-mane (+20)its fockin raining models!
u/unbannedfornothing (+11)Damn, do they know any other numbers than 1 trillion?
u/pmttyji (+11)A day ago, they released 100B sized model Last year, they released 17B sized model called Ling-Mini. Unfortunately they skipped it during last version & also now I think
View full discussion on r/LocalLLaMA
u/Terminator857 2026-05-17 News Laptops

Quote: ...new optimizations for Ryzen AI Max 300 "Strix Halo" and the ROCprof Trace Decoder is now open-source...<snip>... Those rolling from source can grab the ROCm 7.13 Tech Preview via TheRock on GitHub.

u/Terminator857 (+33)ROCm platform is generally preferred over pure Vulkan for better compatibility with frameworks like PyTorch and TensorFlow, for the 3 of us that experiment with training.
u/pmttyji (+15)>Expanded AMD GPU support ROCm 7.13.0 adds support for the following AMD GPUs and APUs: AMD Instinct MI350P (gfx950) AMD Radeon PRO W6800 (gfx1030) AMD Radeon PRO V620 (gfx1030) AMD Ryzen AI 7 PRO ...
u/JamesEvoAI (+14)The PP is better on ROCm for most models, and the TG only takes a minor hit in exchange. Depending on the use case that is a valuable tradeoff
View full discussion on r/LocalLLaMA
u/boutell 2026-05-17 News Apple SiliconQuantization & Backends

I am not the author. My two cents: I'm not suggesting we don't all know local AI is expensive, at least for now. The math gets interesting if OpenRouter providers are burning investor cash and it runs out, or we take into account hardware we use for ...

u/opezdol (+72)You can still sell your mac after 3-5y.
u/Ok_Technology_5962 (+53)I see that the analysis is a bit wrong. It doesnt take into account agentic tasks. When you run an agent the bottleneck is not output speed but how many back and forth toolcalls you do, thus reusing t...
u/Fit-Produce420 (+41)Also providers will sub in quantized models if they think they can.
View full discussion on r/LocalLLaMA
u/gvij 2026-04-25 Discussion High-end GPU (24+ GB)Quantization & Backends

I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple. One H100 80GB, vLLM 0.19.1, the buil...

u/FatheredPuma81 (+7)>MoE benefits so much more because those models are bottlenecked on moving expert weights through memory, and FP8 cuts that traffic in half. Interesting if you're on consumer hardware running quant...
u/gvij (+3)Right, the bottleneck depends on which phase and which hardware. On H100, decode at low to moderate batch size is is bandwidth-bound (it is on basically any modern GPU because each weight gets reused ...
u/Thagor (+1)For concurrency did you use continuous batching?
View full discussion on r/LocalLLaMA
u/Non-Technical 2026-04-30 Question | Help General

Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?

u/billy_booboo (+45)Yeah, I think Qwen3.6 122B would be an extreme sweet spot for me in terms of not relying on claude as much
u/ttkciar (+34)I think a Qwen3.6-122B-A10B release is likely, and am a bit surprised they haven't released it already. Google teased us with a 120B during their beta-testing, but I don't know that we will ever see i...
u/onil_gova (+31)
View full discussion on r/LocalLLaMA

[UPDATE - April 2026] Several people asked about missing models (Qwen 3.5, Gemma 4, the SillyTavern finetune series) and raised valid questions about the methodology. I ran an expanded 37-model sweep with a 5-judge ensemble and documented the selecti...

u/jwpbe (+60)Wisdom Check DC 15: Identify Slop I roll with advantage because the post contains "Elara", you used LLM-as-a-judge, and didn't use any roleplay / drummer finetunes
u/an0nym0usgamer (+56)Using an LLM as a judge for fiction/writing quality is honestly just the funniest thing to me. Like, I don't know how someone can actually set that up and actually take the results seriously.
u/cr0wburn (+13)Where are Gemma 4, Qwen 3.5, and Qwen 3.6 they are all really good for their size.
View full discussion on r/LocalLLaMA
u/facethef 2026-05-04 New Model General

Talkie-1930-13b-it and Gemma 4 31b in the same chat. Talkie is a 13B vintage language model from 1930. Hosted version if you can't run them both locally

u/Eyelbee (+28)Check this one out
u/facethef (+10)Big Chungus is definitely Chinese if you ask Talkie
u/TheRealMasonMac (+9)Transcribed for accessibility: USER: Who is Big Chungus? ASSISTANT: Big Chungus is the name given to a Chinese giant, said to have been 120 feet high, whose bones were discovered in 1661, in…
View full discussion on r/LocalLLaMA
u/Beamsters 2026-05-12 Resources Apple SiliconQuantization & Backends

According to this. I run several more tests to cover more models and quants. [Qwen3.6 35B-A3B MLX oQ4. 2 extra pawns. (oMLX - local)](

u/dampflokfreund (+11)Gemma 4 26b q4\K\L by Bartowski does a very good job here, especially for its size:
u/nixudos (+10)Thanks! I really like these tests as a supplement to agentic coding tests. I have been wondering if the Qwen 27b Q4K_M is a better choice than a Qwen3.6 27B/35B-A3B at Q6? I can run both of them at a ...
u/Charming-Author4877 (+6)I am not sure what this is really testing, multiple things and some of them have been heavily trained (chess is certainly a training benchmark) but I like it. The big flaw here is that this is a singl...
View full discussion on r/LocalLLaMA
u/Zyj 2026-05-14 News General

Confidence is persuasive. In AI systems, it is often misleading. Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. ...

u/_wsgeorge (+13)This paper came out last year. Have any major models (open, proprietary, frontier etc) tried this technique?
u/TheRealMasonMac (+10)Dunno if they use this specific technique, but Gemma 4 31B is pretty good about it.
u/oxygen_addiction (+9)Anthropic obviously did something like this for Opus 4.5+ That was the first model that didn't do the "You are absolutely right".
View full discussion on r/LocalLLaMA

I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: \- Create IndexNow CLI in Golang (Easy Task) and \- Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwen 3.5, &a...

u/pulse77 (+15)For complex coding tasks where precision matters the 3-bit quantization is "gambling"...
u/InternationalNebula7 (+13)Please run with Qwen3.6:27B when unsloth releases the quants. Look forward to seeing the results!
u/Designer_Reaction551 (+7)Qwen3 Coder Next beating the larger 3.5/3.6 general models tracks with what I've seen on agentic coding tasks on similar hardware. Task-tuned models hold structure better at 25-50k context than bigger...
View full discussion on r/LocalLLaMA

Provided in both Safetensors and GGUFs. Safetensors: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic: GGUFs: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic-GGUF:

u/pigeon57434 (+33)i dont ever trust these weird merged models
u/eidrag (+20)I don't see the point of it, when the merge already done with other finetune, which itself already done with Gemma-4-31B-it-heretic-ara
u/Hydroskeletal (+16)"Boost Lateral Thinking" -> How? The only time I've seen this kind of claim was the model trained on 4chan
View full discussion on r/LocalLLaMA
u/Jorlen 2026-05-17 Discussion Quantization & Backends

I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory. I don't...

u/Stepfunction (+44)The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead, which was introduced relatively recently. Q8 for K and Q4 for V is another option.
u/hurdurdur7 (+23)Model q6 and up, context cache fp16
u/diffore (+21)Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can - I don't quantize cache.
View full discussion on r/LocalLLaMA

Link post. The discussion is mostly in the comments. Target:

u/VoiceApprehensive893 (+15)basically edge gallery is legit usable now
u/Quantum_Pigeon (+14)I love how when you install the app you're immediately forced to agree to Google collecting data from the app. Doesn't this defeat the entire purpose of using local models?
u/Chupa-Skrull (+12)The purpose of edge-runnable models, from a business perspective, is to consume the user's hardware, battery cycles, etc. in order to save companies money serving inference for low-intelligence worklo...
View full discussion on r/LocalLLaMA
u/StudentDifficult8240 2026-04-21 Discussion Apple SiliconQuantization & Backends

I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count. *All 8-bit MLX, M3 Max 128GB, served via omlx, prompted through Claude Code. Same prompt every time — single...

u/Alternative_You3585 (+6)Thanks, I see more and more confirmation that opus distillation seems to be effective. I'll always download such models if available now
u/Long_comment_san (+5)"Qwen3.6 is a real step up over 3.5. The .1 increment packs more than usual" That's what I fucking said in another post. It should have been called 4.0. 3.6 is stupid and confusing. I don't know what ...
u/BingpotStudio (+4)I don’t consider adding features you don’t ask for a positive. This is scope creep and it leads to a model that isn’t doing as it’s told and builds features your program wasn’t designed to handle else...
View full discussion on r/LocalLLaMA

Pocket LLM v1.5.0🚀 New in this release: \- 🎙️ Voice input \- 🖼️ Image input with OCR, Gemma vision, and FastVLM support \- 📷 Camera capture with retake, crop, and photo review \- 🗂️ Previous chats side panel \- 💾 Downloaded model deletion to save sto...

u/100daggers_ (+17)Thanks for the honest part of the feedback. I agree the chat deletion icon can be improved, and I’ll clean that up in the next UI iteration. But calling it “AI-generated slop” is unfair. I’ve been wor...
u/MalabaristaEnFuego (+6)What percentage of code does a person need to refactor before the code no longer becomes AI slop?
u/KaroYadgar (+6)slop of Theseus jokes aside, I was mainly referring to how the UI reminded me of AI-generated frontends back in the day.
View full discussion on r/LocalLLaMA
u/yes_i_tried_google 2026-05-07 Tutorial | Guide High-end GPU (24+ GB)Quantization & Backends

I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday. YMMV! Too busy with work to write myself, so I asked Opus to write for me (I have validated the content!). I’m sure there will be debate over using q4 bl...

u/am17an (+21)Just use my fork lol, all the fixes are going to land there
u/yes_i_tried_google (+4)Same limitations I’m afraid. I’ve not modified anything in the actual PRs beyond the bare essentials.
u/GCoderDCoder (+3)This is great thanks! Speculative decoding works on other quants too if anyone complains about q4. I'm wrestling with some things right now in speculative decoding with gemma 4 31b where formatting se...
View full discussion on r/LocalLLaMA
u/wombweed 2026-05-02 Discussion Apple SiliconHigh-end GPU (24+ GB)Quantization & Backends

I run Qwen-3.6 27B with the FP8 safetensors on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance be...

u/Gesha24 (+84)I am convinced majority of the people are not running local AI for any kind of serious work, it's mostly for fun. So accuracy is irrelevant for them. Once you realize accuracy matters, you have to set...
u/ilintar (+32)On llama.cpp Qwen3.6 Q8 KV quant is almost lossless, as shown by multiple benchmarks (Gemma 4, by comparison, due to its iSWA architecture, is apparently much more sensitive to KV cache quantization).
u/ikkiho (+27)A few things in this thread are getting blurred together and I think that explains the conflicting results. First, fp8 in vllm and Q8 in current llama.cpp are not the same operation. fp8 (E4M3) is a p...
View full discussion on r/LocalLLaMA

Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size. What surprised me was JSON output. Short input, give it a structured prompt, you get clean parse able JSON back. Way better than I expe...

u/wbulot (+8)I did something a bit different. I used Qwen 3.6 27B to code an Android keyboard tailored for me. I integrated NVIDIA’s Parakeet voice model into it, which runs directly on the phone. It then sends th...
u/mhl47 (+7)Sounds great. Did you try to use the model directly for voice input instead of adding whisper?
u/Effective-Drawer9152 (+6)Tried it early on, couldn't get it working cleanly so I went with Whisper. I am goona try again and see how it goes.
View full discussion on r/LocalLLaMA
u/boutell 2026-04-27 Question | Help LaptopsQuantization & Backends

I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K. Thanks!

u/Willing-Toe1942 (+38)I didn't like the performance comparing it to 35B MoE for the desnse 27B numbers are: 10 - 11 T/S for generation near 300 for prompt processing strixhalo laptop tdp maximum to 80watt
u/tecneeq (+19)The results are great, but it's slow as heck. I use 35b-a3b for that reason.
u/Sixstringsickness (+15)Very slow depending on quant, 7-8tps. Not convinced the improvement in intelligence is worth the trade off in performance over 35b 3a. Gemma4 using draft model and spec decoding is 13-20tps using q4 k...
View full discussion on r/LocalLLaMA
u/Conscious_Nobody9571 2026-05-11 Question | Help Apple SiliconQuantization & Backends

Around 3B please thank you

u/ML-Future (+49)Qwen 3.5 4b or Gemma 4 2b has best benchmarks results.
u/SendMeGapePics (+42)Qwen3.5 uses a lot more tokens per correct answer, and has been overfitted on a bunch of test-sets. Gemma 4 is the go-to
u/wesmo1 (+34)At such a small parameter size it's important you experiment for your specific use case and learn the limitations of such a small parameter size. Look into Gemma 4 e2b, smollm3, granite 4.1, nanbiege ...
View full discussion on r/LocalLLaMA
u/yeah-ok 2026-05-06 News Quantization & Backends

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: ...

u/retireb435 (+29)Inside the github it shows the method is running 23 times slower. I don’t see any improvement comparing to our nowadays offloading method? Seems like a clickbait
u/TokenRingAI (+26)So he figured out slow inference across a network? Cool
u/jacek2023 (+14)how it is different than RPC?
View full discussion on r/LocalLLaMA
u/siegevjorn 2026-05-18 Discussion High-end GPU (24+ GB)Mid-range GPU (8-16 GB)Quantization & Backends

Just wanted to share that I'm pretty happy about Qwen 35b a3b agentic coding performance. I'm running the model in q80 quant, kv cache both q8_0 as well, with 262144 in 4090 + 5060 ti, via llama.cpp backend with claude code pointing to localhost. For...

u/NotARedditUser3 (+27)I daily drive it in a pretty large codebase, I'm thoroughly impressed. Have moved off of cursor and using 35b exclusively.
u/Fast-Satisfaction482 (+13)In my tests, it performed vastly better and even faster when using fp16 for kv cache, I use the model in q4 quant on two 4090 with full 260k context completely in VRAM. And with ngram speculation, tha...
u/Daniel_H212 (+11)Agreed, OP should drop down to say, Q6KXL or something, and keep cache unquantified for coding. It will be faster too.
View full discussion on r/LocalLLaMA
u/jacek2023 2026-04-20 Generation Quantization & Backends

I was testing OpenCode and Roo Code with Gemma 26B on llama.cpp yesterday for about 10 hours. I was able to make progress on my project, both solutions work. But: OpenCode is kind of fucked up at the moment, because of that there is often long prompt...

u/sine120 (+12)If you don't have a supercomputer, use Pi ( The system prompt is a lot smaller, saving you context and PP time. I haven't used it with gemma much yet, but with Qwen3.6 it's been great.
u/Haiku-575 (+7)I've been testing with Qwen3.6 as well. It is, in fact, great. is the same thing if you want a less shitty URL.
u/Ill-Fishing-1451 (+4)Opencode prunes context sometimes, which causes reprocessing the whole cache. This is annoying for llama.cpp backend.
View full discussion on r/LocalLLaMA

So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominative case an...

u/ai_without_borders (+9)speculative decoding hits different for constrained outputs. if you are asking the model to extract references or classify into a fixed schema, the draft model can basically predict the next token acc...
u/caetydid (+8)what was your reason to choose bartowski over unsloth? Also, context is very small, did you test to scale it to 64k or above?
u/Parzival_3110 (+8)This is exactly the kind of local LLM win I love. Atomic workflows, tight context, structured output, and suddenly local is not just cheaper, it feels faster to iterate on. Curious if the quality hold...
View full discussion on r/LocalLLaMA
u/EuphoricPenguin22 2026-04-25 Discussion Quantization & Backends

When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5 did well on...

u/EuphoricPenguin22 (+3)That's actually what it does itself: it indexes online materials into a local semantically-tagged copy with the embedding model that is then queried with the same embedding model. It's basically textb...
u/mr_Owner (+1)Hmm what if you point the grounded mcp server to a local folder with docs and manuals containing all the utils needed? That would be really offline and sounds tbh very very interesting of possible. Wh...
u/mr_Owner (+1)Dayum
View full discussion on r/LocalLLaMA
u/letsbefrds 2026-05-17 Discussion Quantization & Backends

Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow. It's been working great but it's a bit slow at times. I use Gemma 4 / Qwen, I al...

u/jojotdfb (+50)Llama.cpp is your next step. Spend some time learning the flags and you can fine tune to your heart's content. Llama-server will give you a basic chat web page as well as an openai endpoint.
u/CooperDK (+22)Ollama is probably the slowest tool you could use. LM Studio is really good if it should be easy. The best, not too hard tool is ik_llama.cpp And the winner is vLLM, but it's not that easy to set up.
u/ComplexType568 (+13)LM studio has such a high opportunity to be an amazing piece of software but the fact that you can't use your own custom runtimes or specificy custom launch params per model REALLY drags it down. Alon...
View full discussion on r/LocalLLaMA
u/C_Coffie 2026-05-16 Resources Apple SiliconHigh-end GPU (24+ GB)Mid-range GPU (8-16 GB)LaptopsQuantization & Backends

I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML. The dataset: 55 runs, three rigs, five backends (ro...

u/fallingdowndizzyvr (+15)> Memory bandwidth runs the show for decode. The RTX 5070 (12 GiB GDDR7, Vulkan) actually beats the RTX 3090 (24 GiB GDDR6X, CUDA) on every model that fits in 12 GiB: Ah.... the 3090 has faster mem...
u/tmvr (+10)I don't know what you did there exactly, but you messed it up. Go run both cards with CUDA and see what the results are. There is no way a 5070 is faster at tg/decode than a 3090.
u/Edenar (+6)Try with MTP too, i now get 20tok/s on 27B(Q8_0) for simple chat on strix halo. It'll also help the 3090. And i finally managed to get sub 5s replies with 3.6 35B A3B since i now get 70+tok/s (to use ...
View full discussion on r/LocalLLaMA
u/opoot_ 2026-05-17 Question | Help Quantization & Backends

I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for \~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can make use of ...

u/pmttyji (+34)Q4/Q5/Q6/Q8 of Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma-4-26B-A4B MXFP4 of GPT-OSS-120B Q4/Q5/Q6 of Qwen3-Coder-Next Q4 of Qwen3.5-122B-A10B, Nemotron-3-Super-120B-A12B, Mistral-Small-4-119B-2603, GLM-4.5-...
u/munkiemagik (+8)With qwen3.6-27b performing as well as it does for its size, what kind of tasks do you find yourself preferring to use GPT-OSS-120B? Im interested in your (and u/kevin_1994) as I used to run OSS-120b ...
u/pmttyji (+5)Air is 100B model(Still many fans there for this) & Flash is 30B model. Both Qwen3.6 & Gemma-4 models replaced previous 30B models so didn't include Flash.
View full discussion on r/LocalLLaMA
u/DeepOrangeSky 2026-05-01 Discussion Quantization & Backends

Which of these do you think we'll get in May? Also, feel free to pick/rank which ones you'd want the most badly: - more Gemma4 models (124b?) (other sizes?) - more Qwen3.6 models (9b? 122b? 397b?) - new Qwen Coder model (80b Even Nexter?) (~397b/400b...

u/ttkciar (+24)Some thoughts: * I do not expect any new Gemma releases, but will be pleasantly surprised if they make a "4.1" release of the 26B and 31B models which fix their lingering tool-calling problems. If the...
u/snapo84 (+22)I am happy with Qwen 3.6 ... only thing i hope is llama.cpp implements MTP and dflash and ddtree ... then life would are already be perfect with current models
u/Kodrackyas (+17)3.6 9b and we are balling, the frontier models are sweating now, AI bubble burst detector: 50%
View full discussion on r/LocalLLaMA
u/Excellent_Jelly2788 2026-05-07 Generation Quantization & Backends

With every new model release there's the "better than Opus 6.13" guys vs the "this is so bad, why did they even release it" camp and I'm always wondering which one is using it wrong. So I did a little test with 2 related prompts, 3 models and ran eac...

u/ExplanationAway672 (+12)Try 27B... first try below with the short prompt Qwen3.6 27B UD Q4\K\XL (and got it right twice more in new sessions). Biggest hang up in reasoning was whether he was one of 6 siblings or had 6 siblin...
u/Qwoctopussy (+7)native speaker here and yes, you are right. this prompt has so much ambiguity. it’s less a test of “can the model reason logically?” and more a test of “can it read my mind when i don’t say what i mea...
u/averymerryunbirthday (+5)> He has $25 and bought 5 boxes of apples for his organic apples business. I'm not a native speaker, but doesn't that imply he still has 25$ and we actually have no idea how much he spent buying th...
View full discussion on r/LocalLLaMA
u/mr_Owner 2026-04-30 Other Mid-range GPU (8-16 GB)Quantization & Backends

Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30. EDIT: I offload my display to my igpu btw to save some vram on the rtx dgpu. Otherwise drop 10% or so on p...

u/Party-Log-1084 (+10)40 t/s on a 35B Q6 with just 12GB of VRAM is honestly wild. That 9800x3D and DDR5 combo is definitely carrying hard since you're obviously spilling a ton over into system RAM. Appreciate you dropping ...
u/Farther_father (+2)As part of the 12GB VRAM club (although with i9/128GB RAM on the CPU side), I appreciate this post! I wonder how my daily driver (gpt-oss-120) would sit in this comparison, but I’ll steal/test your se...
u/mr_Owner (+1)Enjoy! Curious if it's any better then what you had before 😁
View full discussion on r/LocalLLaMA

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality. In the name of evaluation, I only had a keyword matching script producing numbers that looked like...

u/pmttyji (+26)Why no recent Qwen3.6 models & also Granite-4.1-30B? Would be nice to see those too
u/cr0wburn (+17)What a weird mix of models, there are some top performers missing like the qwen 3.6 series, also gemma 4 31b?
u/pmttyji (+13)Thanks. Both Qwen-3.6-27B & Qwen-3.6-35B. Also please add Qwen3.5-9B & Gemma4-E4B as well.
View full discussion on r/LocalLLaMA
u/OrdoRidiculous 2026-05-09 Question | Help General

I've been learning German recently, and it occurred to me that I could point some of my AI horsepower at having a German speaking LLM to practice with. I'm not too concerned with the speech to text side of things or getting it to talk back, but googl...

u/SevereTilt (+8)I have been testing AI for language learning for 1-2 years. I have a setup in SillyTavern where I created several characters to practice my target languages (I tried a teacher character for Arabic tha...
u/FullstackSensei (+7)Been using Gemma 3 and gpt-oss-120b since they came out to help me learn German. Now using Gemma 4 for the same. They can mix up Trennbare verben but generally they're really helpful in understanding ...
u/MalabaristaEnFuego (+5)I'm talking about that too. It's not just a translation tool, it's an LLM with more training weights around language. I literally recommended it for precisely the purposes you mentioned.
View full discussion on r/LocalLLaMA
u/purealgo 2026-05-07 Discussion Apple SiliconQuantization & Backends

In case you haven't heard, Google just released Multi Token Prediction drafters for Gemma 4, a speculative decoding approach that pairs the main model with a lightweight drafter. It can predict several tokens ahead and then verify them in parallel, s...

u/WillyTheWoo (+14)Works on omlx. At max wattage, doubles my generation speed from 11tk/s to 20+ tk/s. Little effect for spec prefill though. M1Max 64GB RAM
u/Dany0 (+14)it functionally cannot affect prefill
u/pantalooniedoon (+9)That doesnt makes any sense to me. You already have the entire prompt to process in parallel. The point of spec decoding it to get something you can parallelize.
View full discussion on r/LocalLLaMA
u/BestSeaworthiness283 2026-04-30 Tutorial | Guide General

I've spent the last few weeks running real multi-file coding tasks through small local models and small cloud models on free tiers. Wanted to share the failure points that came up consistently, since some of them surprised me and i wanted to share wi...

u/synw_ (+8)About structured output for small models I recommend using xml over json: it's easier to manage for the model, with less formatting rules. Using shots help the small models a lot to stay on tracks
u/synw_ (+6)A shot is a user/assistant history turn. You provide several history turns with examples of well formed outputs, and the model will follow this pattern, making it easier for it to get it right. About ...
u/IrfanZahoor_950 (+6)This is a good reminder that prompts aren’t the contract, the orchestration layer is. small models can be useful, but only if you validate paths, classify actions, check outputs, and never let the mod...
View full discussion on r/LocalLLaMA
u/snowieslilpikachu69 2026-05-15 Question | Help LaptopsQuantization & Backends

Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to other compa...

u/tecneeq (+60)Same situation here, confidential data moves in the company, so we decided to build a local stack. Bought a Gigabyte Server with two 6000 Blackwell MaxQ, with the option to add two more, 26k€ Installe...
u/1beb (+20)I strongly recommend using rentals/API before making a purchase decision. Use cases can quickly outgrow on prem resources. Give people generic access, watch what they do for a month or two, then decid...
u/Enki_40 (+16)It’s amazing how many people who ostensibly want to support a business don’t understand that the enterprise plans out there like ChatGPT and Anthropic not only keep data private / don’t train on it bu...
View full discussion on r/LocalLLaMA

When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them. I made DataGate for that. But if it's web documents that ...

u/Mickenfox (+6)What I don't understand is why models don't just do this. Create two special tokens like <|untrusteddata|> <|endofuntrusteddata|>, then train the model to never ever follow any instruction...
u/User_Deprecated (+3)Some of them basically do, and it works. Not every provider seems to care about it yet though. Feels more like a prioritization thing than a real limitation.
u/lakySK (+3)That seems to be working way better than I’d expect! Would you say the data you tested on is actually “state-of-the-art” of prompt injections? Or did you hold back for now? Where did you get the promp...
View full discussion on r/LocalLLaMA
u/Felix_455-788 2026-04-18 Question | Help Quantization & Backends

yea i know the title looks so stupid, yes i done searches, i searched google, huggingface, youtube, i even tested some via LM Studio, but due to my low-end VRAM (GTX 1050 4G Vram) i cant fit more than 4B or 1B into it, i have about 20G RAM + 15G Page...

u/netherreddit (+23)Qwen 3.5 has 2B, 4B, 9b, it's the best for most tasks
u/netherreddit (+9)Could also try Gemma 4 E2b and E4b
u/netherreddit (+8)Yeah still Qwen 3.5, even just for coding
View full discussion on r/LocalLLaMA
u/Borkato 2026-05-19 Discussion High-end GPU (24+ GB)Mid-range GPU (8-16 GB)Quantization & Backends

I’m upgrading from 32 to 48 soon and am excited but I’m curious what y’all run!

u/-dysangel- (+78)>Do you wish you had more VRAM? Who is ever going to say "no" to this?
u/National_Meeting_749 (+17)This. I'd have about 1TB of VRAM if that didn't cost as much as some houses.
u/stoppableDissolution (+13)Former 48gb user. Used to mainly run either q4 of various llama3 70b tunes or full-precision mistral small, then q6 gemma4 31b. Got 96gb now and still running almost exclusively gemma but now with few...
View full discussion on r/LocalLLaMA

Found this interesting and thought i'd share. A big problem i've had with Qwen 3 MoE is how bad at instruction following it was, and also, it's 'dumb point' in the context window was really low. I was so turned off by it that i never tried Qwen 3.5 a...

u/Express_Quail_1493 (+17)i love this guys videos. he does real test on projects the LLM would stumble on to intentially feel out the models without relying heavily on benchmarks. Most youtubers are lazy zero-shot single file ...
u/vulcan4d (+8)We need more of these and different quant testing to validate the information that we are basically sold to. Everything on paper looks good but everyone seems focussed on testing extremely large model...
u/FastHotEmu (+7)not reverse engineer, just recall
View full discussion on r/LocalLLaMA
u/Chromix_ 2026-04-19 Discussion Quantization & Backends

Nothing extensive to see here, just a quick qualitative and performance comparison for a single programming use-case: Making an ancient website that uses Flash for everything work with modern browsers. I let all 3 models tackle exactly the same issue...

u/grumd (+11)Should've used Q6 or Q8 for 35B, you have the speed and RAM for it if you could run a Q4 80b model. Otherwise a great post, imo real testing like this is most valuable, you're actually seeing how mode...
u/Chromix_ (+10)It's Gemma 31B (dense) in my posting, not "Gema 26B-A4B dense" as you wrote (and that would actually be a MoE) "Old Qwen Dense" would refer to Qwen 3 Coder Next then? That's not a dense model and ther...
u/Chromix_ (+5)Yes, I could easily run Q8 for it. Yet Q5 (XL) seemed like a good speed / quality trade-off to make. There was some talk that higher quantization disproportionally affects the long-context performance...
View full discussion on r/LocalLLaMA
u/MrMrsPotts 2026-05-08 Discussion General

We had deepseek v4 preview recently but it wasn't much better than v3.2. What is the next SOTA local/open model you are excited about?

u/johnfkngzoidberg (+57)Qwen3.6-65B-A7B. Won’t happen, but I can dream.
u/LoveMind_AI (+29)This question caught me by surprise a bit because I think this is the first time in a year when I can honestly say… nothing? Something Qwen 3.6 27B/Gemma 4 31B sized but with audio reasoning capabilit...
u/my_name_isnt_clever (+27)This is probably cheating, but the SOTA for my hardware would probably be Qwen 3.6 122b. Please Qwen, release it 🙏
View full discussion on r/LocalLLaMA
u/PromptInjection_ 2026-05-12 Discussion General

Yes, for material that is an hour long, there is no getting around tools like Whisper - or something even better. However, for transcribing short snippets, Gemma works very quickly and reliably- even in foreign languages. Do you use it as well?

u/PromptInjection_ (+8)True, but sadly it can't process audio files directly. Google only supports that with the E2B and E4B.
u/dev_dan_2 (+7)tbh., I think Gemma 4 E4B reached the "good enough" stage for me - not entirely sure yet, but in my usage so far, it looks quite like that! Of course, I still hope for 6-12 months more of even better ...
u/monrow_io (+4)Yeah I’ve seen people do that split setup. Whisper (or similar) for long, noisy audio, and smaller models like Gemma for quick short clips where latency matters more. I don’t really use tools directly...
View full discussion on r/LocalLLaMA

now you can talk about videos

u/iChrist (+5)Correct me if I'm wrong but the new Gemma4-E variants support video, right?
u/robertpro01 (+3)Now I need a model that understands video input
u/ComplexType568 (+3)So does Qwen3 VL and above too I think
View full discussion on r/LocalLLaMA
u/Sostrene_Blue 2026-05-13 Question | Help General

Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?

u/UnWiseSageVibe (+17)I have been using Gemma 4 for translation processing on some personal projects and I found that having it off is better. It wastes a lot of context thinking about it and also ends up overthinking it.
u/Temporary-Mix8022 (+10)I have a custom harness and I've generally done: 1. Pass one, no thinking. Direct translation. 2. Second pass, consider whether translation are appropriate, flag why/why not. This has a much larger co...
u/MindPsychological140 (+5)Thinking pays off for idioms, jargon, and long-form consistency. For straight prose it's mostly latency overhead. A dedicated translation tune (Qwen3-Translation, Tower) usually beats generalist + thi...
View full discussion on r/LocalLLaMA
u/Gesha24 2026-05-09 Question | Help Quantization & Backends

I have been using local LLM for coding quite a lot as well as some other tasks (like data extraction from images) and I had quite a good success with Qwen3.6 models. It's obviously not Sonnet/Opus, but I am able to get quite a lot of work done. Latel...

u/swagonflyyyy (+26)Gemma4-26b and up seems to be good as an everything but code model. Like, its so goddamn intuitive and good at chatting too, but its fails hard at code.
u/[deleted] (+16)[removed]
u/false79 (+9)I dont use pi. I use cline --tui on windows and it gets the tool calls 100% of the time But qwen 3.6 27B gets out better answers faster. No issues with calls But I strictly use it for coding. Sometime...
View full discussion on r/LocalLLaMA
u/Funny-Trash-4286 2026-04-18 Question | Help Quantization & Backends

Which LLM with under 10B params has the best ability to do web searches Is there any benchmark for this where i could see how certain models perform I've checked out gemma e4b it, is it any good for web searching compared to other alternatives at the...

u/OleCuvee (+14)Any decent model will do solid web search if you provide it with the right tools. I gave my researchers searXNG, so that I don’t have to rely on Firefly, Brave and Google tokens. Works great, search o...
u/totonn87 (+7)If I remember correctly you have to use openwebui + searxng to use Gemma e4b with web search.
u/DinoAmino (+4)> Openweb UI just isn't a great harness for any kind of research. Because OWUI is not a harness for research. But OP only asked about basic web search, and OWUI is completely capable of that. Diffi...
View full discussion on r/LocalLLaMA
u/reto-wyss 2026-04-23 Other High-end GPU (24+ GB)Quantization & Backends

Qwen3.5-27b (BF16) on 2x Pro 6k and Gemma-4-E4B (BF16) on RTX 5090 - Took about 8 minutes total (40k tokens total - but like 10k is opencode prompt) - One prompt for planning (I answered a few follow ups) - One shot 1000 lines of code - Fixed only bu...

u/StrikeOner (+4)great, now try that 5 more times, add gemma-4 and qwen 3-6 35b to the list, measure the time it took for each run and post your results!
u/Long_comment_san (+1)What is this "chat interface"? how it works? It connects to your backend for gemma?
View full discussion on r/LocalLLaMA
u/maxwell321 2026-04-23 Discussion Apple SiliconQuantization & Backends

Hi all! I recently made a post about how Gemma 4 managed to replace Qwen 3.5 for me, for semantic routing and a lot of coding stuff and ultimately it was my new daily driver. The next day, Qwen 3.6 released and I've been using it a lot this week. Her...

u/PhilippeEiffel (+23)Multiple times in your message you say "Qwen 3.6 30B" Do you mean 27B or 35B?
u/mp3m4k3r (+13)Or maybe just average them and call it 31B /s
u/Truth-Does-Not-Exist (+10)Pi > Opencode
View full discussion on r/LocalLLaMA
u/Reactor-Licker 2026-05-11 Question | Help CPU / Raspberry PiLaptopsQuantization & Backends

I’m currently stuck deciding between AMD Strix Halo (128 GB AMD Ryzen AI Max+ 395 Framework Desktop) and an Nvidia DGX Spark (Asus Ascent GX10) for a home LLM server that can be accessed over the local network with a ChatGPT like interface in a web b...

u/abnormal_human (+51)If you're just doing LLM inference on it, Spark all the way. If you also want a gaming PC (or whatever), Strix Halo.
u/tomekrs (+41)Definitely Ryzen 395, as it's a standard x86/amd64 machine that can always be repurposed and will never lose drivers or compatibility with new operating systems. Nvidia on the other hand has a history...
u/Eugr (+33)Spark has much faster GPU which results in faster prompt processing speeds. Also, the performance degrades less on Spark as context grows (I have both).
View full discussion on r/LocalLLaMA
u/cafedude 2026-05-06 Question | Help LaptopsQuantization & Backends

I've got a 128GB Strix Halo box. Yesterday I wanted to try out Step-3.5-flash. It's a model that barely fits in my system as is - I found a bartowski Q4_XS that's 105GB. With about 150K context it takes to about 108GB. That leaves about 20GB minus wh...

u/Anbeeld (+22)Context checkpoints?
u/coder543 (+15)It's not a memory leak, but yes, there are things that aren't allocated in advance, seemingly because llama.cpp assumes that the host memory is separate from the GPU memory, and that you can just allo...
u/AnonLlamaThrowaway (+5)It's context checkpoints. I noticed this only with the release of Gemma 4. --ctx-checkpoints 4 fixes it for me. I figure setting it to 1 or 2 is probably too little. I haven't noticed any adverse effe...
View full discussion on r/LocalLLaMA
u/pmttyji 2026-05-13 Discussion Mid-range GPU (8-16 GB)Quantization & Backends

We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs). Any 8GB VRAM(and 32GB RAM) folks already doing Agentic co...

u/FatheredPuma81 (+18)
u/R_Duncan (+16)Qwen3.6-35B-A3B is the only choice with 8GB VRAM: gemma has huge kv cache (can't fit 128k+ context) and 27B is way slow.
u/FatheredPuma81 (+7)Yes... I dropped running my own AIME2025 benchmark because I didn't want to spend 12+ hours per KV quant running it and chose to say that I would recommend against using a mostly AI coded fork built o...
View full discussion on r/LocalLLaMA
u/Ok_Warning2146 2026-05-01 Discussion Quantization & Backends

I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the phone. It seems like llama.cpp's Hexagon backend is highly supported by…

u/Ok_Warning2146 (+2)The point of NPU is power saving. If you don't want llm to drain your battery, it is better to use npu than gpu.
u/pdycnbl (+1)i have sd elite (i guess its gen4) oneplus 13 and my experience of running it was not good. i was mostly interested in qwen3.5 9B Q4 model its approx 6gb. on paper it looks nice approx 3-4TFlop of gpu...
u/Ok_Warning2146 (+1)Thanks for your input. What kind of number do you get for gemma3-4b-qat-q4_0?
View full discussion on r/LocalLLaMA
u/do_u_think_im_spooky 2026-05-19 Discussion High-end GPU (24+ GB)Mid-range GPU (8-16 GB)Quantization & Backends

I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes. It has a static results explorer, schema-validated be...

u/see_spot_ruminate (+5)Can I get in on this with my magnum (quad) setup?
u/do_u_think_im_spooky (+3)Absolutely, quad 5060 Ti data would be really useful. I’m trying to structure the repo around hardware lanes now, so 1x, 2x, 4x/multi-card, and mixed/non-5060 Ti CUDA results can all be useful without...
u/do_u_think_im_spooky (+3)Still a work in progress. Just saw the club-3090 repo and thought the community might like something for these GPUs. With everything I've experimented and got working, I figured I could make a useful ...
View full discussion on r/LocalLLaMA
u/Choice_Sympathy9652 2026-05-03 Question | Help High-end GPU (24+ GB)Quantization & Backends

Community discussion comparing Gemma 4 31B, Qwen 3.6 27B, and GLM 4.7 30B on non-English (primarily European) languages. Original poster reports Gemma 4 31B as the best at Czech, noting it "blows their mind" at 18GB. Key community finding: Gemma 4 31...

View full discussion on r/LocalLLaMA
u/sid351 2026-04-30 Question | Help Quantization & Backends

I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking". It just loops through spitting out / until the max tokens is hit so you see...

u/TokenRingAI (+4)Why do you have your KV cache quantized so heavily?
u/chimph (+5)running the same model at q6 in opencode and have no issues. Works beautifully.. tho I did when I first set it up. Since then I have this in my agents.md file.. maybe try it out yourself but of course...
u/phidauex (+3)Dumb question, are you using CUDA toolkit 13.1 or 13.2? There is a known issue with these models and 13.2.
View full discussion on r/LocalLLaMA
u/CrowKing63 2026-05-01 Discussion LaptopsQuantization & Backends

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4...

u/Solary_Kryptic (+4)Same card, how did Qwen 3.6 35B perform for you?
u/hurdurdur7 (+2)That context size + that model doesn't fit in your vram. You are suffering because you are offloading to cpu and regular ram.
u/maxpayne07 (+1)Lmstudio won't let you use also the igpu and share some layers to IGPU 780M of Ryzen? I love to kow if possible, i want to buy a laptop with amd Ryzen igpu and also with Nvidia gpu mobile
View full discussion on r/LocalLLaMA
u/BitGreen1270 2026-05-01 Question | Help Quantization & Backends

I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do .... I have to do ... I need to do ... Is this a known issue with lower quantization ? I usuall...

u/lit1337 (+3)Yeah, from what I've seen this happens with quantized MoE models. Both Gemma 4 and Qwen 3.6 do this at Q3/Q4, I've hit it on my own quants too. I don't think its a sampling thing. I think what's going...
u/Confident_Ideal_5385 (+2)Qwen 3.6-A3B absolutely requires a presence penalty of 1.5 or so if using it with CoT enabled (which you absolutely want to do, since otherwise it's really just 10 3B models in a trenchcoat). Qwen men...
u/into_devoid (+1)Are you using Google’s recommended sampling settings? Is your context filling?
View full discussion on r/LocalLLaMA

SGLang backend compatibility report from AI Router Switzerland. The author reports FP8 KV cache corruption with radix-cache prefix hits on Qwen3.6-27B-FP8, and explicitly says the bug seems to affect FP8 models such as DeepSeek-V4, Gemma 4, and Qwen3...

View full discussion on r/LocalLLaMA