Real-world experiences running Gemma models, curated from the community. Browse hardware reports, read the weekly field notes, or search for your setup.
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use.
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (17 new or updated since 2026-05-19, 164 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.
May 20 sweep, 2026-05-20 00:00 EDT: five developments from this sweep deliver the first Gemma 4 MTP support on mobile hardware via Google AI Edge Gallery, clarify that desktop llama.cpp MTP does not yet support Gemma 4 models, establish Gemma 4 31B Q8 as the consensus daily driver for 48GB VRAM setups, record early RTX 5060 Ti NVFP4 testing with the nvidia/Gemma-4-26B-A4B variant, and provide community evidence on KV cache quantization quality tradeoffs at large context windows.
Gemma 4 MTP arrives on Android — Google AI Edge Gallery v1.0.13 and v1.0.14. Google AI Edge Gallery released v1.0.13 on May 18, adding Gemma 4 Multi-Token Prediction support for on-device inference on Android. The same day, v1.0.14 landed with experimental Model Context Protocol (MCP) support, Pixel TPU hardware acceleration, new built-in skills, and chat history persistence. Community reaction is positive: a top commenter (score 15) states "edge gallery is legit usable now." Practical notes: MTP requires re-downloading the on-device model variants; Pixel TPU acceleration is available on compatible Pixel devices; the MCP integration is marked experimental. A notable community concern (score 14) flags that the app requires agreeing to Google data collection on first launch — a relevant consideration for users running Edge Gallery specifically for offline privacy. The practical picture: Gemma 4 on mobile has reached a materially more usable state with v1.0.13+ — faster token generation via MTP, hardware acceleration on Pixel, and persistent chat sessions. For users with Pixel 8/9 or other Pixel-class devices, this represents the first production-quality path to Gemma 4 MTP inference without desktop hardware. (source, May 19, 44 score, 17 comments)
Desktop llama.cpp MTP improvements land, but Gemma 4 MTP support is not yet included. A community post (May 19, 99 score, 73 comments) links llama.cpp PR #23269, a meaningful MTP performance improvement for models that already support MTP in llama.cpp. A prominently upvoted community comment (score 24) clarifies directly: "Gemma4 MTP is not supported yet." This creates a notable divergence: on mobile, Gemma 4 MTP is production-available through Google AI Edge Gallery v1.0.13; on desktop, the draft-head speculative decoding path in llama.cpp still does not implement the Gemma 4 MTP head. Users reporting gains in this thread are on Qwen 3.6 models. A commenter who combined a 1660 Ti with a 5070 Ti to reach 22GB VRAM reports going "from single digit tps to double digit" — those gains are entirely on Qwen workloads. Practical guidance: update llama.cpp for MTP gains if you run Qwen 3.6 27B or 35B-A3B; do not expect any Gemma 4 MTP speedup on desktop llama.cpp builds until a Gemma 4 MTP head PR merges. Watch llama.cpp PRs for Gemma 4 MTP support as a separate milestone from the already-landed Qwen MTP. Confidence: explicit community statement supported by absence of any Gemma 4 MTP benchmark in the thread. (source, May 19, 99 score, 73 comments)
48GB VRAM sweet spot: Gemma 4 31B Q8 as daily driver, Q6 at 96GB for extended context. A community discussion (May 19, 24 score, 51 comments) on 48GB VRAM use patterns produced two clear Gemma 4 data points. A commenter (score 12) running dual 24GB P40 cards confirms Gemma 4 31B Q8 GGUF as the daily driver, noting it supports a useful context size with the workload split across both GPUs, and leaves enough headroom for an image model and TTS/STT on the remaining VRAM. A former 48GB user who upgraded to 96GB (score 13) reports running Q6 Gemma 4 31B with "a few hundred thousand tokens of context" and substantially faster throughput — treating Q6 at this tier as the next plateau after Q8 at 48GB. The practical picture: at 48GB (whether two P40s, one RTX 6000 Ada, one A6000, or similar configurations), Gemma 4 31B Q8 provides good quality with large context; the primary reason to go higher is extended context beyond 50–100k tokens or the step-up to Q6 quality. Community sentiment: "Q6 gemma4 31b" is the enthusiast target at the 96GB tier, but Q8 at 48GB is well-established and not a compromise worth stressing over. Confidence: anecdotal, small engagement; consistent with broader field notes on this hardware tier. (source, May 19, 24 score, 51 comments)
RTX 5060 Ti community recipes expanded — early NVFP4 Gemma 4 26B testing underway. A follow-up post (May 19, 20 score, 9 comments) from the club-5060ti project reports a cleaned-up benchmark and recipe repository with schema-validated JSON, a static results explorer, and structured lanes for 1x, 2x, and multi-card 5060 Ti configurations. A commenter reports that after getting vLLM working on the 5060 Ti, they "revisited the gemma model since the nvidia/Gemma-4-26B-A4B" NVFP4 variant — the first community signal of NVFP4-format Gemma 4 being tested on a Blackwell budget card. The 5060 Ti's 16GB GDDR7 and Blackwell native FP4 support in principle make it a viable target for NVFP4 Gemma 4 inference at higher throughput than equivalent FP16 or Q4 GGUF builds. Important caveat: this testing is still in early stages; the commenter notes difficulty getting the NVFP4 model to work with MTP on Qwen before switching to Gemma, so the Gemma NVFP4 result is not yet benchmarked. Practical status: 5060 Ti + Gemma 4 26B NVFP4 is an active community experiment, not a confirmed recipe. Follow the club-5060ti GitHub repository for results as they publish. Confidence: low — single commenter, no benchmark numbers yet. (source, May 19, 20 score, 9 comments)
KV cache quantization at large context: Q4_0 quality loss is significant, Q5_1 is the recommended middle ground. Community consensus (May 17, 44 score, 91 comments) on KV cache quantization for developers using large context windows (50k+ tokens) is clear and consistent. The top response (score 44) is direct: "The quality loss at Q4 is pretty severe. I'd recommend the Q5_1 option instead, which was introduced relatively recently. Q8 for K and Q4 for V is another option." A second commenter (score 23) recommends "Model Q6 and up, context cache FP16." A developer-facing finding (score 21): "Lesser quant == more tool call errors. So it depends on harness and model, how good both of them at error recovering. If I can — I don't quantize cache." Practical guidance for Gemma 4 users: if you are running Gemma 4 31B or 26B-A4B at context lengths above 32k and using the model for structured output, tool calling, or multi-turn agentic tasks, avoid Q4_0 KV quantization. Q5_1 is the community-recommended minimum for quality preservation at large context; FP16 KV cache is the reliability ceiling but carries VRAM cost. The Q8K / Q4V hybrid is a middle-ground option if VRAM is the constraint. These findings directly apply to Gemma 4 26B-A4B MoE, which has the architectural capacity for large context but requires careful KV quantization choices to maintain coherence over long sessions. Confidence: community consensus across multiple practitioners. (source, May 17, 44 score, 91 comments)
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (18 new or updated since 2026-05-18, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.
May 19 sweep, 2026-05-19 00:00 EDT: six developments from this sweep deliver the first cross-hardware MTP numbers from mainline llama.cpp, surface a high-engagement coding-agent harness demo built on Gemma 4 E4B, add a third community fine-tune to the Gemma 4 31B ecosystem, report ROCm 7.13 Strix Halo optimizations with a real-world AMD stability confirmation, frame Gemma 4 31B as the competitive ceiling for the 27–31B class, and record a practical agentic-coding comparison showing where Qwen 35B-A3B currently edges Gemma 4 26B on one user's setup.
MTP in mainline llama.cpp — measured numbers across Strix Halo and RTX 3090 rigs. PR #22673 (commit 4f13cb7) is confirmed in mainline llama.cpp as of May 16. A community benchmark post (May 18, 39 score) measured Qwen3.6-27B performance on two rigs using `--spec-type draft-mtp --spec-draft-n-max N`: on a Strix Halo (Framework Desktop, ROCm 7.0.2), Q4_K_M went from 11.7 to 21.2 tok/s (1.81×) and Q8_0 from 7.4 to 18.1 tok/s (2.44×); on a single RTX 3090 at 450W (CUDA 12.9), Q4_K_M improved from 38.7 to 59.5 tok/s (1.54×); on a dual RTX 3090 layer-split, Q8_0 went from 25.7 to 55.9 tok/s (2.17×). For MoE comparison: Qwen 35B-A3B gained 1.40× on Strix Halo and 1.24× on the RTX 3090 — confirming the by-now well-established asymmetry where dense models benefit substantially and MoE models gain less because each forward pass is already cheap. These results transfer directly to Gemma 4: expect comparable gains on Gemma 4 31B Dense (the closest architectural equivalent to Qwen3.6-27B dense), and more modest gains on Gemma 4 26B-A4B MoE. The optimal `--spec-draft-n-max` sweet spot varies by rig: uncapped 3090 preferred n=2 at Q4; power-capped 3090 and Strix Halo preferred n=3. Output is described as byte-identical to baseline at the same seed and temperature. Confidence: structured benchmark with multiple rigs and runs; single hardware configuration per rig. (source, May 18, 39 score, 28 comments)
SmallCode: Gemma 4 E4B as a 4B coding agent achieves 87% on self-selected benchmark — harness design matters more than model size. A high-engagement post (May 18, 639 score, 306 comments) introduced SmallCode, a coding agent harness built from scratch for small local models, demonstrating 87/100 tasks passing with Gemma 4 E4B (which activates 4B parameters per token). The author's core insight: standard agents like OpenCode and Cursor assume large frontier models, causing small models to fail on multi-step tool chains. SmallCode compensates with three harness techniques — compound tools that bundle sequential file operations into a single call (cutting failures from multi-step coherence loss in half), an improvement loop that feeds compilation errors back automatically, and a task-decomposition fallback when the model fails twice in a row. The author claims OpenCode scores approximately 75% with 14B models on their benchmark, suggesting the harness closes a meaningful gap. Important caveats: the benchmark is self-selected and not reproducible against a standard suite; top community responses were pointed ("TrustMeBro-2.1-hard," "custom benchmarks is like marking your own homework"). A top comment with 126 score questioned why these improvements aren't integrated into existing tools like OpenCode or little-coder rather than creating another standalone agent. Practical takeaway: the harness techniques — compound tool bundling, lint-driven improvement loops, and decomposition on repeated failure — are generalizable regardless of implementation, and the post confirms Gemma 4 E4B is capable enough for agentic coding when the scaffold compensates for its coherence limits. Treat the benchmark numbers with appropriate skepticism. (source, May 18, 639 score, 306 comments)
Gembrain: third community fine-tune merges seven Gemma 4 31B variants — community reception is skeptical. LLMFan46 published GGUF packaging for Gemma-4-Gembrain-31B-it-uncensored-heretic (May 18, 34 score), created by Nimbz as a merge of seven Gemma 4 31B fine-tunes targeting improved logical and lateral thinking, adherence, prose variety, and creative output. KLD is 0.0186 with 13/100 refusals. Community response is more skeptical than prior heretic-line releases: a top comment (score 27) says "I don't ever trust these weird merged models"; a second (score 17) questions why merging a fine-tune that already merged the base model adds value; a third (score 12) challenges the "boost lateral thinking" claim with no published mechanism. The fine-tune author revealed an internal contradiction: the merge includes a model with 99/100 refusals — the same as the base Gemma 4 31B — which may explain the KLD not moving strongly. This is the third community fine-tune in the Gemma 4 31B ecosystem in one week (Ortenzya for prose, Meromero for creative breadth, Gembrain for thinking and variety); all three are published by the same packaging pipeline (LLMFan46 GGUFs). None have been systematically benchmarked against the base model. If you are experimenting with fine-tuned Gemma 4 31B for creative or reasoning tasks, these give you options to test, but treat confidence as low until independent evaluations appear. (source, May 18, 34 score, 29 comments)
ROCm 7.13 nightly: Strix Halo optimizations merged, RX 6800 Gemma 4 stability confirmed. AMD released ROCm 7.13 Tech Preview (May 17, 50 score, 24 comments) with dedicated optimizations for the Ryzen AI Max 300 "Strix Halo" and new support for additional APU and GPU SKUs including Ryzen AI 7 PRO 360, 350, and other gfx1152-class devices. A commenter with an RX 6800 (score 2) reported running Gemma 4 E2B, E4B, and 26B at various quantizations via lemonade-sdk/llamacpp-rocm "for months" without a single crash — a meaningful real-world stability data point for AMD discrete-GPU Gemma 4 inference under ROCm. A second commenter (score 14) notes ROCm offers better prompt processing than pure Vulkan with a minor generation throughput tradeoff. Caution: an early commenter found ROCm 7.14 preview running slower than expected on Strix Halo when using the latest llamacpp-rocm build, suggesting not all builds in the preview channel are stable. For Strix Halo users: the 7.13 Tech Preview available from TheRock on GitHub is the tested path; the 7.14 preview introduced a regression for at least one user. For AMD discrete GPU (RX 6800 class and newer) users running Gemma 4 via llama.cpp: the lemonade-sdk ROCm build is now reported stable for production use including MTP testing, with the caveat that `-np 1` may be required and mmproj handling needs attention for multimodal use cases. Confidence: community report, single-user stability confirmation; not a controlled benchmark. (source, May 17, 50 score, 24 comments)
Model release anticipation: community positions Gemma 4 31B as one of two competitive options in the 27–31B class. A widely-engaged discussion (May 18, 120 score, 71 comments) forecasting when new local models will drop surfaced a useful framing for Gemma 4's current market position. A top commenter (score 39) stated directly: "Qwen 3.6 27B dense raised the bar very high — only competitor (not for coding, of course) is Gemma 4 31B dense currently." Another commenter (score 41) specifically flagged "Gemma 4 123B or Qwen 3.6 122B would be huge," reiterating the community's ceiling aspiration documented in prior field notes. Several commenters referenced expectations for new Google releases in the following days, potentially at Google I/O. Practical context: Gemma 4 31B Dense is the community's shortlist pick for non-coding general quality at the 27–31B parameter tier. For coding and agentic tasks, Qwen 3.6 27B and 35B-A3B are consistently preferred. This positioning has been stable across the last two weeks of field notes. No product announcement signal exists for a larger Gemma 4 variant. (source, May 18, 120 score, 71 comments)
Agentic coding with 4090+5060Ti: Q8_0 Qwen 35B-A3B at 262k context edges Gemma 4 26B for demo work. A practitioner report (May 18, 29 score, 27 comments) compared Qwen 35B-A3B against Gemma 4 26B-A4B on an agentic coding workload — demo and data analytics scripts via Claude Code's API endpoint pointing to localhost. The author ran Q8_0 Qwen 35B-A3B on a 4090+5060Ti combination with 262,144-token context and reports it is "better than Gemma 4 26B" for their use case, though they note it underperforms in plain chat compared to agentic mode. Community recommendations push back on the Q8_0 KV cache setting: multiple comments (scores 10 and 9) recommend dropping to Q6_K_XL or similar model quant and using unquantized or FP16 KV cache for better coding quality. A notable data point from a top comment (score 10): 70 tok/s on dual RTX 3090 with Qwen 35B-A3B at Q8 and 196k context — useful headroom for context-heavy agentic sessions. Important context: this comparison uses models of different parameter counts (Qwen 35B-A3B activates only ~3.5B parameters per token as an MoE; Gemma 4 26B-A4B activates ~4B). The architectural comparison is approximate. For users with a single 4090 or 4090+5060Ti and primarily agentic coding or demo work at large context: Qwen 35B-A3B at Q6_K_XL or similar with unquantized KV cache is the community's current recommendation over Gemma 4 26B in that specific niche. For general-purpose or non-coding tasks, Gemma 4 26B or 31B remain competitive options. Confidence: anecdotal, single practitioner report. (source, May 18, 29 score, 27 comments)
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (26 new or updated since 2026-05-17, 157 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.
May 18 evening sweep, 2026-05-18 02:00 EDT: three developments from this sweep expand the Gemma 4 community fine-tune ecosystem with a second heretic creative model, surface strong community demand for a larger Gemma variant, and document Gemma 4 running on Blackwell hardware as a catalyst for runtime migration.
G4-Meromero-31B: second creative fine-tune for Gemma 4 31B joins Ortenzya in the community ecosystem. Community developer zerofata released G4-MeroMero-31B-Uncensored-Heretic (May 17, 101 score, 49 comments), available via llmfan46's GGUF packaging on Hugging Face. Designed for creative tasks broadly, it complements the Ortenzya fine-tune (natural English prose) released the prior day by the same heretic-packaging pipeline. KLD is 0.0100 with 15/100 refusals. Community discussion highlights the two as potentially complementary: Ortenzya targets prose quality and natural English for translation and RP; Meromero targets creative breadth. A commenter asks directly how the two differ, and the fine-tune author notes Ortenzya improves prose naturalism while Meromero focuses on creative task coverage. Neither model has been systematically benchmarked against the base Gemma 4 31B; treat as community options to trial on your specific creative use case. Important context: the base Gemma 4 31B is already noted by community members as relatively permissive for creative tasks — these fine-tunes address tone and prose quality rather than unlock otherwise-blocked capabilities. Confidence: anecdotal — no head-to-head benchmark against the base model published. (source, May 17, 101 score, 49 comments)
Community appetite for a 124B Gemma is strong — Google shows no present signs of building one. A high-engagement post (May 17, 285 score, 51 comments) imagining a 124B Gemma model surfaced broad community agreement that such a model would be compelling — commenters frame it as "basically an open-weights version of Gemini Flash," which would be among the most capable locally-runnable models available. Top responses are skeptical Google will deliver: "There is no interest in doing that for them" (score 45) and "That would be awesome but I guess there is no interest in such a huge model" (score 34). A commenter jokes that any announced release will turn out to be a ShieldGemma variant. No product roadmap signal exists to support expectation of a 100B+ Gemma model. Practical context for users: Gemma 4's current ceiling is 31B dense (or 26B MoE, equivalent to approximately 4B active parameters per token). Users needing 100B-class locally-runnable models currently have Qwen 3.5 122B, Qwen 3.6 MoE, DeepSeek-V4 Flash, and similar options. This finding is editorial context rather than a hardware or performance data point — it captures the ceiling of current Gemma 4 availability and community aspirations for the model line. (source, May 17, 285 score, 51 comments)
Blackwell 5000-class GPU + Gemma 4 as a migration driver from Ollama to llama.cpp. A user with 64GB RAM and a Blackwell 5000-series GPU (identified as "backwell 5000" in post, likely RTX PRO 5000 or similar) running Gemma 4 and Qwen via Ollama and LM Studio asked for migration advice to get better speeds (May 17, 35 score, 73 comments). Community response: llama.cpp is the direct next step, offering fine-tuned control via flags and measurable throughput gains over Ollama's wrapper overhead. ik_llama.cpp was called "the best, not too hard tool" by a prominent commenter; vLLM wins on 4+ concurrent users but adds setup complexity. A late comment highlights lemonade-server as a drop-in Ollama-compatible endpoint with llama.cpp and vLLM backends. The hardware context adds a useful data point: Blackwell 5000-class cards (estimated 24–48GB GDDR7 depending on variant) running Gemma 4 are a real user segment upgrading from earlier Ollama-based workflows. This confirms that Gemma 4 is in active use by developers who are growing beyond simple Ollama deployments into tunable inference stacks, and that llama.cpp remains the community's first recommendation for that transition. (source, May 17, 35 score, 73 comments)
May 18 sweep, 2026-05-18 00:00 EDT: five developments from this sweep close the long-running open question on MTP's mainline llama.cpp status, deliver the first community benchmarks of officially-merged MTP across Strix Halo and RTX-class NVIDIA hardware, quantify the wall-time picture at production-scale context (85k tokens), and add a cross-platform decode-bandwidth comparison showing where each GPU tier wins on Gemma 4 model sizes.
MTP officially merged into llama.cpp mainline — PR #22673 approved by Georgi Gerganov. After weeks of testing in forks and patched builds, Aman Gupta's PR #22673 landed in llama.cpp's main branch, making Multi-Token Prediction (MTP) available to all users via a standard build without patching. Community reaction was celebratory, with the announcement post reaching 733 score. The key mechanism: MTP adds a lightweight draft head that predicts multiple tokens ahead; accepted drafts expand effective decode throughput at no quality cost (equivalent to standard generation when tokens are rejected). Important context from community commentary: MTP benefits are task-type dependent. Low-entropy outputs — code generation, math, structured text — see 67–90% acceptance rates and meaningful speedups. High-entropy outputs — creative writing, roleplay, diverse prose — see low acceptance rates and sometimes slower wall time due to the dual-prefill overhead. Practical note for immediate testers: at time of first community testing, the official Docker image for llama.cpp server-cuda had not yet picked up the merge; users wanting to test immediately need to build from source with `CUDA_DOCKER_ARCH` set for their GPU. The container will follow shortly. (source, May 16, 733 score, 236 comments)
Strix Halo MTP benchmarks: Qwen3.6-27B gains +111% generation speed; 35B-A3B gains are context-length dependent. The first systematic MTP benchmarks on Strix Halo hardware (Ryzen AI MAX 395, 128GB unified LPDDR5X) reveal a clear asymmetry between 27B dense and 35B-A3B MoE. For Qwen3.6-27B on a 15k-token single-turn task: generation rate went from 7.63 to 16.15 tok/s (+111%), but prompt processing slowed 12.5% due to the MTP head's dual-prefill pass; net wall time improved by 10 seconds (-11.5%). Over a 5-turn chat conversation (~28.5k cumulative context): generation improved +136% on average (7.61 to 17.98 tok/s), and turns 2-5 were 56 seconds faster overall (-26.5%) as the prompt-processing overhead amortized across turns. For Qwen3.6-35B-A3B (MoE): single-turn generation improved +16.5% (48→56 tok/s), but the dual-prefill overhead made wall time 2.33 seconds slower (+11.2%) on the same 15k-turn task. On 5-turn chat, the MoE was roughly tied (+2.3% slower). A post-publish update from the author tested ROCm 7.13 versus Vulkan: ROCm now shows +12% better prompt processing than Vulkan across all tested models — a meaningful reversal from earlier data. The pattern maps directly to Gemma 4: dense 31B benefits substantially from MTP across conversation length; MoE 26B-A4B gains less because its already-high base throughput means MTP overhead costs proportionally more. Confidence: single hardware setup, code-heavy synthetic prompts per community analysis. (source, May 16, 136 score, 57 comments)
RTX 3090 MTP at 85k context: PP halved, TG +85%, net wall time -41%. A real-world production data point from a headless RTX 3090 running Qwen3.6-27B-MTP-Q4_K_M at 128k context demonstrates the wall-time picture that per-metric numbers obscure. On an 85,000-token research task: without MTP, prompt processing ran at 1,050 tok/s and generation at 27 tok/s, completing in roughly 39 minutes. With MTP enabled (`--spec-draft-n-max 3`): prompt processing fell to 600 tok/s (-43%), generation rose to 50 tok/s (+85%), and the same task completed in roughly 23 minutes — a 41% time reduction. The key insight is that decode dominates wall time on large-context generation tasks, so even a significant PP regression translates to large net savings when TG meaningfully improves. This matches the Strix Halo multi-turn result: PP overhead matters most on the first turn of a fresh session; at 85k context with substantial output, the generation benefit compounds. Practical guidance: if your workload generates substantially more than it reads (high TG:PP token ratio), MTP is likely a net win despite the PP regression. Benchmark first if your workflow is PP-heavy — large document ingestion, RAG with many retrieved chunks, or very short output responses where TG never gets to compound. (source, May 17, 44 score, 37 comments)
RTX 5090 first-day MTP community testing confirms dense-vs-MoE asymmetry at the high-end tier. A controlled RTX 5090 MTP test (32GB, built from llama.cpp source commit 4f13cb7 with `CUDA_DOCKER_ARCH=120`, Unsloth Q5_K_M for 27B and UD-Q4_K_M for 35B-A3B, 128k context with flash attention and q8_0 KV cache) confirms what Strix Halo and RTX 3090 data show: dense 27B delivers large generation speedup; MoE 35B-A3B shows smaller fractional improvement because its high base throughput means MTP verification overhead costs proportionally more. A commenter reported 180 tok/s on dual 5060 Ti with MTP and parallel=2, confirming that parallel execution is now fully supported (an earlier limitation requiring parallel=1 is resolved). For Gemma 4 users: the architectural principles transfer directly — expect similar 2x+ generation gains on 31B Dense with MTP on code tasks; expect more modest or task-dependent gains on 26B-A4B MoE. (source, May 17, 203 score, 30 comments)
Multi-platform decode comparison: RTX 5070 beats RTX 3090 on sub-12GB models; 3090 wins on 14–31B band. A community benchmark (55 runs, 3 hardware platforms, 5 backends) compared Strix Halo ROCm, RTX 3090 CUDA, and RTX 5070 Vulkan across a range of model sizes. Key results for Gemma 4: the RTX 5070 (12GB GDDR7, Vulkan) outperforms the RTX 3090 (24GB GDDR6X, CUDA) on models that fit in 12GB — Gemma-4-E4B at 124.3 vs 118.4 tok/s. For models that require more than 12GB, the 3090 wins decisively: Gemma-4-26B-A4B scored 100.5 tok/s on the 3090 versus 43.7 (Strix ROCm) and 47.7 (Strix Vulkan). The Strix Halo systems are not competitive on models that fit in discrete VRAM but offer unmatched capacity for larger models neither discrete card can run at full quality. Community pushback flagged a methodology issue: the 5070 was benchmarked with Vulkan rather than CUDA, which may understate its performance margin over the 3090 on sub-12GB models. Practical guidance for Gemma 4 users: for E4B or E2B on a 12GB budget, the RTX 5070 generation rate is higher than the 3090; for 26B-A4B or 31B Dense, 24GB+ VRAM from the 3090 or higher is required for competitive speeds. (source, May 16, 34 score, 20 comments)
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (12 new or updated since 2026-05-16, 148 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.
May 17 sweep, 2026-05-17 00:00 EDT: eight developments from this sweep surface a new fine-tune for creative writing and RP use cases, confirm a community-derived power-efficiency curve for multi-3090 inference rigs, validate Gemma 4 E4B's native audio transcription capability, extend MTP dense-vs-MoE evidence to million-token scale, document enterprise team deployment patterns, establish that thinking mode hurts translation tasks, add Terminal-Bench 2.0 context for Gemma 4 positioning, and resolve the GPU-vs-RAM debate for MoE inference.
Ortenzya: first quality creative writing fine-tune for Gemma 4 31B. Community developer LLMFan46 released `gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic` (May 16, 25 score), a Gemma 4 31B fine-tune targeting natural English prose quality, creative writing, translation fidelity, and RP use cases. Available in safetensors and GGUF formats. The fine-tune addresses a community-noted weakness in base Gemma 4 31B: while the model produces correct and concise prose, some users find the writing style lacks naturalness in extended creative or narrative output. Key community finding from the discussion: the base Gemma 4 31B is already "uncensored asf" for most creative use cases — the fine-tune's value is specifically prose quality and natural English, not primarily safety softening. A commenter notes the fine-tune also addresses "softening" (toned-down language without hard refusal), which matters for translation and RP tasks where the original source material has strong tone or content. Practical guidance: if you find base Gemma 4 31B too dry or stiff for creative writing, Ortenzya is the first community option to try. Confidence: anecdotal — no systematic quality benchmark against base model yet. (source, May 16, 25 score, 16 comments)
4x RTX 3090 power efficiency curve: 220W per GPU is the sweet spot — memory-bandwidth-bound decode confirmed. A systematic power-limit benchmark on a 4x RTX 3090 rig running Qwen3.6-27B at FP16 via vLLM TP=4 (May 15, 38 score; updated comments May 17) measured output generation speed and prompt-processing throughput across power limits from 200W to unrestricted (350–390W). Key result: reducing from unrestricted to 220W drops output generation from 29 to 27 tok/s while pushing efficiency from 0.77 to 1.13 tok/joule. Below 220W, both efficiency and throughput fall together (200W: 24 tok/s, 1.11 t/J). A top commenter (score 9) provided the architectural explanation: "output generation speed is flat from 300W down to 220W because decode is memory-bandwidth-bound, not compute-bound. 3090 GDDR6X bandwidth barely changes with power limit, so you hit the same ~29 t/s regardless." Prompt processing drops proportionally because prefill IS compute-bound. The hardware setup uses a PCIe Gen 3 bifurcated topology (x16/x8/x8/x4); the x4 slot is a known bottleneck, and a P2P driver patch (`github.com/aikitoria/open-gpu-kernel-modules`) that supports mixed NVLink/PCIe topologies was flagged but tested without improvement on this specific PCIe-3-limited rig. Practical guidance for multi-3090 (and general multi-GPU NVIDIA) inference: power-limiting to ~220W per card costs ~7% in output throughput but saves ~37% in power draw. The decode floor is bandwidth-limited, so power won't buy you output speed beyond ~300W. Test prefill-heavy workflows first — prefill does benefit from compute headroom and will degrade proportionally below ~250W. Anecdotal confidence (single setup, Qwen workload; principles are general). (source, May 15, 38 score, 54 comments)
Gemma 4 E4B confirmed for short multi-lingual audio transcription — not a Whisper replacement for long audio. A community practitioner report (May 12, 22 score, 9 comments) validated Gemma 4 E4B's native audio input for transcription. Key findings: E4B processes short audio clips accurately in multiple languages including foreign languages, without additional STT tooling. A top commenter confirms active use for voice assistant STT, noting the model's promptability is a practical advantage over fixed-vocabulary Whisper — you can instruct E4B to focus on specific terms, format output in a particular way, or filter filler words in the prompt. Practical limits: for audio exceeding roughly one hour, Whisper or a dedicated STT model remains necessary; E4B's context window constrains continuous transcription. Multiple commenters also noted that E2B may support the same audio input path via LiteRT-LM. This is the first direct community report of E4B transcription as a primary use case. Combined with the Jetson Orin NX SUPER finding (May 16 sweep), E4B is now documented as a viable complete voice pipeline component: STT natively, inference on-device, TTS via Piper or similar, with no cloud dependency. Confidence: anecdotal, small engagement, no controlled accuracy comparison against Whisper published. (source, May 12, 22 score, 9 comments)
MTP dense-vs-MoE finding confirmed at 1M token scale — dense 27B gains ~1.5x, MoE 35B gains under 10%. A practitioner who spent over 1 million tokens across three sessions building a pygame project with Qwen 3.6 MTP models (May 15, 127 score, 78 comments) directly confirms the MTP task-type dependency at production usage scale: the dense Qwen3.6-27B model with MTP gained approximately 1.5x tok/s; the MoE 35B-A3B gained less than 10%. A commenter adds a critical caveat: the test used `q4_0` KV quantization — already warned in earlier field notes to carry meaningful quality risk on long-context tasks. For Gemma 4 users: this is further confirmation that MTP is primarily valuable on dense models (Gemma 4 31B, Qwen 3.6 27B dense) and delivers marginal gains on MoE variants (Gemma 4 26B-A4B, Qwen 3.6 35B-A3B). The result has now been independently confirmed by the 300-test systematic analysis (May 10), the M4 Max measured results (code: 1.53x; prose: wash; JSON: 0.50x), and this million-token practitioner run. (source, May 15, 127 score, 78 comments)
Enterprise server for 7-person team: 2x RTX 6000 Blackwell MaxQ with Proxmox and vLLM — community recommends testing cloud first. A team setting up local inference for a 7-person company (May 15, 20 score, 58 comments) drew a substantive community discussion on small-team deployment patterns. The most-upvoted practical setup: a Gigabyte server with 2x RTX 6000 Blackwell MaxQ (~26k€), running Proxmox with an LXC container using Debian 13 + NVIDIA drivers + CUDA 13.2, serving Gemma 4 and Qwen models via vLLM. A key community concern: the commenter with this setup is running llama.cpp instead of vLLM on two 6000s — a top comment calls this "leaving so much performance on the floor." For multi-GPU inference of 30–35B class models, vLLM tensor-parallel is the right backend choice. The second-highest-voted response argues for API/rental first: "Use cases can quickly outgrow on-prem resources. Give people generic access, watch what they do for a month or two, then decide." A third pattern: a 1x RTX Pro 6000 with large RAM to run Kimi K2.6 for 1-2 power users who need a genuinely strong coding model. Hardware and architecture recommendations for small-team deployment: TP=4 vLLM on multi-GPU for 35B class; single high-VRAM GPU with large RAM for flexibility; validate use case demand before committing to on-prem hardware at this scale. Confidence: community discussion, multiple experienced practitioners, not a benchmark. (source, May 15, 20 score, 58 comments)
Thinking mode consistently hurts Gemma 4 translation — direct pass is preferred, two-pass is useful only for complex edge cases. Community consensus (May 13, 22 score, 17 comments) on using Gemma 4 for translation with thinking mode enabled is clear: thinking mode "wastes a lot of context thinking about it and also ends up overthinking it," and turning thinking off produces better results for direct translation tasks. A more nuanced practitioner approach from the comments: use a first pass at temperature 0 with no thinking for direct translation, then a second optional reasoning pass to review flagged segments, with KV cache prefix reuse on the second pass to minimize latency. A dedicated translation fine-tune (Qwen3-Translation, Tower) remains the community recommendation over generalist + thinking for high-volume or professional-quality needs. Practical guidance: disable thinking mode for Gemma 4 translation; reserve the optional review pass for idioms, jargon, or segments where you need explicit justification. This is consistent with the token efficiency picture — Gemma 4 is concise and direct, and adding thinking overhead to tasks that don't require multi-step reasoning adds cost without quality gain. (source, May 13, 22 score, 17 comments)
Terminal-Bench 2.0: Qwen 3.6 35B-A3B scores 24.6% and beats Gemma 4 31B on terminal coding — expected gap given dense vs MoE. The public Terminal-Bench 2.0 leaderboard now includes Qwen3.6-35B-A3B at 24.6% (±3.2) with the little-coder scaffold, placing it above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B (23.9%). Community commentary (May 16, 243 score, 57 comments) is broadly positive but includes an important framing note: comparing Qwen 3.6 35B-A3B (MoE) against Gemma 4 31B Dense is not architecturally equivalent — the MoE uses 3.5B active parameters while the dense model uses all 31B. A commenter notes: "Gemma 4 31B is a dense model. Would not be fair to compare the Qwen MoE to it. The better comparisons would be between Qwen 27B dense and Gemma 31B." Gemma 4 31B has not yet been officially benchmarked on Terminal-Bench 2.0 as of this writing. For readers using Gemma 4 for terminal/agentic coding: this benchmark suggests Qwen 3.6 MoE leads on this specific leaderboard task; however, the community also consistently reports Gemma 4 produces higher-quality output per token on focused tasks (see the Packman benchmark and three.js creative coding findings). Neither model has a clean win across all coding task patterns. (source, May 16, 243 score, 57 comments)
GPU vs RAM debate: VRAM wins on throughput, but Gemma 4 MoE is the best case for high-RAM inference. A community debate (May 15, 63 score, 81 comments) on whether "rich RAM / poor GPU" is a viable strategy produced two clear data points. A practitioner with both 192GB RAM and a 5090 reports using RAM only for testing new models, avoiding it otherwise: "The speed gain is just too important for the too small gain on accuracy." A separate commenter (512GB across 128GB devices) notes that the Gemma 4 26B MoE and Qwen 3.6 27B dense models have changed the calculus, making 30B-class dense-equivalent quality achievable on consumer VRAM for the first time. The analytical breakdown by a third commenter: sub-7B models must be task-specific; 24–35B dense is the minimum for general-purpose quality; MoE in the 100B parameter class is viable at 128GB+ RAM with hybrid offload. The Gemma 4 26B-A4B MoE architecture — activating only 4B parameters per token — is explicitly identified as the strongest argument for the high-RAM approach: its MoE sparsity means CPU RAM throughput is not penalized as severely as a dense 26B model would be. For Gemma 4 users with a mid-range GPU (16–24GB) and 64–128GB RAM: the 26B-A4B with `--n-cpu-moe` offload is the architecture that most justifies the RAM-over-GPU strategy; the 31B Dense requires VRAM to run without significant throughput penalties. (source, May 15, 63 score, 81 comments)
A weekly synthesis of what the r/LocalLLaMA community is reporting about Gemma 4 in real use. Curated from the latest Gemma-mentioning posts (13 new or updated since 2026-05-15, 147 total) and their top comment threads. Confidence is medium unless noted, since this is community signal rather than a controlled benchmark.
May 16 sweep, 2026-05-16 00:00 EDT: three developments from this sweep surface the lowest-power embedded hardware data point to date for Gemma 4, extend context-length degradation evidence in the budget GPU tier, and confirm NVIDIA's own NVFP4 quantization path for Blackwell hardware.
Gemma 4 E4B confirmed working on Jetson Orin NX SUPER 16GB — 14–15 tok/s fully offline with 200ms cached TTFT. A community robotics project (May 15, 419 score, 61 comments) detailed a fully offline suitcase robot running Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention, 12K context, on a Jetson Orin NX SUPER 16GB. Sustained generation: 14–15 tok/s. Cached TTFT: ~200ms after a prompt structure optimization that moved persona and tool definitions to the top of the system block, history to the middle, and volatile sensor/vision data to the bottom of the most recent user turn — a disciplined ordering that kept the prefix cache stable and dropped TTFT from multi-second to 200ms. A key benefit observed: Gemma 4's native vision capability eliminated the separate BLIP subprocess required in prior versions, simplifying the pipeline. The author uses SenseVoiceSmall for STT and Piper for TTS; all inference runs on-device with no network interface. This is an anecdotal single-device report, not a reproducible benchmark. The Jetson Orin NX SUPER 16GB is a specialist embedded GPU with ~204 GB/s memory bandwidth; expect similar results on comparable Jetson-class hardware, and lower results on Orin NX 16 (not SUPER). (source, May 15, 419 score, 61 comments)
Long-context throughput and quality degradation on the $200 GTX 1080 setup — short-context numbers don't transfer. New community comments on the budget GTX 1080 inference guide (now 97 score, 49 comments as of May 16) quantify two independent degradation curves for the 8 GB VRAM / 32 GB RAM + Gemma 4 26B-A4B setup. Throughput: tok/s drops from ~30 at 4k context to ~20 at 50k, matching the expectation that KV cache fills VRAM and forces more expert weights to page over PCIe. Quality: a separate commenter reports retrieval-heavy tasks degrade meaningfully past 32–64k context, well before the advertised 128k limit — the visible tok/s curve is not the only performance cliff. The commenter's framing: "there's a quieter second curve underneath" where output quality erodes on retrieval tasks even as generation speed appears acceptable. This tightens the practical context guidance: the GTX 1080 + TurboQuant setup is usable at 4–16k context for routine chat and code; treat 32k+ as experimental territory where output reliability is unconfirmed. The MTP fix (`--override-tensor-draft "token_embd\.weight=CUDA0"`) and prefill speedup (`-ub 4096+`) remain valid tuning regardless of context length. (source, May 13, 97 score, 49 comments)
NVIDIA released its own NVFP4 quantization of Gemma 4 26B-A4B for Blackwell GPUs. NVIDIA published `nvidia/Gemma-4-26B-A4B-NVFP4` on Hugging Face (post 1t0i18e), a first-party NVFP4 quantization targeting the RTX 5090 (SM120, Blackwell). NVFP4 is a GPU-native 4-bit floating point format specific to Blackwell and newer NVIDIA architectures; it is not GGUF Q4 and does not run on older consumer hardware. A separate community report from a Radeon 9060 XT 16GB user achieved 25.9 tok/s on an IQ4_NL GGUF of the same model via llama.cpp, providing a comparable data point from the AMD side (anecdotal, single report). Practical guidance: if you have an RTX 5090, the NVIDIA NVFP4 model is worth testing over AWQ-4bit for throughput; if you are on older NVIDIA or AMD hardware, standard GGUF quantizations remain the mainstream path. The RTX 5090 DFlash speculative decoding benchmark from May 8 (600 tok/s peak) used an AWQ-4bit model, not NVFP4; NVFP4 throughput comparisons have not yet been published by the community. (source, score 32, 11 comments)
May 15 sweep, 2026-05-15 00:00 EDT: two developments from this sweep refine KV cache quantization guidance for vLLM serving and extend the budget GPU picture with new community benchmark methodology discussion.
FP8 confirmed as the best KV cache quantization default for vLLM — TurboQuant variants offer a VRAM tradeoff, not a free lunch. A first comprehensive study of TurboQuant against BF16 and FP8 in vLLM (May 14, 64 score, 17 comments; source article) settles a frequently debated question for constrained-VRAM Gemma 4 deployments. Key conclusions: FP8 via `--kv-cache-dtype fp8` provides 2x KV cache capacity with negligible accuracy loss — it matches BF16 on most throughput and latency metrics while meaningfully improving them when VRAM is the binding constraint. TurboQuant k8v4 provides only 2.4x compression (vs FP8's 2x) but consistently degrades throughput and latency; the marginal extra compression is not worth the performance cost. TurboQuant 4bit-nc is more practical: it helps under severe VRAM pressure but trades accuracy, latency, and throughput. TurboQuant 3bit variants show meaningful accuracy drops on reasoning and very long-context tasks. A commenter notes that FP8 KV numbers "are obviously worse" compared to unquantized — users with ample VRAM should keep KV cache unquantized; FP8 is the right default only when VRAM is genuinely constrained. A second commenter provides a reassuring data point: running Gemma 4 at 128k context with TurboQuant 2-3 in a production-style load (large PDF ingestion) produced coherent answers across beginning, middle, and end of the document. These TurboQuant results apply specifically to vLLM with its PagedAttention KV management; llama.cpp's TurboQuant/RotorQuant KV implementation behaves differently and should be benchmarked separately. Critical caveat: the study benchmarks only FP8 and TurboQuant variants; no Q4 comparison is included, drawing criticism that the study misses the primary VRAM-constrained use case. (source, May 14, 64 score, 17 comments)
GTX 1080 Gemma 4 guide attracts community discussion on long-context benchmarking methodology. The May 13 budget inference guide (score climbed from 46 to 97, now 47 comments) prompted a useful community exchange about how to properly evaluate large-context performance. The original benchmarks used small prompts (under 2,000 tokens) despite reserving 128k context. New comments recommend using a large Reddit thread (40k+ tokens in JSON or markdown) as a more realistic long-context stress test — common domain content not baked into training data. The guide author is investigating a standardized benchmarking approach. Practical implication: the 20–24.5 tok/s figures for the GTX 1080 setup should be treated as short-context baselines only; actual throughput at meaningful long-context prompts will be lower because KV cache fills VRAM and forces more CPU round-trips. The `--override-tensor-draft "token_embd\.weight=CUDA0"` MTP fix remains valid regardless of prompt length. (source, May 13, 97 score, 47 comments)
May 14 sweep, 2026-05-14 00:00 EDT: five developments from this sweep extend the budget hardware picture for Gemma 4 MoE, surface practical GPU power tuning, confirm prefill tuning for partially-offloaded models, and add new guidance on vLLM vs llama.cpp for single-user workloads.
Gemma 4 26B-A4B running at ~24 tok/s on a $200 secondhand GTX 1080 machine — a new floor for budget inference. A detailed guide (May 13, 46 score) demonstrates Gemma 4 26B-A4B and Qwen 3.6 35B-A3B running on an i7-6700 / GTX 1080 (8 GB VRAM) / 32 GB RAM machine costing ~$200 secondhand via llama.cpp with TurboQuant/RotorQuant KV cache quantization. Results at Q4_K_M with 128k context: Gemma 4 26B-A4B (no MTP) ~20 tok/s with `--n-cpu-moe 20`, TurboQuant KV turbo3 on both K and V caches; after fixing the MTP token embedding table placement, ~24.5 tok/s with `--override-tensor-draft "token_embd\.weight=CUDA0"`. The key mechanism: TurboQuant/RotorQuant KV cache compression fits the KV cache within 8 GB VRAM even at 128k context, while `--n-cpu-moe` offloads the cold MoE expert weights to system RAM, streaming them over PCIe as needed. The GPU sits at ~40-50% utilization; the bottleneck is PCIe bandwidth. Important caveat from the post: the GTX 1080 test used small prompts (under 2,000 actual tokens despite 128k reservation); a commenter notes that larger real-world prompts at 128k context will degrade throughput further as VRAM is tighter with large KV. MTP barely helped out of the box (~5% gain) because Gemma 4's tied LM head forces token embedding lookups on the CPU by default; the fix flag above moves the embedding table to GPU. This is an anecdotal data point, not a reproducible benchmark baseline, and TurboQuant is not in mainline llama.cpp. But directionally, a ~$200 machine can now run a 26B MoE at interactive speeds — a meaningful lower bound for the local Gemma 4 story. (source, May 13, 46 score, 10 comments)
Cut GPU power limit to 40% TDP — no throughput loss for LLM decode, meaningful savings on power, heat, and noise. A viral post (May 12, 709 score, 198 comments) benchmarked an RTX 4090 running Qwen3.6-27B-UD-Q4_K_XL with `nvidia-smi -pl` set to various power limits. Result: reducing to approximately 40% of rated TDP (~100W for a 4090) preserves generation throughput almost identically while cutting electricity draw, heat output, and fan noise proportionally. Multiple RTX 5090 owners in the comments independently validated the finding at their own hardware (860mV/2500MHz, ~360W, with only ~12% TPS loss at the absolute voltage floor). The mechanism: LLM decode is memory bandwidth bound, not compute bound. Once the GPU's memory bus is the bottleneck, reducing compute frequency and voltage has minimal effect on bandwidth-limited operations. The result holds for any consumer NVIDIA GPU running inference workloads including Gemma 4. Practical guidance: reduce power limit incrementally with `nvidia-smi -pl` and monitor generation speed — you can reclaim meaningful electricity savings at almost no quality cost. This is a well-established finding now backed by community data across multiple GPU generations. (source, May 12, 709 score, 198 comments)
Raising llama.cpp `-ub` to 4096-8192 gives ~5.5x prefill speedup for partially CPU-offloaded MoE models. A guide (May 12, 112 score, 53 comments) discovered that increasing the micro-batch size (`-ub`) from llama.cpp's default 512 to 4096 or 8192 dramatically improves prompt processing throughput for `--n-cpu-moe` partially-offloaded models. Measured on an RTX 3090 with a 120B model: prompt processing improved from ~380 tok/s at default `-ub 512` to ~2091 tok/s at `-ub 8192` — a ~5.5x gain. Generation speed was nearly unchanged (32.3 → 30.1 tok/s, ~7% regression). The mechanism, debated in comments: either amortizing PCIe transfer overhead across more tokens (reducing per-transfer round-trip cost) or reducing GPU kernel launch overhead by saturating the attention/router on fewer, larger batches. Both explanations are consistent with the observation. The default 512 exists because it's a safe conservative value for low-VRAM cards that have little headroom for compute workspace spikes. Users with spare VRAM should tune upward and stop when either VRAM OOM or generation speed starts to regress. This applies directly to Gemma 4 26B-A4B when partially offloaded — pair with `--n-cpu-moe` adjustment to keep the run within VRAM at the chosen `-ub`. (source, May 12, 112 score, 53 comments)
vLLM vs llama.cpp for single-user workloads: confirmed equivalent at low concurrency, vLLM wins at 4+ concurrent users. A community discussion (May 12, 75 score, 91 comments) produced a clear practical consensus. vLLM adds meaningful value when: (1) concurrent batch inference is in play — vLLM allocates VRAM per-batch as context grows while llama.cpp must pre-allocate max-context KV VRAM at launch; (2) tensor-parallel multi-GPU/multi-node serving is needed (e.g., Qwen 397B across two DGX Sparks). vLLM also supports MTP for Gemma 4 and Qwen3.6 already, while llama.cpp MTP is still in a patched fork. For single-user non-batched local use, llama.cpp remains simpler with equivalent per-query throughput. CUDA prompt processing is faster in vLLM regardless of batch size. AMD Lemonade now ships vLLM ROCm as a built-in experimental backend. This confirms and sharpens the earlier guidance: if you are a solo user running interactive chat or coding sessions, llama.cpp or LMStudio is fine; switch to vLLM when you need to serve multiple concurrent users or run tensor-parallel inference on model weights too large for one GPU. (source, May 12, 75 score, 91 comments)
Docker images simplify llama.cpp MTP deployment — confirmed +34% throughput on RTX 3090. A community developer (May 13, 63 score, 16 comments) released Docker images pre-built from the llama.cpp MTP development branch, removing the barrier of building from source. A commenter reports +34% throughput gain on an RTX 3090 after switching. The images track recent MTP branch improvements including image support and bug fixes. A commenter asks whether Gemma 4 is supported; the Docker images cover the same model classes as the underlying MTP PR (primarily Qwen3.6 for now). For Gemma 4 MTP, the mainline llama.cpp PR #22673 is still in review; until it merges, the AtomicBot-ai patched fork remains the llama.cpp path for Gemma 4 MTP specifically. Recommended flag addition from comments: `--min-p 0.0` (default 0.1 can interfere with speculative decoding). (source, May 13, 63 score, 16 comments)
May 13 sweep, 2026-05-13 00:00 EDT: five developments from this sweep extend the MTP vs DFlash picture, surface a supply chain signal for Apple Silicon buyers, add new practical limits for Gemma 4 E4B in code use cases, and document a home-server hardware comparison from someone who owns both the Strix Halo and DGX Spark.
First controlled head-to-head benchmark of Gemma 4 MTP vs DFlash on a single H100 — MTP wins at concurrency. A community benchmark (May 12, 62 score, 22 comments) ran Gemma 4 31B Dense and 26B-A4B MoE against both MTP and DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench dataset (880 prompts, 11 categories). Results for 31B Dense: at concurrency 1, MTP hit 125.3 tok/s (3.11x over baseline 40.3) and DFlash hit 122.1 tok/s (3.03x). At concurrency 16, MTP reached 953 tok/s versus DFlash's 725 tok/s versus baseline 375 tok/s — a meaningful gap in favor of MTP at higher concurrency. The architectural explanation from commenters: DFlash generates a larger speculative batch via diffusion but has lower acceptance rate per token; MTP is autoregressive with higher per-token acceptance, so at scale its advantage compounds. Practical guidance: at concurrency 1 the two methods are nearly equivalent; at concurrency 4+ for serving multiple users, MTP outperforms DFlash by a widening margin. This is the first benchmark to quantify the concurrency dimension — prior guidance focused on single-user latency where both methods were close. DFlash's lower acceptance rate with the diffusion-based approach means more compute spent on rejected tokens under load. Still vLLM-only for both methods; no mainline llama.cpp path yet. (source, May 12, 62 score, 22 comments)
Apple removes M3 Ultra 256GB Mac Studio — M5 expected, but supply chain is under stress. Apple pulled the M3 Ultra 256GB Mac Studio configuration from its online store (May 9, 462 score, 132 comments). The top community read: M3 is being phased out ahead of an M5 Mac Studio launch, not a deliberate memory cap decision. Technical context: M3 and M5 use incompatible DRAM types (LPDDR5-6400 vs LPDDR5x-9600), so M3 chip stock is not convertible to M5 builds. An independent complicating factor: a Samsung DRAM worker strike cut production capacity by 58% on one shift. Community concern about M5 Ultra memory configurations is real but largely speculative — no M5 Ultra specs have been announced. Practical impact for Gemma 4 Apple Silicon users: the M3 Ultra 256GB, which was the best available option for running Gemma 4 31B Dense at full BF16 precision with context headroom, is no longer orderable. Anyone actively planning a high-memory Apple Silicon build for Gemma 4 should wait for M5 Ultra pricing and configuration announcements before committing. The 192GB M3 Ultra (if still available) or used M2 Ultra 192GB remain the current options if you need maximum unified memory now. (source, May 9, 462 score, 132 comments)
Gemma 4 E4B produces poor results for code autocomplete (infill) — use Qwen 2.5 Coder 7B instead. A practitioner post (May 12, 36 score, 30 comments) sharing a working RTX 5080 16GB + 64GB RAM coding setup explicitly evaluated Gemma 4 E4B for code autocomplete infill alongside Qwen 3.5 9B/4B. The author's conclusion: E4B and the Qwen 3.5 small models "produce weird suggestions" for infill and were rejected in favor of Qwen 2.5 Coder 7B Q6_K_L, which runs at instant-feeling speeds on 8GB VRAM. The same setup uses Qwen 3.6 35B-A3B at Q8 for agentic coding tasks (the higher quant is important; the author notes Q4 is not usable for agentic work). This is the first practitioner report directly comparing Gemma 4 E4B against alternatives for the code autocomplete fill-in-the-middle (FIM) use case. Confidence: anecdotal, single data point. But it aligns with the known limitation that E4B's instruction-following strength does not automatically transfer to the FIM pattern, which requires a different training signal. The guidance: do not assume E4B works for code infill — test it on your IDE and task type before committing. (source, May 12, 36 score, 30 comments)
"Decoupled Attention from Weights" for Gemma 4 26B: community verdict is skeptical. A post (May 6, 40 score, 27 comments) announced a technique to "split attention (a couple of GB) onto local machine and weights onto a cheap Xeon" for Gemma 4 26B, with a GitHub repository (larql/vindex). Community response was immediate and critical. Top comments: the technique is reported to run approximately 23x slower than standard inference; the underlying mechanism is equivalent to llama.cpp's existing RPC multi-node functionality with network latency added; sequential layer dependencies prevent any parallelism benefit from splitting attention vs. weights. The post author acknowledged the concerns and withdrew from further claims pending personal experimentation. The technique remains unvalidated as a practical inference improvement. This matters for lab readers who may have seen the post circulate with excited framing: there is no new local-inference breakthrough here. For distributed inference, llama.cpp RPC and vLLM expert-parallel deployment are the established options. (source, May 6, 40 score, 27 comments)
Strix Halo 128GB vs DGX Spark for home Gemma 4 inference — owner of both says Spark wins on throughput, degrades less at long context. A community question post comparing the Framework Desktop (Ryzen AI Max+ 395, 128GB unified memory, $3,388) against the Asus Ascent GX10 DGX Spark ($3,500) for running Gemma 4 31B and 26B-A4B as a local LLM server drew 91 comments (May 11, 21 score). The decisive data point: a commenter (score 31) who owns both systems reports "Spark has much faster GPU which results in faster prompt processing speeds. Also, the performance degrades less on Spark as context grows." Community consensus aligns on a clear split: Spark for pure LLM inference; Strix Halo for general-purpose or hybrid workloads where repurposability (standard x86/amd64 Linux, GPU gaming, everyday tasks) matters. The counterpoint for Strix Halo from a top commenter (score 41): "Definitely Ryzen 395, as it's a standard x86/amd64 machine that can always be repurposed and will never lose drivers or compatibility with new operating systems. Nvidia on the other hand has a history of abandoning their proprietary ARM SoC." DGX Spark runs ARM Ubuntu with a DGX software package; the same commenter who owns both notes Fedora also works with some tweaks. Practical guidance for anyone in the $3,400–$3,500 range targeting Gemma 4 31B: the DGX Spark delivers faster discrete GPU throughput and better long-context scaling; the Strix Halo 128GB unified memory trades some raw inference speed for a more flexible, repurposable machine. Neither is a clear wrong choice; the tradeoff is inference specialization vs. general-purpose longevity. Anecdotal confidence; the owner-of-both data point is the strongest signal. (source, May 11, 21 score, 91 comments)
May 12 sweep, 2026-05-12 00:00 EDT: four findings from this sweep extend the inference backend, edge deployment, and small-model picture.
ExLlamaV3 gains Gemma 4 support and DFlash — up to 2.51x coding speedup on consumer NVIDIA GPUs. ExLlamaV3, the successor quantization and inference engine from turboderp, has reached a run of rapid updates directly relevant to Gemma 4. Version 0.0.29 added Gemma 4 model support; version 0.0.31 added DFlash speculative decoding with measured results (from the post, testing on RTX 3090 and 4090): coding tasks 55.98 → 140.61 tok/s (2.51x), agentic code 55.98 → 140.61 tok/s (2.51x), translation 58.11 → 75.73 tok/s (1.30x), creative writing 59.10 → 89.19 tok/s (1.50x). Version 0.0.32 added further model optimizations. ExLlamaV3 requires RTX-class Nvidia CUDA hardware. Unlike the vLLM DFlash path (server-only), ExLlamaV3 is accessible via a Python API suited to single-user setups. The coding speedup is consistent with the vLLM DFlash benchmark (2.56x on RTX 5090); creative writing DFlash improvement is smaller but still positive, unlike MTP which can slow creative tasks. For single-GPU Nvidia users who want DFlash without a full vLLM server deployment, ExLlamaV3 is now a viable path. Confidence: the throughput numbers come directly from the community post; the comparative claim vs vLLM requires independent verification. (source, May 11, 141 score, 61 comments)
Gemma 4 E4B confirmed best-in-class at the 2–4B tier — but quantization quality matters significantly. A community thread asking "what's the current best small model?" (May 11, 26 score) drew strong consensus: Gemma 4 E4B is the top recommendation at the ~3B parameter class, with multiple independent reporters calling it "hands down the best, no arguing." A first-hand practitioner report adds an important caveat: Q8_0 quantization is "kinda bad and mid" for E4B — Q8_XL or BF16 is "night and day" better on tested tasks. A separate commenter confirms E4B "never loops" and "effectively uses the whole 131k context window" — the zombie-loops pattern documented in earlier field notes for larger quantized models does not appear on E4B. The consensus best competitors for the 3B class are smollm3, Granite 4.1, LFM2/2.5, and Qwen 3.5 4B. Community read: Gemma 4 E4B for general instruction following; Qwen 3.5 4B for tasks where a reasoning chain is needed. If running E4B, Q8_XL or BF16 is strongly preferred over Q8_0. Anecdotal confidence. (source, May 11, 26 score, 44 comments)
First documented in-browser Gemma 4 deployment controls a physical robot over WebSerial. A community developer shared a demo of Gemma 4 running fully offline in a browser via Transformers.js on WebGPU, processing camera frames and sending commands to a Reachy Mini robot over the WebSerial API (May 11, 49 score). The model never contacts a server: inference happens entirely on the client GPU via WebGPU, and motor commands go directly over USB/serial via the browser's WebSerial interface. A commenter notes the architectural benefit: "model sees camera/frame state, JS does the motor command, nothing leaves the machine." This is the first documented Gemma 4 use case in the browser-as-inference-engine + physical-actuator pattern, enabled by Transformers.js and the small footprint of Gemma 4 E-series models. The specific variant was not named; the constraint is WebGPU VRAM, which limits practical options to E2B or E4B. No throughput figures were published; treat as a proof-of-concept rather than a production guidance baseline. (source, May 11, 49 score, 9 comments)
Practitioner pattern: Gemma 4 26B for quick interactive fixes, Qwen 3.6 35B for long-context refactoring. A high-engagement discussion thread on Qwen 3.6 35B-A3B (May 11, 333 score, 103 comments) contains a direct Gemma/Qwen split from a practitioner who runs both: "Gemma 26B in thinking mode for quick code fixes and chats, Qwen 35B in thinking mode for longer contexts and refactoring. Qwen 35B rambles on and on before it spits out the final output so I only use it for tasks that I don't mind waiting for." This two-model hybrid pattern — Gemma 4 for latency-sensitive interactive tasks, Qwen 3.6 for depth-first long-context work — is now documented by multiple independent practitioners across several weeks of field notes. The pattern holds whether the user prioritizes speed, quality, or token efficiency: Gemma 4 26B finishes short tasks fast and concisely; Qwen 3.6 35B is more thorough but verbose. A second data point from the same thread: an RTX 3090 24GB + 64GB RAM user (Beelink eGPU dock) reports Qwen 3.6 35B-A3B "blazing fast" with llama.cpp after tuning settings, switching from LM Studio, with Gemma 4 26B as the secondary model for interactive chat. (source, May 11, 333 score, 103 comments)
May 11 sweep, 2026-05-11 00:00 EDT: four findings from this sweep extend the MTP and creative-coding picture.
MTP task-type dependency confirmed by systematic 300-test analysis — dense models benefit far more than MoE. A careful benchmark author published the most rigorous community MTP analysis to date (May 10, 67 score, 24 comments). Over 300 test runs covering four task types, five quantization levels, three temperature values, and two MTP quant settings produced a clear finding: F16 + MTP nearly triples coding-task speed; Q4_K_M + MTP slows creative writing output. Temperature and MTP quant have negligible impact; task type is the only factor that matters. An RTX 5090 user in the comments reported ~70% acceptance rate for coding tasks at --spec-draft-n-max 4, with 70–120 tok/s sustained at 70–160k context on Q6. Expert commentary confirms the MoE penalty: MoE models like Gemma 4 26B-A4B must cycle through more experts per speculative token than dense models, so the overhead is proportionally higher — a Radeon AI Pro 9700 user saw prompt-processing speed drop from 1,400 tok/s to 650 tok/s after enabling MTP. Dense models (Gemma 4 31B, Qwen 3.6 27B full) are the primary beneficiaries; for MoE variants, MTP helps only on coding tasks with high acceptance rate. Practical rule: benchmark before assuming MTP helps on your specific workload. (source, May 10, 67 score, 24 comments)
Gemma 4 26B-A4B excels at one-shot creative coding tasks where Qwen consistently falls flat. A practitioner shared an automated three.js prompt cycling test (May 10, 38 score, 23 comments): a Python app cycles through 80 creative-coding prompts, generates single-file HTML/WebGL outputs, detects crashes, and archives the results. Gemma 4 26B-A4B one-shot generation quality was consistently high on 3D graphics and demoscene-style effects. The same author states Qwen 3.6 "falls flat on its face for just about anything I throw at it" in the creative context. A third commenter summarizes the emerging community consensus: "Gemma has more personality to it; Qwen is better for facts and coding." This creative-coding strength is now documented by at least two independent practitioners — the Packman racing game comparison thread (May 9) and this three.js cycling tool — and represents a consistent divergence from Qwen's strengths. For creative coding and single-file generative output, Gemma 4 26B-A4B appears to be the stronger local option at the 26–31B weight class. (source, May 10, 38 score, 23 comments)
vLLM ROCm added to Lemonade as experimental AMD backend — community wants Gemma 4 MTP support. AMD engineer jfowers announced the integration of vLLM ROCm into the Lemonade SDK as an experimental backend (May 8, 433 score, 90 comments). Installation is now two commands: `lemonade backends install vllm:rocm` followed by `lemonade run
Gemma 4 for language learning: correction-loop prompting pattern works; SillyTavern multi-character setups in active use. A language-learning thread (May 9, 23 score, 19 comments) surfaces a practical deployment pattern for Gemma 4 in education. The most-upvoted comment describes a correction loop: the model answers in three lanes (reply in target language, grammar correction, and explanation of why) while only marking one grammar error and one phrasing suggestion per turn to prevent homework-session overload. One commenter has been using Gemma 3 and then Gemma 4 continuously for German practice, noting it handles verb separation (Trennbare Verben) imperfectly but is broadly helpful for vocabulary connections across Romance languages. A SillyTavern multi-character practitioner reports actively using LLMs for Arabic, French, Portuguese, and Spanish practice across multiple character personas. Gemma 4's instruction-following fidelity — its consistent ability to stay in the target language and maintain a role when prompted — is what makes this use case work. No hardware specifics were shared, suggesting this is primarily a quantized local model use case compatible with standard consumer hardware. (source, May 9, 23 score, 19 comments)
May 10 re-check, 2026-05-10 01:00 EDT: three developments from this sweep reinforce and extend findings from May 9.
Practitioner survey confirms use-case split: Gemma 4 for instruction-following, prose, and games; Qwen for code. A second independent use-case thread (May 9, 20 score, 42 comments) drew direct practitioner reports of what they reach for Gemma 4 specifically. Common answers: generating narrative responses for NPCs in video games (E2B is cited here explicitly), writing PRDs and product specification documents using Gemma 4 31B and then handing implementation to Qwen, and structured tasks where instruction-following fidelity matters more than raw reasoning depth. The most-cited single-sentence summary from the thread: "best instruction-following of any open-weight model I've tried." This is the second large practitioner survey in as many weeks — after the 94-score May 6 thread — reaching the same structure: Gemma 4 is the answer when the task is open-ended instruction compliance or voice/tone matching, and Qwen is the answer for multi-turn agentic code execution. The split is now documented from two independent data points with combined 114 score and 169 comments. (source, May 9, 20 score, 42 comments)
MTP in llama.cpp: Georgi unifying speculative decode architecture before any merge lands. A thread asking how long until official llama.cpp MTP support (May 9, 68 score, 46 comments) surfaced a clarification from Georgi Gerganov: he is building a unified speculative decode architecture that covers MTP, Eagle3, and DFlash together — rather than merging each independently. All three methods will land in one correct implementation rather than piecemeal patches that create technical debt in the speculative decode path. This explains why PRs like #22673 (Gemma 4 MTP) and #22105 (DFlash) have been slow to merge despite being functional. No timeline was given; this is active in-progress work, not a planned milestone. Users who need MTP now should use the AtomicBot-ai patched fork (TurboQuant path) or the omlx runtime on Apple Silicon. The unified refactor, when it lands, should give llama.cpp native parity with vLLM on speculative decode across all three methods simultaneously. (source, May 9, 68 score, 46 comments)
Practical deployment: Gemma 4 on Mac Mini drives MCP server at full interactive speed. A first-person report (May 9, 29 score) confirms that Gemma 4 running on a Mac Mini runs fast enough to serve as the backend for a Model Context Protocol server at full interactive speed — with native tool calling, at zero cloud API cost. This is a concrete production data point for the "Gemma 4 as a free local MCP backend" deployment pattern: the model's tool calling quality and throughput are sufficient for MCP server workloads on current consumer Apple Silicon hardware. No hardware specifics (exact chip, RAM size, model variant, quant) were disclosed, so treat the speed claim as an existence proof rather than a precise benchmark target. The finding is consistent with the broader practitioner picture: Gemma 4 at the right hardware tier delivers cloud-grade instruction following with no recurring API cost. (source, May 9, 29 score)
May 9 re-check, 2026-05-09 01:00 EDT: six significant developments from this sweep.
DFlash for Gemma 4 26B MoE is live — 2.56x speedup in vLLM, 600 tok/s peak on RTX 5090. z-lab released gemma-4-26B-A4B-it-DFlash a few days ago; community benchmarks hit the site on May 8. A controlled vLLM benchmark (RTX 5090 32GB, vLLM 0.19.2rc1) measured baseline 228 tok/s → 578 tok/s at num_speculative_tokens=13 (2.56x speedup) on a 256-input / 1024-output random workload at concurrency 1. Optimal tuning: max_num_batched_tokens=8192 gave the cleanest p95 tail at that speculation depth, with mean E2E latency dropping from 4455ms to 1738ms. Critical community caveat: DFlash drops sharply at approximately 20k context. One commenter testing the same 5090 at 35k context reports speed starting at 400 tok/s but dropping quickly to 200 tok/s and continuing to degrade, with malformed tool calls. For short-to-medium context inference this is a compelling gain; for long-context agentic workloads it is not yet practical. On the DFlash vs MTP comparison: DFlash uses stateful parallel block diffusion drafting with persistent KV cache positions; Gemma 4's MTP implementation uniquely reuses the main model's KV cache, avoiding the memory pressure that afflicts MTP on other architectures. Both require vLLM for DFlash or a patched llama.cpp fork for MTP — no merged mainstream path exists yet. (DFlash benchmark, 99 score; DFlash release discussion, 114 score)
MTP acceptance rate determines whether it helps or hurts. A controlled M4 Max Studio study with Gemma 4 26B-A4B reveals that MTP benefit varies entirely by workload acceptance rate. Measured: code generation 66% acceptance → 1.53x speedup; long-form prose 31% acceptance → essentially no gain (0.95x); JSON structured output 8% acceptance → 0.50x (twice as slow). The mechanism: when the draft model's speculative tokens are rejected, the full model must re-run the verify step with no net gain — at 8% acceptance the overhead dominates. Expert commentary adds two important nuances: first, MoE models like Gemma 4 26B-A4B are harder to speculate in than dense models because spare compute for draft verification is limited; second, Apple Silicon before M5 has limited headroom, and dense Gemma 4 31B is expected to see better MTP gains than the MoE 26B on the same hardware. Practical guidance: MTP is worth enabling for structured code generation and predictable outputs; disable it for free-form prose and especially for JSON schema output, where it reliably degrades performance. Always benchmark before assuming benefit. (source, 24 score, 8 comments)
Multi-GPU topology insight: NVLink pairing beats full tensor parallelism. A detailed benchmark with 4×RTX 3090 (NVLink between GPU pairs 0↔2 and 1↔3, vLLM 0.20.1, CUDA 12.8) found that pinning TP=2 to an NVLink-bonded pair delivered +25% throughput at concurrency 1 and +53% at concurrency 4 compared to running TP=2 over PCIe. Counter-intuitively, expanding to TP=4 across all four GPUs was worse — cross-pair PCIe bus traffic added latency that outweighed the additional capacity. This applies directly to Gemma 4 31B Dense deployment on NVLink-equipped multi-GPU workstations: prefer TP=2 on your NVLinked pair over TP=4, even when you have four GPUs. Tested here with Qwen 3.6 27B AWQ as the workload model; the topology principle holds for any model requiring tensor parallelism across these GPUs. (source, 44 score, 36 comments)
TurboQuant + MTP on RTX 4090: 80-87 tok/s at 262K context — quality claims contested. A demonstration showing TurboQuant quantization combined with MTP on Qwen 3.6 27B reports 80-87 tok/s generation at a 262K context window on a single RTX 4090 (60 score, 42 comments). The numbers are eye-catching, but community pushback on quality was significant: the demonstration used a simple Q&A prompt and did not test accuracy on long-context retrieval tasks where TurboQuant's aggressive compression can degrade meaningfully. TurboQuant is the method from the AtomicBot-ai fork — the same project that shipped the first Gemma 4 MTP implementation for llama.cpp — and it is not merged into mainline llama.cpp or any standard quantization library. The combination of unverified quality and non-mainline tooling means the throughput claim is directionally interesting, but the practical recommendation remains: use quantization methods with published quality benchmarks on your target workload before optimizing around throughput numbers. (source, 60 score, 42 comments)
HTX301 PCIe inference card announced: 384GB at 240W, community skeptical. Taiwanese company Skymizer announced the HTX301, a PCIe inference card with 384GB memory and a 240W TDP (250 score, 103 comments). At face value the memory capacity is striking — 384GB would fit Gemma 4 31B Dense at BF16 with enormous headroom, or multiple models simultaneously. Community reaction was measured skepticism: the announcement contains no memory bandwidth specification, no compute FLOPS figures, and no pricing. Memory capacity without bandwidth is meaningless for LLM inference decode throughput, where bandwidth is almost always the bottleneck. Several hardware-knowledgeable commenters compared it unfavorably to AMD MI300X (192GB at ~5TB/s bandwidth) and suggested the 240W TDP implies a modest memory subsystem relative to the 384GB capacity. Worth tracking if independent benchmarks appear with validated bandwidth figures; do not plan deployments around the headline memory number alone. (source, 250 score, 103 comments)
vLLM ROCm added to Lemonade: AMD GPU users can now run inference before GGUF conversion. The Lemonade server added vLLM ROCm as an experimental backend, enabling inference from standard model weights on AMD GPUs without first converting to GGUF format. This reduces workflow friction for Radeon 6000/7000-series users on Linux who want to test Gemma 4 variants under ROCm. The backend is marked experimental; community verification of Gemma 4 on ROCm via Lemonade is sparse, so validate on your specific GPU before relying on it for production workloads. AMD GPU users for whom GGUF conversion was the primary friction point now have a faster path to initial evaluation. (source)
May 8 re-check, 2026-05-08 01:00 EDT: three new developments from this sweep worth recording.
MTP now working in llama.cpp for Gemma 4 — 40% decode speedup on M5 Max. A community developer (May 8) implemented Multi-Token Prediction for llama.cpp, quantized Google's new Gemma 4 assistant GGUF models, and tested on a MacBook Pro M5 Max. Measured result: 97 tok/s baseline → 138 tok/s with MTP, a 40% speedup. This uses the new Google-released MTP draft models (Gemma-4-26B-A4B-it-assistant) and a patched llama.cpp fork available at AtomicBot-ai; the patch is not yet merged into mainline llama.cpp. Key distinction from the omlx finding (below): this is llama.cpp-based MTP — relevant to Linux and Windows users who cannot use MLX. Commenters note the quality comparison between baseline and MTP outputs used different seeds and temperatures, so "40% faster with identical quality" requires verification at temp=0 with fixed seed; take the exact ratio as approximate. The directional finding (meaningful speedup via MTP on llama.cpp for Gemma 4) is credible given the confirmed mechanism. (source, 95 score, 19 comments)
MTP confirmed working on Apple Silicon via omlx runtime. A direct first-hand report (May 7) confirms that the new Google MTP draft models work with the omlx runtime on M1 Max 64GB, nearly doubling decode speed from 11 tok/s to 20+ tok/s at max wattage. Standard MLX (the more widely used Apple Silicon inference library) does not yet support MTP — the omlx runtime is a separate fork-based project. On the technology: MTP only benefits decode (generation) speed, not prefill — prefill processes the full input in parallel by design, so there is nothing to speculate ahead. Commenters clarified a common confusion: some third-party projects advertise "speculative prefill" as a distinct feature, but this involves lossy KV cache population (not mathematically equivalent to standard generation); lossless MTP applies only to the decode phase. For Apple Silicon users: omlx is the current fastest path to Gemma 4 MTP; native MLX support is pending. (source, 21 score, 22 comments)
Prompting sensitivity: Gemma 4 and Qwen 3.5 need different prompting than Qwen 3.6. A controlled test (May 7) ran two phrasings of the same math-word problem against Gemma 4 31B, Qwen 3.5, and Qwen 3.6 27B — 10 runs each (6 combinations). The headline result: the models respond very differently depending on phrasing, and Qwen 3.6 proved most robust to ambiguous phrasing while Gemma 4 and Qwen 3.5 performed better on the clearer of the two prompts. Key practical takeaway: Gemma 4's accuracy on reasoning tasks is sensitive to prompt clarity. Concise, unambiguous prompts tend to get better results than elaborated prompts that contain implicit assumptions. Quantization also matters: IQ2-quantized Qwen 3.6 underperformed Q8 on the same task, reinforcing the known guidance to prefer higher quants for reasoning workloads. This finding complements the token-efficiency story: Gemma 4 finishes tasks in fewer tokens, but benefits from being asked precisely. (source, 28 score, 13 comments)
May 7 re-check, 2026-05-07 01:00 EDT: two new developments from this sweep that add meaningful signal.
Community use-case survey crystallizes where Gemma 4 wins. A widely-upvoted discussion thread (94 score, 127 comments, May 6) asked practitioners directly what they use Gemma 4 for versus Qwen 3.6. The answers converge on a clear pattern: Gemma 4 is the preferred choice for vision and OCR ("Gemma trounces Qwen for handwriting analysis and general vision tasks"), bug tracing ("Gemma4 is really, really, really good at tracing bugs — much more consistent and reliable for finding the actual root cause"), translation especially Japanese and smaller European languages (independently confirmed across multiple reporters), creative writing, tone-sensitive text, and RAG over structured documents. Qwen 3.6 is preferred for agentic coding, multi-turn tool use, and long agentic loops. The niche-split that has been building across weeks of field notes is now directly confirmed from first-person practitioner reports. One practitioner summarizes: "For things I want to go fast, don't require accuracy or rely mostly on the vision encoder: Gemma4-26B-A4B. For where accuracy and nuance are important: Gemma4-31B. I prefer Qwen3.6 for anything programming or toolcalling related." The survey also confirms that translation quality holds at an unusually high bar: Gemma 4 is rated best open-weight option for Japanese→English, with one commenter noting it is "entirely undisputed" for open models on translation tasks. (source, 94 score, 127 comments)
Prompt injection defense: Gemma 4 E4B jumps from 21% to 100%. A benchmark study (6100+ tests across 15 models, 7 attack types) found that Gemma 4 E4B went from 21.6% to 100% defense rate when the untrusted input was wrapped in a long random delimiter and the model was explicitly told not to execute injected instructions. This was the largest absolute improvement of any tested model (+78.4 percentage points) and the only model to reach a perfect score. Tested attack types included role hijack, authority claims, and fake delimiters. The benchmark used hand-crafted payloads rather than SOTA adversarial search, so the defense rate may be lower against gradient-based attacks. Practical takeaway for RAG and web-document pipelines: the delimiter + strict-prompt defense is a high-ROI hardening step for Gemma 4 deployments that process untrusted external content. (source, 24 score)
Morning re-check, 2026-05-02 08:30 EDT: a follow-up sweep against the past 24 hours of r/LocalLLaMA confirmed three additional posts worth recording. A first-hand AMD Radeon 9060 XT 16GB report (eGPU on a 7840HS mini-PC) lands the 24B A4B IQ4_NL variant at 25.9 tok/s with KV cache at q8_0 and a small 256-token target. More importantly, two independent posts within fourteen hours documented an emerging "zombie loops" failure mode on both Gemma 4 and Qwen 3.6 with quantized KV cache during thinking mode. The convergent expert reading is that q4_0 KV quantization accumulates drift across hundreds of internal reasoning tokens until the model falls into a repetition attractor. This pattern is now strong enough to call out as a known limit (see below).
Evening re-check, 2026-05-02 17:45 EDT: the post-PR #82 sweep found two new high-signal items rather than a broad hardware shift. First, a local vLLM/FP8 vision comparison reports Gemma 4 staying much more concise on messy real-world image prompts, often around 1,500 thinking tokens where Qwen 3.6 can burn 8,000+ tokens and sometimes fail to finish. The same report says Gemma 4 followed normalized 0 to 1 bounding-box JSON instructions more reliably, while Qwen 3.6 did better on the tested 2 FPS deadlift video tracking case. Second, an SGLang production report identified an FP8 KV-cache bug for models with per-layer KV scales, explicitly including Gemma 4, where radix-cache prefix hits can silently corrupt output unless the deployment uses BF16 KV cache or the upstream fix lands. This reinforces the current guidance: for long-context or thinking-mode work, treat KV-cache precision and serving backend as quality controls, not just speed knobs. (vision source, SGLang source, PR #24198)
May 3 re-check, 2026-05-03 01:00 EDT: a new sweep surfaced two notable developments. First, a dedicated KV cache quantization discussion (source, 77 comments) provided the architectural explanation for the zombie loops pattern previously documented: Gemma 4 uses an interleaved Sliding Window Attention (iSWA) mechanism that is structurally more sensitive to KV precision loss than dense models or Qwen-style MoE. The expert comment reads directly: "Gemma 4, due to its iSWA architecture, is apparently much more sensitive to KV cache quantization." Dense architectures accumulate less rounding error per attention step; iSWA's alternating local and global windows amplify quantization noise differently. The practical implication is stronger than the zombie-loops framing: KV precision for Gemma 4 is an architecture-level quality control, not just a safety precaution for thinking mode. Second, follow-up comments on the "Qwen 3.6 wins benchmarks, Gemma 4 wins reality" vision post (source) added two confirming voices worth recording. A commenter with the opposite finding (Qwen 3.6 follows instructions better) attributes the divergence to "backend/harness influence," underscoring that task setup and serving backend matter for the comparison. A second commenter elaborates: "Gemma is much better at short one shot, but because of its architecture it struggles with long context. There is something about its attention mechanism and its also far more sensitive to KV quantization." On the multilingual dimension, a confirmed data point: Gemma 4 is "a much better LLM than Qwen for anyone that doesn't use English or Chinese as their primary language, especially for European languages." Third, the RTX 6000 Pro guidance from May 2 received an important nuance from a card owner: "performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls, then vllm and sglang are better." (source) This corrects the blanket recommendation: for single-user workloads on professional GPUs, llama.cpp-based tools remain competitive; the vLLM/sglang advantage appears primarily at 4+ concurrent requests.
Evening re-check, 2026-05-03 17:05 EDT: the post-PR #86 sweep found two new data points. First, a developer shipped a production Android voice notes app using Gemma 4 E2B (2.4GB) via LiteRT-LM on a OnePlus CE 5 (8GB RAM). The measured end-to-end latency for a 10-15 second voice note is 12-15s: Whisper Small (Sherpa-ONNX) handles transcription in ~5s, Gemma categorizes and extracts structured JSON in ~8-10s. The developer reports JSON output reliability as "way better than expected from a 2.4GB model on a phone" — a strong signal that Gemma 4 E2B's instruction-following quality holds well under aggressive quantization on ARM. Notably, commenters suggest the separate Whisper step may be unnecessary since E2B may support native voice-to-text natively via LiteRT-LM. (source, score 18, 14 comments) Second, a community survey of Gemma 4 31B on smaller European languages confirms the multilingual advantage holds above the 100B MoE tier: multiple independent reporters conclude that Gemma 4 31B beats Qwen 3.5 122B and Mistral 4 119B for Czech, Hungarian, Slovak, and Dutch. The data comes with a precision note: quantization hurts multilingual quality more than English quality, so the comparison is most meaningful at BF16/FP16 — a 16-bit Gemma 4 31B is "extremely good in Hungarian" while the same model at 8-bit shows "slightly Chinese" output contamination. The practical guidance: if your use case is primarily a smaller European language, Gemma 4 31B at high precision is a better choice than any current 100B MoE at standard quantization. (source, score 8, 14 comments)
May 4 re-check, 2026-05-04 01:00 EDT: two new developments worth recording from the latest sweep. First, a report on running llama.cpp via the Snapdragon Hexagon NPU adds early data for mobile NPU inference with Gemma models. The NPU path itself is battery-efficient but constrained: the Hexagon NPU can only address 4GB of RAM, making it unsuitable for anything larger than the smallest Gemma variants without splitting across multiple NPU device instances. In practice, community testing found Gemma 4 E4B achieves 11-14 t/s on a OnePlus 13 (Snapdragon Elite) via the Android Edge APK (GPU path), not NPU. The NPU path on the same chip produced less favorable results. The takeaway for mobile: the GPU via Edge APK is currently the more practical Gemma 4 E4B path on high-end Android phones; NPU is a power-saving alternative that makes sense for always-on background tasks where latency tolerance is high. (source, score 20, 6 comments) Second, a community quality-gap discussion (68 score, 44 comments) adds useful perspective on where Gemma 4 31B sits against frontier cloud models. The converging read across commenters: Gemma 4 31B tracks "Dec 2025 frontier" performance levels for translation and non-English tasks — competitive with Claude Haiku 4.5, which was released roughly half a year ago. For tasks outside English and Chinese, Gemma 4 31B is seen as clearing the bar where the "6-month gap" argument would place it. This is consistent with the separate multilingual finding from May 3: Gemma 4 31B beats all tested 100B+ MoE models for smaller European languages when run at BF16. Anecdotal confidence; no controlled benchmark behind this comparison. (source, score 68, 44 comments)
May 6 re-check, 2026-05-06 01:00 EDT: four high-signal developments from the latest sweep.
Google officially released Gemma 4 MTP draft models. Multi-Token Prediction (MTP) drafters are now available for all four Gemma 4 variants: 31B Dense, 26B-A4B MoE, E4B, and E2B (HuggingFace). The E2B drafter is only 78M parameters — tiny enough to run alongside the main model with minimal memory overhead. MTP works by having the small draft model predict several tokens ahead; the large target model then verifies the full batch in parallel, accepting correct tokens and re-running from the first mismatch. This guarantees identical output quality to standard generation while targeting up to 2x decode speedup depending on task type (structured outputs and repetitive patterns see the largest gains). Community response was immediate: llama.cpp PR #22673 is already in review for Gemma 4 MTP support, and the MTPLX Apple Silicon runtime (see below) also claims MTP model compatibility. This is the biggest single capability addition to Gemma 4 since launch and changes the expected throughput trajectory significantly. (source, score 783, 204 comments)
Token efficiency confirmed: Gemma 4 31B is slower per token but faster per task. A Kaitchup benchmark article (summarized in a community post, 117 score) compared Gemma 4 31B Dense against Qwen 3.6 27B Dense and Qwen 3.5 27B Dense. The headline finding: Qwen models score higher on standard benchmarks ("benchmaxxed") but Gemma 4 31B is "far more efficient with token use" — it produces a correct, complete answer in substantially fewer tokens. The practical implication is that even though Gemma 4 31B is slower per token (it is a larger dense model vs. smaller dense models), total task completion time is often similar or faster because the model doesn't need to elaborate as much. One commenter summarizes the workflow they use: swap Gemma and Qwen 3.6 in Plan/Act roles when either model gets stuck — the two models' different failure modes make them complementary. Another notes that Gemma 4 is more sensitive to quantization, so Qwen's smaller quant + Q8 KV can outperform Gemma at the same VRAM budget, especially for longer contexts. (source)
CPU-only 26B inference is fast because of MoE architecture. A community post (score 100, 70 comments) reports running Gemma 4 26B-A4B on an i5-8500 with 32GB DDR4 RAM and no GPU. The measured generation speed is 9.25 t/s (prompt processing 23.13 t/s). The key explanation from the top comment: "Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model." This is the definitive answer for CPU-only users: the 26B label is misleading — active parameter count per token is ~4B, making CPU inference practical on ordinary hardware. Qwen 3.6 27B is dense (all 27B parameters active every token), so it runs ~8x slower on CPU despite having similar total parameter count. For CPU-only or low-RAM setups, the Gemma 4 26B-A4B MoE is the right model; Qwen 3.6 27B is impractical at the same hardware. (source)
MTPLX: Apple Silicon MTP inference engine shows 2.24x speedup. An open-source runtime built on a patched MLX fork (not a patch to MLX itself) reports 28 → 63 tok/s on Qwen 3.6 27B on MacBook Pro M5 Max using MTP heads built into the model. Key design details: mathematically exact temperature sampling via rejection sampling (not greedy-only like other speculative decode tools on Apple Silicon), custom Metal kernels, and a full OpenAI/Anthropic-compatible API server. The runtime also adds crash-safe fan control and a 562-test suite. With Google's Gemma 4 MTP draft models now released, MTPLX may support Gemma 4 inference as well — the developer says it "works on ANY MTP model." Not yet independently verified for Gemma 4 specifically; treat as promising but unconfirmed. (source, score 60, 38 comments)
May 5 re-check, 2026-05-05 01:00 EDT: four developments from the latest sweep that Gemma 4 users should act on or track.
Update your Gemma 4 GGUFs. A high-traction community post (365 score, 103 comments) announced that the Jinja chat template bug documented in earlier field notes has been fixed in the upstream model files. Updated GGUFs are now available from bartowski and unsloth for all four variants: 31B, 26B-A4B, E4B, and E2B. Community comments flagged that the fix may also reduce the extreme memory usage some users experienced. The exact change is visible at HF discussion 86. If you have been running Gemma 4 GGUFs from before May 2026 and are using tool calling or extended context, updating is strongly recommended. (source)
llama.cpp MTP support is now in beta. A beta implementation of Multi-Token Prediction (MTP) has landed in llama.cpp (477 score, 210 comments). MTP pairs a small fast draft model with the large target model: the draft predicts a token batch, the target verifies the entire batch in parallel, accepting correct tokens and re-running from any mismatch. ELI5: "big model and small model work as a team — small model runs ahead, big model checks from behind, both finish sooner." Currently limited to Qwen3.5 MTP architectures, with broader model support expected. The author notes that between MTP and maturing tensor-parallel support, "most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, should be erased." Relevance for Gemma 4: once Gemma 4 MTP support lands (if and when), E4B could serve as a draft model for 31B Dense — this is architecturally the same as the existing Gemma 4 E2B speculative decoding setup but with native MTP semantics. Not yet merged; track the PR before updating. (source)
APEX MoE quants now cover Gemma 4. The APEX mixed-precision MoE quantization strategy, originally demonstrated for Qwen 3.5 35B-A3B, has expanded to 30+ models including Gemma 4 variants (77 score). APEX applies expert-routing-aware precision tiers: higher precision for edge layers and shared experts (which handle rare long-range tokens), lower for mid-tier experts. Users report noticeably better coherence past 32K tokens compared to uniform Q4_K, with measured faster inference on the benchmarked Qwen 3.6 models. The Gemma 4 26B MoE coverage is confirmed in the library; community reports on Gemma 4 specifically are sparse so treat the long-context claims as plausible but anecdotal until more data surfaces. Quants are available via github.com/mudler/apex-quant. (source)
Research: FastDMS achieves 6.4x KV cache compression faster than vLLM BF16. An MIT-licensed reference implementation of Dynamic Memory Sparsification (DMS) — a technique using learned per-head token eviction to compress the KV cache — reports 6.4x compression with near-lossless quality (perplexity 9.226 → 9.200 on Llama 3.2 1B; KLD ~0.026 nats/tok). The implementation is research-quality and tested only on Llama and Qwen-family checkpoints; it has not been integrated into llama.cpp, vLLM, or SGLang. Author explicitly says the lift for a production serving integration is large and "noped out" of attempting it. Given Gemma 4's documented KV precision sensitivity (iSWA architecture amplifies quantization noise), FastDMS is worth tracking as a potential path to longer context without KV precision degradation — but this is speculative and no Gemma 4 DMS checkpoints exist yet. Confidence: low (research-stage result). (source)
Hardware leak: Ryzen AI Max+ 495 (Gorgon Halo) with 192GB unified memory. A leaked spec for AMD's upcoming Ryzen AI Max+ 495 shows 192GB unified memory, up from 128GB on the current Strix Halo 395 (148 score). Key caveat from community hardware experts: memory bandwidth appears unchanged at ~256GB/s. For Gemma 4 users this means: a Gorgon Halo system could fit Gemma 4 31B Dense BF16 (~62GB), the 26B MoE BF16, and several smaller models simultaneously, with prefill remaining the same speed bottleneck as Strix Halo today. The additional capacity is most useful for parallel model loading, very long contexts, or RAG pipelines that need multiple loaded models. Unconfirmed leak; release timeline and pricing unknown. Strix Halo 395 owners confirm the memory increase alone would not change throughput on single-user workloads. (source)
MTP vs DFlash is now settled at the hardware and concurrency level. A controlled H100 benchmark (May 12) confirms: at concurrency 1, MTP (3.11x) and DFlash (3.03x) are statistically tied for Gemma 4 31B Dense. At concurrency 16, MTP wins decisively — 953 vs 725 tok/s. For single-user inference either method works equally well; for serving multiple concurrent users, MTP is the better choice. Apple removed the M3 Ultra 256GB Mac Studio from its store ahead of an expected M5 launch. DFlash for 26B MoE remains live in vLLM with 2.56x throughput for short-context workloads; both MTP and DFlash require workload-appropriate tuning. New May 15: KV cache quantization guidance for vLLM is now more precise — a formal study confirms FP8 (`--kv-cache-dtype fp8`) as the best default when VRAM is constrained, with 2x capacity and negligible accuracy loss. TurboQuant variants beyond 4bit-nc are not worth the accuracy and throughput cost for Gemma 4. If VRAM is not a constraint, unquantized KV cache remains highest quality.
The most relevant Gemma-mentioning posts driving this update, with the newest first:
The full set of 153 community reports lives in the Community Reports section above, filterable by hardware category and search.
Last updated: 2026-05-18 (May 18 sweep). Confidence: medium. Next update fires when the daily Gemma 4 research cron flags notable new findings.
Real-world hardware experiences from the community. Filter by hardware category or search. These are user reports, not official benchmarks.
Google is going to show what open weights is about. Happy Easter everyone.
> 2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q40 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants, it appears 3 is the optim...
Link post. The discussion is mostly in the comments. Target:
Blog post: MTP draft models:
I think gave it a fair shot over the past few weeks, forcing myself to use local models for non-work tech asks. I use Claude Code at my job so that's what I'm comparing to. I used Qwen 27B and Gemma 4 31B, these are considered the best local models u...
Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens. Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens. So what is more important: tokens per second, or t...
Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour. Ofc the key is building a system around their weaknesses,...
Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4\K\M via llama.cpp with q8_0 KV cache and flash attention. 12K context, native system role, sampler defaults from the model card. Cached TTFT around 200ms, sustained 14-15 tok/s. SenseVoiceSmall fo...
I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi...
Evaluated Qwen 3.6 27B across BF16, Q4\K\M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanEval: 164 He...
sycophancy: deleted efficiency per token:+1000% friendship: just beginning edit: “sup” got cut off at top
I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different models are str...
of course this is just a trust me bro post but I've been testing various local models (a couple gemma4s, qwen3 coder next, nemotron) and I noticed the new qwen3.6 show up on LM Studio so I hooked it up. VERY impressed. It's super fast to respond, han...
Implemented Multi-Token Prediction for LLaMA.cpp. Quantized Gemma 4 assistant models into GGUF format. Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster. Prompt: Write a Python program to find the nth Fibonacci number usin...
Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier. GGUFs: We also want to clear up a few misunderstandings around our ...
other companies are slowly going away from open weight, not releasing base models, delaying open weight distribution, not releasing top models (this one I think is fair, but still), and I also noticed they stopped publishing research (old Gemma and q...
My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is substantiv...
Model(s) or Tool upgrade/New Tool? Source Tweet :
I’ve been testing Google’s Gemma-4-E2B-it as a local, offline resource for emergency preparedness. The idea was to have a lightweight model that could provide basic technical or medical info if the internet goes down. As the screenshots show, the saf...
Chat Template was fixed a few days ago choose your fav dealer:
I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my personal/throw...
Link post. The discussion is mostly in the comments. Target:
Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image. It uses a combination of object detection, visual LLM-bas...
Link post. The discussion is mostly in the comments. Target:
Launched claude code, pointed it at my running Qwen, and, well, it vibe codes perfectly fine. I started a project with Qwen3.6-35B-A3B (Q4) yesterday, and then this morning switched to 27B (Q8), and both worked fine! Running on a dual 3090 rig with 2...
Link post. The discussion is mostly in the comments. Target:
llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF:
I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know). But this genuinely feels like a model I could daily drive...
A lot of people in the Gemma 4 Model Request Thread were asking for better vision capabilities in the next Gemma Model. This tells me that people are not configuring Gemma 4's vision budget. Gemma 4 ships with [Variable Image Resolution](
Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection. First Chinese model to land in th...
I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information from reasonab...
DeepSeek V4 Update
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF:
Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard! little-coder × Qwen3.6-35B-A3B hit 24.6% (±3.2), and now land above Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B on Terminus 2 (23.9%). I didn’t expect t...
A bit of an interesting story of model degradation and censorship. So, one of my use cases for AI has been translating and reading an Chinese novel as it appears, chapter by chapter. Due to the way some characters have secret identities plot points, ...
A bit of context. I was coding up a little html tower defense game where you can alter the path by placing additional waypoints. My setup: 32gb ram with 16gb vram 5070 ti. Using AesSedai/Qwen3.6-35B-A3B-GGUF IQ4_XS on LM Studio with OpenCode. I've gr...
I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchm...
So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up the local ...
Gemma 4 26b-a4b-it is basically a solid B student that gets the job done. Qwen3.6-35b-a3b is an A+ student that has plenty of energy after finishing the assignment to add flairs. On a my 16vram video card. Both models runs comparable speed. On Window...
Hey r/LocalLLaMA we conducted KL Divergence benchmarks for Gemma 4 26B-A4B GGUFs across providers to help you pick the best quant. Mean KL Divergence puts nearly all Unsloth GGUFs on the Pareto frontier KLD shows how well a quantized model matches th...
Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context. - It's 18.8GB | Benchmark | Baseline (Full Precision) | NVFP4 | | --- | --- | --- | | GPQA Diamond | 80.30% | 79.90% | | AIME 2025 | 88.95% | 90.00% | | MMLU Pro ...
tell the Gemma team:
Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at ...
I just wanted to share my experience. At work we have Cursor with the Enterprise tier. Today I burned 10$ with 2 prompts, one on gpt-5.5 and one on claude-opus-4.6-thinking. Last month I burned 80$ in one week with claude-opus-4.7 even with the 50% o...
Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article... and it did not disappoint. Lots of free info in the link. The juicy part is behind a paywall. I'll respect that, but the short of it is: It's showing t...
I'm using the fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4\K\M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% smaller) I could've gone with a smaller quant of Gemma...
I created this chart with recent open models from last 6 months. Few might be older than that possibly. Included only latest versions(Ex: Only Kimi-K2.6, no Kimi-K2.5 & Kimi-K2. Also only GLM-5.1 & GLM-4.7, no GLM-4.6 & GLM-4.5). I couldn...
Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task. The model is performing very well. It handled all tool calls properly and also managed large context using llama.cpp on a 16GB VRAM on laptop. ...
Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes. We started off last month with the release of gemma 4 support, and continued with [improved caching efficiency](
This is a follow-up update to my previous post comparing Qwen 3.6 35B vs Gemma 4 26B. I wanted to particularly follow-up with the following: 1. Gemma 4 26B could've suffered the quantization tax…
Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago. As far as my understanding goes, Dflash should be a better alternative than MTP because of faster parallel ...
I don't know if it's something I am doing horribly wrong or what, but running Open WebUI w/ Terminal on Docker with the models on LM Studio and I am starting to think the community keeps praising the tool calling feature just to cope lol Qwen3.5 27B,...
Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older \~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows. At this point I’m not really finding a rea...
After the recent releases, there's almost a sense of emptiness. When do you think new models will be released? Looking at the chart, it's between the end of May and the beginning of June, but... I don't know why, it seems like something's changing ab...
This is crazy. I've been running local LLMs on CPU only for awhile now and have great results with 12B models running on an i5-8500 and only 32GB of RAM with no GPU. But I've got a version of Gemma4 26B running really fast on the same machine which i...
Link post. The discussion is mostly in the comments. Target:
In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests but only with the dense full 27b Qwen 3.6 model. The MoE 35B version gained less than 10% with the MTP version....
I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when I tried to...
I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.
>The NVIDIA Kimi-K2.6-NVFP4 model is the quantized version of the Moonshot AI's Kimi-K2.6 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Kimi-K...
Originally I was a diehard fan of Gemma4 26b-a4b because it really is a remarkably intelligent llm. Ran qwen3.6 via ollama and found it impressive but still favored Gemma. Ollama did it a disservice at least on my pc. Ran it straight through llama.cp...
I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: GPU: RTX 5090, 32GB VRAM vLLM: 0.19.2rc1 Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit Draft model: z-lab/gemma-4-26B-A4B-it-DFlash Workload: random datas...
I guess we'll have to wait until this PR is merged before we can test it.
DeepSeekv3 OG DeepSeekv3.2/4 Qwen3.5+ GLM4.5+ ~~MiniMax2.5+~~ Step3.5Flash Mimo v2+ Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.
From clem on 𝕏: From Victor M on 𝕏:
Hello, I’ve been scrolling through a lot of posts, reading personal experiences, setup advice, and replies to beginner questions from people like me. LLMs really seem like a revolution. But at the same time in every post there is issues : they’re exp...
Link post. The discussion is mostly in the comments. Target:
Provided in both Safetensors and GGUFs. Safetensors: llmfan46/G4-MeroMero-31B-uncensored-heretic: GGUFs: llmfan46/G4-MeroMero-31B-uncensored-heretic-GGUF:
edits to call out some information: \- All local model uses \`Q4\K\M\` quantization with \`llama.cpp\` engine \- Main factor contribute to difference with Qwen's official post (59% vs 38%) is probably benchmark task timeout used, then quantization, h...
Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web access can ha...
Hi, recently froggeric and allanchan339 released enhanced/fixed template for Qwen3.6 each one addressing different topics. I didn't know which one to use so I merged both with the help of Claude Opus to have the best of both. I've uploaded it to this...
Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks. So is Qwen outright better? In what case would you pick Gemma ...
Hi everyone, I saw an article saying Chrome silently downloads a \~4GB AI model (likely "Gemini Nano") to your computer for features like text summarization. Two questions: 1. What is the exact name/version of this model? 2. Is there a GGUF file avai...
Link post. The discussion is mostly in the comments. Target:
Hardware |Component|Details| |:-|:-| |Machine|MacBook Pro (Mac14,6)| |Chip|Apple M2 Max — 12-core CPU (8P + 4E)| |Memory|64 GB unified memory| |Storage|512 GB SSD| |OS|macOS 15.7 (Sequoia)| # AI Agent Setup I'm using the pi coding agent as my primary...
MTP is amazing. I genuinely thought it would be a nothingburger
I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4\K\M models, 128k ...
Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my cu...
Going to flag this up front - I know that there are some properly smart people on this sub, please can you correct my noob user errors or misunderstandings and educate my ass. Model: google/gemma-4-26b-a4b Versions: * MLX:
I ran a pretty simple but revealing local-LLM test. At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well... Models tested: Gemma4 `cyank...
Hi. It is quite a consensus that the "jump" in quality of agentic development happened sometime in December 2025, transforming from "nice to have", to actually performing. It was also long discussed that open source models lag the state of the art by...
This is follow up from previous post: There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to dat...
You can play them here: This started out as a simple test for Qwen3 Coder Next vs Qwen3.5 4B because they have similar benchmark numbers and then I just kept trying other models and decided I might as well share it even if I'm not that happy with how...
Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. # Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative …
So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I ha...
Im using these settings in llama.cpp: --spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 Whats the real reason for lets say the prompt is for "minor changes in code", whats differing between models: Gemma 4 31b: Doubles in t...
UPDATE: Vulkan benches arew now included. And yes, I used AI to help me write this post. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did ...
There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so...
Provided in Safetensors, GGUFs and NVFP4 formats. Safetensors: llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic: GGUFs: lmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it…
Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B param...
EDIT: OKOKOK. Blackwell all the way. NEW, at MC or NewEgg or where ever and more tokens than my face can handle. Thanks guys. I was close to pulling that Apple.com trigger. You saved me. EDIT AGAIN: I think it's the max-q for me. Central Computers ha...
We have a chat system which we use haiku for because it is mostly about tool calling and summarisation of them. But we have many tools with pretty complex input schemas, and stuff like gemma didn't cut it, so we went with haiku. Haiku is pretty good....
Just sharing the results from experimenting with the B70 on my setup.... These results compare three `llama.cpp` execution paths on the same machine: RTX 3090 (Vulkan) on NixOS host, using main llama.cpp repo (compiled on 4/21/2026) Arc Pro B70 (Vulk...
I benchmarked Qwen 3.6, Qwen 3.5, and 5 other models across 5 agent frameworks on Apple Silicon — here's the full compatibility matrix Hardware: Apple M3 Ultra, 256GB unified memory Frameworks tested: Hermes Agent (64K stars), PydanticAI, LangChain, ...
TLDR: tool parameters using the common JSON Schema pattern \`anyOf: [$ref, null]\` are rendered into the prompt as empty \`type\` fields. This strips the useful schema information before the model sees it. \-- Long, rambling version: Gemma 4 was havi...
I wrote up this little python app to cycle through a bunch of prompts like this: |Single HTML file using three.js from CDN. A central rotating MeshNormalMaterial torus knot. Place a bright Sprite (AdditiveBlending, soft circular canvas texture) at a ...
Link post. The discussion is mostly in the comments. Target:
This isn't an advertisement, and it's very much local and open - I already don't have enough time to keep up with the existing pull requests and issues... just a fond look back on how much this space has grown and matured in the past year. Shit was t...
Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)
I had the idea of splitting the cross-entropy difference into two sums (positive and negative; or the PPL into two ratios >1 and <1) while doing PPL evals of uncensored GGUFs. The inspiration came from looking at the area under the PPL ratio co...
Running my own models. I was having some trouble getting vLLM going so dropped down to LM Studio which I've used on my 24GB MacBook Air. I now have LM Link across both laptops into the AI Workstation RTX Pro 6000 Blackwell. And my phone on LM Mini. I...
And somehow we already got some GGUFs for it!
I was thinking, that some folks in this community will be interested to see what current options are on local deep research field. So I spent some time to collect everything I could find together. Enjoy. TLDR: the most healthiest and local-friendly p...
So I was very excited about the MTP stuff especially since Gemma4 has become my "daily driver" for some stuff. I grabbed the latest mlx-vlm and did some tests and found it disappointing. | Workload | MTP off | MTP on | Result | Draft accept rate | |-...
Everyone has been taking about Luce DFlash and PFlash. I just came across their megakernal and it seems it was released along with Dflash and PFlash. It seems it's giving them 1.8x greater speed with much more power efficiency on nvidia gpu comparabl...
Ling-2.6-1T: A Trillion-Parameter Comprehensive Flagship Model for Complex Tasks Today, we are thrilled to open-source Ling–2.6–1T from the Ling family. Tailored for real–world, complex scenarios, this trillion–parameter model introduces targeted opt...
Quote: ...new optimizations for Ryzen AI Max 300 "Strix Halo" and the ROCprof Trace Decoder is now open-source...<snip>... Those rolling from source can grab the ROCm 7.13 Tech Preview via TheRock on GitHub.
I am not the author. My two cents: I'm not suggesting we don't all know local AI is expensive, at least for now. The math gets interesting if OpenRouter providers are burning investor cash and it runs out, or we take into account hardware we use for ...
I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple. One H100 80GB, vLLM 0.19.1, the buil...
Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?
[UPDATE - April 2026] Several people asked about missing models (Qwen 3.5, Gemma 4, the SillyTavern finetune series) and raised valid questions about the methodology. I ran an expanded 37-model sweep with a 5-judge ensemble and documented the selecti...
Talkie-1930-13b-it and Gemma 4 31b in the same chat. Talkie is a 13B vintage language model from 1930. Hosted version if you can't run them both locally
According to this. I run several more tests to cover more models and quants. [Qwen3.6 35B-A3B MLX oQ4. 2 extra pawns. (oMLX - local)](
Confidence is persuasive. In AI systems, it is often misleading. Today's most capable reasoning models share a trait with the loudest voice in the room: They deliver every answer with the same unshakable certainty, whether they're right or guessing. ...
I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: \- Create IndexNow CLI in Golang (Easy Task) and \- Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwen 3.5, &a...
Provided in both Safetensors and GGUFs. Safetensors: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic: GGUFs: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic-GGUF:
I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory. I don't...
Link post. The discussion is mostly in the comments. Target:
I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count. *All 8-bit MLX, M3 Max 128GB, served via omlx, prompted through Claude Code. Same prompt every time — single...
Pocket LLM v1.5.0🚀 New in this release: \- 🎙️ Voice input \- 🖼️ Image input with OCR, Gemma vision, and FastVLM support \- 📷 Camera capture with retake, crop, and photo review \- 🗂️ Previous chats side panel \- 💾 Downloaded model deletion to save sto...
I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday. YMMV! Too busy with work to write myself, so I asked Opus to write for me (I have validated the content!). I’m sure there will be debate over using q4 bl...
I run Qwen-3.6 27B with the FP8 safetensors on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance be...
Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size. What surprised me was JSON output. Short input, give it a structured prompt, you get clean parse able JSON back. Way better than I expe...
I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K. Thanks!
Around 3B please thank you
Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: ...
Just wanted to share that I'm pretty happy about Qwen 35b a3b agentic coding performance. I'm running the model in q80 quant, kv cache both q8_0 as well, with 262144 in 4090 + 5060 ti, via llama.cpp backend with claude code pointing to localhost. For...
I was testing OpenCode and Roo Code with Gemma 26B on llama.cpp yesterday for about 10 hours. I was able to make progress on my project, both solutions work. But: OpenCode is kind of fucked up at the moment, because of that there is often long prompt...
So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominative case an...
When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5 did well on...
Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow. It's been working great but it's a bit slow at times. I use Gemma 4 / Qwen, I al...
I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML. The dataset: 55 runs, three rigs, five backends (ro...
I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for \~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can make use of ...
Which of these do you think we'll get in May? Also, feel free to pick/rank which ones you'd want the most badly: - more Gemma4 models (124b?) (other sizes?) - more Qwen3.6 models (9b? 122b? 397b?) - new Qwen Coder model (80b Even Nexter?) (~397b/400b...
With every new model release there's the "better than Opus 6.13" guys vs the "this is so bad, why did they even release it" camp and I'm always wondering which one is using it wrong. So I did a little test with 2 related prompts, 3 models and ran eac...
Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30. EDIT: I offload my display to my igpu btw to save some vram on the rtx dgpu. Otherwise drop 10% or so on p...
We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality. In the name of evaluation, I only had a keyword matching script producing numbers that looked like...
I've been learning German recently, and it occurred to me that I could point some of my AI horsepower at having a German speaking LLM to practice with. I'm not too concerned with the speech to text side of things or getting it to talk back, but googl...
In case you haven't heard, Google just released Multi Token Prediction drafters for Gemma 4, a speculative decoding approach that pairs the main model with a lightweight drafter. It can predict several tokens ahead and then verify them in parallel, s...
I've spent the last few weeks running real multi-file coding tasks through small local models and small cloud models on free tiers. Wanted to share the failure points that came up consistently, since some of them surprised me and i wanted to share wi...
Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to other compa...
When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them. I made DataGate for that. But if it's web documents that ...
yea i know the title looks so stupid, yes i done searches, i searched google, huggingface, youtube, i even tested some via LM Studio, but due to my low-end VRAM (GTX 1050 4G Vram) i cant fit more than 4B or 1B into it, i have about 20G RAM + 15G Page...
I’m upgrading from 32 to 48 soon and am excited but I’m curious what y’all run!
Found this interesting and thought i'd share. A big problem i've had with Qwen 3 MoE is how bad at instruction following it was, and also, it's 'dumb point' in the context window was really low. I was so turned off by it that i never tried Qwen 3.5 a...
Nothing extensive to see here, just a quick qualitative and performance comparison for a single programming use-case: Making an ancient website that uses Flash for everything work with modern browsers. I let all 3 models tackle exactly the same issue...
We had deepseek v4 preview recently but it wasn't much better than v3.2. What is the next SOTA local/open model you are excited about?
Yes, for material that is an hour long, there is no getting around tools like Whisper - or something even better. However, for transcribing short snippets, Gemma works very quickly and reliably- even in foreign languages. Do you use it as well?
now you can talk about videos
Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?
I have been using local LLM for coding quite a lot as well as some other tasks (like data extraction from images) and I had quite a good success with Qwen3.6 models. It's obviously not Sonnet/Opus, but I am able to get quite a lot of work done. Latel...
Which LLM with under 10B params has the best ability to do web searches Is there any benchmark for this where i could see how certain models perform I've checked out gemma e4b it, is it any good for web searching compared to other alternatives at the...
Qwen3.5-27b (BF16) on 2x Pro 6k and Gemma-4-E4B (BF16) on RTX 5090 - Took about 8 minutes total (40k tokens total - but like 10k is opencode prompt) - One prompt for planning (I answered a few follow ups) - One shot 1000 lines of code - Fixed only bu...
Hi all! I recently made a post about how Gemma 4 managed to replace Qwen 3.5 for me, for semantic routing and a lot of coding stuff and ultimately it was my new daily driver. The next day, Qwen 3.6 released and I've been using it a lot this week. Her...
I’m currently stuck deciding between AMD Strix Halo (128 GB AMD Ryzen AI Max+ 395 Framework Desktop) and an Nvidia DGX Spark (Asus Ascent GX10) for a home LLM server that can be accessed over the local network with a ChatGPT like interface in a web b...
I've got a 128GB Strix Halo box. Yesterday I wanted to try out Step-3.5-flash. It's a model that barely fits in my system as is - I found a bartowski Q4_XS that's 105GB. With about 150K context it takes to about 108GB. That leaves about 20GB minus wh...
We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs). Any 8GB VRAM(and 32GB RAM) folks already doing Agentic co...
I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the phone. It seems like llama.cpp's Hexagon backend is highly supported by…
I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes. It has a static results explorer, schema-validated be...
Community discussion comparing Gemma 4 31B, Qwen 3.6 27B, and GLM 4.7 30B on non-English (primarily European) languages. Original poster reports Gemma 4 31B as the best at Czech, noting it "blows their mind" at 18GB. Key community finding: Gemma 4 31...
View full discussion on r/LocalLLaMAI've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking". It just loops through spitting out / until the max tokens is hit so you see...
I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4...
I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do .... I have to do ... I need to do ... Is this a known issue with lower quantization ? I usuall...
SGLang backend compatibility report from AI Router Switzerland. The author reports FP8 KV cache corruption with radix-cache prefix hits on Qwen3.6-27B-FP8, and explicitly says the bug seems to affect FP8 models such as DeepSeek-V4, Gemma 4, and Qwen3...
View full discussion on r/LocalLLaMA