Gemma4 Self-Hosting Guide

Find the best Gemma configuration for your hardware. Search by GPU, CPU, or RAM to see what works, how fast, and what quality to expect.

CPU: AMD Ryzen 9 5900X 12-Core Processor
RAM: 121GB
GPU: GPU (via Ollama host)
Best: gemma4:31b (88% at 1.2 tok/s output-est)
functiongemma-270m:latest Params268.10M QuantQ4_K_M Backendollama Score0% Speed20.1 tok/s output-est
CPU: AMD Ryzen 9 5900X 12-Core Processor
RAM: 121GB
GPU: CPU only
Best: gemma-4-26B-A4B-it-Q4_K_M (73% at 2.9 tok/s output-est)
CPU: AMD Ryzen 9 5900X 12-Core Processor
RAM: 121GB
GPU: NVIDIA GeForce RTX 3090
Best: gemma4:e4b (5% at 6.0 tok/s output-est)

Backend Comparison

BackendBest ForGPU SupportNotes
OllamaMost users, GPU setupsCUDA, Metal, ROCmEasiest setup, automatic model management
llama.cppFlexible quantizationCUDA, Metal, VulkanMore quant options, manual model files
gemma.cppCPU-first setupsCPU only (for now)Google-native, Gemma 2/3 only currently

Hardware Tiers

  • High-end GPU (24+ GB VRAM): Run Gemma 4 31B Dense or 26B MoE at full precision. RTX 3090/4090, A100, etc.
  • Mid-range GPU (8-16 GB VRAM): Gemma 4 26B MoE with quantization, or Gemma 4 E4B unquantized.
  • Apple Silicon (32+ GB unified): Gemma 4 26B MoE via Ollama Metal. 48+ GB can try 31B Dense.
  • CPU only (16+ GB RAM): Gemma 4 E4B or Gemma 3 4B via Ollama. Viable for interactive use at 140+ tok/s.
  • CPU only (8-16 GB RAM): Gemma 3 4B or Gemma 2 via gemma.cpp. Smaller but functional.