Gemmaclaw includes a built-in E2E agentic benchmark harness that evaluates Gemma models as AI agents with real tool use. The harness dispatches the full agent task suite, captures full conversations including tool calls, and saves structured results ready for PR submission.
Each task runs in an isolated environment with mock tools (email, calendar, tasks, contacts). Results are saved after every task, so an interrupted run can resume without losing completed tests.
# 1. Set up gemmaclaw (auto-detects hardware, installs backend, pulls model)
gemmaclaw setup
# 2. List all benchmark tasks
pnpm benchmark agent list
pnpm benchmark agent list --suite expanded
# 3. Run the full agentic task suite (model auto-selected from your hardware)
pnpm benchmark agent
pnpm benchmark agent --suite expanded
# 4. Run with a specific model
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high
# 5. Resume a run, rerun one task, or rerun failed tasks only
pnpm benchmark agent --run-id q4-rtx3090-v1
pnpm benchmark agent --run-id q4-rtx3090-v1 --task email_triage --rerun
pnpm benchmark agent --run-id q4-rtx3090-v1 --rerun-failed
# 6. Rebuild aggregate results from saved per-task artifacts
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemble
# 7. Mock mode: test the harness without a real model (instant)
pnpm benchmark agent --mock --run-id smoke
# 8. Sample the 200-variation template suite before a full sweep
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600The benchmark supports two inference backends:
Managed model server. Supports all Gemma 4 models including multimodal. Installed automatically by gemmaclaw setup.
pnpm benchmark agent --model gemma4:e4b
pnpm benchmark agent --model gemma4:31b --ollama-url http://192.168.1.50:11434OpenAI-compatible API via llama-server. Lower overhead, useful for CPU-only systems and custom quantizations.
# Start llama-server (requires a GGUF model file)
llama-server -m /path/to/model.gguf --port 8080 --n-gpu-layers 99
# Run benchmark against it
pnpm benchmark agent --model gemma3:1b --backend llama-cpp --llama-cpp-url http://127.0.0.1:8080Use --suite to choose which task family to run. The default suite is the published comparison baseline. Expanded suites broaden coverage but need reference validation before their model results are published.
| Suite | Tasks | Use | Command |
|---|---|---|---|
default | 47 | Published Gemmaclaw model comparisons | pnpm benchmark agent --suite default |
expanded | 147 | Gemmaclaw expanded productivity, research, writing, coding, analysis, log, meeting, memory, skill, and integration tasks | pnpm benchmark agent --suite expanded |
variants | 29400 | 147 Gemmaclaw-owned templates with 200 controlled variations each | pnpm benchmark agent --suite variants |
all | 29594 | Development sweeps across every registered task family | pnpm benchmark agent --suite all |
The benchmark now includes 29400 generated tests by turning every expanded Gemmaclaw task into a reusable template with 200 controlled variants underneath it. A template defines the skill being measured, fixture schema, expected behavior, and grading rubric. Variants alter role, context, distractors, wording, output framing, and artifact requirements while preserving the same core capability target.
For harness validation, use a deterministic sample before running the full 29400-case suite. The standard smoke path is 2 variants per template, which gives 294 tasks across all 147 templates and still exercises every template family.
pnpm benchmark agent list --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600| Template Family | Variants | Capability |
|---|---|---|
| Expanded productivity | 200 per task template | Calendar, inbox, task, and assistant workflow coverage |
| Expanded research and writing | 200 per task template | Source synthesis, long-form reports, editing, and transformation |
| Expanded coding and skills | 200 per task template | Code review, debugging, skill composition, and implementation planning |
| Expanded analysis and logs | 200 per task template | CSV analysis, log triage, meeting extraction, and structured decisions |
| Expanded integrations | 200 per task template | Safe simulated browser, calendar, email, and external-service workflows |
Generated variants are not publishable by default. They must pass reference e2e validation, harness-bug review, clean model runs, judge evaluation, and site QA before they appear as comparable results.
| Flag | Default | Description |
|---|---|---|
--model <name> | (auto from hardware) | Model to test (e.g. gemma4:e4b, gemma4:31b) |
--backend <type> | ollama | Backend: ollama or llama-cpp |
--suite <name> | default | Task suite: default, expanded, variants, or all |
--sample-per-template <n> | (off) | For generated variation suites, run a deterministic sample of n variants from each template |
--sample-seed <text> | default | Seed used to pick stable sampled variants across repeated runs |
--quant <level> | (auto-detected) | Quantization to record (Q4_K_M, Q8_0, FP16) |
--thinking <level> | default | Thinking level (off, low, medium, high) |
--filter <text> | (all tasks) | Run tasks matching text (id or name) |
--task <id> | (all tasks) | Run a single task by exact id |
--run-id <id> | model + timestamp | Stable result directory for resume and targeted reruns |
--rerun | off | Rerun selected tasks even if matching per-task artifacts exist |
--rerun-failed | off | Rerun only tasks whose saved status is timeout or error |
--assemble | off | Rebuild results.json, RESULTS.md, and evaluation stubs from saved task artifacts |
--ollama-url <url> | http://127.0.0.1:11434 | Ollama API URL |
--llama-cpp-url <url> | http://127.0.0.1:8080 | llama.cpp server URL |
--task-timeout <sec> | 600 | Max seconds per task (0 = unlimited) |
--idle-timeout <sec> | 30 | Idle seconds before task considered done |
--context-length <n> | (model default) | Context window size |
--output-dir <dir> | benchmark-results | Output directory |
--mock | off | Mock mode: no model, instant pass |
Tasks evaluate Gemma models as AI agents. Each task sends a natural language request, the agent decides which tools to call, interprets results, and takes follow-up actions. The full conversation is captured for review.
| Difficulty | What It Covers | Representative Categories |
|---|---|---|
| Easy | Local smoke tests and basic tool intent | Structured output, tool intent |
| Medium | Single-workflow office tasks with concrete side effects | Email, calendar, task management, memory |
| Hard | Multi-step scheduling, coordination, and reconciliation | Email triage, meeting scheduling, client logistics, event coordination |
| Very Hard | Security, recovery, prompt-injection resistance, benchmark operations, durable guidance updates, and cross-source reconciliation | Security, error recovery, data analysis, coordination, ambiguous requests, OpenClaw operations |
gemmaclaw setup to auto-select the best model for your hardware. Override with --model if desired.gemmaclaw agent --local. The agent reads emails, checks calendars, creates tasks, sends emails using mock tools.tasks/<task-id>/result.json, a transcript, and copied session/trajectory logs when available.--run-id reuses matching task artifacts. Add --rerun for a selected task or --rerun-failed for only failures.benchmark-results/
runs/<model>__<quant>__<timestamp>/
manifest.json # Run id, task list, config hash, metadata
metadata.json # Hardware, model, quant, config, git SHA
results.json # Per-task conversations, tool calls, stats
tasks/
<task-id>/
result.json # Atomic per-task artifact used for resume
transcript.txt # Human-readable transcript for this task
session.jsonl # Raw OpenClaw session log, when available
trajectory.jsonl # Raw OpenClaw trajectory log, when available
transcripts/ # Human-readable per-task transcripts
RESULTS.md # Markdown summary
evaluations/<model>__<quant>__<timestamp>/
<task-id>.json # Grading criteria + evaluation scoresThe harness treats each task as an independent artifact. If a process dies halfway through a suite run, keep the same --run-id and run the command again. Completed task artifacts with the same config hash are skipped, while missing tasks continue. If a task has a harness error or suspicious transcript, rerun just that task with --task <id> --rerun. If several tasks failed, use --rerun-failed.
# First full Q4 run
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1
# Resume after crash
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1
# Rerun a suspicious task only
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1 --task calendar_create --rerun
# Rebuild aggregate outputs
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemblePublish only post-template results that have been inspected and evaluated. The benchmark page supports a model-level view, clickable task rows, and a full transcript viewer. Tool calls, tool results, and thinking blocks are shown inline with the conversation and collapsed by default so readers can inspect the evidence without losing the dialogue flow.
benchmark-results/runs/After making changes to the benchmark harness, run the smoke test to verify everything works:
# Mock only (instant, no model needed)
bash scripts/benchmark/smoke-test.sh
# Full test: mock + Ollama + llama.cpp
bash scripts/benchmark/smoke-test.sh --realThe original prompt-response benchmark is still available. It sends isolated prompts to the backend and checks text output (no agent loop, no tool calling).
pnpm benchmark --local --model gemma4:31b
pnpm benchmark --mock