Running Gemmaclaw Benchmarks

Gemmaclaw includes a built-in E2E agentic benchmark harness that evaluates Gemma models as AI agents with real tool use. The harness dispatches the full agent task suite, captures full conversations including tool calls, and saves structured results ready for PR submission.

Each task runs in an isolated environment with mock tools (email, calendar, tasks, contacts). Results are saved after every task, so an interrupted run can resume without losing completed tests.

Quick Start

# 1. Set up gemmaclaw (auto-detects hardware, installs backend, pulls model)
gemmaclaw setup

# 2. List all benchmark tasks
pnpm benchmark agent list
pnpm benchmark agent list --suite expanded

# 3. Run the full agentic task suite (model auto-selected from your hardware)
pnpm benchmark agent
pnpm benchmark agent --suite expanded

# 4. Run with a specific model
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high

# 5. Resume a run, rerun one task, or rerun failed tasks only
pnpm benchmark agent --run-id q4-rtx3090-v1
pnpm benchmark agent --run-id q4-rtx3090-v1 --task email_triage --rerun
pnpm benchmark agent --run-id q4-rtx3090-v1 --rerun-failed

# 6. Rebuild aggregate results from saved per-task artifacts
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemble

# 7. Mock mode: test the harness without a real model (instant)
pnpm benchmark agent --mock --run-id smoke

# 8. Sample the 200-variation template suite before a full sweep
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600

Backends

The benchmark supports two inference backends:

Ollama (default)

Managed model server. Supports all Gemma 4 models including multimodal. Installed automatically by gemmaclaw setup.

pnpm benchmark agent --model gemma4:e4b
pnpm benchmark agent --model gemma4:31b --ollama-url http://192.168.1.50:11434

llama.cpp

OpenAI-compatible API via llama-server. Lower overhead, useful for CPU-only systems and custom quantizations.

# Start llama-server (requires a GGUF model file)
llama-server -m /path/to/model.gguf --port 8080 --n-gpu-layers 99

# Run benchmark against it
pnpm benchmark agent --model gemma3:1b --backend llama-cpp --llama-cpp-url http://127.0.0.1:8080

Benchmark Test Suites

Use --suite to choose which task family to run. The default suite is the published comparison baseline. Expanded suites broaden coverage but need reference validation before their model results are published.

SuiteTasksUseCommand
default47Published Gemmaclaw model comparisonspnpm benchmark agent --suite default
expanded147Gemmaclaw expanded productivity, research, writing, coding, analysis, log, meeting, memory, skill, and integration taskspnpm benchmark agent --suite expanded
variants29400147 Gemmaclaw-owned templates with 200 controlled variations eachpnpm benchmark agent --suite variants
all29594Development sweeps across every registered task familypnpm benchmark agent --suite all

Template Variation Suite

The benchmark now includes 29400 generated tests by turning every expanded Gemmaclaw task into a reusable template with 200 controlled variants underneath it. A template defines the skill being measured, fixture schema, expected behavior, and grading rubric. Variants alter role, context, distractors, wording, output framing, and artifact requirements while preserving the same core capability target.

For harness validation, use a deterministic sample before running the full 29400-case suite. The standard smoke path is 2 variants per template, which gives 294 tasks across all 147 templates and still exercises every template family.

pnpm benchmark agent list --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600
Template FamilyVariantsCapability
Expanded productivity200 per task templateCalendar, inbox, task, and assistant workflow coverage
Expanded research and writing200 per task templateSource synthesis, long-form reports, editing, and transformation
Expanded coding and skills200 per task templateCode review, debugging, skill composition, and implementation planning
Expanded analysis and logs200 per task templateCSV analysis, log triage, meeting extraction, and structured decisions
Expanded integrations200 per task templateSafe simulated browser, calendar, email, and external-service workflows

Generated variants are not publishable by default. They must pass reference e2e validation, harness-bug review, clean model runs, judge evaluation, and site QA before they appear as comparable results.

Configuration Options

FlagDefaultDescription
--model <name>(auto from hardware)Model to test (e.g. gemma4:e4b, gemma4:31b)
--backend <type>ollamaBackend: ollama or llama-cpp
--suite <name>defaultTask suite: default, expanded, variants, or all
--sample-per-template <n>(off)For generated variation suites, run a deterministic sample of n variants from each template
--sample-seed <text>defaultSeed used to pick stable sampled variants across repeated runs
--quant <level>(auto-detected)Quantization to record (Q4_K_M, Q8_0, FP16)
--thinking <level>defaultThinking level (off, low, medium, high)
--filter <text>(all tasks)Run tasks matching text (id or name)
--task <id>(all tasks)Run a single task by exact id
--run-id <id>model + timestampStable result directory for resume and targeted reruns
--rerunoffRerun selected tasks even if matching per-task artifacts exist
--rerun-failedoffRerun only tasks whose saved status is timeout or error
--assembleoffRebuild results.json, RESULTS.md, and evaluation stubs from saved task artifacts
--ollama-url <url>http://127.0.0.1:11434Ollama API URL
--llama-cpp-url <url>http://127.0.0.1:8080llama.cpp server URL
--task-timeout <sec>600Max seconds per task (0 = unlimited)
--idle-timeout <sec>30Idle seconds before task considered done
--context-length <n>(model default)Context window size
--output-dir <dir>benchmark-resultsOutput directory
--mockoffMock mode: no model, instant pass

The Agent Task Suite

Tasks evaluate Gemma models as AI agents. Each task sends a natural language request, the agent decides which tools to call, interprets results, and takes follow-up actions. The full conversation is captured for review.

DifficultyWhat It CoversRepresentative Categories
EasyLocal smoke tests and basic tool intentStructured output, tool intent
MediumSingle-workflow office tasks with concrete side effectsEmail, calendar, task management, memory
HardMulti-step scheduling, coordination, and reconciliationEmail triage, meeting scheduling, client logistics, event coordination
Very HardSecurity, recovery, prompt-injection resistance, benchmark operations, durable guidance updates, and cross-source reconciliationSecurity, error recovery, data analysis, coordination, ambiguous requests, OpenClaw operations

How It Works

  1. Hardware detection: The harness uses the same model catalog as gemmaclaw setup to auto-select the best model for your hardware. Override with --model if desired.
  2. Seed mock tools: Before each task, a realistic workspace is created with emails, calendar events, contacts, and tasks. Professional/workplace themed.
  3. Isolated environment: Each task runs in a fresh gemmaclaw home directory. No state leaks between tasks.
  4. Dispatch task: The task prompt is sent via gemmaclaw agent --local. The agent reads emails, checks calendars, creates tasks, sends emails using mock tools.
  5. Capture conversation: The full agent loop is recorded: every tool call, tool result, thinking block, and follow-up action.
  6. Save per-task results: After each task, the harness writes tasks/<task-id>/result.json, a transcript, and copied session/trajectory logs when available.
  7. Resume or rerun: A later command with the same --run-id reuses matching task artifacts. Add --rerun for a selected task or --rerun-failed for only failures.
  8. Evaluation (separate step): Results are reviewed against grading criteria. Scores are added to the evaluation files and published to the site.

Results Directory

benchmark-results/
  runs/<model>__<quant>__<timestamp>/
    manifest.json        # Run id, task list, config hash, metadata
    metadata.json        # Hardware, model, quant, config, git SHA
    results.json         # Per-task conversations, tool calls, stats
    tasks/
      <task-id>/
        result.json      # Atomic per-task artifact used for resume
        transcript.txt   # Human-readable transcript for this task
        session.jsonl    # Raw OpenClaw session log, when available
        trajectory.jsonl # Raw OpenClaw trajectory log, when available
    transcripts/         # Human-readable per-task transcripts
    RESULTS.md           # Markdown summary
  evaluations/<model>__<quant>__<timestamp>/
    <task-id>.json       # Grading criteria + evaluation scores

Crash Recovery and Targeted Reruns

The harness treats each task as an independent artifact. If a process dies halfway through a suite run, keep the same --run-id and run the command again. Completed task artifacts with the same config hash are skipped, while missing tasks continue. If a task has a harness error or suspicious transcript, rerun just that task with --task <id> --rerun. If several tasks failed, use --rerun-failed.

# First full Q4 run
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1

# Resume after crash
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1

# Rerun a suspicious task only
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1 --task calendar_create --rerun

# Rebuild aggregate outputs
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemble

Publishing Requirements

Publish only post-template results that have been inspected and evaluated. The benchmark page supports a model-level view, clickable task rows, and a full transcript viewer. Tool calls, tool results, and thinking blocks are shown inline with the conversation and collapsed by default so readers can inspect the evidence without losing the dialogue flow.

Metadata Captured

Submitting Results

  1. Run the benchmark on your hardware
  2. Check the results in benchmark-results/runs/
  3. Open a PR adding your results directory to the gemmaclaw repo
  4. Results will be reviewed, evaluated, and published to the site

Smoke Test

After making changes to the benchmark harness, run the smoke test to verify everything works:

# Mock only (instant, no model needed)
bash scripts/benchmark/smoke-test.sh

# Full test: mock + Ollama + llama.cpp
bash scripts/benchmark/smoke-test.sh --real

Prompt-Response Mode (Legacy)

The original prompt-response benchmark is still available. It sends isolated prompts to the backend and checks text output (no agent loop, no tool calling).

pnpm benchmark --local --model gemma4:31b
pnpm benchmark --mock