Gemmaclaw - Run Benchmarks

Running Gemmaclaw Benchmarks

Gemmaclaw includes a built-in E2E agentic benchmark harness that evaluates Gemma models as AI agents with real tool use. The harness dispatches the full agent task suite, captures full conversations including tool calls, and saves structured results ready for PR submission.

Each task runs in an isolated environment with mock tools (email, calendar, tasks, contacts). Results are saved after every task, so an interrupted run can resume without losing completed tests.

Quick Start

# 1. Set up gemmaclaw (auto-detects hardware, installs backend, pulls model)
gemmaclaw setup

# 2. List all benchmark tasks
pnpm benchmark agent list
pnpm benchmark agent list --suite expanded

# 3. Run the full agentic task suite (model auto-selected from your hardware)
pnpm benchmark agent
pnpm benchmark agent --suite expanded

# 4. Run with a specific model
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high

# 5. Resume a run, rerun one task, or rerun failed tasks only
pnpm benchmark agent --run-id q4-rtx3090-v1
pnpm benchmark agent --run-id q4-rtx3090-v1 --task email_triage --rerun
pnpm benchmark agent --run-id q4-rtx3090-v1 --rerun-failed

# 6. Rebuild aggregate results from saved per-task artifacts
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemble

# 7. Mock mode: test the harness without a real model (instant)
pnpm benchmark agent --mock --run-id smoke

# 8. Sample the 200-variation template suite before a full sweep
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600

Backends

The benchmark supports two inference backends:

Ollama (default)

Managed model server. Supports all Gemma 4 models including multimodal. Installed automatically by gemmaclaw setup.

pnpm benchmark agent --model gemma4:e4b
pnpm benchmark agent --model gemma4:31b --ollama-url http://192.168.1.50:11434

llama.cpp

OpenAI-compatible API via llama-server. Lower overhead, useful for CPU-only systems and custom quantizations.

# Start llama-server (requires a GGUF model file)
llama-server -m /path/to/model.gguf --port 8080 --n-gpu-layers 99

# Run benchmark against it
pnpm benchmark agent --model gemma3:1b --backend llama-cpp --llama-cpp-url http://127.0.0.1:8080

Benchmark Test Suites

Use --suite to choose which task family to run. The default suite is the published comparison baseline. Expanded suites broaden coverage but need reference validation before their model results are published.

Suite	Tasks	Use	Command
`default`	47	Published Gemmaclaw model comparisons	`pnpm benchmark agent --suite default`
`expanded`	147	Gemmaclaw expanded productivity, research, writing, coding, analysis, log, meeting, memory, skill, and integration tasks	`pnpm benchmark agent --suite expanded`
`variants`	29400	147 Gemmaclaw-owned templates with 200 controlled variations each	`pnpm benchmark agent --suite variants`
`all`	29594	Development sweeps across every registered task family	`pnpm benchmark agent --suite all`

Template Variation Suite

The benchmark now includes 29400 generated tests by turning every expanded Gemmaclaw task into a reusable template with 200 controlled variants underneath it. A template defines the skill being measured, fixture schema, expected behavior, and grading rubric. Variants alter role, context, distractors, wording, output framing, and artifact requirements while preserving the same core capability target.

For harness validation, use a deterministic sample before running the full 29400-case suite. The standard smoke path is 2 variants per template, which gives 294 tasks across all 147 templates and still exercises every template family.

pnpm benchmark agent list --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513
pnpm benchmark agent --suite variants --sample-per-template 2 --sample-seed gemini-flash-smoke-20260513 --backend google-gemini-cli --model gemini-3-flash-preview --run-id variants-gemini-flash-sample --idle-timeout 10 --no-activity-timeout 120 --hard-cap 600

Template Family	Variants	Capability
Expanded productivity	200 per task template	Calendar, inbox, task, and assistant workflow coverage
Expanded research and writing	200 per task template	Source synthesis, long-form reports, editing, and transformation
Expanded coding and skills	200 per task template	Code review, debugging, skill composition, and implementation planning
Expanded analysis and logs	200 per task template	CSV analysis, log triage, meeting extraction, and structured decisions
Expanded integrations	200 per task template	Safe simulated browser, calendar, email, and external-service workflows

Generated variants are not publishable by default. They must pass reference e2e validation, harness-bug review, clean model runs, judge evaluation, and site QA before they appear as comparable results.

Configuration Options

Flag	Default	Description
`--model <name>`	(auto from hardware)	Model to test (e.g. gemma4:e4b, gemma4:31b)
`--backend <type>`	ollama	Backend: ollama or llama-cpp
`--suite <name>`	default	Task suite: default, expanded, variants, or all
`--sample-per-template <n>`	(off)	For generated variation suites, run a deterministic sample of n variants from each template
`--sample-seed <text>`	default	Seed used to pick stable sampled variants across repeated runs
`--quant <level>`	(auto-detected)	Quantization to record (Q4_K_M, Q8_0, FP16)
`--thinking <level>`	default	Thinking level (off, low, medium, high)
`--filter <text>`	(all tasks)	Run tasks matching text (id or name)
`--task <id>`	(all tasks)	Run a single task by exact id
`--run-id <id>`	model + timestamp	Stable result directory for resume and targeted reruns
`--rerun`	off	Rerun selected tasks even if matching per-task artifacts exist
`--rerun-failed`	off	Rerun only tasks whose saved status is timeout or error
`--assemble`	off	Rebuild `results.json`, `RESULTS.md`, and evaluation stubs from saved task artifacts
`--ollama-url <url>`	http://127.0.0.1:11434	Ollama API URL
`--llama-cpp-url <url>`	http://127.0.0.1:8080	llama.cpp server URL
`--task-timeout <sec>`	600	Max seconds per task (0 = unlimited)
`--idle-timeout <sec>`	30	Idle seconds before task considered done
`--context-length <n>`	(model default)	Context window size
`--output-dir <dir>`	benchmark-results	Output directory
`--mock`	off	Mock mode: no model, instant pass

The Agent Task Suite

Tasks evaluate Gemma models as AI agents. Each task sends a natural language request, the agent decides which tools to call, interprets results, and takes follow-up actions. The full conversation is captured for review.

Difficulty	What It Covers	Representative Categories
Easy	Local smoke tests and basic tool intent	Structured output, tool intent
Medium	Single-workflow office tasks with concrete side effects	Email, calendar, task management, memory
Hard	Multi-step scheduling, coordination, and reconciliation	Email triage, meeting scheduling, client logistics, event coordination
Very Hard	Security, recovery, prompt-injection resistance, benchmark operations, durable guidance updates, and cross-source reconciliation	Security, error recovery, data analysis, coordination, ambiguous requests, OpenClaw operations

How It Works

Hardware detection: The harness uses the same model catalog as gemmaclaw setup to auto-select the best model for your hardware. Override with --model if desired.
Seed mock tools: Before each task, a realistic workspace is created with emails, calendar events, contacts, and tasks. Professional/workplace themed.
Isolated environment: Each task runs in a fresh gemmaclaw home directory. No state leaks between tasks.
Dispatch task: The task prompt is sent via gemmaclaw agent --local. The agent reads emails, checks calendars, creates tasks, sends emails using mock tools.
Capture conversation: The full agent loop is recorded: every tool call, tool result, thinking block, and follow-up action.
Save per-task results: After each task, the harness writes tasks/<task-id>/result.json, a transcript, and copied session/trajectory logs when available.
Resume or rerun: A later command with the same --run-id reuses matching task artifacts. Add --rerun for a selected task or --rerun-failed for only failures.
Evaluation (separate step): Results are reviewed against grading criteria. Scores are added to the evaluation files and published to the site.

Results Directory

benchmark-results/
  runs/<model>__<quant>__<timestamp>/
    manifest.json        # Run id, task list, config hash, metadata
    metadata.json        # Hardware, model, quant, config, git SHA
    results.json         # Per-task conversations, tool calls, stats
    tasks/
      <task-id>/
        result.json      # Atomic per-task artifact used for resume
        transcript.txt   # Human-readable transcript for this task
        session.jsonl    # Raw OpenClaw session log, when available
        trajectory.jsonl # Raw OpenClaw trajectory log, when available
    transcripts/         # Human-readable per-task transcripts
    RESULTS.md           # Markdown summary
  evaluations/<model>__<quant>__<timestamp>/
    <task-id>.json       # Grading criteria + evaluation scores

Crash Recovery and Targeted Reruns

The harness treats each task as an independent artifact. If a process dies halfway through a suite run, keep the same --run-id and run the command again. Completed task artifacts with the same config hash are skipped, while missing tasks continue. If a task has a harness error or suspicious transcript, rerun just that task with --task <id> --rerun. If several tasks failed, use --rerun-failed.

# First full Q4 run
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1

# Resume after crash
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1

# Rerun a suspicious task only
pnpm benchmark agent --model gemma4:31b --quant Q4_K_M --thinking high --run-id q4-rtx3090-v1 --task calendar_create --rerun

# Rebuild aggregate outputs
pnpm benchmark agent --run-id q4-rtx3090-v1 --assemble

Publishing Requirements

Publish only post-template results that have been inspected and evaluated. The benchmark page supports a model-level view, clickable task rows, and a full transcript viewer. Tool calls, tool results, and thinking blocks are shown inline with the conversation and collapsed by default so readers can inspect the evidence without losing the dialogue flow.

Metadata Captured

Hardware: GPU name, VRAM, CPU model, core count, total RAM
Model: Name, parameter count, quantization level, format (from Ollama API)
Config: Backend type, thinking level, context length, URLs
Environment: Git SHA, Node.js version, OS/platform, timestamps

Submitting Results

Run the benchmark on your hardware
Check the results in benchmark-results/runs/
Open a PR adding your results directory to the gemmaclaw repo
Results will be reviewed, evaluated, and published to the site

Smoke Test

After making changes to the benchmark harness, run the smoke test to verify everything works:

# Mock only (instant, no model needed)
bash scripts/benchmark/smoke-test.sh

# Full test: mock + Ollama + llama.cpp
bash scripts/benchmark/smoke-test.sh --real

Prompt-Response Mode (Legacy)

The original prompt-response benchmark is still available. It sends isolated prompts to the backend and checks text output (no agent loop, no tool calling).

pnpm benchmark --local --model gemma4:31b
pnpm benchmark --mock