functiongemma-270m:latest (ollama)

Tiny (270M Function)268.10MFunction-callingQ4_K_MThinking: high2026-05-13

0% 0/2 tasks passed

Model class: Tiny (270M Function) Parameters: 268.10M Architecture: Function-calling Quantization: Q4_K_M Thinking: high Backend: ollama GPU: GPU (via Ollama host) CPU: AMD Ryzen 9 5900X 12-Core Processor RAM: 121GB Generation speed: Pending measurement Total time: 9.1m Failure modes: None

Task Results

Click any task row to expand the full prompt, conversation transcript, and judge evaluation.

Task	Category	Score	Speed	Time
▸ ✗ Easy JSON Fact Extraction	structured_output	0/5	N/A	23.6s
Difficulty: easy Scoring: LLM judge Tool calls: 0 Time: 23.6s Tests whether a small on-device model can extract explicit facts from one short workplace message into a fixed JSON schema without extra text. PROMPT Extract the facts from this message. Return ONLY valid JSON with exactly these keys: person, date, time, action, priority. Use ISO date YYYY-MM-DD and 24-hour time HH:MM. Message: Maya Chen needs me to send the revised launch checklist on 2026-05-08 at 15:00. She marked it high priority. FULL TRANSCRIPT Turn 1 User Extract the facts from this message. Return ONLY valid JSON with exactly these keys: person, date, time, action, priority. Use ISO date YYYY-MM-DD and 24-hour time HH:MM. Message: Maya Chen needs me to send the revised launch checklist on 2026-05-08 at 15:00. She marked it high priority. Turn 2 Assistant The original report indicates the report was submitted to the department. Could you provide the revised launch checklist on 24-hour time HH:MM. Maya Chen needs me to send it. JUDGE EVALUATION (0/5) by cc-acp FunctionGemma 270M completely failed this JSON extraction task. The model produced confused prose that partially restates the prompt, invents unrelated content about a submitted report, and asks a clarifying question instead of extracting the facts. This is a fundamental model capability failure: the 270M parameter model is trained specifically for native function-calling format (developer-role system prompts with <start_function_call> token outputs) and cannot generalize to arbitrary JSON extraction from a chat-style prompt. Score 0/5. ✗ Must return valid JSON and no surrounding explanation (0 pts) — Model produced garbled prose text, not JSON. The response invents context ('The original report indicates the report was submitted to the department') unrelated to the prompt and asks for clarification rather than extracting facts. ✗ Must include exactly the keys person, date, time, action, priority (0 pts) — No JSON output was produced, so none of the required keys are present. ✗ person must be Maya Chen (0 pts) — No JSON output produced. Maya Chen is mentioned in the garbled prose but not as a structured field. ✗ date must be 2026-05-08 and time must be 15:00 (0 pts) — No JSON output produced. No date or time values extracted. ✗ action must preserve the revised launch checklist request (0 pts) — No JSON output produced. No action field present. ✗ priority must be high (0 pts) — No JSON output produced. No priority field present.
▸ ✗ Easy Single-Step Tool Intent	tool_intent	0/5	N/A	23.6s
Difficulty: easy Scoring: LLM judge Tool calls: 0 Time: 23.6s Tests whether a small on-device model can choose one safe tool-like intent and produce valid arguments without performing side effects. PROMPT Choose exactly one action from this schema and return ONLY JSON. Do not call any real tools. Allowed actions: - create_task arguments: title, due, priority - create_calendar_event arguments: title, date, time, duration_minutes - send_email arguments: to, subject, body Request: Please remind me to send the Q2 forecast to Priya by Friday at 3 PM. Mark it high priority. Return shape: {"action":"...","arguments":{...}} FULL TRANSCRIPT Turn 1 User Choose exactly one action from this schema and return ONLY JSON. Do not call any real tools. Allowed actions: - create_task arguments: title, due, priority - create_calendar_event arguments: title, date, time, duration_minutes - send_email arguments: to, subject, body Request: Please remind me to send the Q2 forecast to Priya by Friday at 3 PM. Mark it high priority. Return shape: {"action":"...","arguments":{...}} JUDGE EVALUATION (0/5) by cc-acp FunctionGemma 270M produced no output for this tool-intent task. The assistant message was empty (content=[], 1 output token = stop token only). This is a complete model failure: the model received a clear JSON-schema selection task but generated no response beyond a stop token. This behavior is consistent with the model being fine-tuned for native function-calling format with <start_function_call> tokens — when run in plain chat mode without tool definitions, the model immediately stops without generating content. Score 0/5. ✗ Must return valid JSON and no surrounding explanation (0 pts) — Model produced no output at all. The assistant message has empty content array (0 content blocks, 1 output token indicating a stop token only). No JSON was returned. ✗ Must choose exactly one top-level action (0 pts) — No response produced — no action selected. ✗ Action must be create_task, not send_email or create_calendar_event (0 pts) — No response produced. ✗ Arguments must include title, due, and priority (0 pts) — No response produced — no arguments present. ✗ Title must mention sending the Q2 forecast to Priya (0 pts) — No response produced. ✗ Due must mention Friday at 3 PM (0 pts) — No response produced. ✗ Priority must be high (0 pts) — No response produced.

Methodology

Each task is scored by an LLM judge against the task rubric after the run is inspected for harness errors. A task counts as a pass when it scores at least 60%. Speed is measured in tokens per second when available. Hardware is auto-detected including WSL2 GPU detection.