Click any task row to expand the full prompt, conversation transcript, and judge evaluation.
| Task | Category | Score | Speed | Time | Failure |
|---|---|---|---|---|---|
| ▸ ✗ Easy JSON Fact Extraction | structured_output | 0/5 | 20.1 tok/s output-est | 23.6s | |
|
Difficulty: easy
Scoring: LLM judge
Tool calls: 0
Time: 23.6s
Tests whether a small on-device model can extract explicit facts from one short workplace message into a fixed JSON schema without extra text. PROMPT Extract the facts from this message. Return ONLY valid JSON with exactly these keys: person, date, time, action, priority. Use ISO date YYYY-MM-DD and 24-hour time HH:MM. Message: Maya Chen needs me to send the revised launch checklist on 2026-05-08 at 15:00. She marked it high priority. FULL TRANSCRIPT Turn 1 User Extract the facts from this message. Return ONLY valid JSON with exactly these keys: person, date, time, action, priority. Use ISO date YYYY-MM-DD and 24-hour time HH:MM. Message: Maya Chen needs me to send the revised launch checklist on 2026-05-08 at 15:00. She marked it high priority. Turn 2 Assistant The original report indicates the report was submitted to the department. Could you provide the revised launch checklist on 24-hour time HH:MM. Maya Chen needs me to send it. JUDGE EVALUATION (0/5) FunctionGemma 270M completely failed this JSON extraction task. The model produced confused prose that partially restates the prompt, invents unrelated content about a submitted report, and asks a clarifying question instead of extracting the facts. This is a fundamental model capability failure: the 270M parameter model is trained specifically for native function-calling format (developer-role system prompts with <start_function_call> token outputs) and cannot generalize to arbitrary JSON extraction from a chat-style prompt. Score 0/5.
|
|||||
| ▸ ✗ Easy Single-Step Tool Intent | tool_intent | 0/5 | N/A | 23.6s | |
|
Difficulty: easy
Scoring: LLM judge
Tool calls: 0
Time: 23.6s
Tests whether a small on-device model can choose one safe tool-like intent and produce valid arguments without performing side effects. PROMPT Choose exactly one action from this schema and return ONLY JSON. Do not call any real tools.
Allowed actions:
- create_task arguments: title, due, priority
- create_calendar_event arguments: title, date, time, duration_minutes
- send_email arguments: to, subject, body
Request: Please remind me to send the Q2 forecast to Priya by Friday at 3 PM. Mark it high priority.
Return shape: {"action":"...","arguments":{...}}FULL TRANSCRIPT Turn 1 User Choose exactly one action from this schema and return ONLY JSON. Do not call any real tools.
Allowed actions:
- create_task arguments: title, due, priority
- create_calendar_event arguments: title, date, time, duration_minutes
- send_email arguments: to, subject, body
Request: Please remind me to send the Q2 forecast to Priya by Friday at 3 PM. Mark it high priority.
Return shape: {"action":"...","arguments":{...}}JUDGE EVALUATION (0/5) FunctionGemma 270M produced no output for this tool-intent task. The assistant message was empty (content=[], 1 output token = stop token only). This is a complete model failure: the model received a clear JSON-schema selection task but generated no response beyond a stop token. This behavior is consistent with the model being fine-tuned for native function-calling format with <start_function_call> tokens — when run in plain chat mode without tool definitions, the model immediately stops without generating content. Score 0/5.
|
|||||
Each task is scored by an LLM judge against the task rubric after the run is inspected for harness errors. A task counts as a pass when it scores at least 60%. Speed is measured in tokens per second when available. Hardware is auto-detected including WSL2 GPU detection.