Umwelten MCP Evaluation · Run 004

Can LLMs use real-world tools
to tell a story?

We gave 39 models access to a Rivian owner's real driving and charging data via MCP tools and asked them to summarize 10 days of activity. Most models can call tools. Far fewer call the right tools with the right parameters and weave the results into an engaging narrative.

39 models 8 providers $0.96 total Feb 27 – Mar 8, 2026

Perfect 15/15

8.6s

Fastest Perfect

$0.001

Cheapest Perfect

Failed / Errored

01 — The Test

What are we measuring?

This evaluation tests something most benchmarks skip: can a model use external tools to answer a question that requires real data? Not synthetic function calls. Not mock APIs. Real tools, connected to a real Rivian R1S via the TezLab MCP server.

Each model gets the same prompt and the same 20+ MCP tools (get_drives, get_charges, get_battery_health, etc.). It must figure out which to call, pass the right date parameters, and synthesize raw JSON into a narrative a human would want to read.

The Prompt

Look through my real data and summarize the 10 days of the Rivian's activity between February 27 and March 8, 2026. If there were any notable trips, create a narrative of the time frame. Today is Mar 8 2026 and make sure that you include the full 10 days, so if you don't have any drives and chargers in February it's obviously false.

The last sentence is a trap. Models that call get_drives without a start_date get only recent data (biased toward March), then confidently write a summary that misses the first days. The judge catches this.

02 — Scoring

How we score

Two independent axes: did the model call the right tools (deterministic), and did it write a good summary (judged by Claude Haiku 4.5). Total: 0–15.

Tool Score (0–5, deterministic)

Called get_drives
Called get_charges
get_drives included start_date
get_charges included start_date
At least one start_date ≤ Feb 27

LLM Judge Score (0–10)

Covers full Feb 27 – Mar 8 range
Contains a trip narrative (not just bullets)
Narrative quality (1–5)
Factual grounding from real data (1–5)
Combined into overall score (1–10)

03 — Key Findings

What we learned

⚡

Speed doesn't sacrifice quality

Mercury-2 delivered a perfect 15/15 in 8.6s at $0.006.

💰

Open-source punches above cost

GPT-OSS-120B scored 15/15 for $0.001 — 14,303 pts/$.

🏠

Local models hold their own

GLM-4.7-flash on Ollama scored 14/15, zero API cost.

💥

The date trap works

Models skipping start_date get truncated data and write wrong summaries.

🚫

Tool calling isn't universal

8 models failed outright — DeepSeek R1, Phi-4, Gemma 3N lack tool support.

💫

Qwen 3.5 dominates value

All four Qwen 3.5 variants scored 15/15. Flash: $0.003 in 22s.

04 — Results

Full rankings

Click any row to see the full response, tool-calling thread, and judge verdict.

Max cost any

Max time any

Min score any

Min Elo any

#	Model	Total	LLM	Tools	Time	Cost	pts/$	Elo	Story	Facts	Feb?

06 — Failures

Models that couldn't complete the test

These models lack tool-calling support or aren't available. Choosing a model that can't use tools wastes time and money.

Model	Provider	Error

07 — Methodology

How the test was run

Sequential model runs against a shared TezLab MCP connection via OAuth. Same system prompt, same 20+ read-only tools, up to 20 tool-calling steps. Vehicle commands filtered via MCP annotations.

Responses cached per-run. Judge: Claude Haiku 4.5 via OpenRouter, temperature 0.0. Cost metered by OpenRouter; Ollama = "local" / $0. Duration = wall-clock including all tool round trips.

Subscribe to our newsletter

Ready to distill signal from noise?

Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.

Start a Conversation See Our Work

[FOCUS/AI]

Can LLMs use real-world tools
to tell a story?

What are we measuring?

How we score

Tool Score (0–5, deterministic)

LLM Judge Score (0–10)

What we learned

Speed doesn't sacrifice quality

Open-source punches above cost

Local models hold their own

The date trap works

Tool calling isn't universal

Qwen 3.5 dominates value

Full rankings

Who tells the best story?

Models that couldn't complete the test

How the test was run

Subscribe to our newsletter

Ready to distill signal from noise?

Can LLMs use real-world toolsto tell a story?

What are we measuring?

How we score

Tool Score (0–5, deterministic)

LLM Judge Score (0–10)

What we learned

Speed doesn't sacrifice quality

Open-source punches above cost

Local models hold their own

The date trap works

Tool calling isn't universal

Qwen 3.5 dominates value

Full rankings

Who tells the best story?

Models that couldn't complete the test

How the test was run

Subscribe to our newsletter

Ready to distill signal from noise?

Can LLMs use real-world tools
to tell a story?