Umwelten MCP Evaluation · Run 004

Can LLMs use real-world tools
to tell a story?

We gave 39 models access to a Rivian owner's real driving and charging data via MCP tools and asked them to summarize 10 days of activity. Most models can call tools. Far fewer call the right tools with the right parameters and weave the results into an engaging narrative.

39 models 8 providers $0.96 total Feb 27 – Mar 8, 2026
17
Perfect 15/15
8.6s
Fastest Perfect
$0.001
Cheapest Perfect
8
Failed / Errored
01 — The Test

What are we measuring?

This evaluation tests something most benchmarks skip: can a model use external tools to answer a question that requires real data? Not synthetic function calls. Not mock APIs. Real tools, connected to a real Rivian R1S via the TezLab MCP server.

Each model gets the same prompt and the same 20+ MCP tools (get_drives, get_charges, get_battery_health, etc.). It must figure out which to call, pass the right date parameters, and synthesize raw JSON into a narrative a human would want to read.

The Prompt
Look through my real data and summarize the 10 days of the Rivian's activity between February 27 and March 8, 2026. If there were any notable trips, create a narrative of the time frame. Today is Mar 8 2026 and make sure that you include the full 10 days, so if you don't have any drives and chargers in February it's obviously false.

The last sentence is a trap. Models that call get_drives without a start_date get only recent data (biased toward March), then confidently write a summary that misses the first days. The judge catches this.

02 — Scoring

How we score

Two independent axes: did the model call the right tools (deterministic), and did it write a good summary (judged by Claude Haiku 4.5). Total: 0–15.

Tool Score (0–5, deterministic)

  • Called get_drives
  • Called get_charges
  • get_drives included start_date
  • get_charges included start_date
  • At least one start_date ≤ Feb 27

LLM Judge Score (0–10)

  • Covers full Feb 27 – Mar 8 range
  • Contains a trip narrative (not just bullets)
  • Narrative quality (1–5)
  • Factual grounding from real data (1–5)
  • Combined into overall score (1–10)
03 — Key Findings

What we learned

Speed doesn't sacrifice quality

Mercury-2 delivered a perfect 15/15 in 8.6s at $0.006.

💰

Open-source punches above cost

GPT-OSS-120B scored 15/15 for $0.001 — 14,303 pts/$.

🏠

Local models hold their own

GLM-4.7-flash on Ollama scored 14/15, zero API cost.

💥

The date trap works

Models skipping start_date get truncated data and write wrong summaries.

🚫

Tool calling isn't universal

8 models failed outright — DeepSeek R1, Phi-4, Gemma 3N lack tool support.

💫

Qwen 3.5 dominates value

All four Qwen 3.5 variants scored 15/15. Flash: $0.003 in 22s.

04 — Results

Full rankings

Click any row to see the full response, tool-calling thread, and judge verdict.

any
any
any
any
# Model Total LLM Tools Time Cost pts/$ Elo Story Facts Feb?

Loading...

Loading...

06 — Failures

Models that couldn't complete the test

These models lack tool-calling support or aren't available. Choosing a model that can't use tools wastes time and money.

ModelProviderError
07 — Methodology

How the test was run

Sequential model runs against a shared TezLab MCP connection via OAuth. Same system prompt, same 20+ read-only tools, up to 20 tool-calling steps. Vehicle commands filtered via MCP annotations.

Responses cached per-run. Judge: Claude Haiku 4.5 via OpenRouter, temperature 0.0. Cost metered by OpenRouter; Ollama = "local" / $0. Duration = wall-clock including all tool round trips.

Subscribe to our newsletter

Powered by Buttondown.

Ready to distill signal from noise?

Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.