Can LLMs use real-world tools
to tell a story?
We gave 39 models access to a Rivian owner's real driving and charging data via MCP tools and asked them to summarize 10 days of activity. Most models can call tools. Far fewer call the right tools with the right parameters and weave the results into an engaging narrative.
What are we measuring?
This evaluation tests something most benchmarks skip: can a model use external tools to answer a question that requires real data? Not synthetic function calls. Not mock APIs. Real tools, connected to a real Rivian R1S via the TezLab MCP server.
Each model gets the same prompt and the same 20+ MCP tools (get_drives, get_charges, get_battery_health, etc.). It must figure out which to call, pass the right date parameters, and synthesize raw JSON into a narrative a human would want to read.
Look through my real data and summarize the 10 days of the Rivian's activity between February 27 and March 8, 2026. If there were any notable trips, create a narrative of the time frame. Today is Mar 8 2026 and make sure that you include the full 10 days, so if you don't have any drives and chargers in February it's obviously false.
The last sentence is a trap. Models that call get_drives without a
start_date get only recent data (biased toward March), then confidently
write a summary that misses the first days. The judge catches this.
How we score
Two independent axes: did the model call the right tools (deterministic), and did it write a good summary (judged by Claude Haiku 4.5). Total: 0–15.
Tool Score (0–5, deterministic)
- Called
get_drives - Called
get_charges get_drivesincludedstart_dateget_chargesincludedstart_date- At least one start_date ≤ Feb 27
LLM Judge Score (0–10)
- Covers full Feb 27 – Mar 8 range
- Contains a trip narrative (not just bullets)
- Narrative quality (1–5)
- Factual grounding from real data (1–5)
- Combined into overall score (1–10)
What we learned
Speed doesn't sacrifice quality
Mercury-2 delivered a perfect 15/15 in 8.6s at $0.006.
Open-source punches above cost
GPT-OSS-120B scored 15/15 for $0.001 — 14,303 pts/$.
Local models hold their own
GLM-4.7-flash on Ollama scored 14/15, zero API cost.
The date trap works
Models skipping start_date get truncated data and write wrong summaries.
Tool calling isn't universal
8 models failed outright — DeepSeek R1, Phi-4, Gemma 3N lack tool support.
Qwen 3.5 dominates value
All four Qwen 3.5 variants scored 15/15. Flash: $0.003 in 22s.
Full rankings
Click any row to see the full response, tool-calling thread, and judge verdict.
| # | Model | Total | LLM | Tools | Time | Cost | pts/$ | Elo | Story | Facts | Feb? |
|---|
Loading...
Loading...
Models that couldn't complete the test
These models lack tool-calling support or aren't available. Choosing a model that can't use tools wastes time and money.
| Model | Provider | Error |
|---|
How the test was run
Sequential model runs against a shared TezLab MCP connection via OAuth. Same system prompt, same 20+ read-only tools, up to 20 tool-calling steps. Vehicle commands filtered via MCP annotations.
Responses cached per-run. Judge: Claude Haiku 4.5 via OpenRouter, temperature 0.0. Cost metered by OpenRouter; Ollama = "local" / $0. Duration = wall-clock including all tool round trips.
Subscribe to our newsletter
Ready to distill signal from noise?
Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.