Umwelten Model Showdown · March 2026

Testing 48 LLMs Across
5 Dimensions for $4.63

Which model should you actually use? Not which one tops a leaderboard somewhere — which one will reason through your problem, follow your formatting instructions, write code that compiles, answer factual questions correctly, and orchestrate real-world tools without falling apart?

48 models 5 providers 5 dimensions $4.63 total
93.8%
Top Score
$0.01
Best Value
15
Models >90%
$4.63
Total Cost
01 — What We Tested

Five dimensions, one question: how good is this model?

Each dimension tests something the others can’t. A model that aces reasoning might struggle with precise formatting. A coding specialist might fail at tool orchestration. The combined score reveals which models are genuinely well-rounded.

Reasoning — /20

Four classic logic puzzles where the obvious answer is wrong, scored 1–5 each by an LLM judge on reasoning quality:

  • The Surgeon Riddle — “The surgeon says ‘I can’t operate on this boy, he’s my son.’ How is this possible?”
  • Bat & Ball — “A bat and ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?” (Answer: $0.05, not $0.10)
  • Lily Pads — “A patch doubles daily. It covers the lake in 48 days. When is it half-covered?” (Answer: day 47)
  • Counterfeit Coin — Find a counterfeit coin among 12 using a balance scale exactly 3 times, and determine if it’s heavier or lighter. The hardest task in the entire showdown.

Knowledge — /30

30 factual questions, binary scoring (correct or not) with an LLM judge that allows formatting variations. Six categories:

  • Science — Speed of light, glucose formula, Carbon-14 half-life, Oganesson atomic number, Schwarzschild radius
  • Geography — Capital of Kazakhstan, deepest ocean point, country with most time zones, longest African river, Caspian+Persian Gulf border
  • History — Berlin Wall year, first to South Pole, WWI treaty, transistor year, Principia Mathematica author
  • Technology — HTTPS port, max signed 32-bit int, HTTP/0.9 year, OSI Layer 4, what CUDA stands for
  • AI/ML — What “T” in GPT stands for, Transformer d_model, feed-forward activation, Llama creator, knowledge distillation
  • Tricky/Adversarial — R’s in “strawberry,” is 91 prime, “all but 9 die,” feathers vs steel, 3/5 gallon jug puzzle

Instruction Following — /30

Six constraint tasks, deterministic scoring (5 points each, no LLM judge):

  • Exact Word Count — “Write a 12-word sentence about the ocean. Nothing else.”
  • Structured JSON — Output JSON with name, age (25–35), 3 skills, active=true. No markdown fences.
  • Constrained List — List 5 animals, numbered, max 8 chars each, alphabetical order, no extra text.
  • Negative Constraints — Describe a sunset without “beautiful,” “sky,” “orange,” or exclamation marks. Exactly 2 sentences.
  • Format Transformation — Convert CSV data (Alice/Bob/Charlie) to a markdown table with header and separator.
  • Multi-format Response — Three sections separated by “---”: a color word, a number 1–100, the color repeated 3×. No labels.

Coding — /126

Six challenges in TypeScript, Python, and Rust (7 points each × 18 = 126). Code is extracted, compiled, and run in Docker containers:

  • FizzBuzz Boom — Extended FizzBuzz with divisibility by 3, 5, and 7 for numbers 1–105
  • Business Days — Count working days between dates, skipping weekends and 8 specific 2025 US holidays
  • Vending Machine — State machine processing 18 INSERT/SELECT operations with exact output format
  • Grid Paths — Dynamic programming: count unique paths on blocked grids (right/down only). Three grids: 5×5, 7×7, 10×10
  • Zigzag Cipher — Rail fence cipher encode and decode with 3–4 rails. 4 test cases with exact expected output
  • Data Pipeline — Parse 13 sales records and compute total revenue, top region, top product, high-quantity count, and average revenue

MCP Tool Use — /16

Connect to TezLab’s MCP server with 20+ real tools and analyze vehicle data. Scored on tool usage (0–6) plus LLM-judged response quality (0–10):

  • Required toolslist_vehicles, get_battery_health, get_charges/get_charge_report, get_efficiency, get_my_chargers, search_public_chargers (1 point each)
  • Quality scoring — Data synthesis (1–5), actionable insights (1–5), factual grounding with specific %, kWh, and dates from tool results

The 49 Models

OpenRouter (31 models)

  • anthropic/claude-sonnet-4.6
  • anthropic/claude-opus-4.6
  • anthropic/claude-haiku-4.5
  • openai/gpt-5.4
  • openai/gpt-5.4-mini
  • openai/gpt-5.4-nano
  • openai/gpt-oss-20b
  • openai/gpt-oss-120b
  • x-ai/grok-4.20-beta
  • x-ai/grok-4.1-fast
  • google/gemini-3.1-pro-preview
  • qwen/qwen3.5-397b-a17b
  • qwen/qwen3.5-122b-a10b
  • qwen/qwen3.5-35b-a3b
  • deepseek/deepseek-v3.2
  • meta-llama/llama-4-scout
  • meta-llama/llama-4-maverick
  • inception/mercury-2
  • inception/mercury-coder
  • moonshotai/kimi-k2
  • moonshotai/kimi-k2.5
  • minimax/minimax-m2.7
  • mistralai/mistral-small-2603
  • mistralai/mistral-small-3.2-24b-instruct
  • mistralai/codestral-2508
  • mistralai/ministral-8b-2512
  • nvidia/nemotron-3-nano-30b-a3b:free
  • nvidia/nemotron-3-super-120b-a12b:free
  • nvidia/nemotron-nano-9b-v2:free
  • google/gemma-3-27b-it

Google (2 models)

  • gemini-3-flash-preview
  • gemini-3.1-pro-preview

DeepInfra (3 models)

  • nvidia/Nemotron-3-Nano-30B-A3B
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B
  • nvidia/NVIDIA-Nemotron-Nano-9B-v2

Ollama (13 models)

  • deepseek-r1-14b
  • deepseek-r1-32b
  • deepseek-r1-latest
  • devstral-latest
  • gemma3n-e4b
  • glm-4.7-flash-latest
  • gpt-oss-latest
  • mistral-small-latest
  • nemotron-3-nano-4b
  • nemotron-3-nano-latest
  • phi4-latest
  • qwen3-30b-a3b
  • qwen3-32b
02 — Overall Results

Full leaderboard

Combined score is the mean of normalized dimension percentages. anthropic/claude-sonnet-4.6 leads at 93.8% across all 5 dimensions. Models with MCP data use all 5 dimensions; those without use 4. 41 of 48 models now have full 5-dimension scores including MCP tool use.

Click any column header to sort. Use filters to narrow down by provider or cost tier.

|
any
any
any
# Model Provider Released Type Combined Reason Know Instr Code MCP Cost Time
03 — Biggest Surprises

What we didn’t expect

#1

MCP has a hard ceiling at 11/16

Every model that called all 6 required tools (20 models) got exactly judge=5/10. No model broke through. MCP tool use is binary — either you call the right tools or you don’t. Quality variance comes entirely from tool selection, not generation.

#2

Gemini 3.1 Pro: 91.6% for $0.008

Under a penny for a top-10 model. At 11,450 pts/$, gemini-3.1-pro-preview is the best value in the entire test — beating models that cost 25x more.

#3

Only 4 models are fast AND good

Under 3 minutes and above 90%: Sonnet, Grok, GPT-5.4, GPT-5.4-mini. That’s it. 23 of 48 models are both slow and below 90%. Speed + quality is genuinely rare.

#4

Open weights are 0.7% behind closed

Best open: qwen/qwen3.5-397b-a17b at 93.1%. Best closed: Sonnet at 93.8%. 34 open-weight models tested vs 14 closed. The gap is nearly gone.

#5

qwen3-30b-a3b: brilliant and broken

Perfect 20/20 reasoning, perfect 126/126 coding, but only 4/30 on instruction following — an 87% gap between best and worst dimension. It ignores formatting constraints entirely.

#6

Claude Haiku 4.5: the brand tax

Perfect 126/126 coding but 2/16 on MCP — it gave up after one tool error instead of retrying. At $0.07, it costs more than models that scored perfectly on every dimension.

#7

phi4 is the best local model

phi4-latest on Ollama scores 84.6% for free, beating several paid cloud models. 19 free models tested, 7 above 80%. You don’t need an API key.

#8

Same weights, different results

Nemotron Nano 30B: OpenRouter 83.7% vs DeepInfra 77.0%. The provider gap can exceed 6 percentage points. Benchmarks are provider-specific.

04 — Deep Dive: Reasoning

The counterfeit coin problem

The single hardest task in the entire showdown. You have 12 coins, one is counterfeit (heavier or lighter — you don’t know which). Using a balance scale exactly 3 times, find the counterfeit coin and determine whether it’s heavier or lighter. Three weighings give 27 possible outcomes for 24 possible states — it’s information-theoretically tight.

Many models failed this task. The failure pattern is consistent: models correctly identify the first step (divide into 3 groups of 4) but then hand-wave through the subcases with phrases like “narrow it down to the suspect” without proving that exactly 3 weighings suffice for every branch.

Judge on Claude Haiku 4.5
“The response correctly identifies that the puzzle is solvable in 3 weighings and attempts a reasonable initial strategy (divide into thirds, weigh 4 vs 4). However, the procedure is incomplete and contains critical gaps. Step 2 is vague and hand-wavy.”

Counterfeit Coin Results

Result Models
Solved (5/5) All three Qwen 3.5 variants, gpt-oss-120b, kimi-k2, nemotron-nano:free (OR), mercury-2, Nemotron Super 120B (both providers), gemini-3-flash-preview
Failed (2/5) gpt-oss-20b, minimax-m2.7, deepseek-v3.2, both Llama 4 variants, mistral-small-3.2, mercury-coder, claude-haiku-4.5, Nemotron-Nano (DI), codestral-2508, gemma-3-27b-it, ministral-8b

What’s striking is that openai/gpt-oss-20b — the best coding model in the showdown — falls on its face here. It can write correct code in three languages but can’t reason through a logic puzzle requiring exhaustive case analysis. This is exactly why multi-dimensional evaluation matters.

The Cross-Provider Reasoning Gap

nvidia/nemotron-3-nano-30b-a3b:free (OpenRouter) scored 20/20 but nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra) only scored 17/20. Same weights. The DeepInfra variant failed the counterfeit coin (2/5) while OpenRouter solved it perfectly (5/5). Quantization or inference configuration can affect a model’s ability to maintain long chains of reasoning.

05 — Deep Dive: Knowledge

What models get wrong

Many models achieved a perfect 30/30. Those that missed questions each missed 1–5. The errors reveal how models process information.

“How many R’s in Strawberry?”

Three models answered “2” instead of “3.” Both Llama 4 variants and mistralai/mistral-small-3.2-24b-instruct failed this classic character-counting test. Models that tokenize the word rather than examining individual characters consistently get it wrong. Multiple models still fail it in March 2026.

“All But 9 Die”

“A farmer has 10 sheep. All but 9 die. How many sheep does the farmer have left?” mistralai/codestral-2508 and mistralai/ministral-8b-2512 both answered “8,” misinterpreting “all but 9” as “10 minus 9 = 1 dies, leaving… 8?” Both failing models are from Mistral, suggesting a shared training-data blind spot.

The Carbon-14 Ambiguity

Four models got the Carbon-14 half-life wrong (correct: 5,730 years). meta-llama/llama-4-scout, mistralai/codestral-2508, mistralai/ministral-8b-2512, and openai/gpt-oss-20b all provided incorrect or imprecise answers. The hardest science question by error count.

AI/ML Gotchas

google/gemma-3-27b-it said the “T” in GPT stands for “Transformative” instead of “Transformer.” mistralai/ministral-8b-2512 said the original Transformer’s d_model was 64 instead of 512. openai/gpt-oss-120b returned an empty response for the activation function question. Surprising for models that are themselves transformers.

06 — Deep Dive: Instruction Following

Do exactly what I ask

Many models achieved a perfect 30/30. The failures are mechanical and revealing. This is the easiest dimension to ace — and the least predictive of real-world capability.

The Markdown Fence Epidemic

The most systematic failure: “markdown fence hallucination.” When told to output raw JSON with no markdown fences, models wrap it in ```json ... ``` anyway. When told to output a markdown table, models double-wrap the markdown inside a ```markdown block. google/gemma-3-27b-it and mistralai/ministral-8b-2512 both lost points for this — a deep training bias toward code-fence formatting that overrides explicit instructions.

Word Counting Is Hard

“Write a 12-word sentence about the ocean. Nothing else.” Seven models got this wrong. mistralai/codestral-2508 wrote 6 words. moonshotai/kimi-k2 wrote 10. Four models wrote 11 (off by 1). Models that reliably nail word counts tend to be the same ones that score well on reasoning — they actually count rather than estimate.

Instruction Following Doesn’t Predict Tool Use

The most counterintuitive finding: perfect instruction following (30/30) does not predict good MCP tool use. anthropic/claude-haiku-4.5 and nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra) both got 30/30 on instruction yet scored 2/16 on MCP. Meanwhile, inception/mercury-2 scored only 25/30 on instruction but got 11/16 on MCP. Following rules is a different cognitive task than autonomously planning a multi-step tool chain. Instruction following tests compliance; MCP tests initiative.

07 — Deep Dive: Coding

Code that actually runs

The coding dimension has the widest score spread: 0/126 to 126/126. Multiple models now achieve perfect scores, while the cipher and pipeline challenges still separate the strong from the weak.

Cipher TypeScript Results

Many models produced perfect cipher implementations. Click column headers to sort.

Model Score Time

qwen/qwen3.5-122b-a10b, the overall winner, only scored 2/7 on cipher-typescript. Meanwhile, google/gemma-3-27b-it — near last overall — got a perfect 7/7. Single challenges can invert rankings.

Why Some Models Score Zero

Three distinct failure modes explain the zero scores on cipher and data pipeline challenges:

Failure Mode 1

65k token reasoning loops

openai/gpt-oss-20b and nvidia/nemotron-3-nano-30b-a3b:free generated exactly 65,536 completion tokens (OpenRouter’s max) in 4–7 minutes — but returned empty content. The models were trapped in internal reasoning/thinking chains that consumed the entire token budget without producing visible output. For gpt-oss-20b this was transient — it solved cipher in Rust and Python perfectly in a later batch. The cached empty result persisted.

Failure Mode 2

Infinite verification loops

qwen/qwen3.5-122b-a10b spent 5 minutes generating 31,000 tokens of visible reasoning — manually verifying its cipher solution by re-tracing the same index calculation dozens of times in an infinite loop. The code extractor found only a partial function, not a complete program. Score: 2/7.

Failure Mode 3

Token-limit truncation

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B (DeepInfra) hit 16k tokens and its output was truncated mid-generation. Hardcoded data strings got corrupted (missing newlines between records). The code compiled but produced wrong output. Score: 0–5/7 depending on the challenge.

The cached responses reveal a nuanced picture. For nvidia/nemotron-3-nano-30b-a3b:free, the empty responses were consistent across multiple challenges — its coding score of 63/126 reflects genuine limitations. But for openai/gpt-oss-20b, the cipher-typescript zero was a transient failure from an overnight API call that got cached. It solved cipher-rust and cipher-python perfectly in later batches (18k tokens, 45 seconds). If the overnight run had succeeded, its already-highest coding score of 112/126 would be even higher. qwen/qwen3.5-122b-a10b’s infinite verification loop, however, is a model behavior — reasoning-heavy models can get trapped in self-checking loops on algorithmic problems.

Rust Is the Hardest Language

Language Avg Score Compile Rate Perfect Rate
TypeScript 5.3/7 92% 78%
Python 5.1/7 88% 76%
Rust 4.3/7 72% 65%

Rust’s lower success rate comes from its strict type system and ownership model. Models struggle with borrowing, lifetimes, and the borrow checker — generating code that looks correct but fails to compile. The gap between Rust and TypeScript is consistent across all challenge types.

Speed as Capability

Speed remains a decisive factor. Multiple top models now achieve perfect 126/126, but the data pipeline and cipher challenges still separate fast from slow. Models that respond within seconds across all 18 tasks — spanning FizzBuzz, business days, vending machines, grid paths, ciphers, and data pipelines — consistently rank higher overall.

08 — Deep Dive: MCP Tool Use

Can models orchestrate real-world tools?

The most revealing dimension. Each model connects to TezLab’s MCP server and must analyze battery health and charging patterns by calling 6 tools in the right sequence. Among models that called all 6 tools, quality scores were uniformly 5/5. The variance is entirely in tool usage — MCP tool use is primarily a planning problem, not a generation problem.

Note: MCP results are available for 41 of 48 models — comprehensive coverage across all providers. The 7 models without MCP data either lack tool-use support (phi4, gemma3n, deepseek-r1 variants) or were not attempted (Nemotron Nano 9B variants). Models without MCP data show “—” in the leaderboard and are scored on 4 dimensions only.

The Efficiency Spectrum

Model Tool Judge Total Time Cost

Click any row to see the full response, tool calls, and judge verdict.

Loading...

Loading...

The Failures Tell the Real Story

Claude Haiku 4.5 via OpenRouter (2/16)
“I apologize — I’m encountering persistent server errors when trying to connect to your TezLab account to retrieve vehicle information. The service appears to be temporarily unavailable.”

The service was running fine. Every other model connected without issues. Haiku received an error on the first list_vehicles call, retried once, then generated a polite failure response instead of persisting. It treated a tool error as terminal rather than retryable. At $0.013, it cost more than most models that scored perfectly.

Nemotron Nano 30B on DeepInfra (2/16)
“I see you have two vehicles in your TezLab account: 1. Tesla Model X, 2. Rivian R1S. Could you tell me which vehicle you’d like me to analyze?”

It refused to proceed autonomously. The same weights on OpenRouter scored 11/16 by just picking the Tesla and running all 6 tools. Same model, different provider: one asks permission, the other gets to work.

google/gemma-3-27b-it couldn’t even start — OpenRouter returned “No endpoints found that support tool use.” A legitimate model limitation, not an error.

09 — The Provider Effect

Why infrastructure matters

The same model weights, served by different providers, produce meaningfully different results. Two NVIDIA Nemotron models appear on both OpenRouter and DeepInfra — same weights, different infrastructure. This wasn’t planned as an experiment, but it became one of the most revealing findings.

Cross-Provider Comparison

Model Dimension OpenRouter DeepInfra Gap
Nemotron Nano 30B Reasoning 20/20 17/20 −3
Knowledge 30/30 30/30 0
Instruction 28/30 30/30 +2
Coding 71/126 110/126 +39
MCP Tool Use 11/16 2/16 −9
Combined 83.7% 77.0% −6.7pp
Nemotron Super 120B Reasoning 20/20 20/20 0
Knowledge 30/30 28/30 −2
Instruction 25/30 30/30 +5
Coding 74/126 88/126 +14
MCP Tool Use 11/16 4/16 −7
Combined 82.2% 77.6% +4.6pp

Seven Layers of Provider Difference

There are at least seven layers where providers can introduce behavioral differences, even when serving identical model weights:

  • Quantization — DeepInfra typically runs FP8; OpenRouter routes to NVIDIA’s own infrastructure which may use different precision. Affects long reasoning chains most.
  • SDK layers — OpenRouter uses a native Vercel AI SDK provider; DeepInfra uses a generic OpenAI-compatible adapter. Tool calling and response parsing differ.
  • Middleware — OpenRouter applies context compression, response healing (malformed JSON repair), and schema modification. DeepInfra passes requests directly.
  • Tool-calling implementation — How tools appear in context, how results are formatted, how the model signals tool calls. The most likely explanation for the MCP gap.
  • Default parameters — Providers may inject different defaults for temperature, top_p, repetition penalty.
  • Token usage reporting — DeepInfra streaming token counts may be less accurate, affecting cost reporting.
  • Reasoning effort — Different providers expose different parameter names for reasoning tokens, potentially allocating different “thinking time.”

The practical implication: benchmarks are provider-specific. A score obtained on OpenRouter doesn’t transfer to DeepInfra, or vice versa. If your application relies on tool calling, provider choice matters more than model choice.

10 — Cost & Speed Analysis

The cost efficiency curve

The relationship between cost and quality is not linear — it has a sharp knee at around $0.01. You get 89.9% across all 5 dimensions for one cent. The models with the highest 4-dimension scores cost significantly more, but the value proposition of sub-$0.10 models remains remarkable.

Tier Cost Range Best Model Score
Free $0.00 nemotron-3-nano:latest (Ollama) 84.0%
Sub-penny $0.001–$0.01 deepseek/deepseek-v3.2 (OR) 88.3%
Penny $0.01–$0.05 openai/gpt-oss-120b (OR) 89.9%
Dime $0.05–$0.20 x-ai/grok-4.20-beta (OR) 92.8%
Quarter+ $0.20+ anthropic/claude-sonnet-4.6 (OR) 93.8%

The cost efficiency story has changed dramatically. Free local models now compete with mid-tier API models, and $0.01 buys you 89.9% across all 5 dimensions. The premium tier delivers 93.8% across all 5 dimensions, but the value proposition of the sub-$0.10 range is remarkable.

The Speed/Quality Frontier

Model Score Time Sweet Spot

Methodology

All evaluations ran with temperature 0.0 (knowledge, instruction) or 0.2–0.3 (reasoning, coding). Reasoning and knowledge responses were judged by anthropic/claude-haiku-4.5 (OpenRouter). Instruction following and coding used deterministic verification — no LLM judge involved.

Combined scores are the mean of normalized dimension percentages (each dimension’s raw score divided by its maximum, then averaged). Models without MCP data are scored across the 4 available dimensions. MCP data is shown where available; models without MCP results display a dash.

The Judge

anthropic/claude-haiku-4.5 (via OpenRouter) serves as both contestant and judge. We mitigated the circularity by using deterministic scoring for instruction following and coding, scoring MCP tool usage mechanically, and only using the LLM judge for reasoning quality and factual correctness. The judge showed no measurable bias toward its own responses — Haiku scored itself 17/20 on reasoning (below the median of 17.6).

Run Details

  • Reasoning — Run #7, 4 puzzles, scored /20
  • Knowledge — Run #2, 30 questions across 6 categories, scored /30
  • Instruction — Run #2, 6 constraint tasks, scored /30
  • Coding — Run #6, 6 challenges x 3 languages, scored /126
  • MCP Tool Use — Run #1, 1 multi-tool task, scored /16

Limitations

  • Cipher and pipeline challenges have incomplete coverage due to 10-minute timeouts. Models that couldn’t respond score 0 — treated as a finding rather than missing data.
  • MCP eval used a single task with a live API. Results may vary with different MCP servers.
  • Free-tier models may have different availability or routing than paid versions.
  • Each model tested once per task. Stochastic variation means scores could shift 1–3 points on a re-run.
  • 7 of 48 models are missing MCP scores (no tool-use support or not attempted). Their combined scores are based on 4 dimensions only, while the 41 models with MCP data use all 5 dimensions.

Subscribe to our newsletter

Powered by Buttondown.

Ready to distill signal from noise?

Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.