Research Report / February 2026

The Car Wash Test:
Do LLMs Have Common Sense?

A simple question exposes a fundamental gap in AI reasoning. We tested 131 models across 8 providers — including 11 local Ollama models — to see which ones understand that you need to bring your car to the car wash, not just yourself.

131 models evaluated 8 providers Run #007 · 50m distance
The Prompt
"I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
Correct answer: DRIVE — the car must be physically present at the car wash to be washed. Walking there leaves your car at home, defeating the purpose entirely.
131
Models Tested
31
Truly Correct
6
Lucky (Wrong Reason)
90
Failed (Walk)
4
Both / Unclear / Error
Section I

Comparison with Original Research

The car wash test was originally devised by Opper.ai, who tested 53 models at 50 meters. We replicated their study at the same distance with 131 models, added an LLM judge for reasoning quality, included the latest 2026 models, and added 11 local Ollama models to compare cloud vs local inference.

Opper.ai (Original)

11 / 53
  • Distance: 50 meters
  • Single-run pass rate: 20.8%
  • 10-run consistent: 5 models only
  • GPT-5 scored 7/10 across runs
  • Human baseline: 71.5% correct
  • 33 models never correct in any run

Our Replication (Extended)

31 / 131
  • Distance: 50 meters (same)
  • Single-run pass rate: 23.7%
  • Lucky (drive, wrong reason): 6 models
  • Reasoning judge: claude-haiku-4.5
  • 14 thinking/reasoning models tested
  • 11 local Ollama models: 1 passed

At the same 50m distance, our expanded test shows a 23.7% pass rate vs Opper's 20.8%. The improvement comes from newer 2026 models (Qwen 3.5, GPT-5+, Grok 4+) that weren't available in the original study. The 50m distance remains brutally effective at tripping up models — they latch onto "50 meters is so close, just walk!" without considering that the car is the cargo.

Consistent Findings Across Both Studies

🧠

Heuristic Dominance

Models default to "short distance = walk" without considering what needs to travel. The environmental/health framing overrides physical necessity.

📈

Newer Models Do Better

Models released in late 2025–2026 pass at much higher rates. The Qwen 3.5 family achieves 100% (5/5). GPT-5 family mostly succeeds.

💻

Local Models: 0% Pass Rate

All 12 Ollama models failed. Local inference models lack the reasoning quality of cloud frontier models on this task.

Section II

Visual Analysis

Pass Rate by Model Family

Green = truly correct (drive + right reason). Yellow = lucky (drive + wrong reason). Red = failed. Gray = other.

Cost vs. Reasoning Quality

Cost per query (log scale). Expensive models don't guarantee correct reasoning.

Response Time by Result

Correct models are not necessarily slower. Many wrong answers took longer due to elaborate (incorrect) reasoning.

Thinking Models: Does Extended Reasoning Help?

Only 2 of 11 dedicated thinking models pass at 50m. Thinking alone does not guarantee common sense.

Local (Ollama) vs Cloud Models

10 of 11 local Ollama models failed. Only minimax-m2.1:cloud passed. Cloud frontier models pass at 26% vs 9% for local models.
Section III

Full Results

All 131 models sorted by reasoning quality. Click column headers to sort. Use filters to explore by category.

Model Provider Answer Right Reason Quality Time Cost Reasoning
Section IV

Key Findings

1. The 50m Trap

At 50 meters, the distance becomes a powerful distractor. Models fixate on "50 meters is so close!" and immediately conclude walking is better — for health, environment, convenience. Only 24% of models see through this to the core issue: the car must physically be at the car wash. Even models that passed Opper's test at this distance in other runs may fail on any given attempt due to LLM non-determinism.

2. The Qwen 3.5 Breakthrough

The Qwen 3.5 family is the standout performer: all five models (plus, 397b, 122b, 35b, 27b) pass with perfect reasoning. Their responses immediately identify that "the vehicle must be physically present at the car wash." This is notable because the older Qwen 3 models (qwen3-max, qwen3-coder, qwen3-max-thinking) all fail. Something changed in the 3.5 generation.

3. Local Models: Near-Complete Failure

Of 11 local Ollama models, only minimax-m2.1:cloud passed with correct reasoning. The other 10 — including deepseek-r1 (three sizes), qwen3 (two sizes), phi4, gemma3n, mistral-small, devstral, and gpt-oss — all failed. deepseek-r1:14b said "drive" but for wrong reasons (convenience, not physical necessity). This suggests quantization and missing RLHF significantly impact common-sense reasoning.

4. Anthropic's Split

Only Claude Opus 4.5, Opus 4.6, and Claude 3.7 Sonnet:thinking pass. Every other Claude model fails — including Sonnet 4.6, Opus 4, Opus 4.1, Sonnet 4, Sonnet 4.5, and Claude 3.5 models. Notably, Sonnet 4.6 passed in Run #006 but failed in this run — demonstrating the non-determinism that makes single-run benchmarks unreliable. The 3.7 Sonnet thinking variant succeeded this time by reasoning through the physical constraint explicitly.

5. Cost Is Not a Predictor

Claude Opus 4 ($0.009/query) and Claude Opus 4.1 ($0.009) both fail, while Gemini 3 Flash ($0.000044) and Grok 4.1 Fast ($0.000241) pass — at 30–300x less cost. The most expensive correct model is GPT-5-pro at $0.15/query. The cheapest is Gemini 3 Flash at $0.000044. Common sense is orthogonal to model cost.

6. Reasoning Models: Mixed Results

Of 14 dedicated thinking/reasoning models, 7 passed (50%): o3, o3-pro, o1, o3-mini, claude-3.7-sonnet:thinking, qwen3-30b-thinking, qwen3-next-80b-thinking, and qwen3-vl-235b-thinking. But 7 others failed — including o3-mini-high, o4-mini, o4-mini-high, qwen3-235b-thinking, and sonar-reasoning-pro. Extended reasoning helps some models work through the physical constraint, but can also reinforce the "short distance = walk" heuristic.

Section V

Methodology

Models were evaluated using the umwelten evaluation framework. Each model received an identical prompt with a system message describing a "helpful assistant" role (temperature 0.3, max 500 tokens). Responses were cached per-run to ensure reproducibility.

Judging was performed by anthropic/claude-haiku-4.5 via OpenRouter, scoring each response on: recommendation (drive/walk/both/unclear), whether the model recognizes the car must be present, whether the correct reason was given, a 1–5 reasoning quality score, and a free-text explanation.

A model is classified as truly correct only if it recommends driving and identifies that the car must physically be at the car wash. Models that recommend driving for other reasons (convenience, speed) are classified as lucky.

Local models were run via Ollama on an Apple Silicon Mac (M-series). Cloud models were accessed via Google AI API (direct) and OpenRouter. The distance was set to 50 meters to match the original Opper.ai study exactly.

Note on variance: LLM responses are non-deterministic even at low temperature. The original Opper.ai study showed significant run-to-run variance (GPT-5 scored 7/10 across 10 runs). Our results represent a single run and should be interpreted as a snapshot, not a definitive classification.

Subscribe to our newsletter

Powered by Buttondown.

Ready to distill signal from noise?

Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.