The Car Wash Test:
Do LLMs Have Common Sense?
A simple question exposes a fundamental gap in AI reasoning. We tested 131 models across 8 providers — including 11 local Ollama models — to see which ones understand that you need to bring your car to the car wash, not just yourself.
"I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"
Comparison with Original Research
The car wash test was originally devised by Opper.ai, who tested 53 models at 50 meters. We replicated their study at the same distance with 131 models, added an LLM judge for reasoning quality, included the latest 2026 models, and added 11 local Ollama models to compare cloud vs local inference.
Opper.ai (Original)
- Distance: 50 meters
- Single-run pass rate: 20.8%
- 10-run consistent: 5 models only
- GPT-5 scored 7/10 across runs
- Human baseline: 71.5% correct
- 33 models never correct in any run
Our Replication (Extended)
- Distance: 50 meters (same)
- Single-run pass rate: 23.7%
- Lucky (drive, wrong reason): 6 models
- Reasoning judge: claude-haiku-4.5
- 14 thinking/reasoning models tested
- 11 local Ollama models: 1 passed
At the same 50m distance, our expanded test shows a 23.7% pass rate vs Opper's 20.8%. The improvement comes from newer 2026 models (Qwen 3.5, GPT-5+, Grok 4+) that weren't available in the original study. The 50m distance remains brutally effective at tripping up models — they latch onto "50 meters is so close, just walk!" without considering that the car is the cargo.
Consistent Findings Across Both Studies
Heuristic Dominance
Models default to "short distance = walk" without considering what needs to travel. The environmental/health framing overrides physical necessity.
Newer Models Do Better
Models released in late 2025–2026 pass at much higher rates. The Qwen 3.5 family achieves 100% (5/5). GPT-5 family mostly succeeds.
Local Models: 0% Pass Rate
All 12 Ollama models failed. Local inference models lack the reasoning quality of cloud frontier models on this task.
Visual Analysis
Pass Rate by Model Family
Cost vs. Reasoning Quality
Response Time by Result
Thinking Models: Does Extended Reasoning Help?
Local (Ollama) vs Cloud Models
Full Results
All 131 models sorted by reasoning quality. Click column headers to sort. Use filters to explore by category.
| Model ▲ | Provider ▲ | Answer ▲ | Right Reason ▲ | Quality ▲ | Time ▲ | Cost ▲ | Reasoning ▲ |
|---|
Key Findings
1. The 50m Trap
At 50 meters, the distance becomes a powerful distractor. Models fixate on "50 meters is so close!" and immediately conclude walking is better — for health, environment, convenience. Only 24% of models see through this to the core issue: the car must physically be at the car wash. Even models that passed Opper's test at this distance in other runs may fail on any given attempt due to LLM non-determinism.
2. The Qwen 3.5 Breakthrough
The Qwen 3.5 family is the standout performer: all five models (plus, 397b, 122b, 35b, 27b) pass with perfect reasoning. Their responses immediately identify that "the vehicle must be physically present at the car wash." This is notable because the older Qwen 3 models (qwen3-max, qwen3-coder, qwen3-max-thinking) all fail. Something changed in the 3.5 generation.
3. Local Models: Near-Complete Failure
Of 11 local Ollama models, only minimax-m2.1:cloud passed with correct reasoning. The other 10 — including deepseek-r1 (three sizes), qwen3 (two sizes), phi4, gemma3n, mistral-small, devstral, and gpt-oss — all failed. deepseek-r1:14b said "drive" but for wrong reasons (convenience, not physical necessity). This suggests quantization and missing RLHF significantly impact common-sense reasoning.
4. Anthropic's Split
Only Claude Opus 4.5, Opus 4.6, and Claude 3.7 Sonnet:thinking pass. Every other Claude model fails — including Sonnet 4.6, Opus 4, Opus 4.1, Sonnet 4, Sonnet 4.5, and Claude 3.5 models. Notably, Sonnet 4.6 passed in Run #006 but failed in this run — demonstrating the non-determinism that makes single-run benchmarks unreliable. The 3.7 Sonnet thinking variant succeeded this time by reasoning through the physical constraint explicitly.
5. Cost Is Not a Predictor
Claude Opus 4 ($0.009/query) and Claude Opus 4.1 ($0.009) both fail, while Gemini 3 Flash ($0.000044) and Grok 4.1 Fast ($0.000241) pass — at 30–300x less cost. The most expensive correct model is GPT-5-pro at $0.15/query. The cheapest is Gemini 3 Flash at $0.000044. Common sense is orthogonal to model cost.
6. Reasoning Models: Mixed Results
Of 14 dedicated thinking/reasoning models, 7 passed (50%): o3, o3-pro, o1, o3-mini, claude-3.7-sonnet:thinking, qwen3-30b-thinking, qwen3-next-80b-thinking, and qwen3-vl-235b-thinking. But 7 others failed — including o3-mini-high, o4-mini, o4-mini-high, qwen3-235b-thinking, and sonar-reasoning-pro. Extended reasoning helps some models work through the physical constraint, but can also reinforce the "short distance = walk" heuristic.
Methodology
Models were evaluated using the umwelten evaluation framework. Each model received an identical prompt with a system message describing a "helpful assistant" role (temperature 0.3, max 500 tokens). Responses were cached per-run to ensure reproducibility.
Judging was performed by anthropic/claude-haiku-4.5 via OpenRouter, scoring each response on: recommendation (drive/walk/both/unclear), whether the model recognizes the car must be present, whether the correct reason was given, a 1–5 reasoning quality score, and a free-text explanation.
A model is classified as truly correct only if it recommends driving and identifies that the car must physically be at the car wash. Models that recommend driving for other reasons (convenience, speed) are classified as lucky.
Local models were run via Ollama on an Apple Silicon Mac (M-series). Cloud models were accessed via Google AI API (direct) and OpenRouter. The distance was set to 50 meters to match the original Opper.ai study exactly.
Note on variance: LLM responses are non-deterministic even at low temperature. The original Opper.ai study showed significant run-to-run variance (GPT-5 scored 7/10 across 10 runs). Our results represent a single run and should be interpreted as a snapshot, not a definitive classification.
Subscribe to our newsletter
Ready to distill signal from noise?
Whether you're exploring possibilities or ready to build, we'd love to hear what you're working on.