THEFOCUS.AI LABS
PROJECT: Umwelten Local Providers Matrix
DOC. NO: LP-2026-05
DATE: May 2026

Local Providers
Full Assessment

A comprehensive evaluation of running open-weights models locally: which model, which runtime, with thinking on or off. Plus an experiment in giving small models compile-error feedback to see whether they can self-correct.

Score vs. Wall-Clock Time efficiency frontier — gemma-4-26b-a4b leads at 95% in 4 minutes

What We Tested

7 model families across 3 runtimes, producing 20 (model, runtime, thinking-mode) cells. All models run on a single 64 GB Apple Silicon machine.

The Models

Gemma 4 27B-A4B
Params: 27B (4B active)
Released: Apr 2025
Quant: Q4_K_M
On disk: 16 GB
Developer: Google
MoE
Gemma 4 31B
Params: 31B
Released: Apr 2025
Quant: Q4_K_M
On disk: 19 GB
Developer: Google
Dense
Gemma 4 E4B
Params: ~4B
Released: Apr 2025
Quant: Q4_K_M
On disk: 3 GB
Developer: Google
Small
Gemma 4 E2B
Params: ~2B
Released: Apr 2025
Quant: Q4_K_M
On disk: 1.5 GB
Developer: Google
Small
GPT-OSS 20B
Params: 20B
Released: Mar 2025
Quant: MXFP4 (native)
On disk: 12 GB
Developer: OpenAI
Dense
GLM 4.7 Flash
Params: ~7B
Released: Feb 2025
Quant: Q4_K_M
On disk: 17 GB
Developer: Zhipu AI
Dense
Qwen 3.6 27B
Params: 27B
Released: Apr 2026
Quant: Q4_K_M
On disk: 16 GB
Developer: Alibaba
Dense
Nemotron 3 Nano 4B
Params: 4B
Released: Mar 2025
Quant: Q4_K_M
On disk: 2.9 GB
Developer: NVIDIA
Small

The Runtimes

Runtime Engine Thinking Notes
Ollama ollama daemon Default (model decides) Q4_K_M for all; Go-template chat approximations
llamaswap llama-server + llama-swap On (default) --ctx-size 0 --jinja — native chat template
llamaswap-nothink llama-server + llama-swap Off (enable_thinking=false) Same binary, thinking suppressed per-request

The 5 Evaluation Dimensions

Dimension Scoring Max Method
Instruction FollowingDeterministic constraint checking/306 tasks, 5 pts each
ReasoningLLM judge (Claude Haiku 4.5)/204 puzzles, 5 pts each
Coding (Write)Compile + run + verify output/1266 problems × 3 languages × 7 pts
Coding (Fix)Hidden test harness/255 buggy JS functions, 5 pts each
Tool MathCorrectness + tool-call counting/255 multi-step arithmetic tasks

The Surprising Conclusions

95.0%
Best Combined
4m13s
Wall Clock
20
Cells Tested
5
Dimensions
  1. The runtime barely matters. Ollama and llamaswap score within a few points of each other on every model when you control for quantization and thinking mode.
  2. Thinking mode usually hurts on short-horizon tasks — except for tool-calling, where it’s required.
  3. gpt-oss-20b on llamaswap (thinking on) is the best small open model for tool-calling. 23/25.
  4. gemma-4-26b-a4b is the best value model overall. 95.0% combined in 4 minutes.
  5. Tool-calling is where the matrix falls apart. Three cells scored 0/25, eight scored ≤5/25.
  6. A 4B model cannot fix its own broken code from the compile error. 0 improvements across 8 retries.
  7. Thinking mode can turn 4-minute runs into 1.5-hour runs. Pure cost, negative benefit on short tasks.
  8. gemma-4-31b is paradoxically worse than gemma-4-26b-a4b. The MoE model with 4B active params outperforms the larger dense variant on tool-calling.
  9. The ollama vs llamaswap question is decided by configurability, not performance.
Runtime choice is mostly a wash. What actually moves the needle is which model you load and whether you turn thinking on.

Overall Results

Combined Score — All 20 Cells
60% 70% 80% 90% 100% gemma-4-26b-a4b LLAMASWAP-NOTHINK 95.0% gpt-oss-20b LLAMASWAP 94.6% gpt-oss-latest OLLAMA 89.5% gemma-4-26b-a4b LLAMASWAP (THINK) 86.1% gemma-4-31b BOTH RUNTIMES 84.0% gemma-4-e2b LLAMASWAP-NOTHINK 82.6% gemma4-26b OLLAMA 79.0% glm-4.7-flash OLLAMA 78.7% qwen3.6-27b LLAMASWAP-NOTHINK 78.4% 6 more cells 73–78% 73–78% 4 more cells 61–69% 61–69%
Score vs. Wall-Clock Time — The Efficiency Frontier
60% 70% 80% 90% 100% 3m 8m 16m 24m 80m WALL-CLOCK TIME → COMBINED SCORE → IDEAL ZONE gemma-4-26b-a4b NOTHINK · 4m · 95% gpt-oss-20b THINK · 16m · 94.6% gpt-oss ollama 24m · 89.5% gemma-26b think 80m · 86.1% gemma-31b gemma-e2b 3m · 82.6% nemotron think 62m · 73% nemotron nothink 61%
Each dot is one (model, runtime, thinking-mode) cell. Blue = Gemma family. Red = GPT-OSS family. The ideal zone is top-left: high score, fast.
Per-Dimension Breakdown — Top 5 Cells
Instruction Reasoning Coding (Write) Coding (Fix) Tool Math 100% 75% 50% gemma-26b-a4b NOTHINK gpt-oss-20b LLAMASWAP gpt-oss OLLAMA gemma-31b OLLAMA gemma-e2b NOTHINK
The red bar (Tool Math) tells the story: gemma-31b aces everything except tools; gpt-oss is the only model where tool-calling nearly matches the other dimensions.
Thinking On vs. Thinking Off — Same Model, Same Weights
Thinking Off Thinking On Wall-clock time gemma-4-26b-a4b 95.0% · 4m 86.1% · 80m gpt-oss-20b 74.6% · 20m 94.6% · 16m ← thinking ON wins here (tool-calling) gemma-4-e4b 73.9% · 7m 68.2% · 12m nemotron-nano-4b 61.0% · 3m 73.4% · 62m
Blue (thinking off) is faster and usually scores higher — except gpt-oss-20b, where thinking-on unlocks tool-calling and is actually faster than nothink (which gets stuck in empty retries).

How “Best” Was Decided

Model Instruction Reasoning Code (Write) Code (Fix) Tool Math Combined % Time
gemma-4-26b-a4b (llamaswap-nothink) 30/30 20/20 120/126 25/25 20/25 95.0% 4m13s
gpt-oss-20b (llamaswap) 30/30 17/20 121/126 25/25 23/25 94.6% 16m6s

The “Combined %” is the mean of the per-dimension percentages, not a weighted score. Both models tie on Instruction (100%) and Coding-fix (100%). Gemma wins reasoning by 3 points; gpt-oss wins tool-math by 3 points and coding-write by 1. They cancel out within 0.4%.

“Best raw scores on tool-calling” → gpt-oss-20b llamaswap (23/25)
“Best general-purpose at reasonable cost” → gemma-4-26b-a4b llamaswap-nothink
“Best out-of-the-box” → gemma4:31b ollama (84.0%, no thinking-mode tuning required)


Methodology

Run Mechanics

The matrix runner iterates model-major: load one model, run all five evaluation suites, evict before loading the next. Required on a 64 GB machine because most models are 17–19 GB resident. 20-min per-cell watchdog, 5-min per-task watchdog. Partial responses are scored as-is.

All quantizations are Q4_K_M across runtimes (apples-to-apples), except gpt-oss which ships natively as MXFP4. Total matrix wall-clock time: 316 minutes (~5.3 hours).


Instruction Following DETERMINISTIC · /30

Can the model follow precise format instructions without overshooting? No reasoning needed; just literal compliance.

The Exact Prompts

1.1 Exact Word Count (5 pts)

Write a sentence about the ocean that contains EXACTLY 12 words. Do not include any other text, explanation, or commentary.

Scoring: 5 if exactly 12 words; 3 if off by 1; 1 if off by 2–3; 0 otherwise.

1.2 Structured JSON Output (5 pts) — 4 typed fields, no markdown fences.

1.3 Constrained List (5 pts) — 5 animals, numbered, alphabetical, ≤8 chars.

1.4 Negative Constraints (5 pts) — sunset description avoiding 3 words.

1.5 Format Transformation (5 pts) — CSV to markdown table.

1.6 Multi-format Response (5 pts) — 3 sections separated by ---.

What We Saw

Eleven of 20 cells scored a perfect 30/30. The most discriminating task was exact-word-count:

  • Thinking mode hurt word counting. gemma-4-26b-a4b llamaswap (thinking on) scored 0/5. Same weights, thinking off: 5/5.
  • glm-4.7-flash scored 1/5 — writes nicely but cannot constrain length.

Reasoning LLM JUDGE · /20

Classic reasoning puzzles where the obvious answer is wrong. Claude Haiku 4.5 as judge.

2.1 The Surgeon Riddle · 2.2 Bat and Ball · 2.3 Lily Pad · 2.4 Counterfeit Coin

Twelve cells scored 17–20/20. gemma-4-31b scored 20/20 on both runtimes. The hardest puzzle was counterfeit-coin.

Signal: On short reasoning puzzles, model size matters more than thinking-mode budget.


Coding — Write COMPILE + RUN + CHECK · /126

6 problems × 3 languages (TypeScript, Python, Go), each scored 0–7. Top scorers: gemma-4-31b (126/126 on both runtimes).

The dominant failure mode in Go: Of 35 non-perfect Go scores, 67% were compile errors — “imported and not used” or “declared and not used.”


Coding — Fix HIDDEN TEST HARNESS · /25

5 buggy JavaScript functions. Sixteen of 20 cells scored a perfect 25/25. The four that lost points: all three nemotron variants and gemma4-e2b ollama. This dimension has the least signal — too easy for most models.


Tool Math CORRECTNESS + TOOL-CALL COUNTING · /25

Can the model use provided tools (calculator, statistics) to chain multi-step arithmetic?

Model Tool Math
gpt-oss-20b llamaswap (thinking-on)23/25
gemma-4-26b-a4b llamaswap (thinking-on)20/25
gemma-4-26b-a4b llamaswap-nothink20/25
gpt-oss-latest ollama20/25
gemma-4-31b llamaswap (thinking-on)20/25
glm-4.7-flash-latest ollama14/25
gemma-4-e2b llamaswap-nothink12/25
gemma4-26b ollama0/25
gpt-oss-20b llamaswap-nothink0/25

Why Tool-Calling Is So Weak

  1. Models try to one-shot the answer instead of chaining tools.
  2. Thinking-off models get stuck. gpt-oss-20b nothink: 0/25. Same model thinking-on: 23/25.
  3. Tool-calling format errors. Wrong argument shapes exhaust the retry budget.
  4. Long chains exceed the watchdog.
  5. Reasoning loops on Stats+arithmetic. Models re-derive instead of trusting tool output.

Root cause: Tool-calling requires the model to maintain state across multiple turns and trust intermediate results. Smaller and thinking-off models fail this.


Coding 2-Pass — The Experiment /126

If a model fails in round 1, does showing it the actual compile/runtime error help?

Round 1 Round 2 Δ
Variant A (scaffolded) 93/126 (73.8%) 79/126 (62.7%) −14 pts
Variant B (raw error) 93/126 (73.8%) 80/126 (63.5%) −13 pts

Same outcome. Zero improvements. The 4B model correctly identified the fix for one error and then introduced a new error elsewhere.

The 4B-parameter model genuinely lacks the capacity to make a localized fix without disturbing surrounding code. It’s not context length, not prompt format — it’s model capability.


What This Means in Practice

Use Case Pick
Best general overall, low cost gemma-4-26b-a4b on llamaswap-nothink (95.0%, 4 minutes)
Best for tool-use / agent loops gpt-oss-20b on llamaswap (thinking on) (94.6%, 23/25 on tool math)
Best out-of-the-box gemma4:31b on ollama (84.0%, no config needed)
Smallest still usable gemma-4-e2b on llamaswap-nothink (82.6%, 3 minutes)
What to avoid Thinking-on by default for non-agent tasks. The wall-clock cost is huge.
If you’re building agentic workflows that need tools: use thinking-on. The penalty is wall-clock time but you can’t get tool-calling right without it.

If you’re iterating on agent loops that need the model to use compile/test feedback to fix its own code: don’t trust models below ~20B parameters to do this reliably.

Subscribe to our newsletter

Powered by Buttondown.

Ready to ship production AI?

Whether you need a quick Vibe Check or a full Habitat built on Habitat OS, we'd love to hear what you're working on.