Local Providers — Full Assessment

What We Tested

7 model families across 3 runtimes, producing 20 (model, runtime, thinking-mode) cells. All models run on a single 64 GB Apple Silicon machine.

The Models

Gemma 4 27B-A4B

Params: 27B (4B active)

Released: Apr 2025

Quant: Q4_K_M

On disk: 16 GB

Developer: Google

MoE

Gemma 4 31B

Params: 31B

Released: Apr 2025

Quant: Q4_K_M

On disk: 19 GB

Developer: Google

Dense

Gemma 4 E4B

Params: ~4B

Released: Apr 2025

Quant: Q4_K_M

On disk: 3 GB

Developer: Google

Small

Gemma 4 E2B

Params: ~2B

Released: Apr 2025

Quant: Q4_K_M

On disk: 1.5 GB

Developer: Google

Small

GPT-OSS 20B

Params: 20B

Released: Mar 2025

Quant: MXFP4 (native)

On disk: 12 GB

Developer: OpenAI

Dense

GLM 4.7 Flash

Params: ~7B

Released: Feb 2025

Quant: Q4_K_M

On disk: 17 GB

Developer: Zhipu AI

Dense

Qwen 3.6 27B

Params: 27B

Released: Apr 2026

Quant: Q4_K_M

On disk: 16 GB

Developer: Alibaba

Dense

Nemotron 3 Nano 4B

Params: 4B

Released: Mar 2025

Quant: Q4_K_M

On disk: 2.9 GB

Developer: NVIDIA

Small

The Runtimes

Runtime	Engine	Thinking	Notes
Ollama	ollama daemon	Default (model decides)	Q4_K_M for all; Go-template chat approximations
llamaswap	llama-server + llama-swap	On (default)	`--ctx-size 0 --jinja` — native chat template
llamaswap-nothink	llama-server + llama-swap	Off (`enable_thinking=false`)	Same binary, thinking suppressed per-request

The 5 Evaluation Dimensions

Dimension	Scoring	Max	Method
Instruction Following	Deterministic constraint checking	/30	6 tasks, 5 pts each
Reasoning	LLM judge (Claude Haiku 4.5)	/20	4 puzzles, 5 pts each
Coding (Write)	Compile + run + verify output	/126	6 problems × 3 languages × 7 pts
Coding (Fix)	Hidden test harness	/25	5 buggy JS functions, 5 pts each
Tool Math	Correctness + tool-call counting	/25	5 multi-step arithmetic tasks

The Surprising Conclusions

95.0%

Best Combined

4m13s

Wall Clock

Cells Tested

Dimensions

The runtime barely matters. Ollama and llamaswap score within a few points of each other on every model when you control for quantization and thinking mode.
Thinking mode usually hurts on short-horizon tasks — except for tool-calling, where it’s required.
gpt-oss-20b on llamaswap (thinking on) is the best small open model for tool-calling. 23/25.
gemma-4-26b-a4b is the best value model overall. 95.0% combined in 4 minutes.
Tool-calling is where the matrix falls apart. Three cells scored 0/25, eight scored ≤5/25.
A 4B model cannot fix its own broken code from the compile error. 0 improvements across 8 retries.
Thinking mode can turn 4-minute runs into 1.5-hour runs. Pure cost, negative benefit on short tasks.
gemma-4-31b is paradoxically worse than gemma-4-26b-a4b. The MoE model with 4B active params outperforms the larger dense variant on tool-calling.
The ollama vs llamaswap question is decided by configurability, not performance.

Runtime choice is mostly a wash. What actually moves the needle is which model you load and whether you turn thinking on.

Overall Results

Combined Score — All 20 Cells

Score vs. Wall-Clock Time — The Efficiency Frontier

Each dot is one (model, runtime, thinking-mode) cell. Blue = Gemma family. Red = GPT-OSS family. The ideal zone is top-left: high score, fast.

Per-Dimension Breakdown — Top 5 Cells

The red bar (Tool Math) tells the story: gemma-31b aces everything except tools; gpt-oss is the only model where tool-calling nearly matches the other dimensions.

Thinking On vs. Thinking Off — Same Model, Same Weights

Blue (thinking off) is faster and usually scores higher — except gpt-oss-20b, where thinking-on unlocks tool-calling and is actually faster than nothink (which gets stuck in empty retries).

How “Best” Was Decided

Model	Instruction	Reasoning	Code (Write)	Code (Fix)	Tool Math	Combined %	Time
gemma-4-26b-a4b (llamaswap-nothink)	30/30	20/20	120/126	25/25	20/25	95.0%	4m13s
gpt-oss-20b (llamaswap)	30/30	17/20	121/126	25/25	23/25	94.6%	16m6s

The “Combined %” is the mean of the per-dimension percentages, not a weighted score. Both models tie on Instruction (100%) and Coding-fix (100%). Gemma wins reasoning by 3 points; gpt-oss wins tool-math by 3 points and coding-write by 1. They cancel out within 0.4%.

“Best raw scores on tool-calling” → gpt-oss-20b llamaswap (23/25)
“Best general-purpose at reasonable cost” → gemma-4-26b-a4b llamaswap-nothink
“Best out-of-the-box” → gemma4:31b ollama (84.0%, no thinking-mode tuning required)

Methodology

Run Mechanics

The matrix runner iterates model-major: load one model, run all five evaluation suites, evict before loading the next. Required on a 64 GB machine because most models are 17–19 GB resident. 20-min per-cell watchdog, 5-min per-task watchdog. Partial responses are scored as-is.

All quantizations are Q4_K_M across runtimes (apples-to-apples), except gpt-oss which ships natively as MXFP4. Total matrix wall-clock time: 316 minutes (~5.3 hours).

Instruction Following DETERMINISTIC · /30

Can the model follow precise format instructions without overshooting? No reasoning needed; just literal compliance.

The Exact Prompts

1.1 Exact Word Count (5 pts)

Write a sentence about the ocean that contains EXACTLY 12 words. Do not include any other text, explanation, or commentary.

Scoring: 5 if exactly 12 words; 3 if off by 1; 1 if off by 2–3; 0 otherwise.

1.2 Structured JSON Output (5 pts) — 4 typed fields, no markdown fences.

1.3 Constrained List (5 pts) — 5 animals, numbered, alphabetical, ≤8 chars.

1.4 Negative Constraints (5 pts) — sunset description avoiding 3 words.

1.5 Format Transformation (5 pts) — CSV to markdown table.

1.6 Multi-format Response (5 pts) — 3 sections separated by ---.

What We Saw

Eleven of 20 cells scored a perfect 30/30. The most discriminating task was exact-word-count:

Thinking mode hurt word counting. gemma-4-26b-a4b llamaswap (thinking on) scored 0/5. Same weights, thinking off: 5/5.
glm-4.7-flash scored 1/5 — writes nicely but cannot constrain length.

Reasoning LLM JUDGE · /20

Classic reasoning puzzles where the obvious answer is wrong. Claude Haiku 4.5 as judge.

2.1 The Surgeon Riddle · 2.2 Bat and Ball · 2.3 Lily Pad · 2.4 Counterfeit Coin

Twelve cells scored 17–20/20. gemma-4-31b scored 20/20 on both runtimes. The hardest puzzle was counterfeit-coin.

Signal: On short reasoning puzzles, model size matters more than thinking-mode budget.

Coding — Write COMPILE + RUN + CHECK · /126

6 problems × 3 languages (TypeScript, Python, Go), each scored 0–7. Top scorers: gemma-4-31b (126/126 on both runtimes).

The dominant failure mode in Go: Of 35 non-perfect Go scores, 67% were compile errors — “imported and not used” or “declared and not used.”

Coding — Fix HIDDEN TEST HARNESS · /25

5 buggy JavaScript functions. Sixteen of 20 cells scored a perfect 25/25. The four that lost points: all three nemotron variants and gemma4-e2b ollama. This dimension has the least signal — too easy for most models.

Tool Math CORRECTNESS + TOOL-CALL COUNTING · /25

Can the model use provided tools (calculator, statistics) to chain multi-step arithmetic?

Model	Tool Math
gpt-oss-20b llamaswap (thinking-on)	23/25
gemma-4-26b-a4b llamaswap (thinking-on)	20/25
gemma-4-26b-a4b llamaswap-nothink	20/25
gpt-oss-latest ollama	20/25
gemma-4-31b llamaswap (thinking-on)	20/25
glm-4.7-flash-latest ollama	14/25
gemma-4-e2b llamaswap-nothink	12/25
…	…
gemma4-26b ollama	0/25
gpt-oss-20b llamaswap-nothink	0/25

Why Tool-Calling Is So Weak

Models try to one-shot the answer instead of chaining tools.
Thinking-off models get stuck. gpt-oss-20b nothink: 0/25. Same model thinking-on: 23/25.
Tool-calling format errors. Wrong argument shapes exhaust the retry budget.
Long chains exceed the watchdog.
Reasoning loops on Stats+arithmetic. Models re-derive instead of trusting tool output.

Root cause: Tool-calling requires the model to maintain state across multiple turns and trust intermediate results. Smaller and thinking-off models fail this.

Coding 2-Pass — The Experiment /126

If a model fails in round 1, does showing it the actual compile/runtime error help?

	Round 1	Round 2	Δ
Variant A (scaffolded)	93/126 (73.8%)	79/126 (62.7%)	−14 pts
Variant B (raw error)	93/126 (73.8%)	80/126 (63.5%)	−13 pts

Same outcome. Zero improvements. The 4B model correctly identified the fix for one error and then introduced a new error elsewhere.

The 4B-parameter model genuinely lacks the capacity to make a localized fix without disturbing surrounding code. It’s not context length, not prompt format — it’s model capability.

What This Means in Practice

Use Case	Pick
Best general overall, low cost	gemma-4-26b-a4b on llamaswap-nothink (95.0%, 4 minutes)
Best for tool-use / agent loops	gpt-oss-20b on llamaswap (thinking on) (94.6%, 23/25 on tool math)
Best out-of-the-box	gemma4:31b on ollama (84.0%, no config needed)
Smallest still usable	gemma-4-e2b on llamaswap-nothink (82.6%, 3 minutes)
What to avoid	Thinking-on by default for non-agent tasks. The wall-clock cost is huge.

If you’re building agentic workflows that need tools: use thinking-on. The penalty is wall-clock time but you can’t get tool-calling right without it.

If you’re iterating on agent loops that need the model to use compile/test feedback to fix its own code: don’t trust models below ~20B parameters to do this reliably.

[FOCUS/AI]

Local Providers
Full Assessment

What We Tested

The Models

The Runtimes

The 5 Evaluation Dimensions

The Surprising Conclusions

Overall Results

How “Best” Was Decided

Methodology

Run Mechanics

Instruction Following DETERMINISTIC · /30

The Exact Prompts

What We Saw

Reasoning LLM JUDGE · /20

Coding — Write COMPILE + RUN + CHECK · /126

Coding — Fix HIDDEN TEST HARNESS · /25

Tool Math CORRECTNESS + TOOL-CALL COUNTING · /25

Why Tool-Calling Is So Weak

Coding 2-Pass — The Experiment /126

What This Means in Practice

Subscribe to our newsletter

Ready to ship production AI?

Local ProvidersFull Assessment

What We Tested

The Models

The Runtimes

The 5 Evaluation Dimensions

The Surprising Conclusions

Overall Results

How “Best” Was Decided

Methodology

Run Mechanics

Instruction Following DETERMINISTIC · /30

The Exact Prompts

What We Saw

Reasoning LLM JUDGE · /20

Coding — Write COMPILE + RUN + CHECK · /126

Coding — Fix HIDDEN TEST HARNESS · /25

Tool Math CORRECTNESS + TOOL-CALL COUNTING · /25

Why Tool-Calling Is So Weak

Coding 2-Pass — The Experiment /126

What This Means in Practice

Subscribe to our newsletter

Ready to ship production AI?

Local Providers
Full Assessment