models posts

Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)
Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)

Will Schenk April 4, 2026

We benchmarked Gemma 4 (e2b, default, 26B MoE, 31B dense) through Ollama against 50+ hosted and local models on reasoning, knowledge, instruction, coding, and TezLab MCP tool use—same Umwelten harness as our other showdowns. Here’s where the new line shines, where frontier models still pull ahead, and how the biggest Gemma handles real EV data tools.

Read more →

Same Weights, Different Results
Same Weights, Different Results

Will Schenk March 24, 2026

We ran the same Nemotron model on four providers and got wildly different results. MCP tool use ranged from 1/6 to 6/6. Speed varied 16x. The weights are identical. The results are not.

Read more →

Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup
Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup

Will Schenk March 15, 2026

We ran 39 models against real Rivian driving data via MCP tools. Inception's Mercury-2 delivered a perfect 15/15 in 8.6 seconds. Here's the standout model, the ELO narrative rankings, and how the same Umwelten setup powers both chat and evals.

Read more →

Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites
Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites

Will Schenk March 12, 2026

Wittgenstein's shift from the Tractatus to the Philosophical Investigations—triggered by Sraffa's Neapolitan gesture—reframes the stochastic parrot debate. The danger isn't that LLMs lack secret understanding. It's our willingness to treat fluent simulation as the real thing.

Read more →

The Car Wash Test: Learning from Model Evals
The Car Wash Test: Learning from Model Evals

Will Schenk March 1, 2026

We asked 131 AI models a simple question — should I walk or drive to the car wash? 76% got it wrong. Simple gotcha questions reveal more about model reasoning than any benchmark leaderboard.

Read more →

gpt5 is smarter than you are
gpt5 is smarter than you are

Models

Will Schenk September 4, 2025

gpt5 can choose to be so smart it's almost impossible to judge. Lets see how it does on some unanswerable questions and if it can totally replace google.

Read more →

Code Generation with Local Models
Code Generation with Local Models

Will Schenk August 20, 2025

Small, local AI models deliver surprisingly effective results for everyday tasks. Also llama3.2 is surprisingly fast and gpt-oss is surprisingly good.

Read more →

gpt-5 and gpt-oss
gpt-5 and gpt-oss

Models

Will Schenk August 13, 2025

OpenAI’s GPT-5 launch stole headlines, but GPT-OSS quietly made local AI a lot more practical. This post covers what’s new, how to run it with Ollama or LM Studio, and why context size can change your results.

Read more →

How I classify models
How I classify models

Will Schenk January 21, 2025

Small models are smart yet limited in knowledge; foundation models possess both deep understanding and extensive knowledge but lack structured problem-solving approaches. Educated models like DeepResearch excel by combining learned reasoning processes with large memory capacities, enabling them to adapt effectively to complex tasks while handling vast information instantaneously.

Read more →

AI for research: DeepResearch a clear winner
AI for research: DeepResearch a clear winner

Will Schenk January 12, 2025

Asking the tough questions: DeepResearch excels in depth and comprehensiveness, while o1, Sonnet 3.5, and DeepSeek with DeepThought provide comparable results for complex inquiries. Smaller models like phi4 and llama3.2 are deemed inadequate for intricate topics.

Read more →

Learning on the go with NotebookLM
Learning on the go with NotebookLM

Will Schenk January 9, 2025

By utilizing NotebookLM, an AI model capable of generating audio summaries and interactive conversations, you can create customized podcasts on-the-go. You can also join the conversation.

Read more →

[FOCUS/AI]

Models

Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)
Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)

Same Weights, Different Results
Same Weights, Different Results

Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup
Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup

Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites
Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites

The Car Wash Test: Learning from Model Evals
The Car Wash Test: Learning from Model Evals

gpt5 is smarter than you are
gpt5 is smarter than you are

Code Generation with Local Models
Code Generation with Local Models

gpt-5 and gpt-oss
gpt-5 and gpt-oss

How I classify models
How I classify models

AI for research: DeepResearch a clear winner
AI for research: DeepResearch a clear winner

Learning on the go with NotebookLM
Learning on the go with NotebookLM

Subscribe to our newsletter

Ready to ship production AI?

Models

Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown) Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)

Same Weights, Different Results Same Weights, Different Results

Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup

Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites

The Car Wash Test: Learning from Model Evals The Car Wash Test: Learning from Model Evals

gpt5 is smarter than you are gpt5 is smarter than you are

Code Generation with Local Models Code Generation with Local Models

gpt-5 and gpt-oss gpt-5 and gpt-oss

How I classify models How I classify models

AI for research: DeepResearch a clear winner AI for research: DeepResearch a clear winner

Learning on the go with NotebookLM Learning on the go with NotebookLM

Subscribe to our newsletter

Ready to ship production AI?

Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)
Gemma 4 on Your Machine: How Google’s New Open Weights Stack Up (Model Showdown)

Same Weights, Different Results
Same Weights, Different Results

Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup
Can LLMs Use Real-World Tools? Mercury-2, ELO, and the Umwelten Setup

Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites
Sraffa's Gesture, the Crack in the Crystal, and Why the Stochastic Parrot Still Bites

The Car Wash Test: Learning from Model Evals
The Car Wash Test: Learning from Model Evals

gpt5 is smarter than you are
gpt5 is smarter than you are

Code Generation with Local Models
Code Generation with Local Models

gpt-5 and gpt-oss
gpt-5 and gpt-oss

How I classify models
How I classify models

AI for research: DeepResearch a clear winner
AI for research: DeepResearch a clear winner

Learning on the go with NotebookLM
Learning on the go with NotebookLM