The variance test that flipped my local LLM ranking

I've been picking models for a small Ollama pool that handles delegated coding chores from my main agent. Parsers, recursive transformers, throwaway test scaffolds — the kind of thing where round-tripping to a paid API is overkill. Six candidates, three strict prompts, an automated verifier that runs each model's output against valid and invalid inputs. Single-shot ranking, done in an evening.

Then I ran the same prompt three times on every model, and the ranking changed.

The single-shot lie

The "winner" was qwen3.5:9b. 0.985 out of 1.0 on the combined score. The "loser" — well, fifth out of six — was gemma4:latest because it failed an unrelated test-scaffolding prompt that needed Python module-level reasoning.

Then I picked the most discriminating prompt (parsing ISO 8601 durations like PT5M, with a ValueError raised on malformed input) and ran it three independent times on every model at temperature=0.2. Three runs. Same prompt. Same parameters.

Model	Run 1	Run 2	Run 3
gemma4:latest	22/22	22/22	22/22
qwen2.5-coder:14b	22/22	20/22	20/22
qwen3.5:9b	9/22	9/22	21/22
qwen3.5:4b	4/22	19/22	16/22

gemma4 was byte-stable perfect across three runs. The "fifth-place" model. qwen3.5:9b produced byte-identical 725-byte buggy regex twice in a row, then on the third run got it right. The 21/22 score that put it #1 in single-shot was the less common sampling path.

That's the part that cost me a few hours of chasing my tail. A model can be reliably wrong and occasionally right, and a single run will tell you it's right.

The other lesson, learned the hard way

First pass on the Qwen3 family: qwen3:14b returned 1 byte (\n) after 1174 seconds of GPU time. Twenty minutes for a single newline.

Ollama's /api/generate returns two fields for thinking models: response and thinking. My script logged response. When I dumped the raw JSON, the model's thinking field was 21 KB of "Wait, I need to check if I can use src... yes... so I will use src... wait, I need to check if I can use src..." repeating until context filled. done_reason: "stop" on a 21,000-character thinking trace with no committed answer.

The fix was one parameter: "think": false in the request body. With it, all three Qwen3 sizes responded in 8-11 seconds and produced clean code.

If you're benchmarking thinking-capable models against constrained-output prompts, smoke-test with think:false first and log both fields. The default-on thinking is a trap when the prompt forbids preamble and explanation — the model spends its budget arguing with itself and never writes the answer.

What I actually use now

After the variance pass, my local routing table is short:

Parsers, regex, recursive transformers → gemma4:latest. Stable across six runs of two prompts at temp 0.2.
Tests, fixtures, anything needing Python runtime semantics → qwen2.5-coder:14b. Tight cluster of 20-22, only model that handled the test-scaffolding trap.
Skip → qwen3:14b (stably mediocre), deepseek-coder-v2:16b (stably wrong on valid inputs, same regex bug 3/3).

The most useful single observation, the one I keep coming back to: a general-purpose model (gemma4) beat the dedicated coder (qwen2.5-coder:14b) on every prompt that didn't require Python runtime reasoning. The "coder" label means trained on code, not best at every code task.

So what

Two things, mostly:

Single-shot benchmarks lie in both directions. They flatter unstable models that happen to roll well, and they punish stable models that fail one unrelated test. If you're picking a model to delegate real work to, three runs minimum on the prompt that matters most. Five if you can afford it. The cost is rounding error compared to debugging a bimodal model in production.

Read the fields you don't think you need. I'd already shipped a benchmark wrapper that ignored thinking because none of my older models returned it. The first time it mattered, I lost an evening to a model that was generating output, just not where I was looking.

Both lessons feel obvious in hindsight. They were not obvious when I was looking at a clean ranking that confidently put the wrong model on top.

Setup: single workstation, 16 GB VRAM, Ollama on 127.0.0.1:11434. Bash wrapper, Python verifier that strips markdown fences and exec()s output against a battery of valid + invalid inputs. I'll publish the full numbers and prompts when I clean up the harness.

The variance test that flipped my local LLM ranking

The single-shot lie

The other lesson, learned the hard way

What I actually use now

So what

Comments

More from this blog

Research Agent as an MCP Server in Claude Code: Full Integration

The Research Agent: notes from building a 5-persona LangGraph thing

Always Building - Hello, World

Command Palette

The single-shot lie

The other lesson, learned the hard way

What I actually use now

So what

Comments

More from this blog