The variance test that flipped my local LLM ranking
I've been picking models for a small Ollama pool that handles delegated coding chores from my main agent. Parsers, recursive transformers, throwaway test scaffolds — the kind of thing where round-tripping to a paid API is overkill. Six candidates, three strict prompts, an automated verifier that runs each model's output against valid and invalid inputs. Single-shot ranking, done in an evening.
Then I ran the same prompt three times on every model, and the ranking changed.
The single-shot lie
The "winner" was qwen3.5:9b. 0.985 out of 1.0 on the combined score. The "loser" — well, fifth out of six — was gemma4:latest because it failed an unrelated test-scaffolding prompt that needed Python module-level reasoning.
Then I picked the most discriminating prompt (parsing ISO 8601 durations like PT5M, with a ValueError raised on malformed input) and ran it three independent times on every model at temperature=0.2. Three runs. Same prompt. Same parameters.
| Model | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| gemma4:latest | 22/22 | 22/22 | 22/22 |
| qwen2.5-coder:14b | 22/22 | 20/22 | 20/22 |
| qwen3.5:9b | 9/22 | 9/22 | 21/22 |
| qwen3.5:4b | 4/22 | 19/22 | 16/22 |
gemma4 was byte-stable perfect across three runs. The "fifth-place" model. qwen3.5:9b produced byte-identical 725-byte buggy regex twice in a row, then on the third run got it right. The 21/22 score that put it #1 in single-shot was the less common sampling path.
That's the part that cost me a few hours of chasing my tail. A model can be reliably wrong and occasionally right, and a single run will tell you it's right.
The other lesson, learned the hard way
First pass on the Qwen3 family: qwen3:14b returned 1 byte (\n) after 1174 seconds of GPU time. Twenty minutes for a single newline.
Ollama's /api/generate returns two fields for thinking models: response and thinking. My script logged response. When I dumped the raw JSON, the model's thinking field was 21 KB of "Wait, I need to check if I can use src... yes... so I will use src... wait, I need to check if I can use src..." repeating until context filled. done_reason: "stop" on a 21,000-character thinking trace with no committed answer.
The fix was one parameter: "think": false in the request body. With it, all three Qwen3 sizes responded in 8-11 seconds and produced clean code.
If you're benchmarking thinking-capable models against constrained-output prompts, smoke-test with think:false first and log both fields. The default-on thinking is a trap when the prompt forbids preamble and explanation — the model spends its budget arguing with itself and never writes the answer.
What I actually use now
After the variance pass, my local routing table is short:
Parsers, regex, recursive transformers →
gemma4:latest. Stable across six runs of two prompts at temp 0.2.Tests, fixtures, anything needing Python runtime semantics →
qwen2.5-coder:14b. Tight cluster of 20-22, only model that handled the test-scaffolding trap.Skip →
qwen3:14b(stably mediocre),deepseek-coder-v2:16b(stably wrong on valid inputs, same regex bug 3/3).
The most useful single observation, the one I keep coming back to: a general-purpose model (gemma4) beat the dedicated coder (qwen2.5-coder:14b) on every prompt that didn't require Python runtime reasoning. The "coder" label means trained on code, not best at every code task.
So what
Two things, mostly:
Single-shot benchmarks lie in both directions. They flatter unstable models that happen to roll well, and they punish stable models that fail one unrelated test. If you're picking a model to delegate real work to, three runs minimum on the prompt that matters most. Five if you can afford it. The cost is rounding error compared to debugging a bimodal model in production.
Read the fields you don't think you need. I'd already shipped a benchmark wrapper that ignored thinking because none of my older models returned it. The first time it mattered, I lost an evening to a model that was generating output, just not where I was looking.
Both lessons feel obvious in hindsight. They were not obvious when I was looking at a clean ranking that confidently put the wrong model on top.
Setup: single workstation, 16 GB VRAM, Ollama on 127.0.0.1:11434. Bash wrapper, Python verifier that strips markdown fences and exec()s output against a battery of valid + invalid inputs. I'll publish the full numbers and prompts when I clean up the harness.

