Skip to main content

Command Palette

Search for a command to run...

The variance test that flipped my local LLM ranking

Updated
4 min read
R
Senior Technical Product Manager | AI-Native Transformation & Cloud-Native Network Functions 20+ years leading product strategy in telecom, now focused on the convergence of AI/ML and cloud-native 5G network infrastructure. Defining AI-native transformation strategies for complex Kubernetes-based network functions — embedding ML-driven lifecycle management, AIOps, and intelligent automation into telecom products. Lifelong AI/ML engagement since my university thesis on Hopfield Neural Networks (1998). I maintain a personal lab and build projects to stay hands-on with the technologies shaping the products I manage.

I've been picking models for a small Ollama pool that handles delegated coding chores from my main agent. Parsers, recursive transformers, throwaway test scaffolds — the kind of thing where round-tripping to a paid API is overkill. Six candidates, three strict prompts, an automated verifier that runs each model's output against valid and invalid inputs. Single-shot ranking, done in an evening.

Then I ran the same prompt three times on every model, and the ranking changed.

The single-shot lie

The "winner" was qwen3.5:9b. 0.985 out of 1.0 on the combined score. The "loser" — well, fifth out of six — was gemma4:latest because it failed an unrelated test-scaffolding prompt that needed Python module-level reasoning.

Then I picked the most discriminating prompt (parsing ISO 8601 durations like PT5M, with a ValueError raised on malformed input) and ran it three independent times on every model at temperature=0.2. Three runs. Same prompt. Same parameters.

Model Run 1 Run 2 Run 3
gemma4:latest 22/22 22/22 22/22
qwen2.5-coder:14b 22/22 20/22 20/22
qwen3.5:9b 9/22 9/22 21/22
qwen3.5:4b 4/22 19/22 16/22

gemma4 was byte-stable perfect across three runs. The "fifth-place" model. qwen3.5:9b produced byte-identical 725-byte buggy regex twice in a row, then on the third run got it right. The 21/22 score that put it #1 in single-shot was the less common sampling path.

That's the part that cost me a few hours of chasing my tail. A model can be reliably wrong and occasionally right, and a single run will tell you it's right.

The other lesson, learned the hard way

First pass on the Qwen3 family: qwen3:14b returned 1 byte (\n) after 1174 seconds of GPU time. Twenty minutes for a single newline.

Ollama's /api/generate returns two fields for thinking models: response and thinking. My script logged response. When I dumped the raw JSON, the model's thinking field was 21 KB of "Wait, I need to check if I can use src... yes... so I will use src... wait, I need to check if I can use src..." repeating until context filled. done_reason: "stop" on a 21,000-character thinking trace with no committed answer.

The fix was one parameter: "think": false in the request body. With it, all three Qwen3 sizes responded in 8-11 seconds and produced clean code.

If you're benchmarking thinking-capable models against constrained-output prompts, smoke-test with think:false first and log both fields. The default-on thinking is a trap when the prompt forbids preamble and explanation — the model spends its budget arguing with itself and never writes the answer.

What I actually use now

After the variance pass, my local routing table is short:

  • Parsers, regex, recursive transformersgemma4:latest. Stable across six runs of two prompts at temp 0.2.

  • Tests, fixtures, anything needing Python runtime semanticsqwen2.5-coder:14b. Tight cluster of 20-22, only model that handled the test-scaffolding trap.

  • Skipqwen3:14b (stably mediocre), deepseek-coder-v2:16b (stably wrong on valid inputs, same regex bug 3/3).

The most useful single observation, the one I keep coming back to: a general-purpose model (gemma4) beat the dedicated coder (qwen2.5-coder:14b) on every prompt that didn't require Python runtime reasoning. The "coder" label means trained on code, not best at every code task.

So what

Two things, mostly:

Single-shot benchmarks lie in both directions. They flatter unstable models that happen to roll well, and they punish stable models that fail one unrelated test. If you're picking a model to delegate real work to, three runs minimum on the prompt that matters most. Five if you can afford it. The cost is rounding error compared to debugging a bimodal model in production.

Read the fields you don't think you need. I'd already shipped a benchmark wrapper that ignored thinking because none of my older models returned it. The first time it mattered, I lost an evening to a model that was generating output, just not where I was looking.

Both lessons feel obvious in hindsight. They were not obvious when I was looking at a clean ranking that confidently put the wrong model on top.


Setup: single workstation, 16 GB VRAM, Ollama on 127.0.0.1:11434. Bash wrapper, Python verifier that strips markdown fences and exec()s output against a battery of valid + invalid inputs. I'll publish the full numbers and prompts when I clean up the harness.

10 views