The Evening I Compared Claude Sonnet and Gemini Flash on the Same Bug
The Evening I Compared Claude Sonnet and Gemini Flash on the Same Bug
It was around 9pm and my Tistory pipeline had failed for the third time that day. The error was the same one I had been ignoring for a week: a Playwright click on the editor’s category dropdown that worked locally on my MacBook but kept timing out when the cron job ran on my Mac mini.
I was tired enough to stop being loyal to any one model. So I opened two terminals, pointed both at my claude-max-api-proxy on localhost:3456, and pasted the exact same broken function into Claude Sonnet 4.6 and Gemini 2.5 Flash. Same prompt, same logs, same screenshots. I just wanted to see what would happen.
The Bug I Was Actually Stuck On
The function clicked a category button, waited for the dropdown to render, then clicked the target category by text. Locally it was fine. In cron at 16:00 it threw ElementHandleError: element is not attached to the DOM roughly half the time.
I had spent two evenings adding wait_for_selector calls in random places. None of them helped. I assumed it was a network speed issue on the mini, which is the kind of lazy diagnosis you reach for when you are too tired to actually read the trace.
What Claude Sonnet Said
Claude went first because that is my default reflex now. The reply was clean and confident. It suggested switching from page.click(selector) to page.locator(selector).click(), adding an explicit wait_for_load_state("networkidle"), and wrapping the dropdown interaction in a small retry loop.
It was good advice. It was also the same advice I had already half-implemented. The retry loop in particular felt like papering over the symptom. I copied the diff into a scratch file but did not run it yet. Something about the answer felt like Claude was being polite to my framing of the problem instead of questioning it.
What Gemini 2.5 Flash Said
Then I pasted the same thing into Flash. The response was shorter and a little blunter. It pointed at one specific line and said the dropdown was probably re-rendering between the moment I queried the option element and the moment I clicked it. Classic stale element. It suggested I stop holding a reference to the option and instead chain the locator so Playwright re-resolves it on click.
That was the bug. I tested it. The race went away. The cron has been green for four runs in a row as I write this.
Eben’s note: The cheaper model that I had been treating as a backup beat my default model on a real bug from my own pipeline, and I have to actually sit with that.
Why This Bothered Me a Little
I pay attention to model rankings. I read the threads. But running a head-to-head on something I actually cared about felt different from skimming a benchmark chart. Claude was not wrong. It was just answering a slightly different question than the one I needed answered. Gemini happened to read the trace more literally and caught the timing pattern.
This is not the first time a smaller or cheaper model has surprised me. I had a similar moment with DeepSeek V3 writing Python cleaner than I do, and that experience already nudged me toward being less brand-loyal. Tonight pushed me further.
How I Am Splitting the Pipeline Now
I am not switching everything to Flash. Claude Sonnet still wins on the work I trust it for: writing the actual blog drafts, refactoring messy modules, and anything where I want it to push back on my plan. Claude Code in my editor is still the daily driver and I am not changing that.
But for stack trace triage and “look at this log and tell me what is happening,” I am routing to Gemini 2.5 Flash through OpenRouter first. It is fast, it is cheap, and on this kind of narrow forensic question it was simply better tonight. I added a tiny router in my proxy that picks the model based on the prompt prefix. Two lines of config. That is it.
What I am taking from this
Run your own head-to-heads on bugs you actually have. A toy benchmark cannot tell you which model fits which slot in your pipeline, because your pipeline has its own weird shape. Mine has a Mac mini, a flaky Tistory editor, and a cron schedule, and the right model for each task is not obvious until you put two of them next to each other on the same trace.
The practical takeaway: next time a model gives you confident, polished advice that does not solve the problem, paste the exact same prompt into a cheaper one before you keep iterating. It costs almost nothing and it might catch the thing you stopped seeing three hours ago.
Related Posts
- The Morning I Realized DeepSeek V3 Writes Python Cleaner Than I Do
- The Week I Let a Cron Agent Email Me Daily Summaries of My Own Code
- The Tuesday I Watched Qwen Run Locally on My Mac Mini With No Internet
Tags: #AIagents #ClaudeCode #OpenClaw #MacMini #OpenRouter #buildinginpublic #Eben