We Compared 5 AI Coding Assistants on Real Work. The Rankings Surprised Us.
Benchmarks lie. So we ran five leading AI coding assistants through two weeks of actual development work — features, bug hunts, refactors, and reviews — and ranked what matters: shipped code.
By The Daily Query · · 3 min read
Every AI coding assistant tops some benchmark. Every launch post shows the demo where it one-shots a web app. None of that tells you what the tool is like on day nine of a real project, when the codebase is messy, the requirements are political, and the bug only reproduces on Tuesdays.
So we did the boring version of a comparison: five leading assistants, two weeks, the same real backlog of work — feature tickets, bug reports, a gnarly refactor, and code review duty. We're not naming the tools (the rankings would be stale in a month anyway); we're naming the patterns, because those are durable. Here's what two weeks of shipped work taught us.
What we measured
Four kinds of real work, scored on outcomes, not vibes:
- Feature tickets — small-to-medium features in an existing codebase with established patterns
- Bug hunts — real bugs, including two we didn't know the cause of ourselves
- A refactor — splitting a tangled module without changing behavior
- Code review — finding planted and unplanted issues in pull requests
Finding 1: Codebase context beats model intelligence
The single biggest separator wasn't raw model quality — it was how well the tool ingested and used the existing codebase. The assistants that automatically studied our conventions wrote code that looked like ours: same error-handling idioms, same test structure, same naming. The ones that didn't produced technically correct code that read like a stranger's, and every diff cost extra review time to normalize.
Over two weeks, "writes code that matches the codebase" saved more total hours than "writes slightly smarter code." It's not close.
Finding 2: Bug hunting separates the tiers brutally
On feature work, all five tools clustered surprisingly close. On debugging, they split into tiers instantly.
The top tier treated debugging as an investigation: formed hypotheses, asked to run commands, read logs, narrowed the search, and — critically — said "I need to see X before I can say." The bottom tier pattern-matched the symptom to a generic cause and confidently patched the wrong thing. One assistant "fixed" the same bug three times, three different ways, none of them the bug.
If you only run one evaluation before choosing a tool, make it a debugging session on your own codebase. It's the most honest signal available.
Finding 3: The refactor revealed who can hold a plan
The refactor required touching fourteen files in a coherent order without breaking tests in between. This is where long-horizon discipline showed. The best performers worked incrementally — small steps, tests green after each — and kept a running account of what remained. The weakest made a heroic all-at-once attempt, broke the build, and then struggled to dig out of their own hole, each fix creating the next problem.
Ask not "can it write code?" but "can it stay oriented across an hour of dependent steps?" That capability gap is wider than any benchmark suggests.
Finding 4: Review is the sleeper feature
Code review was the task we expected least from and got the most. Every tool caught the planted bugs at decent rates, but the better ones also flagged things we hadn't planted — a subtle race condition and an off-by-one in pagination that had survived two human reviews.
The economics here are absurd in a good way: review takes the tool minutes, runs on every PR, and the cost of a false positive is one dismissed comment. If your team adopts AI for nothing else, adopt it as a second reviewer.
Finding 5: Speed is a trap metric
The fastest assistant produced first drafts quickest — and finished the two weeks dead last in shipped work, because its drafts generated the most rework. The assistant that "felt slow" — asking clarifying questions, reading more files before writing — required the fewest revision cycles and quietly won the overall ranking.
Measure time-to-merged, not time-to-first-token. They rank the tools in nearly opposite order.
The takeaway
After two weeks and a fully shipped backlog, the durable advice isn't a product name. It's a selection method:
- Test on your codebase, never a toy repo
- Weight debugging and refactoring over greenfield demos
- Measure time-to-merged, including your review burden
- Turn on AI code review regardless of which assistant you pick — it's the best ROI in the category
The tools will keep leapfrogging each other monthly. The evaluation method is what's worth keeping.
enjoyed this one?_
Get the next one in your inbox.
One email every morning. The AI news that matters, decoded in five minutes.
up_next → Tools
I Replaced My Entire Research Workflow With 3 AI Tools. Here's the Stack.
After years of browser-tab chaos and abandoned note apps, I rebuilt my research process around three AI tools. The result: deeper reading, faster synthesis, and a system I've actually kept.