We Compared 5 AI Coding Assistants on Real Work. The Rankings Surprised Us.

Every AI coding assistant tops some benchmark. Every launch post shows the demo where it one-shots a web app. None of that tells you what the tool is like on day nine of a real project, when the codebase is messy, the requirements are political, and the bug only reproduces on Tuesdays.

So we did the boring version of a comparison: five leading assistants, two weeks, the same real backlog of work — feature tickets, bug reports, a gnarly refactor, and code review duty. We're not naming the tools (the rankings would be stale in a month anyway); we're naming the patterns, because those are durable. Here's what two weeks of shipped work taught us.

What we measured

Four kinds of real work, scored on outcomes, not vibes:

Feature tickets — small-to-medium features in an existing codebase with established patterns
Bug hunts — real bugs, including two we didn't know the cause of ourselves
A refactor — splitting a tangled module without changing behavior
Code review — finding planted and unplanted issues in pull requests

Finding 1: Codebase context beats model intelligence

The single biggest separator wasn't raw model quality — it was how well the tool ingested and used the existing codebase. The assistants that automatically studied our conventions wrote code that looked like ours: same error-handling idioms, same test structure, same naming. The ones that didn't produced technically correct code that read like a stranger's, and every diff cost extra review time to normalize.

Over two weeks, "writes code that matches the codebase" saved more total hours than "writes slightly smarter code." It's not close.

Finding 2: Bug hunting separates the tiers brutally

On feature work, all five tools clustered surprisingly close. On debugging, they split into tiers instantly.

The top tier treated debugging as an investigation: formed hypotheses, asked to run commands, read logs, narrowed the search, and — critically — said "I need to see X before I can say." The bottom tier pattern-matched the symptom to a generic cause and confidently patched the wrong thing. One assistant "fixed" the same bug three times, three different ways, none of them the bug.

If you only run one evaluation before choosing a tool, make it a debugging session on your own codebase. It's the most honest signal available.

Finding 3: The refactor revealed who can hold a plan

The refactor required touching fourteen files in a coherent order without breaking tests in between. This is where long-horizon discipline showed. The best performers worked incrementally — small steps, tests green after each — and kept a running account of what remained. The weakest made a heroic all-at-once attempt, broke the build, and then struggled to dig out of their own hole, each fix creating the next problem.

Ask not "can it write code?" but "can it stay oriented across an hour of dependent steps?" That capability gap is wider than any benchmark suggests.

Finding 4: Review is the sleeper feature

Code review was the task we expected least from and got the most. Every tool caught the planted bugs at decent rates, but the better ones also flagged things we hadn't planted — a subtle race condition and an off-by-one in pagination that had survived two human reviews.

The economics here are absurd in a good way: review takes the tool minutes, runs on every PR, and the cost of a false positive is one dismissed comment. If your team adopts AI for nothing else, adopt it as a second reviewer.

Finding 5: Speed is a trap metric

The fastest assistant produced first drafts quickest — and finished the two weeks dead last in shipped work, because its drafts generated the most rework. The assistant that "felt slow" — asking clarifying questions, reading more files before writing — required the fewest revision cycles and quietly won the overall ranking.

Measure time-to-merged, not time-to-first-token. They rank the tools in nearly opposite order.

The takeaway

After two weeks and a fully shipped backlog, the durable advice isn't a product name. It's a selection method:

Test on your codebase, never a toy repo
Weight debugging and refactoring over greenfield demos
Measure time-to-merged, including your review burden
Turn on AI code review regardless of which assistant you pick — it's the best ROI in the category

The tools will keep leapfrogging each other monthly. The evaluation method is what's worth keeping.

We Compared 5 AI Coding Assistants on Real Work. The Rankings Surprised Us.

What we measured

Finding 1: Codebase context beats model intelligence

Finding 2: Bug hunting separates the tiers brutally

Finding 3: The refactor revealed who can hold a plan

Finding 4: Review is the sleeper feature

Finding 5: Speed is a trap metric

The takeaway

Get the next one in your inbox.

// keep_reading

I Replaced My Entire Research Workflow With 3 AI Tools. Here's the Stack.

AI Agents Quietly Took Over Customer Support — and Almost Nobody Noticed

Small Models Are Eating the Enterprise: Why CTOs Are Quietly Downsizing Their AI