AI Coding Agents Write Faster Than You Can Read. So Why Are You Still the Bottleneck?

A pull request landed in my queue last Tuesday at 9:14 a.m. The agent had written it in about ninety seconds: a refactor of a payment-retry path, four files touched, 230 lines changed, a commit message so clean and confident it read like it had been written by someone who'd never been wrong in their life. I approved it at 10:47.

That gap — ninety seconds to write, ninety-three minutes to trust — is the whole story. If you ship code with AI coding agents and you feel slower than the demos promised, you are not imagining it. The generation got cheap. The reading did not. And reading is now your job in a way it never quite was before.

So here is the question I want to actually answer, because I have asked it of myself on bad afternoons: how do I stop being the human rate-limiter on my own team's output, without becoming the person who rubber-stamps a bug into production?

The short version of the answer: you stop trusting the agent and start trusting a verification system you've measured. The long version is the rest of this piece, including the parts where the honest answer is "it depends," and I'll tell you which parts those are.

One disclosure up front, because you'd find it anyway: City of Punk builds a creative tool that competes in the AI tooling space. This article is about developer workflow, not our product, and I'd write every sentence of it the same way if we sold nothing. The day a review publication starts pulling punches to protect a sibling product is the day it becomes a brochure. Hold me to that.

Why agent code reads slower than human code

Start with the thing nobody says out loud: reviewing code an agent wrote is genuinely harder than reviewing code a colleague wrote, line for line. Not because the agent is dumber. Because of what's missing around the code.

When your teammate Dana sends you a PR, you are not reading 200 lines cold. You are reading 200 lines plus everything you know about Dana. You know she's careful with null states and sloppy with logging. You know she pinged you Monday saying she wasn't sure about the retry semantics, so you read the retry logic twice. You know the shape of her mistakes. Human review has always been a trust model built on top of a person, and the code is only part of the evidence.

The agent erases all of that and replaces it with nothing. There is no track record. There is no "I wasn't sure about this part." There is no body language in the diff. Every PR arrives as a stranger who is also an extremely fluent liar.

That fluency is the second problem. Agent code fails differently than human code. A junior dev writes code that's obviously wrong — it doesn't compile, the variable is named thign, the logic is tangled in a way that signals confusion. You can see the uncertainty. Agent code fails by being plausible-wrong: syntactically immaculate, idiomatic, well-named, and quietly incorrect in a way that looks exactly like correct. The off-by-one is in clean code. The race condition has a tidy comment above it explaining the wrong mental model with total conviction. You can't skim for smell anymore, because there is no smell. There's just polish over a hole.

Third: volume. One engineer with a capable agent can open more PRs in a day than they used to open in a week. The review queue doesn't scale with a clever prompt. It scales with human attention, which is the one input that did not get cheaper. So the team's throughput quietly recenters on whoever does the reviewing, and that person — maybe you — becomes the constraint that the whole "we move fast now" story crashes into.

None of this is an argument against agents. I use them every day and I'm not giving them back. It's an argument that the work moved. The interesting part of your job is no longer producing the candidate solution. It's deciding, efficiently and at scale, whether the candidate solution is true. That's a different skill, and most teams are trying to do it with the same tool they always used: one tired senior engineer reading diffs.

Verification is a budget, and you're overspending on the wrong layer

Here's the reframe that changed how I work. Reviewing isn't a single act. It's a stack of three layers, and they cost wildly different amounts. The mistake almost everyone makes is paying for everything with the most expensive layer — human attention — when the cheaper layers would have caught most of it before it ever reached a person.

Go cheapest first.

Layer one: deterministic gates

These are the checks that give the same answer every time, cost almost nothing per run, and never get tired at 4:30 on a Friday. Compilers. Type checkers. Linters. The test suite. And — the underused one — property-based tests and contract checks that assert invariants instead of specific cases.

For agent-generated code specifically, deterministic gates are worth more than they were when humans wrote everything, because the agent's failure mode is plausible-wrong, and a machine doesn't care how plausible something looks. A type checker has no sense of how confident the commit message was. A test either passes or it doesn't.

The move here is to treat your gates as the thing the agent has to satisfy, not the thing you check afterward. Make the agent run the suite before it hands you anything. Make it write the property test that pins the invariant it's about to change. If you have a flaky test suite, fix that first, because flaky gates train both the agent and the human to ignore red — and a gate everyone ignores is worse than no gate, because it's pure theater that still costs CI minutes.

What deterministic gates can't do: they can't tell you whether the code does the right thing, only whether it does a consistent thing. They'll confirm the function returns a sorted list. They won't tell you the list shouldn't have been sorted at all. That's the next layer's problem.

Layer two: machine reviewers

This is the layer people get weird about, so let me be precise about what it is and isn't. You can point a second agent — ideally a different model, or at least a different prompt with an adversarial framing — at the diff and ask it to review. Not to rewrite. To find problems. "Here is a diff. Here is the ticket it claims to address. List every way this could be wrong, every edge case it ignores, every place the implementation diverges from the stated intent."

Done well, this catches a real slice of the plausible-wrong failures, because a model reading critically with a specific rubric behaves differently than a model generating optimistically. The reviewer isn't trying to be helpful and finish the task. It's trying to find holes, and it's fast and indefatigable in a way a human isn't.

Now the honest part, because this is where the hype gets loud and wrong. A machine reviewer has a structural conflict of interest the moment it shares failure modes with the generator. If the same model that confidently wrote the wrong race-condition comment is the one reviewing the race condition, it may well bless its own mistake, because it has the same blind spot. This is why I push for a different model on the review pass, and why I treat the machine reviewer as a filter that raises the floor, never as a gate that certifies the ceiling. It catches the obvious-once-pointed-out. It does not catch the thing that requires knowing your business, your users, and the incident you had in March.

Use it to make the human's job smaller, not to remove the human. A machine reviewer that turns a 200-line diff into "here are the four spots that actually need a person's eyes" has earned its keep. A machine reviewer you've configured to auto-approve has just built you a faster path to an outage.

Layer three: human judgment, rationed

The human is the most expensive, slowest, most valuable, and least scalable reviewer you have. So you spend that resource the way you'd spend a limited budget on anything: on the decisions where being wrong is costly and being right requires taste.

The skill that's becoming rare and valuable isn't reviewing everything. It's knowing what to read closely and what to wave through after the machine layers clear it. A logging-format change that's covered by tests and reviewed by a second model? Skim it, ship it. A change to how you compute account balances, or who can access what, or anything touching money, auth, or data deletion? That gets the full ninety minutes, and it should, and you should stop feeling guilty about the ninety minutes.

The point of layers one and two is to protect layer three's attention so it lands on the things that actually need it.

How I'd decide where to aim the expensive reviewer

"Read the important stuff closely" is true and useless without a way to decide what's important. Here's the rubric I actually run, roughly in priority order.

Blast radius. How many systems, users, or downstream services does this touch if it's wrong? A change to a shared library that forty services import is not the same as a change to one cron job. Wide blast radius buys a human read regardless of how clean the diff looks.

Reversibility. If this ships broken, how fast and how cheaply can you undo it? A feature flag you can flip in ten seconds is forgiving. A database migration that drops a column is not — there's no undo button on data you deleted. The less reversible the change, the more it earns human scrutiny up front, because the cost moved from "fix it later" to "you can't."

Novelty. Is this a pattern the codebase has seen a hundred times, or is the agent inventing a new architectural approach? Agents are strongest on well-trodden patterns and weakest on the genuinely novel, where there's less to imitate and more to get subtly wrong. New patterns get read by a human who can hold the whole design in their head. Routine CRUD does not.

Cost of being wrong. Combine the above into the real question: if this is subtly broken and ships, what does it cost? Money, trust, a 2 a.m. page, a compliance problem, a customer who quietly leaves? High cost of wrongness overrides everything else. Low cost of wrongness — an internal tool with three users who'll Slack you when it breaks — is a fine place to let the cheaper layers do the work and move on.

Notice none of these criteria are "how long is the diff" or "how confident does the agent seem." Length and confidence are exactly the signals that have stopped meaning anything. A two-line change to a permissions check deserves more of your attention than a 400-line generated test file.

The places where the honest answer is "it depends"

I told you I'd flag where this framework gets thin. Here's where.

Security and adversarial correctness. Deterministic gates and machine reviewers both struggle with code that's wrong only against an attacker, because the failure isn't visible in the normal path. A SQL injection or an auth bypass passes the tests, passes the linter, and may well pass a model reviewer that's reasoning about functionality rather than threat models. I don't have a clean automated answer here. Security review still wants a human who thinks like an adversary, and possibly tooling specific to that domain. If your team ships anything sensitive, do not let the three-layer story lull you into thinking the machine layers cover this. They mostly don't.

Genuinely novel architecture. When the agent proposes a new way to structure something — a new caching strategy, a new concurrency model — the machine reviewer is weakest exactly where the risk is highest, because there's no established pattern for it to measure against. These decisions are where senior human judgment is least replaceable, and trying to automate the review of them is how you end up with an elegant, well-tested, deeply wrong foundation.

The grader-grading-the-generator trap. I keep returning to this because it's the subtle one. The more you lean on a model to verify a model's work, the more you need to know they don't share blind spots — and that's genuinely hard to know. You can mitigate with different models and adversarial prompts. You cannot eliminate it. Treat every claim that "the AI checks the AI's work so it's fine" with the suspicion it deserves.

Measuring whether any of this works. The discipline that makes the whole thing real is building a small set of evaluations — a held-out collection of past PRs where you already know the right verdict, run through your verification stack to see what it catches and what it waves through. This is early, honestly. The tooling for evaluating your own review pipeline is less mature than the tooling for generating code, and most teams are flying without instruments. I'd rather tell you that than pretend there's a tidy dashboard. Build the eval set anyway, even a crude one of twenty PRs, because a verification system you haven't measured is a feeling, not a system.

What to do this week

Not a manifesto. Four concrete moves, in order of how much they'll buy you per hour spent.

Audit where your review time actually goes. For one week, note which PRs ate your attention and which were trivial. You'll find most of your time went to a small fraction of changes — and that some of the "important" reads were routine code you read out of anxiety, not need. That data tells you where the cheaper layers should take over.
Make the agent satisfy your gates before you see anything. No diff reaches your eyes until the suite is green and the types check. If that's not enforceable today, that's your first infrastructure task, and it's worth more than any prompt tweak.
Stand up one machine-review pass with a different model and an adversarial prompt. Have it summarize the diff into "the three spots that need human eyes." Measure whether those spots match what you'd have flagged. If they don't, tune the rubric, not your trust.
Build a twenty-PR eval set. Past PRs, known verdicts, run through your stack. It will be embarrassingly revealing. That's the point.

The career framing, briefly, because it's real: the engineer who owns verification — who can say with evidence "here's what we check automatically, here's what gets human eyes and why, here's our catch rate" — is becoming more valuable than the engineer who's fastest at producing code. Generation is a commodity now. Judgment about correctness, at scale, is not.

Who this is for, and who should skip it

This is for you if you're already shipping with agents and feeling the slowdown — the engineering manager watching review become the bottleneck, or the senior dev who's quietly become the team's human bottleneck and resents it. The framework is built for exactly that frustration.

Skip most of this if you're on a tiny team where you write all the code yourself and review is a non-issue — the overhead of three layers won't pay off until volume forces it. And skip the "ration the human" advice entirely if you work in a domain where everything is high blast radius and low reversibility: medical devices, aerospace, anything where the cost of wrong is a person. There, the human reads everything, and that's not inefficiency, that's the job.

For everyone in the messy middle — which is most of us — the clear answer is: stop spending human attention as your first line of defense. Make it your last and most precise one.

One thing from my own week

Here's what this actually looks like lived out, not as advice but as evidence. That payment-retry PR I approved at 10:47 last Tuesday? I didn't spend ninety-three minutes reading all 230 lines. I spent eighty of them on the eleven lines that touched the retry budget — money path, low reversibility, high cost of wrong — and I trusted the gates and a second model's review for the rest. I found one bug in those eleven lines. It would have double-charged on a specific timeout. The agent's commit message said, with total confidence, that the change was idempotent. It was not. I'm keeping that diff pinned in a folder of examples for my eval set, because a verification system is only as honest as the times it caught the thing that looked exactly like nothing.