Why Your Team Should Code Review AI-Generated PRs Differently

The first time a team I work with starts shipping meaningful AI-generated code, the bugs they catch in code review are different from the bugs they were catching the month before. Not worse, not more frequent — different. The human-authored code had human-shaped mistakes. The AI-authored code has AI-shaped mistakes. The review practices the team built for humans catch fewer of them, because they were built for different failure modes.

This article is the short list of AI-specific review heuristics I teach teams once the AI-generated share of their codebase gets above 20%. These are the things I specifically look for when I am reviewing a PR that an AI tool (Claude Code, Cursor agent, Copilot agent mode, a background agent) produced. If your team is already shipping AI-authored code at volume, this is the update to your review checklist.

The premise: AI writes different mistakes

Human engineers make human mistakes. They forget the edge case they did not think of. They write code that matches their mental model, not the actual system. They take shortcuts when they are tired. The code review practice most teams have developed over the years is optimized for these failure modes.

AI makes different mistakes. It tends not to forget edge cases it was told about. It does not get tired. It has no mental model of "the actual system" at all — it only has what was in its context window at the moment of generation. The mistakes it makes flow from those differences:

It confidently produces code that calls functions that do not exist.
It matches the surface pattern of what was in context, but misses the semantic meaning.
It handles the case it saw in the example, not the case that actually matters.
It fixes the symptom the user reported, not the underlying bug.
It writes tests that pass but do not actually test the claimed behavior.
It duplicates logic that already exists elsewhere in the codebase because it did not see the existing version.

The review practices that catch these are different from the ones that catch human mistakes.

The AI-specific review checklist

Ten things I look for in every AI-authored PR, in rough order of how often they fire:

1. Does this function actually exist?

AI sometimes calls functions, imports, or APIs that look plausible but do not actually exist in the codebase or the library. The symptom is a PR that reads well but fails at runtime. The fix is to check the imports and the function calls against the real API. Your linter will catch some of this; good code review catches the rest.

2. Does this duplicate logic we already have?

AI has a strong tendency to reinvent things. It saw the task, wrote a plausible implementation, and had no way to know that you already have a utility for exactly this. In the review, scan the PR for any function that looks like a general-purpose utility, then grep the codebase to confirm it does not already exist. If it does, the PR needs to use the existing version.

3. Is the fix addressing the symptom or the cause?

AI is better at fixing symptoms than at fixing causes. If the bug was "the dashboard crashes when a user has no email," the AI's fix is often "add a check for missing email at the crash site." The right fix might be "ensure users cannot get into this state" higher up in the stack. In review, always ask: is this the fix, or is it the workaround?

4. Do the tests actually test what they claim to test?

AI loves to write tests. The tests frequently pass. The tests are also frequently not testing the thing the PR is supposed to do — they test a reworded version of the thing, or they mock the key behavior away, or they assert on an intermediate state instead of the final one. Read every test critically. Ask: "if the underlying code was broken, would this test catch it?"

5. Are the error cases handled with real intent?

AI often adds try/catch blocks that swallow errors silently or log them without surfacing them. Human engineers do this too, but they usually have a reason. AI does it reflexively because it pattern-matches "errors should be handled" without thinking about what handling means. Every new catch block deserves a question: what should actually happen when this fires?

6. Does the PR stop at the right boundary?

AI often either does too little (one surface-level change) or too much (refactors three unrelated files because they were in context). Scope creep in AI PRs is common and should be pushed back. A PR should do one thing. If the PR touches five files and only one is necessary, the other four are a review problem.

7. Are the imports and dependencies right?

AI sometimes adds dependencies that the project does not need, or uses existing dependencies in ways that conflict with how the rest of the project uses them. Review the imports. If the PR added a new package to package.json, ask whether it was necessary.

8. Does the change make sense in context?

Read the changed files in context, not just the diff. AI generates diffs that look correct in isolation and break assumptions held elsewhere in the file or the module. The reviewer needs to see the whole file, not just the patch, to catch this.

9. Is the output format what it should be?

For changes that affect output (API responses, log messages, error messages, UI text), AI often produces outputs that are almost right but differ subtly from the existing style. Tone, pluralization, capitalization, field names. These differences are small individually and wreck consistency collectively.

10. Did the AI cite its sources?

This is the most useful meta-heuristic. Good AI tools produce PRs that show you what they read, what they considered, and why they made the choice they did. If the PR arrived with no explanation — just a diff — that is a red flag. Ask the AI (or the developer who drove it) to explain the decisions. If the explanation does not hold up, the PR does not hold up.

The team practice shift

Beyond the specific checklist items, there is a broader shift in how I ask teams to think about review.

The PR author is responsible, even if the AI wrote the code. The engineer who submitted the PR is the person on the hook for its correctness. They cannot blame the AI. "The AI did it" is not a review comment resolution. The author reads the AI's output, understands it, validates it, and stands behind it.

Review time does not shrink proportionally to generation time. AI can write a 500-line PR in 90 seconds. The review still takes 20 minutes. Teams that assume review time scales down with generation time end up rubber-stamping bad PRs. The review is a bottleneck, and that is a feature of the process, not a bug.

Review comments should explain the AI-specific failure mode when it applies. If you catch the AI duplicating logic, the comment should say "this is a common AI pattern — it does not know about the existing utility; use the existing one." The team learns to anticipate the patterns and catch them faster.

Automated review tools help but do not replace human review. AI code review tools that run automatically on every PR catch a meaningful percentage of issues. They also miss the ones that matter most — the semantic mistakes, the architectural violations, the "this is not what the user asked for" mistakes. Use the automation as a first pass; keep the human review.

The calibration session

The thing I install at most engagements when a team starts shipping significant AI-authored code is a weekly 30-minute "review calibration" session. Two reviewers pick three recent AI-authored PRs, review them together, and compare notes on what they caught and what they missed. The outputs of this session are the additions to the team's shared review checklist.

Over a few months, the team develops a shared sense of what AI-generated code looks like and where it fails. That shared sense is more valuable than any specific checklist because it generalizes to new tools and new failure modes as the AI landscape changes.

What not to do

Do not ban AI-generated code. That train has left the station. Banning it just drives the usage underground and robs you of the ability to shape the practice.

Do not add more process on top of existing process. Teams that respond to AI by adding five new review gates end up shipping nothing. Update the existing review practice; do not layer on top of it.

Do not treat AI PRs as "less careful" work. A common failure mode is "this is just AI-generated, so we review it faster." The inverse is closer to right: AI PRs deserve more careful review because the failure modes are less familiar.

Do not assume velocity gains will show up everywhere. Some teams will see massive velocity gains from AI tools; others will see modest ones; a few will see losses. The difference is usually the quality of the review practice. Teams that invest in review catch more, ship fewer bugs, and compound their gains. Teams that skip review pay for it later.

Counterpoint: do not over-index on AI-specific review

A warning. All the review heuristics I have described are good practices for human-authored code too. A review that catches "does this function exist" and "does this duplicate existing logic" is a good review regardless of who wrote the code. Do not build an elaborate "AI review" practice that is separate from your normal review practice. Merge the practices. The goal is better review overall, not review tooling that only applies to AI.

Your next step

This week, pick three AI-authored PRs that have shipped in the last month. Review them again with the checklist above in mind. Write down anything you find that made it through the original review. If the list is long, that is the signal that your review practice needs an update. If it is short, you are in better shape than most teams.

Where I come in

Installing review practices that specifically handle AI-generated code is one of the things I do at engagements where the team has ramped up AI tool usage. Usually a week of pairing on reviews, updating the team's checklist, and establishing the calibration ritual. Book a call if your team is shipping AI code at volume and the review practice has not caught up.