Skip to main content

Command Palette

Search for a command to run...

How to Build a Code Review Workflow That Catches AI-Generated Bugs

A practical setup guide for teams where AI tools are generating a significant share of production code

Published

Code review is the primary quality gate between what developers write and what reaches production. That has always been true. What is changing is that developers are no longer writing everything themselves. AI coding assistants are generating a growing percentage of the code in production repositories, and the standard review process was not designed with that volume or that failure mode in mind.

The bugs that slip through AI-assisted reviews are different from the ones that slip through in normal code. They are rarely syntax errors or obvious logic failures - those get caught immediately. What slips through is contextually wrong code: functions that are technically valid but violate conventions your team established, patterns that introduce security edge cases your current test suite does not cover, dependencies on deprecated APIs that the AI does not know are deprecated.

This guide walks through a concrete workflow setup that addresses these failure modes without requiring a separate review track for AI-generated code.

Step 1: Configure a Pull Request Template That Makes AI Usage Visible

Before reviewing anything, reviewers need to know how much of the code in a PR was AI-generated. Without that context, they will apply the same review heuristics they use for hand-written code and miss the specific things worth checking.

GitHub's pull request template is the right place to capture this. A checkbox asking whether AI-generated code is present - and a field for which parts - costs reviewers ten seconds to fill out and gives reviewers the context they need.

Create .github/PULL_REQUEST_TEMPLATE.md in your repository:

## Changes

<!-- Brief description of what this PR does -->

## AI Assistance

- [ ] This PR contains AI-generated code
  - If checked, which sections? <!-- e.g., "utility functions in src/utils/format.ts", "test cases in tests/api.test.ts" -->

## Checklist

- [ ] Tests added or updated for new behavior
- [ ] No new secrets or credentials in code
- [ ] Follows existing module patterns
- [ ] Security-sensitive changes have a second reviewer assigned

This is not about creating friction. It is about making the review calibrated to the actual content of the PR.

Step 2: Add Automated Checks That Run Before Human Review

Human reviewers should not be doing the work that tools can do. Static analysis, linting, and security scanning should run automatically on every PR and block merge if they fail - leaving human reviewers to focus on semantic and architectural questions.

A GitHub Actions workflow that covers the main bases:

name: Code Quality

on:
  pull_request:
    branches: [main, develop]

jobs:
  lint-and-analyze:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Run ESLint
        run: npx eslint src/ --max-warnings 0

      - name: Run TypeScript type check
        run: npx tsc --noEmit

      - name: Run tests with coverage
        run: npm test -- --coverage --coverageThreshold='{"global":{"lines":80}}'

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Semgrep
        uses: semgrep/semgrep-action@v1
        with:
          config: auto

For Python projects, substitute ESLint with Pylint and add Bandit for security scanning. For Go, add golangci-lint. The structure is the same across languages.

The key configuration here is --max-warnings 0 for ESLint and the coverage threshold. These settings mean the workflow fails on any new lint warning or coverage drop, not just on errors. This is the right threshold for codebases where AI assistance is adding code at volume.

Step 3: Define What Reviewers Should Check That Automation Cannot

Once automated checks pass, human reviewers focus on what tools cannot evaluate: semantic correctness, architectural fit, and behavioral edge cases.

For AI-generated code specifically, the review checklist should include:

Does this code follow the conventions this module was already using? AI tools default to common patterns from their training data. Your module may use a specific approach to error handling, dependency injection, or state management that the AI does not replicate. Check whether the new code is consistent with the existing code in the same file, not just syntactically valid.

Does this code introduce dependencies the team has agreed to phase out? If your team is migrating away from a library, an ORM, or an internal utility, that context is not available to the AI. It will suggest using whatever is most common in the training data for this kind of task.

Does this code handle all the documented error states for the external services it calls? AI tools are good at handling the happy path. They frequently omit handling for rate limiting, service unavailability, malformed responses, and partial success states. Check the API documentation for the services being called and verify the error paths are covered.

Are there new network calls, database queries, or file operations that were not present before? These are often the source of performance regressions that do not appear in tests.

Google's engineering practices guide has a useful framing for what a code review is trying to accomplish - not just catching bugs, but ensuring the code makes the overall system better over time. That framing is especially important when the volume of code under review is higher.

Step 4: Set Up Pre-commit Hooks for Fast Local Feedback

Catching issues in CI is correct but slow. Pre-commit runs checks locally before each commit, giving developers fast feedback on the specific things most likely to cause CI failures or review comments.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: detect-private-key
      - id: check-merge-conflict
      - id: check-json
      - id: check-yaml
      - id: trailing-whitespace

  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets

  - repo: https://github.com/psf/black
    rev: 23.12.1
    hooks:
      - id: black
        language_version: python3

The detect-private-key and detect-secrets hooks are particularly important in AI-assisted workflows. AI tools will occasionally include example credentials inline when generating code from examples that include real-looking credentials. These hooks stop that before it reaches the repository.

Install pre-commit and activate the hooks with:

pip install pre-commit
pre-commit install

A single PR review catches point-in-time issues. Trend tracking catches systemic drift.

CodeClimate and SonarQube both provide trend dashboards that show how code complexity, coverage, and debt scores change over time. After AI tools are introduced to a team's workflow, these trends are worth checking monthly. A complexity trend that is increasing faster than the feature count is an early signal that accepted suggestions are not being reviewed with sufficient rigor.

GitClear publishes research on how AI assistance affects code churn specifically - code that is merged and then modified again shortly after. Tracking your own churn rate over time, before and after AI tool adoption, gives you objective data on whether the review process is catching enough issues before merge.

The GitHub code review documentation covers the mechanics of GitHub's review tooling in detail - required reviewers, review dismissal policies, and status checks that must pass before merge are all configurable at the repository or organization level.

Developer team discussing code review process at a monitor Photo by Mikhail Nilov on Pexels

Connecting the Workflow to Accountability

The workflow steps above work because they create accountability at each stage: developers can see automated checks before pushing, reviewers have context about AI usage, and metrics surface systemic trends. The missing piece in most teams is the accountability norm: who owns AI-generated code once it merges?

The answer is simple but worth stating explicitly in team documentation: the developer who accepted the suggestion owns it. It is their code in the git history, their responsibility to ensure it was correctly reviewed, and their name in the blame if it causes problems later. Conventional Commits is a useful standard for commit messages that makes the history more readable when reviewing which changes introduced a problem.

For teams building complex applications where this kind of governance matters from the start, 137Foundry specializes in establishing these development practices as part of building production-grade software - integrating the tooling, review processes, and accountability norms into the workflow rather than adding them retroactively. For a broader look at the process considerations around AI coding tools in production, the writeup on integrating AI coding tools without technical debt covers the governance policies and measurement approaches in more depth.

Resources