BUILD · Jun 1, 2026

Building an AI code review agent: lessons from production

I built a code review agent that reviews PRs on GitHub. Here's the architecture, prompts, and the surprising failure modes I hit in production.

Agent-ready: drop this post into Claude Code or Codex

TL;DR: My code review agent posted 47 comments on its first PR. 38 were wrong. The architecture was the easy part. The failure modes took 11 weeks to fix. Here is what I learned building a GitHub PR review agent that actually helps.

The Claude Code documentation describes how code review fits into the agentic coding workflow.

I spent three months building an AI code review agent for GitHub PRs. The first prototype took a weekend. Making it reliable enough for production, where false positives erode trust and missed bugs defeat the purpose, took the remaining eleven weeks.

Here’s what I learned.

Key takeaways:

  • The architecture is straightforward: webhook → classify → fetch diff → review with LLM → post comments
  • The system prompt matters more than the model. “senior engineer reviewing a junior’s PR” works best
  • Chunking large diffs is essential but introduces its own problems (duplicate comments, lost cross-file context)
  • False positives are the #1 reliability challenge: confidence thresholds and severity classification help
  • The most surprising failure mode: the AI arguing with human reviewers in comment threads
Confession

I nearly scrapped this project three times. Each time, a breakthrough in prompt design or architecture kept me going. The current version runs on about 20 repos and posts reviews that are useful, not perfect, but useful, on about 70% of PRs.

What is the architecture of an AI code review agent?

The agent runs as a simple web service with three components:

GitHub Webhook → Event Classifier → Diff Fetcher → LLM Reviewer → Comment Poster
 ↓ ↓ ↓
 Skip if not PR Chunk if large Aggregate results

1. Webhook receiver

from fastapi import FastAPI, Request
from github_webhook import Webhook

app = FastAPI()
webhook = Webhook()

@app.post("/webhook")
async def handle_webhook(request: Request):
 payload = await request.json()
 event = request.headers.get("x-github-event")
 
 if event == "pull_request" and payload["action"] in ["opened", "synchronize"]:
 pr = PRContext(
 owner=payload["repository"]["owner"]["login"],
 repo=payload["repository"]["name"],
 number=payload["pull_request"]["number"],
 title=payload["pull_request"]["title"],
 base_sha=payload["pull_request"]["base"]["sha"],
 head_sha=payload["pull_request"]["head"]["sha"],
 )
 asyncio.create_task(review_pull_request(pr))
 
 return {"status": "ok"}

The key insight: process reviews asynchronously. GitHub’s webhook timeout is 10 seconds. A full review takes 30-90 seconds. Kick off a background task and return immediately.

2. Event classifier

Not every PR needs reviewing. I classify events to save cost and reduce noise:

async def should_review(pr: PRContext) -> bool:
 """Skip drafts, docs-only changes, and trivial PRs."""
 # Skip draft PRs
 if pr.is_draft:
 return False
 
 # Get changed files
 files = await github.get_pr_files(pr.owner, pr.repo, pr.number)
 
 # Skip docs-only PRs
 if all(f.endswith(".md") for f in files):
 return False
 
 # Skip trivial changes (less than 10 lines changed)
 diff = await github.get_pr_diff(pr.owner, pr.repo, pr.number)
 if len(diff) < 200: # Rough: 200 chars ≈ 10 lines
 return False
 
 return True

This filters out about 30% of webhook events and saves about ₹650 ($8) per month in unnecessary LLM calls.

3. Diff fetching and chunking

Large diffs are the hardest problem. A 2000-line diff won’t fit in a single review context without losing quality.

async def chunk_diff(diff: str, max_file_changes: int = 500) -> list[DiffChunk]:
 """Split a large diff into reviewable chunks."""
 files = parse_diff_files(diff)
 chunks = []
 
 for file_path, file_diff in files:
 if len(file_diff) <= max_file_changes:
 chunks.append(DiffChunk(file=file_path, diff=file_diff))
 else:
 # Split by function boundaries within the file
 function_diffs = split_by_functions(file_diff)
 for i, func_diff in enumerate(function_diffs):
 chunks.append(
 DiffChunk(
 file=f"{file_path}#function-{i}",
 diff=func_diff,
 )
 )
 
 return chunks

Each chunk gets an independent review. Then I aggregate the results:

async def review_pull_request(pr: PRContext):
 diff = await github.get_pr_diff(pr.owner, pr.repo, pr.number)
 chunks = await chunk_diff(diff)
 
 # Review each chunk independently
 all_findings = []
 for chunk in chunks:
 findings = await review_chunk(pr, chunk)
 all_findings.extend(findings)
 
 # Deduplicate and aggregate
 consolidated = consolidate_findings(all_findings)
 
 # Post comments
 for finding in consolidated:
 if finding.confidence > 0.7: # Confidence threshold
 await post_comment(pr, finding)

4. The system prompt

This was the most iterated part. Here’s what works:

REVIEW_PROMPT = """You are a senior engineer reviewing a junior developer's pull request.

Focus ONLY on genuine issues:

CRITICAL issues (block merge):
- Logic errors that would cause incorrect behavior
- Security vulnerabilities (injection, auth bypass, data leaks)
- Performance problems in hot paths
- Race conditions or concurrency bugs

WARNINGS (should fix):
- Error handling gaps (uncaught exceptions, silent failures)
- Resource leaks (unclosed connections, file handles)
- Testing gaps that would miss real bugs

Do NOT comment on:
- Code style preferences (use the project's formatter)
- Missing docstrings on private methods
- Variable naming unless genuinely confusing
- Patterns that are unconventional but correct

For each issue found, provide:
1. File and line number
2. Severity (CRITICAL or WARNING)
3. Clear explanation of the problem
4. Specific code suggestion (exact diff if possible)
5. Confidence (0.0 to 1.0)

If you're unsure about something, skip it. False positives erode trust."""

The key decisions:

  • “Senior engineer reviewing a junior” sets the right tone: helpful, not pedantic
  • Explicit “do not comment on” list reduces noise by about 40%
  • Confidence scoring lets me filter low-confidence findings
  • “If unsure, skip” is the most important instruction: it directly fights false positives

What failure modes surprised me in production?

1. Hallucinated bugs

The AI would find bugs that don’t exist. Here’s a real example:

The PR changed a CSS class name from btn-primary to btn-main. The AI commented: “This class is missing the :hover state: users won’t see any visual feedback on mouseover.” But the hover state was defined in a parent class in a different file that the AI couldn’t see.

Fix: Added cross-file context by including the most relevant files (imports, parent classes) in the review context. And lowered expectations: the agent now adds a disclaimer: “Based on the diff alone; verify against full codebase.”

2. Missing real bugs in large files

The agent did great on files under 300 lines. Above that, quality dropped sharply. In a 600-line file, it missed a null pointer dereference that a human reviewer caught immediately.

Fix: Chunking at function boundaries, not file boundaries. Each function gets its own review pass. But this introduced the next problem..

3. Duplicate comments

When the same issue appears in two chunks, or when two chunks touch adjacent code, the AI would flag the same problem twice. Once in chunk A, once in chunk B. Sometimes in slightly different wording, making deduplication non-trivial.

Fix: A deduplication pass that compares findings by location and semantic similarity:

def consolidate_findings(findings: list[Finding]) -> list[Finding]:
 """Merge duplicate findings from different chunks."""
 grouped = {}
 
 for f in findings:
 # Group by file + line proximity
 key = (f.file, f.line // 10) # Group within 10 lines
 if key not in grouped:
 grouped[key] = f
 else:
 # Keep the higher severity, merge explanations
 existing = grouped[key]
 if f.severity == "CRITICAL" and existing.severity == "WARNING":
 grouped[key] = f
 
 return list(grouped.values())

4. Arguing with human reviewers

This was the most bizarre failure. A human would comment “Actually, this pattern is intentional for performance reasons.” The AI agent, in a follow-up review of the same PR, would post a rebuttal: “While performance is a concern, the correctness issue outweighs it..”

The AI didn’t know it was arguing with a human. It saw new context (the human’s comment) and treated it as new code to review.

Fix: The agent reads existing review comments before posting and skips topics that have been discussed:

async def has_been_discussed(pr: PRContext, finding: Finding) -> bool:
 existing_comments = await github.get_pr_comments(
 pr.owner, pr.repo, pr.number
 )
 
 for comment in existing_comments:
 if comment.path == finding.file:
 # Simple overlap check
 if abs(comment.line - finding.line) < 5:
 return True
 return False

5. Cost variation

Early estimates were way off. Here’s actual cost data from 100 PRs:

PR SizeAverage Costp90 CostAverage Time
Small (<100 lines)₹65 ($0.80)₹100 ($1.20)20s
Medium (100-500 lines)₹165 ($2.00)₹250 ($3.00)45s
Large (500-2000 lines)₹500 ($6.00)₹820 ($10.00)90s
Monstrous (>2000 lines)₹1,000 ($12.00)₹1,650 ($20.00)180s

Total monthly cost at 50 PRs/month (my current volume): approximately ₹4,100 ($50).

When NOT to use AI code review

I learned this the hard way. These scenarios will waste your money and frustrate your team:

  1. Codebases you can’t send to external APIs: if your company bans sending code to third-party LLMs, skip it
  2. Generated code: the AI will flag patterns it doesn’t recognize as odd (because they are, they were generated)
  3. As a replacement for humans. AI catches style issues and simple bugs. Humans catch architectural problems and design inconsistencies
  4. For greenfield projects: the first few PRs on a new project are mostly scaffolding. AI reviews add little value
  5. When you can’t handle false positives: if your team will start ignoring all AI comments after 3 bad ones, don’t start

What I’d do differently

If I were starting over:

  1. Start with a prompt-only approach. I jumped to complex chunking and aggregation too early. A good prompt on the full diff catches 70% of issues
  2. Invest in deduplication earlier: duplicate comments erode trust faster than missed bugs
  3. Track per-developer false positive rate: some developers get more false positives because their code style triggers the AI more. Adjust sensitivity per developer
  4. Add human feedback loop: let developers thumbs-up/thumbs-down comments. Use that to fine-tune

The agent runs daily on my repos now. It catches about 3-5 real issues per week, misses about 1-2 that humans catch, and posts about 2-3 false positives. Not perfect. But it makes the team a little better, and that’s enough.


Related: How to build your first AI agent in 2026: a step-by-step tutorial from scratch. Also see LangGraph tutorial for beginners for building agent workflows. If you’re shipping tools for agents, Microsoft’s deep dive on how AI coding agents use your SDK is essential reading.

FAQ

How much does AI code review cost per PR? For a small PR (under 500 lines changed), the cost is about ₹65-₹165 ($0.80-$2.00) using Claude Sonnet. Large PRs with 2000+ lines can cost ₹500-₹1,000 ($6-$12). About 70% of the cost goes to diff analysis and 30% to comment generation.

How do you handle AI code review for very large PRs? I use a chunking strategy: split the diff into files, review each file independently, then aggregate the results. Files over 500 lines changed get further split into function-level chunks. Each chunk gets an independent LLM call. This keeps context manageable but increases cost proportionally.

How do you reduce false positives in AI code review? Three techniques: confidence thresholds (only post comments above 70% confidence), severity classification (warnings vs critical), and deduplication (same issue mentioned in multiple chunks gets merged). Even with these, expect 20-30% of comments to be noise.

When should you NOT use AI code review? Don’t use AI code review for: sensitive codebounds where you can’t send code to external APIs, PRs with mostly generated code (the AI will complain about patterns it doesn’t recognize), or as a replacement for human code review. AI is a first pass, not a final gate.


This article was published on Agentic Up (https://agenticup.dev): practical guides for developers and founders building with AI agents. Reach me at hello@agenticup.dev.

Newsletter

Get the brief on AI agents

Practical posts on shipping agents, automating work, and building in public. No hype, no fluff.

Contact: hello@agenticup.dev