In 1976, software engineer Michael Fagan at IBM published a paper introducing what he called "software inspections": structured, formal examinations of code by groups of engineers before it was tested or deployed. Fagan's data showed that inspections caught more defects per hour of effort than testing, and caught a category of errors — design problems, logic flaws, incorrect assumptions — that testing often missed entirely.
Fifty years later, code review is a standard practice at virtually every professional software organization, conducted in a form that Fagan would recognize: one developer's code is examined by at least one other before being accepted into the shared codebase. The tools have changed from printed listings to pull requests on GitHub. The findings have not: code review remains one of the most effective defect-detection and knowledge-sharing mechanisms available to software teams.
This article explains what code review is, what the research says about its effectiveness, what good reviews look like, what common antipatterns undermine them, and when alternatives like pair programming may be more appropriate.
What Code Review Is and What It Accomplishes
The basic process
Code review is the practice of having at least one engineer, other than the author, examine source code changes before those changes are merged into the shared codebase. The reviewer is looking for:
- Correctness: Does the code do what it is supposed to do? Are there edge cases it fails to handle?
- Logic errors: Is the reasoning flawed in ways that might not produce immediate test failures but will cause problems later?
- Security vulnerabilities: Are there injection risks, authentication flaws, exposed credentials, insecure data handling?
- Performance: Are there obvious inefficiencies — N+1 queries, unnecessary allocations, blocking operations?
- Readability: Is the code clear enough that a future maintainer (who may be the author themselves six months hence) can understand it without the context that exists in the author's head today?
- Design and architecture: Is this the right approach, or is there a simpler or more robust solution?
- Standards compliance: Does the code follow the team's established conventions, naming patterns, test requirements?
The outcome of a review is typically: approval and merge, approval with minor non-blocking suggestions, or a request for changes before merging.
What makes code review different from testing
Automated testing and code review are complementary, not competing. Tests verify that code behaves correctly against defined assertions. Code review examines whether the code is correct in a broader sense — whether the design makes sense, whether the assumptions are right, whether the readable logic is actually what was intended.
Tests catch bugs that violate the test's expectations. Code review catches bugs that the author didn't think to test for. Empirically, defect type distributions differ: testing is better at catching regression bugs and boundary conditions; review is better at catching logic errors, security problems, and design issues.
A widely cited figure from Capers Jones and others in software quality research suggests that the combination of code review and testing catches more than 95 percent of defects before release, compared to roughly 75-85 percent for either method alone.
What Research Says About Code Review Effectiveness
Google's findings
Google has published extensively on its engineering practices, including code review. In its 2020 book Software Engineering at Google (Winters, Manshreck, and Wright), the company reports that code review is required for all changes to its codebase before submission, with few exceptions.
Google's data, gathered from an internal survey of more than 900 engineers, found that code review serves purposes beyond defect detection:
- 83 percent of respondents cited reviewing for "correctness" as a primary motivation
- 73 percent cited improving code readability
- 59 percent cited knowledge transfer to other team members
- 58 percent cited ensuring code maintainability
This multi-purpose framing is important: code review is not only a defect-detection mechanism but a shared ownership and knowledge distribution mechanism. Engineers who review code become familiar with parts of the codebase they didn't write, which reduces the single-points-of-failure created when only one person understands a given component.
Microsoft Research
A significant body of code review research has come from Microsoft Research. A 2013 study by Bacchelli and Bird, "Expectations, Outcomes, and Challenges of Modern Code Review," surveyed 165 developers across multiple Microsoft teams and analyzed 450 code reviews.
Their key finding: developers' primary goal in code review is not defect finding (as Fagan's formal inspection model assumed) but understanding the change: ensuring that reviewers comprehend what the code does and why, which serves knowledge transfer as much as quality assurance.
A 2015 Microsoft study by Rigby and Bird synthesized data from multiple companies and found that changes reviewed by more than one reviewer show significantly lower post-commit defect rates than single-reviewer reviews, with diminishing returns above two reviewers.
IBM and the original inspection data
Fagan's original 1976 IBM data and subsequent replications established that formal inspections could remove 60-80 percent of defects before testing. Later studies by Boehm and Basili at NASA found that the cost to fix a defect in code review is roughly 1/10 the cost to fix it after release, providing the classic economic argument for early defect detection.
The implication: an hour spent in code review typically prevents substantially more than an hour of debugging, rework, and incident response later. The investment compounds because code review also prevents the category of defects that cascade — a wrong assumption in a core abstraction that propagates through every dependent system.
What Good Code Review Looks Like
The scope problem: not too big, not too small
The single most important variable in code review quality is the size of the change under review. Research by Sadowski et al. (2018) at Google found that review quality degrades significantly for changes over 400 lines — reviewers become less thorough, approval times increase, and defect detection rate falls.
This finding has a practical implication: large, sweeping changes should be broken into smaller, reviewable units. This is sometimes uncomfortable — it requires more discipline from authors and more workflow management overhead — but it produces substantially better outcomes.
The same finding implies a preference for frequent small reviews over infrequent large ones. Teams that merge changes multiple times per day with small, reviewable diffs outperform teams that accumulate large changes for periodic review, both in quality and in cycle time.
Author behavior
Before requesting review, the author should:
Write a clear, informative description. The review description should explain what the change does, why it is needed, and any non-obvious implementation choices. A reviewer who understands the intent can provide better feedback than one who must infer it from the code.
Review your own code first. Authors who conduct a self-review before requesting external review catch a significant fraction of their own issues. Reading code in the review interface (rather than the editor) often reveals issues that weren't visible in development.
Annotate complex sections. If a specific implementation is unavoidably complex or makes a subtle tradeoff, a brief comment explaining it in the review request reduces reviewer confusion and defensiveness.
Reviewer behavior
Good review feedback is:
Specific and actionable. "This function is confusing" is less useful than "I found it difficult to follow the control flow here — would extracting the retry logic into a helper function make it clearer?"
Explaining reasoning. A suggestion without rationale can feel arbitrary. "Consider using a Set here instead of a List — lookup will be O(1) rather than O(n), which matters when the collection is large" gives the author the information to decide whether the suggestion applies.
Distinguishing blocking from non-blocking concerns. Not every suggestion should block the review. Teams benefit from conventions — a "nit:" prefix for style suggestions, a "blocker:" prefix for required changes — that communicate the reviewer's intent clearly.
Focused on the code, not the person. "We should add error handling here" rather than "you forgot error handling." The code is a shared artifact; the review is about improving it.
"The goal of code review is not to catch every imperfection but to ensure the code is better after the review than before. Treating review as an examination the author must pass rather than a collaboration to improve the code produces adversarial dynamics that harm both quality and team culture." — Common framing in engineering culture literature
What to look for: a practical checklist
| Category | Key Questions |
|---|---|
| Correctness | Does it handle null/empty inputs? Are off-by-one errors possible? Are all code paths covered? |
| Error handling | Are errors caught and handled appropriately? Do failures degrade gracefully? |
| Security | Is user input sanitized? Are credentials or secrets hardcoded? Are permissions checked? |
| Tests | Do tests cover the new behavior? Do they test the right things? Are they brittle? |
| Readability | Can a new team member understand this in 10 minutes? Are names descriptive? |
| Performance | Are there obvious bottlenecks? Are database queries efficient? |
| Architecture | Does this fit the existing design? Does it introduce technical debt? |
Common Antipatterns
Rubber stamping
Rubber stamping is approving code without genuinely reviewing it. It occurs when reviewers face time pressure, lack familiarity with the relevant part of the codebase, find it socially uncomfortable to challenge colleagues, or are in a culture that values speed over quality.
Rubber stamping is worse than no review in one specific sense: it provides the organizational comfort of having conducted code review while allowing the defects that review was meant to catch to pass through. It is the process theater of engineering quality.
Signs of rubber stamping: reviews completed in under two minutes for substantial changes; consistently no comments; approval immediately following a "LGTM" (looks good to me) with no specifics.
Nitpicking
The opposite extreme is nitpicking: spending review time on cosmetic issues — inconsistent indentation, naming style preferences, punctuation in comments — while providing insufficient attention to correctness and design.
Nitpicking is demoralizing for authors, particularly newer engineers who receive feedback that feels arbitrary and stylistic rather than substantive. It also misallocates reviewer attention away from the issues that actually matter.
The solution is automation: style and formatting concerns should be handled by linters and formatters (ESLint, Prettier, Black, gofmt) that are enforced as part of the build pipeline. This removes stylistic disagreements from human code review entirely and frees reviewer attention for substantive questions.
Asynchronous review as a delay mechanism
In some organizations, code review becomes a bottleneck because reviewers do not respond promptly and authors spend days or weeks waiting for feedback. Google's internal data shows that reviews lasting more than 24 hours correlate with substantially lower developer satisfaction and slower feature delivery.
Teams can address this with service level expectations: reviewers commit to providing initial feedback within four or eight business hours. Most review issues can be addressed quickly; only the complex ones require extended consideration. Prompt initial engagement, even if only to say "I'll look at this more carefully tomorrow," prevents the demoralizing silence that degrades review culture.
Author defensiveness
When authors interpret review feedback as personal criticism rather than improvement suggestions, reviews become conflict-prone and reviewers learn to soften or omit feedback to avoid friction. This dynamic degrades review quality without anyone explicitly choosing to do so.
Engineering cultures that frame code as a shared organizational asset rather than personal property of the author reduce this friction. Practices like mob/ensemble programming, where code is written collaboratively from the start, produce a similar shift in ownership framing.
Synchronous vs Asynchronous Review
The dominant model for most teams is asynchronous code review via pull requests or merge requests: the author submits code, reviewers examine it on their own schedule, leave comments, and the author responds. This model scales well: reviewers and authors don't need to coordinate schedules, and written comments create a record of design decisions.
Synchronous code review — where reviewer and author sit together (or are on a call) and review code in real time — is faster for complex discussions but requires coordination overhead. It is often used for particularly complex or sensitive changes, or for onboarding situations where real-time explanation is valuable.
Pair programming — two engineers working on the same code simultaneously — is a form of continuous review that avoids the async model's batch-and-wait cycle entirely. One engineer writes (the "driver") while the other observes, questions, and suggests (the "navigator"), switching roles periodically.
Research on pair programming (Cockburn and Williams, 2001) found that pairing produces code with roughly 15 percent fewer defects and takes roughly 15 percent longer, making it approximately quality-neutral on direct productivity measures but with a long-term advantage if defect prevention costs are counted. It also produces more consistent knowledge distribution than serial code review.
Pair programming is particularly effective for:
- Novel or complex problems where two perspectives reduce design errors
- Onboarding situations where knowledge transfer is the primary goal
- High-stakes changes where the cost of defects is elevated
It is less practical for routine tasks, geographically distributed teams, and individual work styles that find real-time collaboration exhausting.
Building a Code Review Culture
Technical process alone does not produce good code review. The cultural and structural environment matters as much as the mechanics.
Psychological safety is foundational. Engineers who fear that critical feedback will damage their reputation or standing will not provide honest reviews, and engineers who receive honest feedback in an unsupportive environment will find code review demoralizing rather than helpful. Amy Edmondson's research on psychological safety applies directly: teams where it is safe to say "I don't understand this" or "I think this is wrong" have more useful reviews.
Seniority dynamics require deliberate management. Junior engineers often hesitate to question senior engineers' code, even when they have valid concerns. Explicit norms that treat code review as egalitarian — any engineer can question any other's code, and junior engineers are expected to ask questions they don't understand — produce better reviews and accelerate learning.
Rotation and cross-team review prevents the formation of review silos where only the author's immediate teammates ever see their code. Occasional cross-team or cross-functional reviews spread knowledge and surface assumptions that teams develop collectively over time.
Code review done well is one of the most valuable engineering practices available. It catches defects cheaply, distributes knowledge, maintains shared standards, and creates the cultural artifacts — documented design decisions, visible technical debates, accessible history — that allow codebases to remain comprehensible over time. Done poorly — rubber-stamped, nitpicky, or delayed — it wastes time and creates the illusion of quality without its substance. The difference lies less in the tooling than in the culture and habits that surround it.
AI-Assisted Code Review
The emergence of AI code review tools — GitHub Copilot, Amazon CodeGuru, DeepCode/Snyk, and others — adds a new dimension to the code review landscape. These tools use machine learning to detect common bug patterns, security vulnerabilities, and style issues automatically, before human review begins.
AI-assisted review tools are most effective at:
- Security vulnerability detection: Identifying common vulnerability classes (SQL injection, insecure dependencies, exposed credentials) that can be pattern-matched against known vulnerability signatures
- Style and convention enforcement: Supplementing or replacing manual style review for teams without comprehensive automated linting
- Obvious bug patterns: Off-by-one errors, null dereferences, and resource leak patterns that appear frequently in training data
They are weakest at:
- Design and architecture review: Whether this approach is the right one for the problem requires understanding context that AI tools lack
- Requirements correctness: Whether the code correctly implements the intended behavior requires understanding the requirements
- Team-specific conventions and standards: Context about this specific codebase, team history, and organizational constraints
The practical implication: AI code review tools should be treated as a first pass that reduces the burden on human reviewers by handling automatable concerns, not as a replacement for human review of design, correctness, and knowledge transfer. The most valuable aspects of code review — the judgment, the questions, the knowledge sharing — remain human work.
Frequently Asked Questions
What is code review in software development?
Code review is the systematic examination of source code by someone other than its author before it is merged into the main codebase. The reviewer checks for bugs, logic errors, security vulnerabilities, adherence to team standards, readability, and whether the implementation is the right approach for the problem. It is a standard practice in professional software development and one of the most effective known methods for improving software quality.
What does Google's research say about code review effectiveness?
Google has published findings from its internal code review practices showing that code review catches approximately 80 percent of defects before production. Studies from Microsoft Research and other sources find similar results: formal code review consistently outperforms other defect-detection methods including unit testing and automated static analysis, particularly for logic errors and design problems that tools cannot catch.
What is the difference between synchronous and asynchronous code review?
Synchronous code review — such as pair programming or live review sessions — happens in real time with reviewer and author present together. Asynchronous code review, the dominant model for most teams, uses pull requests or code review tools where comments are left for the author to address at a different time. Asynchronous review scales better and allows more considered feedback; synchronous review enables faster iteration and richer communication for complex changes.
What is rubber stamping in code review?
Rubber stamping is the antipattern where a reviewer approves code without genuinely examining it, typically due to time pressure, social awkwardness about giving critical feedback, or a culture where challenge is discouraged. Rubber stamping provides the compliance appearance of code review without its substance, giving teams false confidence while allowing defects to pass into the codebase.
When is pair programming preferable to code review?
Pair programming is preferable to asynchronous code review when the work is particularly novel or complex, when the two people have significantly different knowledge and pairing enables real-time knowledge transfer, when rapid iteration is needed and the back-and-forth of async review would slow development, or when the team is small and communication overhead is low. Pair programming has higher upfront time cost but can produce cleaner code that requires less revision.