Enable javascript in your browser for better experience. Need to know to enable it? Go here.

Where does the rigor go?

Engineering rigor doesn't disappear when AI writes code — it moves

Disclaimer: AI-generated summaries may contain errors, omissions, or misinterpretations. For the full context please read the content below.

Here's a question that keeps surfacing in practitioner conversations about AI-assisted development, but doesn’t yet have a clear answer: if we stop caring about code, what do we start caring about instead?

 

The question isn't hypothetical. Engineers at major tech companies and financial institutions are already watching their code review processes buckle under AI-generated changesets. Pull requests are getting merged faster, metrics dashboards look greatand a growing number of senior practitioners are quietly asking whether those faster merge times are measuring confidence or just measuring capitulation.

 

One senior engineer put it bluntly in a recent closed-door discussion: larger AI-generated changesets are more likely to get a quick "looks good to me" approval — those fast approvals are then cited as proof that AI is creating value. The circularity of that logic should concern us.

 

The real insight isn't that code review is broken, but instead that the rigor code review was supposed to provide doesn't vanish when you stop doing code review. It has to go somewhere; the question is where.

 

The functions we're actually losing

 

Before we can redirect rigor, we need to be honest about what code review actually does. In conversations with practitioners across multiple organizations, a consistent taxonomy emerges. Code review serves at least four distinct functions.

 

Mentorship is the one practitioners mention first and care about most. Code review is how junior engineers learn the codebase, absorb patterns and develop judgment. It's apprenticeship dressed up as a quality gate.

 

Consistency matters for maintainability. Somebody needs to ensure the codebase doesn't drift into a patchwork of competing styles, especially now that AI agents have their own opinions about how code should be structured.

 

Correctness is the function most people assume code review provides and the one practitioners are most willing to hand off to machines. As one engineer at a major tech company said plainly: "I do not want a human being to determine correctness." Automated checks, tests, and static analysis are better at this than tired humans scanning diffs at 4pm on a Friday.

 

Trust is the function nobody talks about explicitly but everyone recognizes. Code review is a social contract. It's how one engineer says to another: I've looked at your work, I understand what it does and I'm comfortable with it entering our shared codebase. It's territorial, it's political and it's deeply human.

 

The practitioners navigating this well aren't trying to preserve code review in its current form. They're decomposing it into these functions and finding new homes for each one.

 

Five places the rigor is migrating

 

No single destination is sufficient. Most teams will need some combination.

 

Upstream: spec review and plan review

 

The most common migration pattern is shifting attention from the code to whatever produced the code. One practitioner described their approach: "I focus more on pre-reviewing the plans and post-reviewing the engineering. I don't focus as much on the code itself."

 

This sounds obvious until you try it. The specification artifacts that AI coding tools generate are often harder to review than the code they produce. One practitioner went so far as to say they'd rather review code than review specs, because at least code reflects what the system actually does.

 

The truth is that spec review requires us to be good at writing specs, and we've spent thirty years getting away with being bad at it. Moving the rigor upstream means confronting that gap head-on. The most promising approach comes from teams that treat the spec as a living constraint rather than a document. Instead of reviewing a specification, they review the tests, the invariants, and the boundaries that will constrain what the AI generates.

 

Into the test suite: the safety net as first-class artifact

 

If you can't review 10,000 lines of generated code, review the 200 lines of tests that define what that code is supposed to do. This requires treating the test suite as the specification of the system, not a check on the implementation.

 

The most striking finding from practitioners experimenting with AI-assisted development is that test-driven development produces dramatically better results from AI coding agents. One practitioner reported better outcomes from TDD with agents than from any other approach, specifically because TDD prevents a failure mode that plagues agent-generated code: the agent writes a test that verifies its own broken behavior. When the human writes the test first, the agent has a fixed target to hit rather than a moving one it can redefine to match whatever it produced.

 

That insight generalizes. Determinism is an input to agent capability, not an obstacle to it. Every constraint you give an agent makes its output more trustworthy. TDD works because it gives the agent a deterministic validation target. The same principle applies to property-based testing, formal verification and type systems.

 

Into the type system: constraining what agents can produce

 

Agents make fewer dangerous mistakes in languages that constrain them more. TypeScript has become the default language for AI-assisted development not because agents are better at TypeScript, but because developers feel safer when the type system catches errors that would otherwise slip through. The type checker is doing work that used to fall to code review.

 

Several practitioners reported noticeably better results from agents working in Rust. One put it in parental terms: "It's a language where I feel safe that they cannot hurt themselves." Rust's compiler is aggressive about preventing entire categories of bugs, and that aggressiveness turns out to be exactly what you want when an AI is producing the code.

 

Into risk maps: knowing where to look

 

One of the most provocative reframings to emerge from these conversations is that engineering is becoming risk management rather than code production. "I don't care that much about producing code anymore," said one senior practitioner. "I care about what are we going to do, and then how are we going to determine which code really matters."

 

Several practitioners arrived independently at an 80/20 model: roughly 80% of software gets generated with automated verification, and roughly 20% gets deep human attention because it's high-risk or high-value. The split isn't new. Organizations have always had tiers of criticality. What's new is making the tiers explicit and building tooling around them. Practitioners described wanting heat maps that overlay risk profiles with change frequency. Where is the code changing? How critical is that area? Those tools exist in primitive form. They need to become standard engineering infrastructure.

 

The hard part isn't technical. Software all looks the same from the outside. You can't tell by looking at the code which parts are plumbing and which parts process billions of dollars in transactions. Risk tiering requires senior engineers to map business criticality onto the codebase, and that's organizational work as much as it is technical work.

 

Into continuous comprehension: ensemble practices and architectural retrospectives

 

The final migration point is the most counterintuitive: instead of reviewing code at a gate, understand the code continuously.

 

Teams that practice ensemble programming report that they've never done code reviews and don't need to start now. If understanding the system matters, you do it all the time. You don't do it in little phases where you have your code review gate. These teams advocate for weekly architectural retrospectives where the entire team stops, looks at the codebase and asks: are we going the right direction?

 

Whether the specific practice scales matters less than the principle underneath it: humans need ongoing, continuous understanding of what's being built. You can't defer that understanding to a gate at the end and expect it to work. Not when the rate of change is accelerating beneath you.

 

The rigor has to go somewhere

 

In the early days of compilers, programmers reviewed the binary output to verify that the compiler accurately reflected their intent. That sounds absurd now, but it took decades to build enough confidence in compilers to stop checking their work. We're in the same position with AI code generation.

 

As one practitioner framed it: "When you check English into the repository and there is no code in your repository, then you do not have to look at code anymore." We're not there yet. Until we are, the rigor that used to live in code review has to be consciously, deliberately redistributed into specs, tests, type systems, risk maps and continuous comprehension practices.

 

The mistake would be to let it dissipate. Faster merge times feel like progress. Bigger changesets feel like productivity. But if the rigor isn't going somewhere specific and intentional, it's just evaporating. Make sure you know where yours went.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Keep up to date with our latest insights