Can vibe coding produce production-grade software?

Premanand Chandrasekaran

Published: April 30, 2025 | Last updated: May 12, 2025

We've been discussing the concept of 'vibe coding' a lot at Thoughtworks recently. But can it actually be used to write software that we put out into the world? Prem Chandrasekaran did three experiments to see what would work and what wouldn't.

The idea of letting an AI write production-grade code can stir both fascination and doubt. Some see the promise of near-instant productivity — code at the click of a button — while others worry about unleashing legions of barely readable and unmaintainable scripts into our codebases. As practitioners who have spent countless hours refining standards of “good” code, we approached this debate with a mix of curiosity and caution.

We set out to test a simple hypothesis: Can an AI build a non-trivial application from scratch — without any code written by us — and still produce something that humans can maintain? We ran three experiments to explore this question. In each case, we interacted with the AI to steer the build, but the expectations we set were markedly different. In one experiment, we embraced a freeform, improvisational style — what AI researchers like Andrej Karpathy have called vibe coding — where we focused purely on functionality and expressed little about how the system should be structured. In the others, we were far more deliberate — prescribing design heuristics, setting expectations around modularity and testability and reinforcing quality through continuous feedback.

To ground our experiments, we chose to build an application we called the System Update Planner — a tool for managing software updates and patch deployments across a fleet of devices. It allows users to define which packages need updating, plan staged rollouts in batches and track execution status down to individual devices. The domain spans software package versioning, device state management, batch sequencing and real-time monitoring. In short, it’s neither a “Hello World” exercise nor an overwhelmingly massive enterprise system — just right for running a realistic test of AI-generated code quality.

The contrast in approach gave us a chance to examine the role of human intent and engineering discipline in shaping AI-authored systems. In the sections that follow, we walk through these experiments, the decisions we made and what they reveal about the possibilities — and limits — of using AI to build software that meets production-grade standards.

What do we mean by "production-grade" software?

Before diving into the experiments, it’s worth acknowledging that production-grade is not a universally defined term. If there were a single, widely accepted definition, many real-world codebases might struggle to meet it.

In practice, the term reflects a combination of engineering judgment, organizational context and lived experience. For the purposes of this article, we don’t use production-grade to imply perfection, but rather to describe software that's robust, maintainable and ready to be deployed and evolved responsibly in real-world conditions.

To keep things grounded, we roughly eyeballed a mix of qualitative indicators and quantitative signals that experienced developers often associate with production-worthy code. These aren't hard rules, but directional heuristics — enough to assess whether the software feels trustworthy, evolvable and operationally sound.

Characteristic	Description	Indicative qualitative measures	Indicative quantitative measures
Correct	It behaves as intended, with key workflows verified preferably through fast-running automated tests.	Edge cases are handled; no regressions during basic use.	Test pass rate near 100%; mutation score > 80%.
Testable	Its design supports meaningful unit, integration and end-to-end testing.	Tests are fast, focused and isolated; naming is consistent and purposeful.	Unit test coverage > 90%; no test flakiness.
Maintainable	The code is readable, modular and consistent enough for others to safely understand and change.	Idiomatic structure; easy onboarding for new contributors.	Low cognitive complexity; stable change rate in core modules.
Scalable	It’s designed with non-functional concerns like performance, security and operational robustness in mind.	Design anticipates growth; avoids excessive coupling.	Baseline performance benchmarks; graceful degradation patterns.
Diagnosable	It provides enough instrumentation and structural clarity to support effective troubleshooting.	Logs are meaningful and context-rich; failures are traceable.	Presence of structured logs; alert coverage for key failure paths.
Disciplined	It follows sound engineering practices — version control, CI, static analysis, etc.	Frequent commits with clear messages; workflow is CI-compliant.	Commits gated by CI; clean lint runs; no critical SAST issues.

Expand table

Collapse table

To be clear: this isn’t a checklist for perfection. When we ask “can vibe coding produce production-grade software?” — this is roughly the benchmark we have in mind.

Tools and setup

Across all three experiments, we primarily used the Cursor IDE in agent mode, which empowers the AI to autonomously perform a wide range of tasks beyond simple code editing — such as executing command-line programs, interacting with Git repositories and leveraging additional tools we exposed through MCP servers. Agent Mode is particularly suited for complex development workflows, enabling the AI to explore the codebase, read documentation, browse the web, run terminal commands and reason across sessions.

We also experimented briefly with JetBrains’ Junie and Windsurf (as an IntelliJ IDEA plugin). At the time of writing, both tools offered interfaces and workflows that closely resembled Cursor’s agentic approach. However, for consistency and to minimize tool-switching overhead, we conducted the bulk of our work using Cursor.

Initially, we relied on Cursor’s default AI model selection. As our experiments progressed, we transitioned to Claude 3.7 Sonnet for its enhanced reasoning capabilities. Eventually, we adopted Google Gemini 2.5 Pro almost exclusively, impressed by its proficiency in understanding functional requirements and engaging in critical architectural discussions.

To augment the AI’s capabilities, we experimented with integrating various Model Context Protocol (MCP) servers, including Sequential Thinking, Git, GitHub and Memory. These tools were intended to maintain coherent reasoning across sessions, enable direct execution of Git commands, facilitate repository interactions and provide persistent context.

To simulate brownfield application semantics, we initiated new chat sessions, even though the codebase was fully indexed. This approach led the AI to “forget” that it had previously created the codebase and the use of the Memory MCP server did not notably mitigate this behavior.

In all experiments, we began by providing the AI with a comprehensive functional overview of the System Update Planner, ensuring it understood the system’s purpose before delving into architectural and implementation details.

We conducted three distinct experiments:

Experiment 1: Provided the AI with a functional overview and allowed it to build the application autonomously. The resulting solution was primarily written in JavaScript.
Experiment 2: In addition to the functional requirements, we imposed specific implementation rules, such as adhering to test-driven development practices. This led the AI to choose TypeScript and Prisma as the ORM.
Experiment 3: Disabled all MCP servers and relied solely on Google Gemini 2.5 Pro. We engaged in a conversational style akin to human collaboration, encouraging the AI to ask clarifying questions and critically assess our inputs. The final implementation was in Python.

Despite the variations in setup and guidance, our primary objective remained consistent: to evaluate the AI’s capability to produce production-grade software under different conditions and constraints.

Experiment one: Letting the AI take the wheel

In our initial experiment, we adopted a hands-off approach, providing the AI with a high-level functional description of the desired system. This is what might be considered a pure vibe coding approach. The AI impressively generated a near-working application in a single pass, showcasing its potential to rapidly translate specifications into code.

However, as we progressed, the limitations of this approach became evident. Subsequent modifications — such as tweaking UI output formats, introducing an interactive menu-based mode, integrating a database migration tool (Knex.js) and adding tests — proved challenging. The AI often struggled with these incremental changes, leading to regressions and necessitating manual interventions.

This experience aligns with observations from industry experts. For instance, in Exploring Generative AI, Birgitta Böckeler notes that while AI can generate code swiftly, it often falters when dealing with complex, evolving systems. She emphasizes the importance of well-structured, modular code to facilitate effective AI collaboration.

Further reinforcing this perspective, the LeadDev article How AI-generated code accelerates technical debt highlights how AI-generated code can compound technical debt if not properly managed. It underscores the necessity of human oversight to ensure code quality and maintainability.

Despite these limitations, it’s essential to recognize how AI in software development can lower the barrier to entry. Less technical team members can now prototype ideas more effectively, and technical teams can expedite proof-of-concept developments. While the quality may vary, the ability to produce working software swiftly is a significant advancement.

Recognizing the need for better code quality and maintainability, we proceeded to our second experiment, where we provided the AI with additional architectural guidance and non-functional requirements to assess its performance under more structured constraints.

Experiment two: Shaping the AI with discipline

In contrast to our first experiment — where the AI was given functional goals and minimal supervision — our second effort focused on instilling engineering discipline from the outset. We laid out clear expectations: follow a test-driven development (TDD) approach, make small, incremental changes, commit regularly and maintain modularity. We also asked for type safety, prompting the AI to choose TypeScript and Prisma as its ORM — both sensible choices that aligned with our goals.

The early results were promising. The AI created domain models, scaffolded unit tests and committed to a trunk-based workflow with reasonable consistency. We introduced coverage thresholds to prevent regressions and added Stryker for mutation testing to validate whether our tests were meaningful. Some parts of the codebase reached near 100% mutation coverage — an encouraging sign.

But the journey wasn’t without friction.

Despite repeated guidance, the AI occasionally reverted to old habits — writing production code first and retrofitting tests later. This behavior appeared frequently enough that we suspect it reflects a training bias: high-quality, test-first codebases are still underrepresented in the data AI models learn from. The imbalance showed up in our commit history, where test line growth lagged behind production until we explicitly reinforced test coverage and enforced mutation checks.

One incident stood out. After a clean test run, we asked the AI to remove some lingering console.log statements from a functional test. A small change, in theory. But what followed was a cascade of breaking edits, test regressions and, finally, an unprompted downgrade of the Prisma dependency — from version 6.5 to 5.6. That the AI could trace the issue back to a version mismatch was impressive. That it introduced the change without cause or warning was concerning.

These moments are a reminder: the AI is fast and often insightful — but also occasionally erratic. The more complex the change, the more prone it is to unraveling. Guardrails like static analysis, test enforcement, mutation thresholds and human review remain essential to achieving reliability.

Still, this second experiment gave us hope. With enough structure and feedback, AI-assisted coding can produce systems that are not just functional, but maintainable. It’s not yet self-correcting, but it’s already good enough to augment human developers — and improving fast.

Experiment three: Conversational collaboration

For our third experiment, we shifted strategies again: disabling MCP servers, using only Google Gemini 2.5 Pro and engaging the AI in richer architectural conversations — just as we might with a human collaborator.

A particularly impressive moment came when discussing the design of device snapshots. Rather than simply implementing a snapshot API, the AI proactively raised thoughtful questions:

Should snapshots capture the full installed package state at a given point in time?
How should discrepancies between reported and stored states be handled?
Should submitting a snapshot overwrite the current state or record a historical version for auditing?

The AI proposed API designs, debated trade-offs (eager vs. lazy loading) and surfaced gaps in our original assumptions. These conversations demonstrated not just rote generation, but genuine architectural foresight.

Although occasional inconsistencies remained, the code produced — a Python/FastAPI system — was notably cleaner, more modular and more aligned with sound RESTful design principles than in earlier experiments. You can find an excerpt of this conversation here.

How did the coding experiments stack up?

Using the six production-grade criteria defined earlier, here’s how we subjectively rated each of the experiments on a scale of 1 to 5. These ratings are approximate and based on our direct observations — we didn’t use strict checklists or automation to derive them. Think of them as “eyeballed” evaluations grounded in qualitative insight.

Characteristic	Experiment one: Vibe coding	Experiment two: High discipline	Experiment three: Conversational
Correct	3 – Basic functionality worked, but regressions were common during changes.	4 – Key workflows were validated through tests and mutation checks.	4 – Core flows implemented, though verification relied on interactive validation.
Testable	1 – Minimal test coverage; attempts to add tests revealed poor structure.	5 – Testability was emphasized from the outset, with high unit coverage and mutation scores.	3 – Some tests present, but lacked systemic rigor or coverage metrics.
Maintainable	2 – Long methods, tightly coupled logic and inconsistent naming.	4 – Smaller modules, better boundaries and commit hygiene.	4 – Clearer structure and API layering through collaborative design discussions.
Scalable	2 – Very little thought given to non-functional concerns.	3 – Some attention to type safety and modularity, but scalability wasn't deeply explored.	3 – Architecture discussions touched on extensibility, but not systematically validated.
Diagnosable	2 – Logging was ad hoc and mostly for debugging.	3 – Basic observability through structured outputs, but no metrics or structured logging.	3 – Thoughtful about failure flows, but limited instrumentation.
Disciplined	2 – Few commits, little structure, ESLint added late.	4 – Frequent commits, CI thresholds, mutation gates and commit discipline.	4 – Good commit hygiene, clear reasoning in prompts but less tool-based enforcement.

Expand table

Collapse table

While each experiment exposed different strengths and shortcomings, this side-by-side view makes it clear that structure, guidance, and collaboration can have a meaningful impact on code quality. But these aren’t just abstract observations. Across our experiments, we surfaced several practical insights that teams can act on right now.

Key takeaways from our vibe coding experiments

Across all three experiments, our overarching goal remained the same: to evaluate the AI’s ability to produce production-grade software under varying conditions and constraints. The following insights emerged:

Tool selection matters. The choice of AI model and development environment had a significant influence. Claude 3.7 Sonnet excelled at general reasoning, while Google Gemini 2.5 Pro showed superior understanding of functional requirements and critical architectural thinking.
Contextual awareness is limited. Even with tools like MCP servers, the AI often lacked persistent awareness across sessions. Simulating brownfield development — by starting new chats — often erased prior context, requiring re-explanation.
Human-AI collaboration enhances outcomes. When treated as a junior partner — asking clarifying questions, debating trade-offs and encouraging critical thinking — the AI produced significantly better-structured systems.
Production-grade quality requires deliberate oversight. Factors like testability, maintainability, scalability and diagnosability improved markedly when reinforced through prompts, testing strategies and reviews.
Experimentation is key. Different setups and prompting styles led to meaningful differences. Flexibility, iteration and critical evaluation were crucial for effective AI-assisted workflows.

These lessons underscore the potential of AI as a collaborative partner — but also the continued necessity of human intent and discipline to guide it toward higher-quality outcomes.

What should teams do next?

We’re clearly entering a new era — but navigating it will require intentional adaptation. Here’s what we recommend:

For developers: Start experimenting. Learn how to guide AI tools thoughtfully. Practice framing prompts with architectural patterns, testing strategies and coding principles. Think of the AI as a fast but inexperienced pair programmer.
For tech leads and architects: Define and enforce the guardrails. Create starter templates, reference repositories and governance policies for AI-assisted development.
For testers and security analysts: Use AI tools to rapidly explore test scenarios, edge cases and attack surfaces. Even if you don’t use them to generate production code, they can be powerful for prototyping and validating assumptions.
For product managers and business analysts: AI can help prototype business workflows, acceptance criteria and validate logic — all using natural language conversations. Use it to tighten the feedback loop between ideation and validation.
For IT leaders: Prepare your organization. AI coding tools introduce new capabilities, new risks and new costs. Support safe experimentation, update measurement frameworks and build AI fluency across engineering teams.

And stay mindful of real-world costs, too. During our second experiment, we exhausted our “fast premium” request quota in just a few hours. Cursor (our IDE) charges ~$20 per 500 fast premium requests — a cost that can add up quickly for medium or large teams.

A glimpse into the future of software development

AI tools are no longer just glorified auto-completers or snippet generators. They are increasingly capable of building coherent, end-to-end solutions — APIs, databases, tests, everything. Yes, they still make mistakes. Yes, they still require human supervision. But the bar has been raised.

What once took a team days can now be scaffolded by a single developer in hours — with an AI partner moving at breathtaking speed. This no longer feels like a novelty — it’s a paradigm shift in how software is built.

So, can AI-assisted coding produce production-grade software? Not consistently — not yet.

But the gap is closing. With the right architectural intent, oversight and feedback loops, AI is inching closer to becoming a reliable teammate.

At the same time, it’s important to acknowledge: today’s AI models do not inherently optimize for self-verifiable code — code that asserts its correctness through automated tests, assertions, or contracts. This likely reflects both the scarcity of rigorously engineered examples in training data and the early stage of toolchain evolution. Closing this gap will require deliberate effort from developers and engineering leaders.

Looking ahead, we may also see a shift in how we think about codebases. As AI models grow more powerful, there may be growing advantages to maintaining smaller, modular systems — ones that fit entirely into a model’s context window. Rather than endlessly patching legacy codebases, teams may prefer regenerating clean modules. Software could become less of a permanent structure, and more of a renewable asset — rebuilt easily when needed.

In such a world, maintainability won’t just mean writing code that lasts. It may increasingly mean writing code that’s easy to replace.

The future of software development may not be fully automated — but it will almost certainly be AI-assisted. The organizations that invest now in understanding how to guide, govern and integrate these tools will be tomorrow’s leaders.

It’s time to stop treating AI coding tools as side experiments — and start treating them as the co-creators they are quickly becoming.

Appendix

Experiment 1: https://github.com/premanandc/system-update-experiment-1

Experiment 2: https://github.com/premanandc/system-update-experiment-2

Experiment 3: https://github.com/premanandc/system-update-experiment-3

Requirements conversation with Google Gemini 4o: https://gist.github.com/premanandc/379c55e1e340f00fdb0ead1082d95892

Requirements conversation with OpenAI 4o: https://gist.github.com/premanandc/11afdec9a2a3f28096dcde2c968bd432

Functional requirements: https://gist.github.com/premanandc/93f0b03b8397f32499439c8f5548b75f

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.