Can we use generative AI to generate test cases from user stories?

Disclaimer: AI-generated summaries may contain errors, omissions, or misinterpretations. For the full context please read the content below.

Thuc Van Hoang

Published: July 30, 2025

Creating manual test cases is an important yet often time-consuming part of a QA’s job. However, could generative AI help us? We’ve seen over the last year or so that AI can be immensely valuable in many different facets of software engineering — perhaps, then, it could have an impact on this critical QA task.

To find out, we conducted an experiment. In this blog post, we’ll explain what we did and demonstrate how effective AI can be in manual testing — as well as some caveats.

What brought us to this experiment?

As quality analyst teams face increasing pressure to deliver comprehensive test coverage within tight deadlines, manual test case creation has become a significant challenge. Traditional approaches, while thorough, are time-consuming and often struggle to ensure consistency across team members. We believe using AI to generate test cases will reduce test case generation time, improve test coverage (especially for exceptions and complex scenarios) and improve consistency in test case structure compared to manual generation. Success will be measured by a reduction in the time it takes to write test cases, increased test coverage, and fewer edits to tests required.

The design of our experiment

To systematically evaluate AI's effectiveness in test case generation, we designed a structured experiment that would allow us to measure improvements and refine our approach iteratively. Our experiment followed a controlled methodology where we compared AI-generated test cases against manually created ones using the same user stories as input. We established baseline metrics for time efficiency, test coverage, and quality consistency to ensure objective evaluation.

Metric	Description	Measurement method
Correctness (%)	Percentage of test cases correctly matching requirements	QA review and scoring: (Correct test cases / Total manual test cases) × 100
Acceptance criteria coverage (%)	Coverage of acceptance criteria by LLM test cases	Requirement traceability matrix: (Covered AC / Total AC) × 100
Duplication rate (%)	Percentage of redundant test cases	Manual review: (Duplicate test cases / Total LLM test cases) × 100
Incorrect test cases (%)	Percentage of incomplete or irrelevant test cases	Review flagging: (Incorrect test cases / Total LLM test cases) × 100
Ambiguity score (%)	Percentage of test cases with unclear steps	QA ambiguity review: (Ambiguous test cases / Total LLM test cases) × 100
Consistency score (%)	Adherence to structured format templates	Structure analysis: (Well-formatted test cases / Total LLM test cases) × 100
Prompt optimization impact (%)	Improvement after prompt refinement	Comparative scoring: [(Improved score - Initial score) / Initial score] × 100
Time efficiency (%)	Time reduction versus manual creation	Time tracking: [(Manual time - LLM time) / Manual time] × 100

Expand table

Collapse table

Rather than limiting our experiment to a single AI tool, we evaluated multiple GenAI platforms, including ChatGPT, GitHub Copilot, Glean, and Claude to understand how different tools approach test case generation.

Each platform or tool exhibits distinctive strengths — this is shaped by its underlying model architecture and training data sources.

ChatGPT, in our experience, excels at generating comprehensive test cases and shows superior performance in handling complex, ambiguous requirements, although it sometimes generates outputs that are not relevant to the current context.
GitHub Copilot demonstrated strong consistency and clarity, but is limited when it comes to understanding complex business logic. The test cases generated tended to be quite general and abstract, with few details and specific examples.
Glean provided strong real-time optimization capabilities and maintains a low duplication rate. However, it sometimes struggled with identifying exceptions and edge cases.
With Claude, we were surprised by its efficiency in creating test cases. The generated test cases were highly detailed with a clear, well-structured format and accompanied by specific, relevant test data.

This multi-tool evaluation ultimately provides valuable insights into the current capabilities of GenAI and informs our recommendations for optimal prompt instructions and general guidelines to help create effective test cases.

The iterative refinement process

In our work with AI to generate the test cases, we found that generating test cases from user stories is not the end — it’s a cycle. Each iteration of creation, review and improvement is a loop that makes the GenAI smarter and, most importantly, delivers more real-world value. Here’s how we build and refine our approaches.

Initial generation: Start with a basic prompt

To begin, we use a simple prompt to ask the AI to generate test cases from our prepared user stories. The goal of this step isn’t to produce a perfect result but to establish a foundation for understanding how the AI “thinks” and responds to the very first request.

Example of an initial prompt:

"Create test cases for the following user story: [user story]."

This approach provided us with insight into how the AI interprets and processes information to generate test cases.

Comprehensive assessment: Look beyond content and evaluate structure

Next, we performed an assessment process for each output generated. We were not only interested in "Is the AI writing it correctly?" but also explored more profound questions regarding its content:

Are special issues or boundary conditions missed?
Does the AI add the desired format (e.g., Gherkin, table or clean list)?
Are the usability test cases reproducible and maintainable?

This evaluation often requires collaboration between members of the QA team with specific metrics to determine the efficiency of generated test cases.

Rapid improvement: Adjusting prompts to “teach” the AI

After identifying the strengths and weaknesses in the first instance of generation, we begin to refine the prompt. Each small change to the prompt can bring a comprehensive improvement.

Here are some examples of prompt adjustments we’ve used:

“Add the test data to the test cases.”

→ This helps the AI generate test cases with test data and make it more detailed.

“Write in Gherkin (For/When/Then).”

→ This makes integrating with frameworks like Cucumber or Behave easier.

“Think like a senior quality analyst with 10 years of experience.”

→ This helps the AI focus on deeper analysis angles, rather than just listing simple steps.

“Check for boundary conditions and exceptions.”

→ This ensures the test cases created cover a wide range of boundary conditions and exceptions.

Iterate and reassess: Optimizing in a cycle

We don’t stop at only one improvement. Every iteration starts with us applying the new, refined prompt, collecting the output and continuing to evaluate the quality of the output. Once we see any improvements, we continue to refine the prompt until we reach a quality standard the QA team can accept.

This process can take significant time in the early stages; gradually, the AI becomes more accurate, saving effort and shortening test case development time.

The iterative approach not only helped us build high-quality test cases, but it was also a great lesson in how to work with AI. Because we don't expect AI to be perfect from the start, we learned how to train and shape it through continuous feedback and experimentation.

What insights did we gather from the experiment?

After conducting our comprehensive experiment on using AI for manual test case generation, we gathered valuable insights that paint a picture of both the potential and limitations of AI in software testing. Our findings reveal a promising future in applying AI-generated test cases while highlighting gaps that still require human expertise.

The speed advantage

Perhaps the most striking discovery from our experiment was the dramatic time savings that AI-generated test cases can provide. In our trials, AI demonstrated the ability to create initial test cases dramatically faster than manual writing approaches. Our metrics show an average time efficiency of 80.07%, with teams generating comprehensive test suites in minutes rather than hours or days. This advantage becomes particularly valuable in agile development, where rapid iteration is essential. Teams can now generate comprehensive test suites in minutes rather than hours or days, allowing more time to focus on test refinement, execution and analysis.

Consistency as a hidden strength

In addition to its speed, we found GenAI excels at maintaining consistent formatting and structure across test cases. Our consistency score averaged 96.11%, which is significantly higher than traditional manual approaches. This consistency proved more valuable than initially anticipated, as it significantly improves test case readability and management. When test cases follow a uniform structure, teams can more easily review, maintain, and scale their testing efforts.

Hitting the mark for simple functional coverage

For straightforward user stories with clear functional requirements, our experiment showed that GenAI performs impressively in covering the main functional areas. The AI achieved 98.67% acceptance criteria coverage and maintained a low duplication rate of only 4.22%. The AI demonstrated a solid understanding of basic functional testing needs and rarely produced redundant or incorrect test cases when working with well-defined user stories with simple requirements.

The limitations

However, our experiment also revealed several significant limitations teams must carefully consider when implementing AI-generated test cases.

Input quality dependency

The most significant limitation we discovered was the AI's heavy reliance on input quality. GenAI operates as a sophisticated text processor, generating test cases based primarily on what's explicitly written rather than understanding the underlying business logic or system dependencies. When requirements are ambiguous, incomplete, or lack business context, the AI will miss key testing scenarios that a human tester would naturally consider.

Limited application in advanced testing techniques

Our experiment revealed that GenAI encounters significant challenges when dealing with complex business logic or advanced testing scenarios. Even when explicitly instructed to apply sophisticated testing techniques such as boundary value analysis, decision tables, or state transition testing, the AI is often confused about how to implement these methods correctly or comprehensively.

Raw-level test steps and missing details

Another finding was the tendency for AI-generated test cases to provide test steps that were too general or unclear. While AI was able to identify what needed to be tested, it often struggled to provide the detailed, step-by-step instructions that testers needed to execute. Our ambiguity score averaged 27.22%, indicating that roughly one-quarter of generated test cases required clarification or additional detail. Key inputs, expected outputs and specific validation criteria were often missing or not fully specified.

Non-functional testing gap

Our experiment confirmed that GenAI has a strong bias toward functional testing, typically overlooking non-functional requirements such as performance, security, usability and reliability testing. This creates a significant gap in test coverage that must be addressed through human intervention or specialized prompting.

The irreplaceable human element

Despite the impressive capabilities we observed, our experiment reinforced that manual review and enhancement remain absolutely essential. Even the fastest AI-generated test cases require human oversight to ensure accuracy, completeness, and alignment with business objectives.

The human role will likely evolve from test case creation to test case curation, requiring testers to develop new skills in AI prompt engineering, output evaluation, and strategic test planning.

Strategic implications for testing teams

These insights suggest the most effective approach to AI-assisted test case generation involves treating GenAI as powerful productivity tools rather than complete replacements for human expertise. Teams should leverage AI for rapid initial test case generation while maintaining robust human oversight for refinement, validation and strategic testing decisions.

The experiment demonstrates organizations can achieve significant efficiency gains by combining AI speed with human insight, creating a hybrid approach that maximizes the strengths of both artificial and human intelligence in software testing.

Disclaimer: This experiment represents our team's specific experience with particular AI tools and use cases. Results may vary significantly based on tool versions, domain complexity, team expertise, and implementation approach.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Publications and Tools

All Insights