Fuzz-testing in the AI era

Rediscovering an old technique for new challenges

Richard Gall

Published: July 07, 2025

Fuzz testing is a software testing technique that’s been around for some time. But, despite being nearly forty years old, the technique hasn’t been widely adopted by software development teams. While it’s commonly used in specialist fields like penetration testing, it’s often viewed as somewhat marginal by the industry mainstream.

However, in an age of increasing AI, fuzz testing is perhaps more relevant than it’s ever been. Its ability to introduce unpredictability into systems and applications is well-suited to a time when non-deterministic software has become all the rage. In this blog post we’ll take a look at what it is and why it might be more valuable today than ever before as generative AI continues to eat the software landscape.

What is fuzz testing?

Fuzz testing — or fuzzing as it’s sometimes known — is an automated software testing technique where unexpected or invalid inputs are used as a way to uncover bugs or vulnerabilities. It’s usually done at scale, too, with many hundreds of permutations of unexpected inputs. It’s a way of stress testing an application to see if it may behave in unexpected ways when faced with data it wasn’t designed to encounter and process.

Fuzz testing can offer development teams a level of insight into application behavior and security beyond more static approaches to testing like static application security tests (SAST) or software component analysis (SCA). While SAST and SCA address bugs and risks in source code, fuzzing tests input fields or parameters in deployed web applications. Even the most rigorous SAST tests can’t give you an insight into how inputs might make your application behave when met with real-world inputs.

Fuzz testing does share some similarities with dynamic application security testing (DAST), but there are also some subtle differences. DAST typically involves simulating attacks to identify known vulnerabilities, whereas fuzz testing is much more effective at finding unknown issues that might otherwise be hard to uncover through testing.

It’s important to note that fuzz testing isn’t an alternative to other testing techniques. Really, it should be used alongside other approaches when and where relevant. If edge cases are a particular concern for the resilience and security of software, fuzzing can be a really valuable addition to a development team’s security practices.

Different types of fuzz testing

It’s worth also noting that fuzz testing isn’t a single technique but instead encompasses a number of related approaches. They can be distinguished by:

Existing knowledge of the software you’re testing.
- Black box fuzzing, where tests are done without any knowledge of how the target software is working (this is probably the most common type of fuzzing in the industry today, usually done by pentesters or bug bounty hunters).
- White box fuzzing, where you’ll have a much deeper and detailed knowledge of the software.
- Gray box fuzzing, which sits somewhere in between.
How the fuzzing data is generated.
- Mutation (or “dumb”) fuzzing, where input data is changed (ie. mutated) indiscriminately (for example, iterating over a static list like the Big List of Naughty Strings).
- Generation (or “smart”) fuzzing where input data is more considered — based on the specifics of the format or protocol. Fuzzing libraries like FuzzDB break down lists of fuzzable input based on use cases.

Ultimately, the right approach to use will be context-specific. It will depend on the type of software you’re testing, what you’re trying to find out and what risks you’re trying to mitigate — not to mention delivery constraints and commercial demands.

Why is fuzz testing still a marginal technique?

Fuzz testing is marginal in software development for the simple reason that it’s ultimately chaotic and, used in the wrong contexts, dangerous. It’s understandable, then, that many teams just don’t see it as a technique worth pursuing.

Indeed, there a number of challenges and drawbacks to fuzzing:

It can be complex and time-consuming to set up fuzzing environments.
Complexity and resource requirements also don’t scale well, making it challenging to run fuzz-tests on particularly large applications and systems.
It doesn’t necessarily deliver consistent, conclusive or comprehensive results.
Although automated in implementation, it nevertheless needs human oversight and expertise to analyze and assess results.
It’s not that relevant for every bit of software: fuzz testing is useful for uncovering vulnerabilities in input validation and error handling; your software may not have this and even if it does, you still need to do other tests to find other kinds of bugs and vulnerabilities (in the code, for instance).

These issues are why, as we’ve already noted, it has very much been a specialist technique used by practitioners in fields like pentesting, where pushing systems and applications to their limits is a fundamental goal.

However, there’s a good argument that in today's software landscape — dominated by generative AI and LLMs — innovative and idiosyncratic approaches to testing need to be explored by more than just specialists.

Maybe it’s worth software developers rediscovering the benefits and power of fuzzing?

How can AI power effective fuzz testing?

To some extent, AI — and generative AI specifically — can help make fuzz testing more accessible to teams that might not usually consider it in their testing toolkit. Generating novel combinations of words or other sequences at scale is, after all, the foundations of fuzzing, and we know generative AI is exceptionally good at that.

This work is being done now: in November 2024 a team at Google wrote a blog post outlining how its open source fuzz testing platform OSS Fuzz found a number of significant vulnerabilities in open source projects by leveraging generative AI. The piece explains that the team has been experimenting with what it calls “AI-powered fuzzing”, where the generative capabilities of LLMs are used to create fuzzing cases (in other words, the random and unexpected inputs) to improve test coverage.

The benefits of such an approach are more than are about more than just test generation and coverage. The team also notes that it can also be useful in the triaging and analysis steps to identify actual vulnerabilities. They write that “an LLM can be prompted with the relevant context (stacktraces, fuzz target source code, relevant project source code) to perform this triage.”

Although generative AI can clearly play a valuable role in fuzz testing — as the team at Google demonstrated — other AI techniques and approaches can be helpful too.

In 2024, for instance, a team from Northwestern University won first place at an international fuzz testing conference with their AI-backed tool BandFuzz. BandFuzz doesn’t use generative AI but instead reinforcement learning to select the most effective fuzzing strategy for a given situation.

There’s no doubt we’re going to continue to see AI integrated into fuzz testing. Given the long-standing challenges doing it effectively poses, particularly in terms of time, the use of AI to both generate test cases and to triage issues means fuzz testing should be more accessible and attractive to software development and security teams.

How can fuzz testing stress test AI applications and systems?

Generative AI is, at one level at least, a technology of inputs and outputs. Prompting, after all, has become a discipline of its own in a matter of months. But there’s of course much more at work: AI systems consist of a range of different components interacting with one another, from varied sources of data to algorithmic models.

These interlocking parts, and the inherent unpredictability of these systems, make testing challenging yet critical. This is where fuzz testing may help: because the technique uses randomness and unpredictability as its modus operandi, it can uncover weaknesses and vulnerabilities inside similarly unpredictable and non-deterministic technology.

In fact, this isn’t hypothetical; fuzz testing is being used in the context of AI testing today. For instance, one company is using it to identify vulnerabilities in image recognition models by injecting corrupted or adversarial images. Researchers have also written about the value of fuzzing AI systems; in this paper, the authors discuss using fuzz testing in a Python project. “By applying fuzzing to ML frameworks,” they write, “it is possible to identify security vulnerabilities that may not be found through traditional testing methods.”

However, the paper also highlights that fuzz testing needs to be part of a wider workflow that includes evaluation and analysis steps such as “crash triaging” and “severity estimation”. This is, of course, important for fuzzing to be done effectively in any context, but it’s particularly critical when dealing with complex and opaque machine learning or AI systems.

How to get started with fuzz testing

With the increasing integration of AI, fuzzing should become easier and faster. However, it might be obvious, but it’s essential to remember fuzzing nevertheless requires human oversight. Rushing into AI-powered fuzzing without a good knowledge of how the practice should be done is extremely dangerous. To reiterate the point made earlier — don't fuzz test against any environment that contains things you can't afford to be completely destroyed. Limited and local experimentation is a good place to begin.

On a similar note, it’s essential to emphasize (again) that fuzz testing is just one part of a team’s security toolkit. Other techniques and methods, such as static code analysis, unit testing and manual testing, are just as important. Being effective is using them together at the right time. This is also where good security practices come in — shifting security left and ensuring security concerns are a priority in the development lifecycle are essential foundations without which fuzz testing becomes not just difficult but maybe even ineffective.

Bringing the past and future together

While fuzz testing’s failure to become a mainstream software development testing technique is partly a symptom of its chaotic and somewhat risky nature, it also reflects the priorities and pressures placed on software development teams. However, its use in specialist testing and security contexts demonstrates that it offers something that other established techniques can’t offer.

And, in an age of highly complex and often opaque systems, fuzz testing’s ability to properly stress testing applications arguably gives it relevance beyond the domains in which it has largely been found in recent techniques.

Exploring how fuzz testing can play a part in your existing testing toolkit might, in 2025, make sense.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.