Better testing through mutations

Podcast host Ken Mugrage and Prem Chandrasekaran | Podcast guest Bill Codding and Brian Oxley

June 30, 2022 | 25 min 48 sec

Listen on these platforms

Brief summary

Mutation testing has long been a proven method for driving software quality in a way unit testing can't. But it can be a long, expensive and computationally intensive process. Our podcasters explore effective strategies for mutation testing and how to establish when it's right for your projects.

Full transcript

Prem Chandrasekaran: Hello, and welcome everyone to the ThoughtWorks Technology Podcast. I'm Prem, one of the newest hosts. I've got Ken Mugrage, my co-host today. Good morning, Ken.

Ken Mugrage: Good morning. My name's Ken Mugrage. I'm also one of the new co-hosts on the Technology Podcast. Welcome.

Prem: Thank you, Ken. Today, we are going to be also joined by two of our colleagues, Bill Codding and Brian Oxley to talk about mutation testing. Welcome, Bill and Brian. Do you want to quickly introduce yourselves?

Bill Codding: Sure. Thanks. I'm Bill Codding. I'm a technical principal with Thoughtworks. I've been with Thoughtworks for about seven years, working out of the San Francisco office. I've been using mutation testing for years on projects and advocating adoption of the practice as a sensible default for testing. Brian.

Brian Oxley: I'm Brian Oxley. I have a very similar story to Bill. I've been at Thoughtworks about eight years. I am a principal consultant. I love technology, and mutation testing is one of what I feel is one of the key tools to giving you quality.

Prem: Wonderful. Thanks, Bill, and Brian for that introduction. Let's get started. Today, our topic, as I said, is about mutation testing. We'll talk about what it is, why software teams should be looking to adopt this practice, and everything else there is to learn about it. First things first, Bill, can you quickly tell us what mutation testing exactly is in plain English?

Bill: Sure. First, I'd like to just set the scene quickly. Imagine you have this nice code base, and you're using TDD for the most part, and maybe the code base has been extended and modified for some number of months or years, and now you built up this nice suite of tests alongside the code and it passes happily every time you commit. Your test coverage is high, and you have a great deal of confidence in that test suite.

One of the questions is, how do you know if those tests you've written are any good? Remember that traditional test coverage tells you the percentage of code that is executed during the test, but not if the tests actually detect faults in that code. Enter mutation testing. Mutation testing is, it's a white box testing method, which works by automatically inserting small bugs called mutations in your code, and then running the existing test suite.

The existing test suite you have is just re-run for each mutant, or some combination of mutants. Briefly, the idea is that if you alter or mutate your code, your tests suite should now fail if that test suite is complete. If your test fail, then the mutant that was put into the code is termed as being killed. If the test suite still passes, even though your code has been mutated, we say that mutants survived.

Now, the higher number of the mutants killed and the fewer mutants that escaped, we say the more effective your tests are. In short, it's a method to find out how much trust you can place in your existing test suite, and it's a way of saying, how do you test your tests for effectiveness?

Prem: Wow. That's quite a lot to unpack there. You mentioned automatically inserting small bugs in code called mutants. Can you give us an example of what a mutant is?

Bill: Sure. There's a lot of different types of mutants, and we can get into some of those later. There's all kinds of mutants, mutations of logic, there's mutations of raw code, a whole bunch of different kinds.

Here's an example of a mutation. Let's say you have a simple Boolean condition where you're checking the driver must be 18 years old and above and therefore, able to drive. You write a simple conditional, driver age is greater than or equal to 18. The mutation testing tool will now change that action, change your code, and mutate it to say, driver age is less than 18. Maybe it'll change it to be equal to 18.

Maybe it's greater than 18, or maybe it will be driver age is greater than 1, or maybe it'll just return true or return false, any of those kind of things. That is the actual mutation that is written into your source code base.

Prem: Oh, okay. Great. Are there other types of mutations that can be applied as well? Brian, you want to answer that?

Brian: Yes. There are a lot of interesting things that go on with mutation testing. Essentially, it's taking the code that you've built and breaking it on purpose. Just like Bill's talking about with introducing bugs. The example he gave of a conditional branch based on whether something's bigger or smaller or the same, that's very classic.

Another common situation might be simple equality. If I'm doing red, blue, green colors to display on screen, what happens if I change one of the colors to orange? I hopefully have tests that break.

Prem: Great. What we are saying here is that by making these changes or creating what you're calling mutants and rerunning our tests, we find out the effect it has on unexpected code changes. If the test fails or if at least one test fails, then mutants are killed and that means that we may have better tests.

Now because of that we can rely on these tests more confidently and that's really, really helpful and impressive. Then, wait, lots of teams use code coverage as a means to establish quality of your tests. Tell us why that's not good enough, Brian.

Brian: You're hitting on something that I love about mutation testing. I work hard to write small, short, simple code. I have colleagues looking at my code. We're always trying to find out ways to make it easier to read and shorter and faster. Unit tests generally just do not help you answer questions like that.

They're answering questions like yes, no. Does it work? Does it fail? They don't help you understand, what does it behave like? Great thing with mutation testing, I will go into a code base that I'm working on. I will crank up my mutation test passing a little bit higher and I get some failures and I'm scratching my head. It turns out that when I go look at the source code, what has happened is that I wrote it much more complicated than I needed to.

The mutation testing is changing some of those complicated parts of my code. My unit tests still pass but with the inserted bugs, they start failing and I have to think, I have to look at it. I realize, wow, I could have done this in a few lines instead of a whole paragraph. That's one of my joys of mutation testing. It really helps me improve my discipline in writing clean, easy to read code.

Prem: Then my question really is about about code coverage. A lot of teams just use code coverage and they do seem to be doing just fine. Are you saying that code coverage is not enough?

Brian: Well, there's something that we are all assuming but we should talk about explicitly. Mutation testing is not going to help you if you do not already have high coverage. The assumption of mutation testing is there is this excellent test suite you can run over your code base. That's the first thing.

The next thing is there are many ways to decide what the word quality means. The very first goal of a programmer is to make it work. Unit tests are there to help you understand making it work. Your next goal is make it right. There is no easy definition of right. It has a lot to do with what good code looks like, what your ecosystem is like if you're following best practices.

If you ask me, I'm going to want to use a combination of tools unit testing, mutation testing, but I'm also going to want code linting, static code analysis, security checking, those kinds of things.

Ken: If I could, Bill, you had talked about changing mutants and running all the tests and so forth. I deal a lot with continuous delivery pipelines and everybody wants faster, faster, faster. It feels like if you're creating a mutant, you're running all the tests whether the mutants survive or not, it just feels very, very intensive. It might take a lot of time, a lot of process power, et cetera. Can you talk about that?

Bill: Sure. Mutation testing is very computationally intensive. The package is going in and it's creating a number of mutants as Brian described some of them. There's actually some of the new mutation testing packages can do some really, really other creative mutations to your code such as actually even deleting lines of code, injecting exceptions to be thrown, all kinds of creative mutants.

Then it injects these mutants and simply in some combination and it reruns the entire test suite. Just like you're saying, Ken, it has to rerun that test suite many, many times. If you got a large code base, that means there's a lot of mutants that have been added and a lot of reruns the test suite. It does get expensive.

Think about line by line in your logic, the number of decision points and logic points, and the cyclomatic complexity of your code. If every one of those constants, operators, decisions, if then else's, et cetera, are being altered and even lines of code are being deleted and exceptions are being injected, think about the combinatorial explosion of the number of different mutations that could be put in. Yes, as your code base grows, and if you have a lot of slow-running tests, this is going to get lengthy.

Ken: We use the term a lot that a project or something has a smell, an indicator that something is wrong. If these tests are getting so intensive, or so long running, or so what have you that it's an impact, is that a smell or a signal that maybe the code base itself is too large and we maybe want to look at smaller more focused components? Should we be looking at refactoring the architecture of the system itself?

Brian: No, Ken, that's exactly where I wanted to go when talking about cleaner code. You put that just right. What I find out is when I have flabby, fat, unorganized code, I'm going to fail lots of mutations and something this helps me do is go back to my domain model, my requirements, my framework, with my colleagues, and chop out a lot. I have found after using mutation test reading, I can cut my code base 10% or 20% sometimes.

Bill: Now, that's an awesome point, Brian. I think regardless though, just the mechanics of mutation testing are going to make the mutation testing process long and computationally intensive. Just like Brian is saying, the simpler, your code is the lower, the cyclomatic complexity, then the less likely you are going to have those long run. Probably also the less likely you're going to have test cases that you missed, that the mutation is going to catch.

Nonetheless, this is computationally intensive and there are some mitigation strategies that you can use for this. The obvious one is when this gets lengthy, run it in a branch of your pipeline, as opposed to in the pipeline itself where it's a critical stage, which will pass or fail a pipeline. Of course, running it in a branch of your pipeline does come with some caveats that you're going to have to be aware of.

You're going to have to now check on that branch when it completes, or you're going to have to alert on it or something like that. You're going to have to have the discipline to go back and find out when the mutation testing is completed and what it has shown. Now that being said, I think that there is a case for running mutation testing, even periodically. I think even periodically is better than none because it's a way of checking the tests that you do have in your test suite. The mutation testing report may not change drastically for small changes in the code. It could be that you do this periodically.

Ken: A lot of times-- I don't want to put words in your mouth but when I look at pipelines because that's most of my background over the last several years, there's lots of different times when you might run to run certain tests in parallel, even not on separate branches, but just there's no reason that two test suites can't run and parallel on two different build agents, that kind of thing.

That would still give you the ability would it not to get the results, but not for go further if one of them fails without slowing down the pipeline. Have you done that at all? Again, I know I'm leading the witness and I apologize for that. Is parallel running in the same pipeline, would it accomplish the same goal?

Prem: Another question which comes up in this context is test development. In which case you're writing a test before you write your production code, which means that now you've got really valuable tests by definition. In that case, is there a place for mutation testing? Bill?

Bill: Yes, absolutely, because mutation testing can show that we may be missing test cases, it's absolutely still a case before we're running it. It's very easy to miss boundary conditions. As a matter of fact, mutation testing is really great at determining the completeness of your test suite and the possible missing cases, the incompleteness of your test suite, but it's also great at finding places where assertion boundaries are too loose.

The combination of both of those are absolutely one of the greatest values in mutation testing. They will find those for you. Then you can augment your test suite to include those loose, or to tighten the assertions and to include the test cases you missed. Absolutely very effective way to augment an already solid practice.

One other point about that is I think that there are cases that mutation testing will catch where there's some really blatant errors made and I've seen this happen in real life. I've seen test cases with no assertions, which will always pass no matter how the code is mutated. I've seen test suites that have tests that have just an assert true with a fixed me after it.

Even things like this, mutation testing can catch and will alert you to. Of course, I think maybe it's worth saying here that if you're following sort of the traditional TDD of the red, green refactor, maybe you won't hit quite so many of those other missing cases and no-op test cases as you might.

Ken: Big, bad manager hat, can mutation testing help find test coverage gaming?] We see that, right? When an organization puts out a metric that says you must have this much test coverage, we've seen it in organizations where people game that metric. It's like, yes, it got executed. Not that I think this is a compliance tool or anything like that, but I think Brian's waving his hand at me. I think he might have a response for this. Go ahead, Brian.

Brian: Yes, I really hate having to validate what you just said. I've seen this too. What will happen might be a large organization with developers under massive deadlines and they just need to get coverage above X%. You do everything you can to get to X%, but those tests are not valuable tests.

They were not written ahead of the time to drive development of the production code. They're not tests that line up with goals on the work. They don't line up with business, goals, the domain model, or things like that. It's awful but mutation testing gets everyone to the table so that you could [crosstalk]

One of my favorite large projects, seven separate development teams working together, client had an independent testing team that relied heavily on mutation testing against our code. I did not see a lot of problems actually. I was very pleased with ourselves because we had not game the system, but I would've felt pretty bad to a developer who wrote tests like that. That's a interesting conversation to have.

Bill: Yes, I totally agree. It's a fantastic point. I think that we've found over the years, that if you report on something, then it will be measured. If you measure something, it will be gamed. That's a cynical view, but yes, that happens all the time. I think that a very simplistic way of trying to establish code quality bounds or boundaries or levels is to say you want a certain percentage of code coverage.

If a team, if maybe a less experienced team or a team that's trying to game the system sees that yes, absolutely. You can write all kinds of tests which just bring code coverage up without being effective. Again, as mutation testing is about testing the quality of the test suite, this will absolutely show you where test suite is not effective or possibly being gamed.

Prem: Great. We've seen that this can be a pretty solid practice to adopt. Are there any tools that you can recommend that help make this easier to run mutation testing well? Bill?

Bill: Yes, there are open-source tools out there for nearly every modern language. Probably the best known is PItest for Java but we do know for sure that the open-source mutation testing packages are available for Java, JavaScript, .NET, C, C++, C#, Swift, Python, Rust, PHP, and many others. I've seen a whole bunch of the methods.

By the way, it also might be worth saying that mutation testing as a practice has appeared on the Thoughtworks Tech Radar since 2016 and has moved from assess back in 2016 to trial in 2020. There are some of us, myself included, that think that maybe it should even be beyond a trial recommendation into an adopt recommendation, which would be tools, techniques, et cetera, that we think you should be using now as a sensible default.

Prem: Absolutely. I definitely concur with a lot of what you're saying in terms of it being one of the starters or one of the first things that you do on a project. It looks like this isn't a new practice either, right? This has been around for a fairly long time as well. Can you folks just quickly touch on the history and tell us some real implications of having or adopting this practice?

Bill: Sure. Mutation testing has been around since, I believe, 1971, when a paper was written on it. The popularity and the use of it has increased greatly, though, in the recent years due to improvements in the algorithms and the computational power that's necessary for it. Also, it's based on something called the coupling effect, which essentially states if your tests are sensitive enough to detect small faults in software, then they're likely sensitive enough to be able to detect the major faults.

When we're talking about what we've talked about so far with mutation testing, the minor faults are the ones that are injected into the code automatically by the test suite. The major faults are possibly ones comprised of those minor faults. The mutation testing package will really help you find places in your code where your tests are not able to detect the minor faults that it's thrown in. This coupling effect has been studied both theoretically and empirically and has been very well supported by those studies.

Brian: Yes. I will add to that, this is outstanding academic work that has come over into the practical professional commercial space. We're able to leverage decades of outstanding research. The idea behind this came from fault injection. That goes back a century or more in engineering where people would intentionally create faults in systems and make sure that they still recovered or behaved well.

Prem: Wonderful. Thank you, Bill and Brian, for this really invigorating conversation. I'll end on this. If you are having doubts about wanting to adopt mutation testing in your own practice, think about this, the DeFi crypto platform Compound, they lost $162 million recently. This was about six months ago. Where a greater than sign in code actually should have been a greater than or equal to. If they had just found that out before their users found it, they could have actually saved all of that money. On that note, thanks a lot to both our experts. Thank you, Ken. See you next time.

Bill: Thanks.

Ken: Bye to everyone.

View less

More episodes

Episode name

Published

Themes from Technology Radar Vol.31

October 17, 2024

Build Your Own Radar: Using the Technology Radar as a governance tool

October 03, 2024

Exploring DuckDB: A relational database built for online analytical processing

September 19, 2024

Software service granularity: Getting it right

September 05, 2024

Measuring developer experience

August 22, 2024

How can AI support designers?

August 08, 2024

Sensible defaults: A way to think about our technology practices

July 25, 2024

Tracking technology stacks, practices and experiences across teams

July 11, 2024

Inside Bahmni: An open-source digital public good

June 27, 2024

How to assess your organization's security maturity

June 13, 2024

Continuous delivery vs. continuous deployment: What should be the default?

May 30, 2024

Themes from Technology Radar Vol.30

May 16, 2024

Building at the intersection of machine learning and software engineering

May 02, 2024

Refactoring with AI

April 18, 2024

How to measure your cloud carbon footprint

April 04, 2024

Technology through the Looking Glass: Preparing for 2024 and beyond

March 21, 2024

Diving head first into software architecture

March 07, 2024

Exploring the building blocks of distributed systems

February 22, 2024

Software-defined vehicles: The future of the automotive industry?

February 08, 2024

Beyond the DORA metrics: Measuring engineering excellence

January 25, 2024

Asynchronous collaboration: Getting it right

January 11, 2024

Looking back at key themes across technology in 2023

December 28, 2023

Leveraging generative AI at Bosch

December 14, 2023

Jugalbandi: Building with AI for social impact

November 30, 2023

AI-assisted coding: Experiences and perspectives

November 16, 2023

What's it like to maintain an award-winning open source tool?

November 02, 2023

Engineering platforms and golden paths: Building better developer experiences

October 19, 2023

Managing cost efficiency at scale-ups

October 03, 2023

Exploring SQL and ETL

September 21, 2023

Driving innovation in radio astronomy

September 07, 2023

XR with impact: Building experiences that drive business value

August 24, 2023

Leadership styles in technology teams

August 10, 2023

Making design matter in technology organizations

July 27, 2023

Generative AI and the future of knowledge work

July 13, 2023

Scaling mobile delivery

June 29, 2023

Making privacy a first-class citizen in data science

June 15, 2023

Multi-cloud: Exploring the challenges and opportunities

June 01, 2023

Scaling up at Etsy

May 18, 2023

TinyML: Bringing machine learning to the edge

May 04, 2023

The weaponization of complexity

April 20, 2023

How we put together the Technology Radar

April 06, 2023

Inside India's Drug Discovery Hackathon

March 23, 2023

Serverless in 2023

March 09, 2023

My Thoughtworks journey: Rebecca Parsons

February 23, 2023

How to tackle friction between product and engineering in scale-ups

February 09, 2023

6 key technology trends for 2023

January 26, 2023

Tackling system complexity with domain-driven design

January 12, 2023

Shifting left on accessibility

December 29, 2022

Data Mesh revisited

December 15, 2022

Low-code/no-code platforms: The 10% trap and the limits of abstractions

December 01, 2022

Welcome to the fediverse: Exploring Mastodon, ActivityPub and beyond [Special]

November 24, 2022

Rethinking software governance: Reflecting on the second edition of Building Evolutionary Architectures

November 17, 2022

Reckoning with the force of Conway's Law

November 03, 2022

Exploring the Basal Cost of software

October 20, 2022

Why full-stack testing matters

October 05, 2022

Acknowledging and addressing technical debt in startups and scale-ups

September 22, 2022

XR in practice: the engineering challenges of extending reality

September 08, 2022

Agent-based modelling for epidemiology: EpiRust and BharatSim

August 19, 2022

Mastering architectural metrics

August 12, 2022

Building a culture of innovation

July 28, 2022

Starting out with sensible default practices

July 14, 2022

Better testing through mutations

June 30, 2022

Patterns of legacy displacement — Part two

June 16, 2022

Patterns of legacy displacement — Part one

June 02, 2022

Mitigating cognitive bias when coding

May 19, 2022

Following an usual career path: from dev to CEO

May 05, 2022

Software engineering with Dave Farley

April 21, 2022

Tackling bottlenecks at scale-ups

April 07, 2022

Coding lessons from the pandemic

March 24, 2022

Is there ever a good time for a code freeze?

March 10, 2022

Navigating the perils of multicloud

February 25, 2022

Compliance as a product

February 10, 2022

The big five tech trends for 2022

January 27, 2022

Fluent Python revisited

January 13, 2022

Creating a developer platform for a networked-enabled organization

December 30, 2021

The art of Lean inceptions

December 16, 2021

The hard parts of data architecture

December 02, 2021

TDD for today

November 18, 2021

You can't buy integration

November 04, 2021

The rise of NoSQL

October 21, 2021

The hard parts of software architecture

October 07, 2021

Machine learning in the wild

September 24, 2021

Delivering innovation at scale

September 09, 2021

Jim Highsmith: a 54-year agile journey

August 26, 2021

Securing the software supply chain

August 12, 2021

Making retrospectives effective — and fun

July 22, 2021

Patterns of distributed systems

July 08, 2021

Refactoring databases — or evolutionary database design

June 24, 2021

Making developer effectiveness a reality

June 10, 2021

Team topologies and effective software delivery

May 20, 2021

How green is your cloud?

May 07, 2021

Green software engineering

April 22, 2021

Twenty years of agile

April 08, 2021

Talking with tech leads with Pat Kua

March 25, 2021

My Thoughtworks Journey: Patricia Mandarino

March 11, 2021

Exploring infrastructure as code

February 25, 2021

XR in the enterprise

February 11, 2021

Getting to grips with data visualization

January 21, 2021

Computational notebooks: the benefits and pitfalls

January 07, 2021

The architect elevator

December 24, 2020

The future of Clojure

December 10, 2020

The future of digital trust

November 27, 2020

Integration challenges in an ERP-heavy world — Pt 2

November 12, 2020

Democratizing programming

October 28, 2020

Integration challenges in an ERP-heavy world

October 16, 2020

Models of open sourcing software

October 01, 2020

Applying software engineering practices to data science

September 17, 2020

Using visualization tools to understand large polyglot code bases

September 03, 2020

Machine learning in astrophysics

August 20, 2020

Programming languages geek out

August 06, 2020

Observability does not equal monitoring

July 23, 2020

Working with 50% of code in the browser

July 09, 2020

Realising the full potential of CD

June 25, 2020

Testing the user journey

June 12, 2020

Continuous delivery in the wild

June 01, 2020

Lessons from a remote Tech Radar

May 13, 2020

The future of Python

April 30, 2020

A sensible approach to multi-cloud

April 17, 2020

Digital transformation: a tech perspective

April 02, 2020

IT delivery in unusual circumstances

March 20, 2020

Continuous delivery for today's enterprise

March 06, 2020

Fundamentals of Software Architecture

February 21, 2020

Cloud migration — part two

February 10, 2020

The price of reuse

January 24, 2020

Towards self-serve infrastructure

January 13, 2020

Martin Fowler: my Thoughtworks journey

December 27, 2019

Building an autonomous drone

December 13, 2019

Cloud migration is a journey not a destination

November 28, 2019

Getting to grips with functional programming

November 14, 2019

Compliance as code

November 01, 2019

Data meshes: a distributed domain-oriented data platform

October 18, 2019

Edge — a guide to value-driven digital transformation

October 04, 2019

Tech choices: CIO or CTO?

September 20, 2019

Microservices as complex adaptive systems

September 05, 2019

Supporting the Citizen Developer

August 22, 2019

Getting hands-on with RESTful web services

August 08, 2019

Zhong Tai: innovation in enterprise platforms from China

July 25, 2019

What’s so cool about micro frontends?

July 11, 2019

Unravelling the monoglot monopoly

June 27, 2019

Breaking down the barriers to innovation

June 13, 2019

Delivering strategic architectural transformation

May 30, 2019

Exploring programming languages via paradigms vs labels

May 16, 2019

Multicloud in a regulated environment

May 03, 2019

Can DevSecOps help secure the enterprise?

April 18, 2019

A11Y — Making web accessibility easier

April 04, 2019

Continuous delivery for modern architectures

March 21, 2019

Delivering developer value through platform thinking

March 07, 2019

Architectural governance: rethinking the Department of ‘No’

February 21, 2019

Serendipitous Events

February 08, 2019

Diving into serverless architecture

January 24, 2019

Seismic Shifts

January 10, 2019

Understanding bias in algorithmic systems

December 28, 2018

Microservices: The State of the Art

December 14, 2018

Evolving Interactions

November 29, 2018

The state of API design

November 15, 2018

How we build the Tech Radar

November 01, 2018

IoT Hardware

October 18, 2018

Continuous Intelligence

October 04, 2018

Distributed systems antipatterns

September 13, 2018

Agile Data Science

August 23, 2018

Services

Industries

Resource Hubs

Publications and Tools

All Insights

Better testing through mutations

Brief summary

Full transcript

Explore the latest Technology Radar