Mutation testing has long been a proven method for driving software quality in a way unit testing can't. But it can be a long, expensive and computationally intensive process. Our podcasters explore effective strategies for mutation testing and how to establish when it's right for your projects.
Prem Chandrasekaran: Hello, and welcome everyone to the ThoughtWorks Technology Podcast. I'm Prem, one of the newest hosts. I've got Ken Mugrage, my co-host today. Good morning, Ken.
Ken Mugrage: Good morning. My name's Ken Mugrage. I'm also one of the new co-hosts on the Technology Podcast. Welcome.
Prem: Thank you, Ken. Today, we are going to be also joined by two of our colleagues, Bill Codding and Brian Oxley to talk about mutation testing. Welcome, Bill and Brian. Do you want to quickly introduce yourselves?
Bill Codding: Sure. Thanks. I'm Bill Codding. I'm a technical principal with Thoughtworks. I've been with Thoughtworks for about seven years, working out of the San Francisco office. I've been using mutation testing for years on projects and advocating adoption of the practice as a sensible default for testing. Brian.
Brian Oxley: I'm Brian Oxley. I have a very similar story to Bill. I've been at Thoughtworks about eight years. I am a principal consultant. I love technology, and mutation testing is one of what I feel is one of the key tools to giving you quality.
Prem: Wonderful. Thanks, Bill, and Brian for that introduction. Let's get started. Today, our topic, as I said, is about mutation testing. We'll talk about what it is, why software teams should be looking to adopt this practice, and everything else there is to learn about it. First things first, Bill, can you quickly tell us what mutation testing exactly is in plain English?
Bill: Sure. First, I'd like to just set the scene quickly. Imagine you have this nice code base, and you're using TDD for the most part, and maybe the code base has been extended and modified for some number of months or years, and now you built up this nice suite of tests alongside the code and it passes happily every time you commit. Your test coverage is high, and you have a great deal of confidence in that test suite.
One of the questions is, how do you know if those tests you've written are any good? Remember that traditional test coverage tells you the percentage of code that is executed during the test, but not if the tests actually detect faults in that code. Enter mutation testing. Mutation testing is, it's a white box testing method, which works by automatically inserting small bugs called mutations in your code, and then running the existing test suite.
The existing test suite you have is just re-run for each mutant, or some combination of mutants. Briefly, the idea is that if you alter or mutate your code, your tests suite should now fail if that test suite is complete. If your test fail, then the mutant that was put into the code is termed as being killed. If the test suite still passes, even though your code has been mutated, we say that mutants survived.
Now, the higher number of the mutants killed and the fewer mutants that escaped, we say the more effective your tests are. In short, it's a method to find out how much trust you can place in your existing test suite, and it's a way of saying, how do you test your tests for effectiveness?
Prem: Wow. That's quite a lot to unpack there. You mentioned automatically inserting small bugs in code called mutants. Can you give us an example of what a mutant is?
Bill: Sure. There's a lot of different types of mutants, and we can get into some of those later. There's all kinds of mutants, mutations of logic, there's mutations of raw code, a whole bunch of different kinds.
Here's an example of a mutation. Let's say you have a simple Boolean condition where you're checking the driver must be 18 years old and above and therefore, able to drive. You write a simple conditional, driver age is greater than or equal to 18. The mutation testing tool will now change that action, change your code, and mutate it to say, driver age is less than 18. Maybe it'll change it to be equal to 18.
Maybe it's greater than 18, or maybe it will be driver age is greater than 1, or maybe it'll just return true or return false, any of those kind of things. That is the actual mutation that is written into your source code base.
Prem: Oh, okay. Great. Are there other types of mutations that can be applied as well? Brian, you want to answer that?
Brian: Yes. There are a lot of interesting things that go on with mutation testing. Essentially, it's taking the code that you've built and breaking it on purpose. Just like Bill's talking about with introducing bugs. The example he gave of a conditional branch based on whether something's bigger or smaller or the same, that's very classic.
Another common situation might be simple equality. If I'm doing red, blue, green colors to display on screen, what happens if I change one of the colors to orange? I hopefully have tests that break.
Prem: Great. What we are saying here is that by making these changes or creating what you're calling mutants and rerunning our tests, we find out the effect it has on unexpected code changes. If the test fails or if at least one test fails, then mutants are killed and that means that we may have better tests.
Now because of that we can rely on these tests more confidently and that's really, really helpful and impressive. Then, wait, lots of teams use code coverage as a means to establish quality of your tests. Tell us why that's not good enough, Brian.
Brian: You're hitting on something that I love about mutation testing. I work hard to write small, short, simple code. I have colleagues looking at my code. We're always trying to find out ways to make it easier to read and shorter and faster. Unit tests generally just do not help you answer questions like that.
They're answering questions like yes, no. Does it work? Does it fail? They don't help you understand, what does it behave like? Great thing with mutation testing, I will go into a code base that I'm working on. I will crank up my mutation test passing a little bit higher and I get some failures and I'm scratching my head. It turns out that when I go look at the source code, what has happened is that I wrote it much more complicated than I needed to.
The mutation testing is changing some of those complicated parts of my code. My unit tests still pass but with the inserted bugs, they start failing and I have to think, I have to look at it. I realize, wow, I could have done this in a few lines instead of a whole paragraph. That's one of my joys of mutation testing. It really helps me improve my discipline in writing clean, easy to read code.
Prem: Then my question really is about about code coverage. A lot of teams just use code coverage and they do seem to be doing just fine. Are you saying that code coverage is not enough?
Brian: Well, there's something that we are all assuming but we should talk about explicitly. Mutation testing is not going to help you if you do not already have high coverage. The assumption of mutation testing is there is this excellent test suite you can run over your code base. That's the first thing.
The next thing is there are many ways to decide what the word quality means. The very first goal of a programmer is to make it work. Unit tests are there to help you understand making it work. Your next goal is make it right. There is no easy definition of right. It has a lot to do with what good code looks like, what your ecosystem is like if you're following best practices.
If you ask me, I'm going to want to use a combination of tools unit testing, mutation testing, but I'm also going to want code linting, static code analysis, security checking, those kinds of things.
Ken: If I could, Bill, you had talked about changing mutants and running all the tests and so forth. I deal a lot with continuous delivery pipelines and everybody wants faster, faster, faster. It feels like if you're creating a mutant, you're running all the tests whether the mutants survive or not, it just feels very, very intensive. It might take a lot of time, a lot of process power, et cetera. Can you talk about that?
Bill: Sure. Mutation testing is very computationally intensive. The package is going in and it's creating a number of mutants as Brian described some of them. There's actually some of the new mutation testing packages can do some really, really other creative mutations to your code such as actually even deleting lines of code, injecting exceptions to be thrown, all kinds of creative mutants.
Then it injects these mutants and simply in some combination and it reruns the entire test suite. Just like you're saying, Ken, it has to rerun that test suite many, many times. If you got a large code base, that means there's a lot of mutants that have been added and a lot of reruns the test suite. It does get expensive.
Think about line by line in your logic, the number of decision points and logic points, and the cyclomatic complexity of your code. If every one of those constants, operators, decisions, if then else's, et cetera, are being altered and even lines of code are being deleted and exceptions are being injected, think about the combinatorial explosion of the number of different mutations that could be put in. Yes, as your code base grows, and if you have a lot of slow-running tests, this is going to get lengthy.
Ken: We use the term a lot that a project or something has a smell, an indicator that something is wrong. If these tests are getting so intensive, or so long running, or so what have you that it's an impact, is that a smell or a signal that maybe the code base itself is too large and we maybe want to look at smaller more focused components? Should we be looking at refactoring the architecture of the system itself?
Brian: No, Ken, that's exactly where I wanted to go when talking about cleaner code. You put that just right. What I find out is when I have flabby, fat, unorganized code, I'm going to fail lots of mutations and something this helps me do is go back to my domain model, my requirements, my framework, with my colleagues, and chop out a lot. I have found after using mutation test reading, I can cut my code base 10% or 20% sometimes.
Bill: Now, that's an awesome point, Brian. I think regardless though, just the mechanics of mutation testing are going to make the mutation testing process long and computationally intensive. Just like Brian is saying, the simpler, your code is the lower, the cyclomatic complexity, then the less likely you are going to have those long run. Probably also the less likely you're going to have test cases that you missed, that the mutation is going to catch.
Nonetheless, this is computationally intensive and there are some mitigation strategies that you can use for this. The obvious one is when this gets lengthy, run it in a branch of your pipeline, as opposed to in the pipeline itself where it's a critical stage, which will pass or fail a pipeline. Of course, running it in a branch of your pipeline does come with some caveats that you're going to have to be aware of.
You're going to have to now check on that branch when it completes, or you're going to have to alert on it or something like that. You're going to have to have the discipline to go back and find out when the mutation testing is completed and what it has shown. Now that being said, I think that there is a case for running mutation testing, even periodically. I think even periodically is better than none because it's a way of checking the tests that you do have in your test suite. The mutation testing report may not change drastically for small changes in the code. It could be that you do this periodically.
Ken: A lot of times-- I don't want to put words in your mouth but when I look at pipelines because that's most of my background over the last several years, there's lots of different times when you might run to run certain tests in parallel, even not on separate branches, but just there's no reason that two test suites can't run and parallel on two different build agents, that kind of thing.
That would still give you the ability would it not to get the results, but not for go further if one of them fails without slowing down the pipeline. Have you done that at all? Again, I know I'm leading the witness and I apologize for that. Is parallel running in the same pipeline, would it accomplish the same goal?
Prem: Another question which comes up in this context is test development. In which case you're writing a test before you write your production code, which means that now you've got really valuable tests by definition. In that case, is there a place for mutation testing? Bill?
Bill: Yes, absolutely, because mutation testing can show that we may be missing test cases, it's absolutely still a case before we're running it. It's very easy to miss boundary conditions. As a matter of fact, mutation testing is really great at determining the completeness of your test suite and the possible missing cases, the incompleteness of your test suite, but it's also great at finding places where assertion boundaries are too loose.
The combination of both of those are absolutely one of the greatest values in mutation testing. They will find those for you. Then you can augment your test suite to include those loose, or to tighten the assertions and to include the test cases you missed. Absolutely very effective way to augment an already solid practice.
One other point about that is I think that there are cases that mutation testing will catch where there's some really blatant errors made and I've seen this happen in real life. I've seen test cases with no assertions, which will always pass no matter how the code is mutated. I've seen test suites that have tests that have just an assert true with a fixed me after it.
Even things like this, mutation testing can catch and will alert you to. Of course, I think maybe it's worth saying here that if you're following sort of the traditional TDD of the red, green refactor, maybe you won't hit quite so many of those other missing cases and no-op test cases as you might.
Ken: Big, bad manager hat, can mutation testing help find test coverage gaming?] We see that, right? When an organization puts out a metric that says you must have this much test coverage, we've seen it in organizations where people game that metric. It's like, yes, it got executed. Not that I think this is a compliance tool or anything like that, but I think Brian's waving his hand at me. I think he might have a response for this. Go ahead, Brian.
Brian: Yes, I really hate having to validate what you just said. I've seen this too. What will happen might be a large organization with developers under massive deadlines and they just need to get coverage above X%. You do everything you can to get to X%, but those tests are not valuable tests.
They were not written ahead of the time to drive development of the production code. They're not tests that line up with goals on the work. They don't line up with business, goals, the domain model, or things like that. It's awful but mutation testing gets everyone to the table so that you could [crosstalk]
One of my favorite large projects, seven separate development teams working together, client had an independent testing team that relied heavily on mutation testing against our code. I did not see a lot of problems actually. I was very pleased with ourselves because we had not game the system, but I would've felt pretty bad to a developer who wrote tests like that. That's a interesting conversation to have.
Bill: Yes, I totally agree. It's a fantastic point. I think that we've found over the years, that if you report on something, then it will be measured. If you measure something, it will be gamed. That's a cynical view, but yes, that happens all the time. I think that a very simplistic way of trying to establish code quality bounds or boundaries or levels is to say you want a certain percentage of code coverage.
If a team, if maybe a less experienced team or a team that's trying to game the system sees that yes, absolutely. You can write all kinds of tests which just bring code coverage up without being effective. Again, as mutation testing is about testing the quality of the test suite, this will absolutely show you where test suite is not effective or possibly being gamed.
Prem: Great. We've seen that this can be a pretty solid practice to adopt. Are there any tools that you can recommend that help make this easier to run mutation testing well? Bill?
By the way, it also might be worth saying that mutation testing as a practice has appeared on the Thoughtworks Tech Radar since 2016 and has moved from assess back in 2016 to trial in 2020. There are some of us, myself included, that think that maybe it should even be beyond a trial recommendation into an adopt recommendation, which would be tools, techniques, et cetera, that we think you should be using now as a sensible default.
Prem: Absolutely. I definitely concur with a lot of what you're saying in terms of it being one of the starters or one of the first things that you do on a project. It looks like this isn't a new practice either, right? This has been around for a fairly long time as well. Can you folks just quickly touch on the history and tell us some real implications of having or adopting this practice?
Bill: Sure. Mutation testing has been around since, I believe, 1971, when a paper was written on it. The popularity and the use of it has increased greatly, though, in the recent years due to improvements in the algorithms and the computational power that's necessary for it. Also, it's based on something called the coupling effect, which essentially states if your tests are sensitive enough to detect small faults in software, then they're likely sensitive enough to be able to detect the major faults.
When we're talking about what we've talked about so far with mutation testing, the minor faults are the ones that are injected into the code automatically by the test suite. The major faults are possibly ones comprised of those minor faults. The mutation testing package will really help you find places in your code where your tests are not able to detect the minor faults that it's thrown in. This coupling effect has been studied both theoretically and empirically and has been very well supported by those studies.
Brian: Yes. I will add to that, this is outstanding academic work that has come over into the practical professional commercial space. We're able to leverage decades of outstanding research. The idea behind this came from fault injection. That goes back a century or more in engineering where people would intentionally create faults in systems and make sure that they still recovered or behaved well.
Prem: Wonderful. Thank you, Bill and Brian, for this really invigorating conversation. I'll end on this. If you are having doubts about wanting to adopt mutation testing in your own practice, think about this, the DeFi crypto platform Compound, they lost $162 million recently. This was about six months ago. Where a greater than sign in code actually should have been a greater than or equal to. If they had just found that out before their users found it, they could have actually saved all of that money. On that note, thanks a lot to both our experts. Thank you, Ken. See you next time.
Ken: Bye to everyone.