I jokingly say - “If you do not have a flaky functional tests build, you are not doing anything real”. I’ve spent a good amount of my professional career writing a lot of functional tests. I have interacted with a lot of teams in Thoughtworks trying to understand their functional testing issues. The most common issue in all these teams is the infamous flaky functional test build.
On the Go team, we had a severe issue of a flaky functional test build about 3 years back. It was so bad that we had completely lost trust in our build. A red build did not mean anything any more. Before a release, we would spend about 3 days looking at all the failures and fix them. This completely defeats the purpose of having a CI build that runs these tests. Ideally, what we would want is to run all the tests on every code commit and make sure everything is green.
In every single release the functional tests caught bugs so they were useful. Unfortunately, we were not paying attention to them. We knew we had to fix the flakiness so that we could isolate the real problems, and that the only way we could fix it was by going back to the basics. We brought about some changes in the team:
- Stop calling your build flaky - Random Success
- Acceptance test builds can never be red – Quarantine
- Budget time in the release plan to fix tests – Plan
- Refactor to make sure that no duplication is tolerated – Engineer
- Understand the nature of flakiness – Learning
1. Stop calling your build flaky
Have you ever released to production when, say, your search functionality “sometimes works”? Do you think your customer will be happy when she cannot reliably book a movie ticket using your website?
When you do not tolerate flakiness in production code, how can you tolerate it in functional tests?
The first thing we wanted to address was not the implementation issue at all. Instead, it was fixing the mindset of the people on the ground. After all, it was code that was written by them. How can they not be sure if it was failing for the right reasons or not?
To deal with this, we came up with the concept of Random Success. Most teams say that a test “failed randomly”. We decided that it was the wrong thing to say. If a test ever fails and then passes without any code change, then it “Passed Randomly”. We do not trust such tests. Treat any test, even the ones that are currently passing, flaky if they have failed and then passed without a reason.
2. Acceptance test builds can never be red
Lets say you have identified a flaky test. In fact, you might have identified, say, 23 of them. You cannot obviously fix all of them. That would take a lot of time.
In our case, we didn’t want to delete them, but at the same time, we did not want to infect our build with the disease of redness. A good middle ground was to Quarantine the identified tests.
Quarantined tests do not get run as a part of your builds. They are there so that they can be fixed later.
Over a period of time, our focus was to mercilessly identify and quarantine flaky tests. Lets say we ended up with just 40% of the original suite. That would be 40% that we could trust. We knew if this failed, we needed to stop the line and fix the issue.
3. Budget time in the release plan to fix tests
It is completely useless if you have a quarantine that is, say, 6 months old. Just like quarantined patients need extra care, these tests need engineer love. Remember, there was real work put into it. You do not want it going wasted.
So we planned and set expectations with our stakeholders about the capacity that was required to fix our quarantined automation. The key here was that we told them how strategic it was for us to have reliable automation. That was our prerequisite to take our release cycle from months to weeks.
We prioritized such that whatever our QA felt was a must have was not left in quarantine at all. These must have tests, we call “Bread and Butter tests” - these tests make sure that we earn our bread and butter!
Then, we wanted to fix those tests that might not take too long to figure out what was wrong with them. Low hanging fruits, if you want to call them.
4. Refactor to make sure that no duplication is tolerated
You will be surprised when you see how many functional test issues boil down to bad discipline - random waits, duplication, massive inheritance hierarchies, weird object mothers, fixture objects that are highly complicated to construct etc.
In fact, we found out that a lot of our issues were because of a mix of bugs in our test code and sloppy duplication.
A lot of our bugs were because of the severe asynchronous nature of our application - background processes, batch processing and a lot of ajax. We ended up developing a WaitUtils library to make sure we never have blind sleeps and instead always have targeted waits.
The team started having dev huddles, discussions and design reviews. When a developer was about to write a new page object, she would talk to the others on the team about the usability of the API involved etc.
All of this ensured that we would never repeat the same mistake twice since we followed DRY very strictly.
5. Understand the nature of flakiness
Most flakiness is due to our own mistakes. Some are easy to find, others are not - just like any bugs you would find in your production code. Think about it - inspite of having a strict QA process for production code, some bugs might be lurking around. Why would there not be bugs in your functional test code?
Though the whole process took about 6 months, over a period of time our team got very good at identifying root causes and fixing them. In the last 2 years, we have had a rock solid acceptance suite. We have automation that takes about 9 hours to run sequentially and yet we know that when it fails, it fails for a reason.
Now, it only takes about 40 minutes to run a 9 hour long test suite! How we do that is a topic for its own blog, I guess.
* Martin Fowler has a bliki entry on flaky tests
* Jez Humble and Badri did a talk on creating maintainable acceptance tests at Agile 2012: slides | video
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.