Four chaos engineering mistakes to avoid

Steve Upton

Published: September 27, 2021

“Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

These experiments take the form of simulating real world failures (“chaos variables”) such as hardware or network failures and examining the impact on the running system. Popularized by Netflix and credited with helping them successfully operate a complex system at massive scale, chaos engineering has been gaining attention and acceptance in the software engineering community.

As more teams start using chaos engineering, we see a few common mistakes that are worth discussing.

Mistake 1: Starting with the tools

Developed by Netflix and almost synonymous with chaos engineering, Chaos Monkey picks servers at random and disables them, simulating real world failures. While regular, automated chaos experiments are an important way to realize the full value of chaos engineering, they probably aren’t the best place to start.

A chaos variable can be as simple as logging into your cloud console and disabling a server, and this simple act will usually lead to learning much faster than taking the time to evaluate and deploy the many chaos engineering tools available. Regular, automated chaos experiments are an important goal but manual experiments are usually a faster way to start.

Mistake 2: Not limiting your blast radius

Netflix’s approach to chaos engineering, continuously performing experiments in their production environment has become widely known but like Chaos Monkey, this represents a state of maturity and not a starting point.

There are many ways to limit the blast radius of your chaos experiments:

Run the experiment against a non-production environment
Only target a subset of your services
Run the experiment for a specific, limited time
Run the experiment during a period of lower usage

Each of these approaches comes with costs but limits the potential blast radius of your experiments. For example, experimenting in a non-production environment can never teach as much as working with production but cannot impact real users either.

Running chaos experiments in production is a fantastic way to learn about your system and an important goal for any serious investment in chaos engineering but starting out small, in a non-production environment is a great way to build confidence and experience, without making too many enemies by bringing down production.

Mistake 3: Not going in with a hypothesis

A chaos experiment represents a significant investment from the business: on top of the inherent risk to the system, time is needed to plan chaos variables, plan rollbacks, monitor the system, and respond to any incidents. Given the cost, one of our jobs as chaos engineering practitioners is to ensure a return on that investment, and testing hypotheses is an important way to do that.

Creating your first hypothesis could be as simple as asking your team “what are we concerned about?” or pointing at an architecture diagram and asking “what happens if this service fails?”. Try to focus your questioning on areas where uncertainty is greatest and potential impact is highest.

When you have questions without answers, simply add your expectations:

“We hypothesise that if the recommendations service becomes unavailable, customers will still be able to complete their purchases.”

“We hypothesise that if one database in our cluster becomes unavailable, customers will still experience reasonable performance (95% of requests complete in <200 ms)”

Testing these hypotheses will uncover important information about your system and identify areas for improvement. This is a great starting point for your first chaos experiment.

Mistake 4: Not investing in observability

A hypothesis isn’t very useful if we can’t validate it. To validate the hypothesis “if one database in our cluster becomes unavailable, customers will still experience reasonable performance (95% of requests complete in <200 ms)” we need visibility on all requests to our system, their response time and their success rates.

While a simple dashboard might help us answer some of these questions, observability is what allows us to dig into the data and answer the important questions that come from chaos experiments, like “why did response time increase during the experiment?”.

Practicing chaos engineering forces you to take the observability of your system seriously if you want to see any return on your chaos experiments. Likewise, investments in observability enable more interesting and expansive chaos experiments, so these two practices naturally co-evolve.

Conclusion

Chaos engineering is not just about breaking things, we want to learn from the experiments we carry out. As John Allspaw puts it:

“Incidents are unplanned investments; their costs have already been incurred. Your org’s challenge is to get ROI on those events.”

With chaos engineering, we have a unique opportunity to plan some of those investments. Follow the advice here to make the most of that opportunity.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Related Blogs

View less

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights