Chaos engineering is emerging into the mainstream as an approach to improving and assuring resilience in distributed systems, such as the cloud.
What is it?
Chaos engineering has its roots in a practice developed by Netflix, Chaos Monkey, where it tested how a running system was able to cope with outages in production by randomly disabling instances and measuring the results. Today, organizations typically use chaos engineering in testing environments, rather than production.
Chaos engineering recognizes that failures are going to happen in modern distributed cloud applications. By simulating failures at various points in the application, you can measure the likely impact and plan more effectively for failures in production.
And as enterprises are routinely deploying distributed applications and rely on public cloud and openly available web services, there’s a growing need to test your systems.
What’s in for you?
Chaos engineering will improve your customers’ experience.
By testing how your applications respond to failures in the distributed system, you get a better understanding of the failure modes. For instance, if part of your online retail recommendation engine fails, your customers might not be given eye-catching offers, but they might still be able to buy what they need.
What are the trade offs?
Chaos engineering is another layer of testing, so you can expect some additional expense. For some low-level applications, you may decide that’s not necessary.
And introducing failure on its own is not enough: you also need to invest time and money in fixing things you discover.
Finally, even a well implemented chaos engineering policy, isn’t foolproof. Complex distributed computer systems will still fail.