Decoder

Chaos engineering

A testing system that deliberately introduces failures in parts of an application to evaluate how it responds.

Chaos engineering is emerging into the mainstream as an approach to improving and assuring resilience in distributed systems, such as the cloud.

What is it?

A technique for testing the resilience of distributed cloud applications by injecting failures into parts of the system and monitoring the result.

Learn more

What’s in it for you?

You can improve your customers’ experience through greater resiliency of your cloud applications.

Learn more

What are the trade-offs?

Security engineering isn’t for everyone. And to be effective and safe, it requires organizational support at scale.

Learn more

How is it being used?

Chaos engineering is being used to test and improve the resilience of large scale distributed applications.

Learn more

What is it?

Chaos engineering has its roots in a practice developed by Netflix, Chaos Monkey, where it tested how a running system was able to cope with outages in production by randomly disabling instances and measuring the results. Today, organizations typically use chaos engineering in testing environments, rather than production.

Chaos engineering recognizes that failures are going to happen in modern distributed cloud applications. By simulating failures at various points in the application, you can measure the likely impact and plan more effectively for failures in production.

And as enterprises are routinely deploying distributed applications and rely on public cloud and openly available web services, there’s a growing need to test your systems.

What’s in for you?

Chaos engineering will improve your customers’ experience.

By testing how your applications respond to failures in the distributed system, you get a better understanding of the failure modes. For instance, if part of your online retail recommendation engine fails, your customers might not be given eye-catching offers, but they might still be able to buy what they need.

What are the trade offs?

Chaos engineering is another layer of testing, so you can expect some additional expense. For some low-level applications, you may decide that’s not necessary.

And introducing failure on its own is not enough: you also need to invest time and money in fixing things you discover.

Finally, even a well implemented chaos engineering policy, isn’t foolproof. Complex distributed computer systems will still fail.

How is it being used?

Chaos engineering started out at Netflix, under the guise of Chaos Monkey. These days, few companies inject failures directly into production systems. Nonetheless, chaos engineering has grown in interest and is used by many enterprises that deploy distributed cloud applications.

The toolset around chaos engineering continues to grow and improve.