Reliability under abnormal conditions — Part One

Preparing Systems for the ‘100-year wave'.

Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.

Breaking down the problem

The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF), and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as the impact on revenue or customer experience — more on that later.

For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgment about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This two-part article explores the various strategies and techniques for doing that.

Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.

Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.

There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.

The following diagram outlines the major categories:
Figure 1: Paths to improved reliability

Cost/benefit analysis

Most of the prevention strategies have a similarly shaped cost/benefit curve so that while early investments provide good results, you hit a point of diminishing returns in the value and viability of investing in catching more obscure issues in complex corner-cases.
Figure 2: Levels of confidence gained from pre-production testing
Figure 2: Levels of confidence gained from pre-production testing
Example 1: a system that connects with many partners to get real-time pricing information started receiving malformed XML responses from one partner. That caused an infinite recursion in an open source XML parser, which maxed out CPUs on multiple servers. The effort required to test enough permutations of malformed XML to catch this framework issue ahead of time was extreme.

Example 2: the server-side content caching of one application saw a Thundering Herd problem when all search engines started crawling the content in production. The problem was not caught in performance testing because the problem occurred only when the cache eviction timing coincided with search engine requests for content.

As systems grow more complex — in terms of the number of components, infrastructure, users, integrations, and features — the cost of ensuring a fixed level of correctness and resilience increases to the point that the trade-off becomes less worthwhile.

Maintaining level of confidence as system complexity grows
Figure 3: Maintaining level of confidence as system complexity grows
Example 3: companies like Facebook and Uber have such scale and complexity in their production environments that attempting to replicate production for load testing is unrealistic. They tend to lean more heavily on strategies for testing performance in production using techniques, such as canary releases, to minimize the impact of problems.

Overall recommendations
Our overall recommendations are:
  • Apply rigor to pre-production testing, but be aware of the curve of diminishing returns
  • Use layered strategies to improve the cost/benefit equation
  • Investigate options for load testing safely in production
  • Focus on mitigating the impact of problems in production by both containing how far problems can spread and improve speed of noticing, diagnosing and responding to issues 

Pre-production testing

While not all potential problems can be caught within reasonable budgets and timescales through pre-production testing, we still advise investing effort into mitigating basic risks through rigorous automated testing for correctness, performance and resilience, before releasing to production.

Functional testing

This article isn’t primarily concerned with the functional correctness of a system, nevertheless good unit and functional tests can head off many resilience problems by testing how components handle various error scenarios and whether they degrade gracefully.

Load testing

In pre-production testing, you want to ensure that code changes haven’t had a negative impact on (localized) performance and also to get a baseline for capacity planning — by understanding the load an individual node can support and then the scaling efficiency of adding more resources. The first two can be covered by running basic load testing in your CI/CD pipeline. Understanding the linearity (or lack thereof) in the scaling characteristics of your components will require a more specific set of tests, ones that observe the impact on throughput of your components as you add nodes, CPUs or other resources.

For load testing pre-production, you can use a tool like Gatling or Tsung to generate reasonably realistic loads against a full or partial set of deployed services. We recommend running these types of tests as part of your build pipeline, but since they likely take a while to run, they can often be run out of band in a fan-out/fan-in manner.
Recommended tests for your build pipeline
Figure 4: Recommended tests for your build pipeline
It is useful to test out your monitoring/observability capabilities during these load tests to check that you can identify the cause of bottlenecks. For example, network call monitoring (or code profiling) can catch n+1 problems in chatty service or database calls.

Capacity planning

With the advent of auto-scaling, it is often assumed that systems can scale linearly by adding extra nodes, as long as you follow some basic 12-factor approaches. Sadly, this is rarely true, so it is important to understand how the performance of your system, or elements of your system, responds to the addition of more resources. It is rarely a straight (linear) line; it will normally top out at some point through contention for resources; it may well start degrading as cross-talk (chattiness) increases quadratically with the addition of more nodes. Understanding and applying the Universal Scaling Law, potentially using tools like USL4J, can help with capacity planning and fixing issues that are leading to sub-linearity in scaling.

Resilience testing

Testing how resilient a system will be to unknown issues is a tough problem. We mentioned how elements of resilience could be tested as part of your unit and functional tests if you include suitable tests for predictable “unhappy paths”. In part two, we’ll also look at testing resilience in production with Chaos Engineering approaches. There are a few other failure modes that can be tested relatively easily before getting to production. Your unit and functional tests should test that your application responds to network failures or errors in upstream systems in a graceful way that doesn’t propagate and amplify failures through the system.

There are some other categories of problems like slow networks, low bandwidth, dropped packets and timeouts that can be simulated pre-production, but aren’t always caught by unit testing. Network conditioning tools can help simulate these issues. This is particularly useful for testing mobile applications, but can also be applied to inter-service communications.

Layered strategies

Multiple approaches can be layered to improve your chances of success. Key areas to investigate are:
  • Use contract testing to reduce the number of end-to-end integration tests needed
  • Use service virtualization (e.g. mountebank) to reduce the number of services that need to be deployed for a performance test and to allow you to simulate downstream latency using record and playback

Layering multiple approaches delivers more benefit​
Figure 5: Layering multiple approaches delivers more benefit
Setting up and maintaining fully production-like environments can be costly and occasionally problematic. As a result, there are approaches for using production to gain confidence in how your evolving system will respond under load. We'll explore these approaches in more detail in Part Two along with the practices required to mitigate issues when they do arise.

Read Part Two here.

Many thanks to my colleagues who provided insights and feedback on drafts of this article: Zhamak Dehghani, Linda Goldstein, Joshua Jordan, Praful Todkar, Brandon Byars, Unmesh Joshi, Ken Mugrage, and Bill Codding.