ThoughtWorks
  • Contact
  • Español
  • Português
  • Deutsch
  • 中文
Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

    Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

    Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

    Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
Go to overview

Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
  • Español
  • Português
  • Deutsch
  • 中文
ThoughtWorksMenu
  • Close   ✕
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact
  • Back
  • Close   ✕
  • Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

  • Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

  • Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Back
  • Close   ✕
  • Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
  • Back
  • Close   ✕
  • Go to overview
  • Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

  • Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

  • All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

  • Back
  • Close   ✕
  • Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

  • Back
  • Close   ✕
  • Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
Blogs
Select a topic
View all topicsClose
Technology 
Agile Project Management Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionary Architecture Experience Design IoT Languages, Tools & Frameworks Legacy Modernization Machine Learning & Artificial Intelligence Microservices Platforms Security Software Testing Technology Strategy 
Business 
Financial Services Global Health Innovation Retail  Transformation 
Careers 
Career Hacks Diversity & Inclusion Social Change 
Blogs

Topics

Choose a topic
  • Technology
    Technology
  • Technology Overview
  • Agile Project Management
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionary Architecture
  • Experience Design
  • IoT
  • Languages, Tools & Frameworks
  • Legacy Modernization
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Platforms
  • Security
  • Software Testing
  • Technology Strategy
  • Business
    Business
  • Business Overview
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Careers
    Careers
  • Careers Overview
  • Career Hacks
  • Diversity & Inclusion
  • Social Change
Software TestingTechnology

Reliability under abnormal conditions — Part One

Jonny LeRoy Jonny LeRoy

Published: Nov 4, 2017

Preparing Systems for the ‘100-year wave'.

Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.
 
Here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers, the majority of your questions trend towards the unknown-unknown. Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and make it fast and safe to deploy and roll-back (via automated canaries, gradual rollouts, feature flags, etc.). Charity Majors  

Breaking down the problem

The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF), and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as the impact on revenue or customer experience — more on that later.

For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgment about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This two-part article explores the various strategies and techniques for doing that.

Prevention
Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.

Mitigation
Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.

Hybrid
There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.

The following diagram outlines the major categories:
Figure 1: Paths to improved reliability


Cost/benefit analysis

Most of the prevention strategies have a similarly shaped cost/benefit curve so that while early investments provide good results, you hit a point of diminishing returns in the value and viability of investing in catching more obscure issues in complex corner-cases.
Figure 2: Levels of confidence gained from pre-production testing
Figure 2: Levels of confidence gained from pre-production testing
Example 1: a system that connects with many partners to get real-time pricing information started receiving malformed XML responses from one partner. That caused an infinite recursion in an open source XML parser, which maxed out CPUs on multiple servers. The effort required to test enough permutations of malformed XML to catch this framework issue ahead of time was extreme.

Example 2: the server-side content caching of one application saw a Thundering Herd problem when all search engines started crawling the content in production. The problem was not caught in performance testing because the problem occurred only when the cache eviction timing coincided with search engine requests for content.

As systems grow more complex — in terms of the number of components, infrastructure, users, integrations, and features — the cost of ensuring a fixed level of correctness and resilience increases to the point that the trade-off becomes less worthwhile.



Maintaining level of confidence as system complexity grows
Figure 3: Maintaining level of confidence as system complexity grows
Example 3: companies like Facebook and Uber have such scale and complexity in their production environments that attempting to replicate production for load testing is unrealistic. They tend to lean more heavily on strategies for testing performance in production using techniques, such as canary releases, to minimize the impact of problems.

Overall recommendations
Our overall recommendations are:
  • Apply rigor to pre-production testing, but be aware of the curve of diminishing returns
  • Use layered strategies to improve the cost/benefit equation
  • Investigate options for load testing safely in production
  • Focus on mitigating the impact of problems in production by both containing how far problems can spread and improve speed of noticing, diagnosing and responding to issues 

Pre-production testing

While not all potential problems can be caught within reasonable budgets and timescales through pre-production testing, we still advise investing effort into mitigating basic risks through rigorous automated testing for correctness, performance and resilience, before releasing to production.

Functional testing

This article isn’t primarily concerned with the functional correctness of a system, nevertheless good unit and functional tests can head off many resilience problems by testing how components handle various error scenarios and whether they degrade gracefully.

Load testing

In pre-production testing, you want to ensure that code changes haven’t had a negative impact on (localized) performance and also to get a baseline for capacity planning — by understanding the load an individual node can support and then the scaling efficiency of adding more resources. The first two can be covered by running basic load testing in your CI/CD pipeline. Understanding the linearity (or lack thereof) in the scaling characteristics of your components will require a more specific set of tests, ones that observe the impact on throughput of your components as you add nodes, CPUs or other resources.

For load testing pre-production, you can use a tool like Gatling or Tsung to generate reasonably realistic loads against a full or partial set of deployed services. We recommend running these types of tests as part of your build pipeline, but since they likely take a while to run, they can often be run out of band in a fan-out/fan-in manner.
Recommended tests for your build pipeline
Figure 4: Recommended tests for your build pipeline
It is useful to test out your monitoring/observability capabilities during these load tests to check that you can identify the cause of bottlenecks. For example, network call monitoring (or code profiling) can catch n+1 problems in chatty service or database calls.

Capacity planning

With the advent of auto-scaling, it is often assumed that systems can scale linearly by adding extra nodes, as long as you follow some basic 12-factor approaches. Sadly, this is rarely true, so it is important to understand how the performance of your system, or elements of your system, responds to the addition of more resources. It is rarely a straight (linear) line; it will normally top out at some point through contention for resources; it may well start degrading as cross-talk (chattiness) increases quadratically with the addition of more nodes. Understanding and applying the Universal Scaling Law, potentially using tools like USL4J, can help with capacity planning and fixing issues that are leading to sub-linearity in scaling.

Resilience testing

Testing how resilient a system will be to unknown issues is a tough problem. We mentioned how elements of resilience could be tested as part of your unit and functional tests if you include suitable tests for predictable “unhappy paths”. In part two, we’ll also look at testing resilience in production with Chaos Engineering approaches. There are a few other failure modes that can be tested relatively easily before getting to production. Your unit and functional tests should test that your application responds to network failures or errors in upstream systems in a graceful way that doesn’t propagate and amplify failures through the system.

There are some other categories of problems like slow networks, low bandwidth, dropped packets and timeouts that can be simulated pre-production, but aren’t always caught by unit testing. Network conditioning tools can help simulate these issues. This is particularly useful for testing mobile applications, but can also be applied to inter-service communications.

Layered strategies

Multiple approaches can be layered to improve your chances of success. Key areas to investigate are:
  • Use contract testing to reduce the number of end-to-end integration tests needed
  • Use service virtualization (e.g. mountebank) to reduce the number of services that need to be deployed for a performance test and to allow you to simulate downstream latency using record and playback

Layering multiple approaches delivers more benefit​
Figure 5: Layering multiple approaches delivers more benefit
Setting up and maintaining fully production-like environments can be costly and occasionally problematic. As a result, there are approaches for using production to gain confidence in how your evolving system will respond under load. We'll explore these approaches in more detail in Part Two along with the practices required to mitigate issues when they do arise.

Read Part Two here.

Many thanks to my colleagues who provided insights and feedback on drafts of this article: Zhamak Dehghani, Linda Goldstein, Joshua Jordan, Praful Todkar, Brandon Byars, Unmesh Joshi, Ken Mugrage, and Bill Codding.

Technology Radar

Don't miss our opinionated guide to technology frontiers.

Subscribe
Related blogs
Technology Strategy

Building Reliable Digital Operations

Dan McClure
Jim Highsmith
Learn more
Software Testing

Performance Testing in a Nutshell

Srinivas Murty
Learn more
Software Testing

Moving Ahead From Agile Testing 2.0

Anand Bagmar
Learn more
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Media and analyst relations | Privacy policy | Modern Slavery statement ThoughtWorks| Accessibility | © 2021 ThoughtWorks, Inc.