ThoughtWorks
  • Contact
  • Español
  • Português
  • Deutsch
  • 中文
Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

    Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

    Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

    Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
Go to overview

Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
  • Español
  • Português
  • Deutsch
  • 中文
ThoughtWorksMenu
  • Close   ✕
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact
  • Back
  • Close   ✕
  • Go to overview
  • Engineering Culture, Delivery Mindset

    Embrace a modern approach to software development and deliver value faster

  • Experience Design and Product Capability

    Rapidly design, deliver and evolve exceptional products and experiences

  • Frictionless Operating Model

    Improve your organization's ability to respond to change

  • Intelligence-Driven Decision Making

    Leverage your data assets to unlock new sources of value

  • Partnerships

    Leveraging our network of trusted partners to amplify the outcomes we deliver for our clients

  • Platform Strategy

    Create adaptable technology platforms that move with your business strategy

  • Back
  • Close   ✕
  • Go to overview
  • Automotive
  • Cleantech, Energy and Utilities
  • Financial Services and Insurance
  • Healthcare
  • Media and Publishing
  • Not-for-profit
  • Public Sector
  • Retail and E-commerce
  • Travel and Transport
  • Back
  • Close   ✕
  • Go to overview
  • Featured

  • Technology

    An in-depth exploration of enterprise technology and engineering excellence

  • Business

    Keep up to date with the latest business and industry insights for digital leaders

  • Culture

    The place for career-building content and tips, and our view on social justice and inclusivity

  • Digital Publications and Tools

  • Technology Radar

    An opinionated guide to technology frontiers

  • Perspectives

    A publication for digital leaders

  • Digital Fluency Model

    A model for prioritizing the digital capabilities needed to navigate uncertainty

  • Decoder

    The business execs' A-Z guide to technology

  • All Insights

  • Articles

    Expert insights to help your business grow

  • Blogs

    Personal perspectives from ThoughtWorkers around the globe

  • Books

    Explore our extensive library

  • Podcasts

    Captivating conversations on the latest in business and tech

  • Back
  • Close   ✕
  • Go to overview
  • Application process

    What to expect as you interview with us

  • Grads and career changers

    Start your tech career on the right foot

  • Search jobs

    Find open positions in your region

  • Stay connected

    Sign up for our monthly newsletter

  • Back
  • Close   ✕
  • Go to overview
  • Conferences and Events
  • Diversity and Inclusion
  • News
  • Open Source
  • Our Leaders
  • Social Change
Blogs
Select a topic
View all topicsClose
Technology 
Agile Project Management Cloud Continuous Delivery  Data Science & Engineering Defending the Free Internet Evolutionary Architecture Experience Design IoT Languages, Tools & Frameworks Legacy Modernization Machine Learning & Artificial Intelligence Microservices Platforms Security Software Testing Technology Strategy 
Business 
Financial Services Global Health Innovation Retail  Transformation 
Careers 
Career Hacks Diversity & Inclusion Social Change 
Blogs

Topics

Choose a topic
  • Technology
    Technology
  • Technology Overview
  • Agile Project Management
  • Cloud
  • Continuous Delivery
  • Data Science & Engineering
  • Defending the Free Internet
  • Evolutionary Architecture
  • Experience Design
  • IoT
  • Languages, Tools & Frameworks
  • Legacy Modernization
  • Machine Learning & Artificial Intelligence
  • Microservices
  • Platforms
  • Security
  • Software Testing
  • Technology Strategy
  • Business
    Business
  • Business Overview
  • Financial Services
  • Global Health
  • Innovation
  • Retail
  • Transformation
  • Careers
    Careers
  • Careers Overview
  • Career Hacks
  • Diversity & Inclusion
  • Social Change
Agile Project ManagementTechnology StrategyTechnology

Risk Management for Engineering Resilience

Anthony O'Connell Anthony O'Connell

Published: Dec 23, 2018

Risk is the intentional interaction with uncertainty. It is a consequence of a decision or an action taken (or inaction) in spite of that uncertainty. Reputational risk, risk of financial losses, business continuity, and failure to evolve are all types of risk which impact organisations today.

In product engineering, risk can be understood as a combination of three things: the severity of the effect of a potential failure, the likelihood or chance of the failure occurring, and the ability to detect the failure (or effect of the failure) if it does occur.

A failure is the inability of a system to perform its normal, intended function over its specified life or comply with an intended requirement. Failures can cost time, money, reputation, resources, and even the health and wellbeing of the people involved.

Understanding risk

Risk needs to be deeply understood both from the perspective of how design decisions we make create risk and how we manage risk once a decision is made and the risk is built into a system. The definition of a system here can be taken as a discrete product or a business process. Be aware that we are not talking here about what the possible causes of risks might be, only the potential for the system to fail (or produce undesirable outcomes) based on the decisions we make.

It is important to understand that systems, by themselves, do not fail; components or individual functions within a system fail that result in system failure. Risk analysis needs to be conducted both at a system and at the component or function level (so that we understand how and why a component or function could fail) to be effective.

When we make fundamental engineering or architectural decisions, each decision carries the risk of failure, dependent on the environment within which the system exists and how it is being used. This happens whether we know it or not.

While no-one has a crystal ball to say exactly how and when a system will fail, we can apply logical thinking and past experience, together with use risk prediction tools, to help understand the likelihood of different types of failures. We can then identify appropriate controls, both prevention and detection, to help mitigate or reduce the risk.

Analysing risk

In product engineering, risk analysis is a formal process conducted by the team responsible for both delivery and run. It cannot be delegated to a risk team or a risk function. To be effective, risk analysis needs to be as objective as possible so that there is consistency of evaluation (across teams and projects and over time) and there is a way to compare risks for prioritisation of action.

To this end, an agreed set of principles, which express 'good' architectural attributes, along with descriptive risk ratings, are needed to rank each risk. Descriptive ratings help avoid, or at least minimise, biases and fallacies - confirmation bias and false equivalence are two common ones - often present in evaluating risk.

Good risk analysis is done from the perspective of both the design and architectural decisions we need to make and the run conditions in which we intend to operate. Often potential risk identified in one can be used to inform the other. There is a common approach to analysis in both phases - in that similar questions are posed to evaluate risk - but there are differences in how we evaluate the likelihood of a failure and determine our ability to quickly and accurately detect a failure if it was to occur.

Controls in the design and delivery phase and the run phase are different (although our experience is that most high-performing components make little distinction between design, build and run - they are a continuum). However, both phases value a combination of prevention controls (controls that prevent the cause of the potential failure) and detection controls (controls that identify the effect of the failure or the failure itself). This is especially true where we are trying to mitigate critical risks, risks that, when realised, represent an unacceptable level of failure and loss, especially to an end user.

Examples of controls include organisational design rules and architectural principles, good engineering standards and practices, and information obtained from corrective and preventive actions from previously solved problems (lessons learned).
 
Design influences 70% of a system's performance, availability, security, and compliance; run, by itself, influences only 30%

Mitigating risk

The best way to mitigate risks, especially design risks, is to increase our ability to make better decisions. The risk management process utilises as much previous knowledge and experience as possible to answer the following questions about each function of a system compared to the desired state of each function.
  • How could the component or function within the system fail? What does failure look like?
  • How bad might the failure be (from the perspective of any customer reasonably expected to experience the failure)?
  • What are the possible root causes of the failure?
  • What is the likelihood, given the design decisions we are about to make and the run conditions and noise factors the system is expected experience, of the failure occurring?
  • How good are our controls (for example, automated testing or observability and automated correlation of events) at detecting the failure or the effects of failure?
  • What, if anything, should we do to reduce or mitigate the risk? How can we improve the controls to reduce the chance of a failure occurring and/or increase our ability to detect a failure?
Laid out as a process, the questions are answered in a logical sequence. Severity, occurrence and detection ratings are multiplied together to create criticality (CRIT) and risk priority number (RPN) scores to help prioritise which risks we should address first.

Risk analysis and mitigation process
Risk

Criticality (C) = Severity x Occurrence (SO)
Risk Priority Number (RPN) = Severity x Occurrence x Detection (SOD)


These questions prompt us to consider the conditions in which the system and its functions need to operate and the noise factors that the system will be subjected to.

Noise factors are anything that interferes with a system performing its normal, intended functions and can include things like unexpected loads from other services, security patching and updates, hardware faults and failures, and denial of service attacks. With a little imagination, we can quickly develop a list of noise factors that we should consider when making decisions that affect how the system is built and operated.

Where we are able to identify the possible causes of a serious failure, we can introduce or improve controls that work to prevent the cause from occurring in the first place. And where a failure is potentially catastrophic, we should consider both controls to prevent the failure and controls to detect it as early as possible so that we minimise the chance of the failure affecting customers.

Let's use a simple example to illustrate the difference.

If a specific function of a software component is to provide secure authentication between a user and a service, there is a risk of the loss of integrity of the user's authentication information (the potential failure of the function of the component). One possible cause of the loss of integrity of the users' authentication information could be determined to be no, poor or incomplete encryption of information in transit.

Therefore, a decision is made in the design phase to use a security certificate issued by a Certificate Authority for end-to-end encryption of the user's authentication information. Based on the decision to use a digital certificate (the intended design control), the risk of the likelihood of the potential failure occurring is then evaluated using the rating system. Further mitigation may be required as a result of the evaluation if the risk is deemed too high.

Correspondingly, in the operation of that same function in run environment, the same failure is considered (the potential loss of the integrity of the users' authentication information) but in this case a possible cause of the failure might be determined to be an expired security certificate (no encryption).

Assuming there are no existing controls in place to address this risk, the risk of the expired security certificate could be mitigated by the planned introduction of an automated check and notification script to flag a certificate when it is less than 1 month from expiry. The service management plan would need to be updated to incorporate the new control.

Problem solving as a preventive activity in risk management

When failures do occur in live systems, proper problem solving (with deliberate root cause analysis) can be used to feedback information to improve future risk analysis. Actual root causes of failures are found and verified and the knowledge can be used to inform future decisions at risk of similar failures.

This information becomes proprietary knowledge about how systems actually work in the real world and it increases the accuracy of predictions of future failures for better decision-making and better delivery and run controls.

The knowledge must be centrally stored within the organisation and must be accessible for use in all new projects and incidents. Over time and across projects, this knowledge grows and is supported by real-world experience. This, in turn, reduces risk and increases the resilience of systems that we build to survive in the wild.

Risk needs to be deeply understood, both from the perspective of how design decisions we make create risk and how we manage risk once a decision is made and the risk is built into a system. Like problem-solving skills, good risk management tools are techniques are core skills, essential to anyone involved in architectural and engineering decisions and product build. The better we understand how what we do affects risk, and by extension, the experience of the customers who use the systems we build, the more resilient we can make those systems.


 

Discover Perspectives

Timely business and industry insights for digital leaders.

Explore
Related blogs
Financial Services

Intelligent risk and compliance

Prashant Gandhi
Learn more
Continuous Delivery

[Webinar] Reduce Your IT Risks by Using Continuous Delivery

Joanne Molesky
Learn more
Software Testing

5 Reasons Why Test Automation Can Fail

Torsten Leibrich
Learn more
  • What we do
  • Who we work with
  • Insights
  • Careers
  • About
  • Contact

WeChat

×
QR code to ThoughtWorks China WeChat subscription account

Media and analyst relations | Privacy policy | Modern Slavery statement ThoughtWorks| Accessibility | © 2021 ThoughtWorks, Inc.