This flowchart draws insights from the asset library and reviews whether appropriate controls are in place or need to be created in the context of the story.
In the following, we'll highlight ways to do threat modeling, depending on your desired outcome. It is good to know the advantages and drawbacks of these models ahead of time, especially when you are looking towards working with security professionals.
Agile Threat Modeling
Our goal is to find the highest value security work we can do, and get it into the team’s backlog right away. We do this by applying a timebox so we are threat modelling “little and often”. We capture a new and different partial view of the system each time we do threat modelling rather than overthinking it. Over time, as we try lots of perspectives and zoom levels on the system, threat modelling becomes an agile continuous process!
- Ask participants to draw technical diagram of agreed scope
- Highlight what (data, services, assets) we need to protect
- Evil Brainstorming of threats based on STRIDE
- Prioritise by voting for riskiest threats
- Work on the top three threats define actions (see below)
Brainstorming with STRIDE is quick & flexible to extend your existing ways of working and ask ‘what can go wrong?’. Shostack’s STRIDE categorization is used to examine the current iteration’s functionality delta and whether it can be attacked or otherwise broken by applying the following attack patterns. STRIDE is an acronym for
- Spoofing identity allows attackers to do things they are not supposed to do by posing as someone else. Key Concepts are Identity, Authentication.
- Tampering with input, e.b. by modifying data submitted to your system, can break a trust boundaries and modify the code flow decisions in your system? Key Concepts are Validation, Integrity, Injection.
- Repudiation of action allows actors to use ambiguity to successfully dispute that they have committed an action, which means they cannot be held accountable for their actions. Key Concepts are Non-Repudiation, Logging, Audit.
- Information disclosure threats involve the exposure or interception of information to unauthorised individuals. Key Concepts are Confidentiality, Encryption, Leakage, Man in the middle.
- Denial of Service attacks work by flooding, wiping or otherwise breaking a particular service or system, making it unavailable. Key Concepts are Availability, Botnets, DDoS / DDoSaaS.
- Escalation of Privilege attacks are possible when authorization boundaries are missing or inadequate, allowing a user to gain higher privileges than they should have. Key Concepts are Authorization, Isolation, Blast radius, Remote Code Execution.
Please be rational when coming up with attacks. Many threat modeling frameworks advise you to name a threat actor. Understanding the characteristics of threat actors and building some actor personae requires insight into their identity, relationship, motive, intent, capability. Personal experience shows that this is rather hard and work-intensive, but does not yield better results. Unless you’re likely to be attacked by the NSA, just keep it rational.
Map and order the findings according to their impact as explained in the risk profiling section: Take a moment to reflect on what is going to happen to the business in case this actually came to pass. Are you going out of business? Will you have to spend a day restoring your database? Will you lose a critical competitive advantage? Or is it possible that your elaborate disaster scenario actually turns out to be a minor nuisance at best?
The top threats can be used as ...
- Additional acceptance criteria on an existing user story
- Security debt that is tracked in a shared place, e.g. a radiator on the wall of the team space
- Changes to the team’s definition of done
- Timeboxed spikes to determine if we are really vulnerable
- Epics to implement significant security safeguards
If the agile threat modeling approach is not for you, tabletop exercises are a starting point for inexperienced teams and to understand security risks in tech debt.
In a tabletop-like threat modeling, a team would be confronted with one or more disaster scenarios and would list all the necessary countermeasures to cover the phases in NIST’s cybersecurity incident management: identify, protect, detect, respond and recover. This technique draws on pre-existing knowledge and experience of the team members from other engagements, and tries to map them to the current tech stack and architecture.
Sources for scenarios are likely to incorporate:
- Boundaries and interfaces between systems that might break
- Recovering from unavailability due to misconfiguration
- Recovering from unexpected data loss
- Gracefully degrading due to unavailability of third party services
- Detecting intrusion and data leakage from outside attackers
Examples of tabletop scenarios:
- A monitoring solution shows a large, sustained about of outbound traffic indicating data exfiltration
- Crypto ransomware attack has occurred and infected a production system
- Someone checked in their AWS access keys and all the sudden there's a lot of cryptocurrency mining going on
- A laptop with access to intellectual property, data, credentials was stolen from them in an unlocked state when they were having coffee.
- An unpatched server has been pwned and is part of a botnet.
- Amazon sends an EC2 abuse notice saying that your EC2 instance has been SSH brute forcing targets outbound.
- InfoSec sends you a message saying that your user account failed to authenticate over 1000 times in a 24 hour time frame.
Attack Trees are recommended when focussing on a critical component in the context of high-risk-high-yield assets, as well as in digital forensics. They are a methodology of analyzing the security of systems that allows for top-down discovery of attack vectors in a tree-like structure. They are very labor-intensive, require expert knowledge, but have limited payoff.
During conversations with security practitioners, a common theme was that attack trees lend themselves to waterfall analysis and upfront design in high assurance environments, and should be avoided by agile teams.
// TODO Development Phase
Many security controls in the development phase can be part of your automated CI/CD practices to contribute to the entire system stability. Examples like test pyramids and feature toggles are well-known and well-documented. However, a seldom discussed topic is scanning dependencies, libraries, frameworks, etc., despite them constituting the lion’s share of the codebase at runtime.
Semantic Versioning describes version ranges for floating dependencies, within which new releases can be automatically integrated. The expectation for floating version ranges is that upon the next build, bug fixes are automatically pulled from upstream and non-breaking updates are integrated without a manual intervention in configuration. This is a reasonable assumption, especially since most developers work exclusively fix-forward, instead of backporting releases. However, the result of blindly integrating new untested releases is broken builds and runtimes, non-deterministic builds, and the “works on my machine”-problem. Tools like Greenkeeper are a sign for increased attention to dependencies, as they integrate new versions and bug fixes automatically if the test suite executed by your CI is green. When your CI is green, deployment is the next step. Fitness for production is then usually tested through a blue/green deployment, which can be easily discarded or rolled back in case of an error. This practice is also known as “Canary Build/Deployment”. Should you not be using blue/green deployments, please make sure that you have automated rollbacks in place.
As the time goes on and new vulnerabilities are discovered, we need to safeguard against a growing number of vulnerabilities in our libraries and frameworks in existing builds and live systems. A prominent example is Equifax’s Apache Struts vulnerability staying unpatched for months. Tools like OWASP Dependency Checker or npm audit scan dependencies for published “Common Vulnerabilities and Exposures”, security vulnerabilities that are published by the Mitre Corporation and other CVE Numbering Authorities, and other sources of published weaknesses, e.g. findings on HackerOne or NIST’s NVD.
Same as libraries and frameworks, containers and runtime environments are also subject to having vulnerabilities and need to be inspected regularly. Tools like Clair and JFrog’s X-Ray allow scanning the layers in containers for CVEs. Scanning of dependencies and components is universally accepted as a good practice, so package managers (npm), repositories (jfrog, quay.io) often come with these capabilities out of the box. Scanners like Snyk, Twistlock, or Aqua with hooks into your CI/CD and production environments bring these tools and can provide these insights on an ongoing basis.
It is a good practice to continuously check whether your dependencies are outdated or had CVEs discovered, e.g. once per day. The recommendation is to have the notification on a separate pipeline that is executed as a cronjob, instead of being run only in your deployment pipeline. Not being able to deploy a fix to a production problem because there is a notification of a CVE or a newer version might not be a good situation to be in. Another advantage of running a security scan in a cronjob is that it is not uncommon for microservices to not be touched for months, when the deployment pipeline would not be running to alert you about newly discovered vulnerabilities.
This gives us a good picture about the state of security in development, now let’s look at live production support.
// TODO Deployment and live support
At any point in time, you need to be able to visualize the health of your system in a dashboard. This is not about response times and other technical measures, but the ability of your system to serve the business. Your system is built along a user journey or different business focus points for which it creates value. You need to visualize every single one of them on a dashboard simultaneously, one aspect per tile.
For example, you would want to know the number of interactions with a shopping basket in an ecommerce scenario. Should this number ever suddenly drop, you know there might be a problem you're not detecting.
The basis for any kind of insight into your production environment is structured logging. Instead of logging a string in natural language, you log an indexable data structure. The log output of your container is consumed by a collector, aggregated at a central place, and indexed, so that in the event of an incident, you can pull up the logs and have the pre-formulated search queries ready that allow you to inspect this data. JSON Lines is a convenient format for storing structured data that may be processed one record at a time. The so-called ELK stack is a popular log analysis toolchain.
Now that we can aggregate the logs of different systems, we need to understand how to correlate logs across systems. This is possible with the use of a trace ID. You assign each external request a unique request id, e.g. via an HTTP header. This ID is passed on to all services that are involved in handling the request. The trace ID is included in all log messages. This allows you to pull up the interactions of different services to see which request cause which action or downstream error.
The crux of the matter lies in alerting, or rather in distinguishing nominal from erroneous behavior. Having dashboards that show the behavior of your system from a business perspective will allow you to learn this over time, the same way that pilots are using the dials in their cockpit to safely pilot the airplane. Sending out alerts comes in handy when you understand the operating conditions of your services, so that you can automatically ping your team when one of your services operates outside of these parameters.
When you receive an alert, you need to act. Looking at how emergency first responders are working, Standard Operating Procedures (SOPs) can help in keeping things straight when things get hectic. An SOP is a concise playbook or checklist that contains information that is necessary to deal with incidents. SOPs are not intended to be checklists that "dumb down" the problem at hand to the point where it can be run by anyone. It can contain procedures for restarting services, information on where to look for logs or procedures for handling data. SOPs are not a service manual that deal with any eventualities but more akin to emergency procedures that you see on an aircraft. The goal is to be able to react to incidents and outages largely independent of seniority, tenure, and experience.
The obvious examples are fire drills, as they show what to do to get out of the building. SOPs can take the form of paper-/wiki-based documentation, shell scripts, service interfaces, maintenance endpoints, and many more.
Apart from using them in live support, they can be tested in tabletop exercises similar to the ones used in a tabletop exercise. This will make sure that the SOPs are commonly understood, state of the art, and actionable.
Also, set some resources aside to aggressively pay off tech debt, as this is the most common source of bugs. It helps to have a one-person rotating firefighter role that, when there’s no actual fire to fight, can pick up things from a tech debt wall and work on these many little tasks that rarely make it into a story. The emphasis is on "one person", since nothing drives familiarity with the ugly parts of a codebase like having to shepard a production system for a day when you’re new to a team. Do not pair senior and junior people, as it is likely that the senior person will be in the driver's seat most of the time. Giving everyone the task to keep the system operational for a day or two, regardless of experience and tenure, forces the team to take collective ownership of all parts of the code, regardless of how old or ugly the code is and who authored it.
Security is a culture, not a means to an end
Even if efforts around security are often well received within the organization, please note that many teams will break new ground here. In order for the investment to pay off over a long period of time, it is beneficial to work closely with your business stakeholders. Only if the team understands their work in the context of the business, measures around security can be explained, justified, and agreed upon. Of course, this requires the business side's willingness to invest around quality and security.
Now go forth and build securely!