When organizations move toward microservices, one of the main drivers is the hope for faster time to market. However, this aspiration only tends to be realized when services (and their supporting teams) are cleanly sliced along long-lived business domain boundaries. Otherwise meaningful features will naturally require tight coordination between multiple teams and services, introducing natural friction in competing roadmap prioritization. The solution to this problem is good domain modeling, and event storming has rapidly become one of our favorite methods for rapidly identifying the key concepts in a problem space and aligning a variety of stakeholders in the best way to slice potential solutions.
Fast feedback is one of our core values for building software. For many years, we've used the canary release approach to encourage early feedback on new software versions, while reducing the risk through incremental rollout to selected users. One of the questions regarding this technique is how to segment users. Canary releases to a very small segment (say 1%) of users can be a catalyst for change. While starting with a very small segment of users enables teams to get comfortable with the technique, capturing fast user feedback enables diverse teams to observe the impact of new releases and learn and adjust course as necessary—a priceless change in engineering culture. We call this, the mighty 1% canary.
Most organizations that don't have the resources to custom-build their software will select out-of-the-box or SaaS solutions to meet their requirements. All too often, however, these solutions tend to aggressively expand their scope to entangle themselves into every part of your business. This blurs integration boundaries and makes change less predictable and slow. To mitigate this risk, we recommend that organizations develop a clear target capability model and then employ a strategy we call Bounded Buy—that is, only select vendor products that are modular and decoupled and can be contained within the Bounded Context of a single business capability. This modularity and independent deliverability should be included in the acceptance criteria for a vendor selection process.
Maintaining proper control over sensitive data is difficult, especially when—for backup and recovery purposes—data is copied outside of a master system of record. Crypto shredding is the practice of rendering sensitive data unreadable by deliberately overwriting or deleting encryption keys used to secure that data. For example, an entire table of customer personal details could be encrypted using random keys for each record, with a different table storing the keys. If a customer exercised their "right to be forgotten," we can simply delete the appropriate key, effectively "shredding" the encrypted data. This technique can be useful where we're confident of maintaining appropriate control of a smaller set of encryption keys but less confident about control over a larger data set.
The State of DevOps report, first published in 2014, states that high-performing teams create high-performing organizations. Recently, the team behind the report released Accelerate, which describes the scientific method they've used in the report. A key takeaway of both are the four key metrics to support software delivery performance: lead time, deployment frequency, mean time to restore (MTTR), and change fail percentage. As a consultancy that has helped many organizations transform, these metrics have come up time and time again as a way to help organizations determine whether they're improving the overall performance. Each metric creates a virtuous cycle and focuses the teams on continuous improvement: to reduce lead time, you reduce wasteful activities which, in turn, lets you deploy more frequently; deployment frequency forces your teams to improve their practices and automation; your speed to recover from failure is improved by better practices, automation and monitoring which reduces the frequency of failures.
On-demand self-service is a key characteristic (and benefit) of cloud computing. When large-scale service landscapes are deployed using a single account, rules and processes around usage of that account become necessary, often involving approval steps that increase turnaround time. A better approach is a multi-account cloud setup where several accounts are used, in the extreme one account per team. This does add overhead in other places, for example, ensuring shared billing, enabling communication between VPCs and managing the relationship with the cloud provider. However, it often accelerates development and it usually improves security, because single-service accounts are easier to audit and, in the case of a breach, the impact is greatly reduced. Having multiple accounts also reduces stickiness, because an account provides a good boundary for services that can be moved en bloc to another cloud provider.
The observability is an integral part of operating a distributed and microservices architecture. We rely on different system outputs such as distributed tracing, aggregate logs and metrics to infer the internal state of the distributed components, diagnose where the problems are and get to the root cause. An important aspect of an observability ecosystem is monitoring—visualizing and analyzing the system's output—and alerting when unexpected conditions are detected. Traditionally, configuration of monitoring dashboards and setting up alerts is done through GUI-based point-and-click systems. This approach leads to nonrepeatable dashboard configurations, no ability to continuously test and adjust alerts to avoid alert fatigue or missing out on important alerts, and drift from organizational best practices. We highly recommend treating your observability ecosystem configurations as code, called observability as code, and adopt infrastructure as code for your monitoring and alerting infrastructure. Choose observability products that support configuration through version-controlled code and execution of APIs or commands via infrastructure CD pipelines. Observability as code is an often-forgotten aspect of infrastructure as code and, we believe, crucial enough to be called out.
Often, in an effort to outsource risk to their suppliers, businesses look for "one throat to choke" on their most critical and risky system implementations. Unfortunately, this gives them fewer solution choices and less flexibility. Instead, businesses should look to maintain the greatest vendor independence where the business risk exposure is highest. We see a new risk-commensurate vendor strategy emerging that encourages investment to maintain vendor independence for highly critical business systems. Less critical business functions can take advantage of the streamlined delivery of a vendor-native solution because it allows them to absorb more easily the impact of losing that vendor. This trade-off has become apparent as the major cloud providers have expanded their range of service offerings. For example, using AWS Secret Management Service can speed up initial development and has the benefit of ecosystem integration, but it will also add more inertia if you ever need to migrate to a different cloud provider than it would if you had implemented, for example, Vault.
We still see teams who aren't tracking the cost of running their applications as closely as they should as their software architecture or usage evolves. This is particularly true when they're using serverless, which developers assume will provide lower costs since you're not paying for unused server cycles. However, the major cloud providers are pretty savvy at setting their pricing models, and heavily used serverless functions, although very useful for rapid iteration, can get expensive quickly when compared with dedicated cloud (or on-premise) servers. We advise teams to frame a system's run cost as architecture fitness function, which means: track the cost of running your services against the value delivered; when you see deviations from what was expected or acceptable, have a discussion about whether it's time to evolve your architecture.
We've long cautioned people about the temptation to check secrets into their source code repositories. Previously, we've recommended decoupling secret management from source code. However, now we're seeing a set of good tools emerge that offer secrets as a service. With this approach, rather than hardwiring secrets or configuring them as part of the environment, applications retrieve them from a separate process. Tools such as Vault by HashiCorp let you manage secrets separately from the application and enforce policies such as frequent rotation externally.
Although we've had mostly new blips in this edition of the Radar, we think it's worth continuing to call out the usefulness of Security Chaos Engineering. We've moved it to Trial because the teams using this technique are confident that the security policies they have in place are robust enough to handle common security failure modes. Still, proceed with caution when using this technique—we don't want our teams to become desensitized to these issues.
When it comes to large-scale data analysis or machine intelligence problems, being able to reproduce different versions of analysis done on different data sets and parameters is immensely valuable. To achieve reproducible analysis, both the data and the model (including algorithm choice, parameters and hyperparameters) need to be version controlled. Versioning data for reproducible analytics is a relatively trickier problem than versioning models because of the data size. Tools such as DVC help in versioning data by allowing users to commit and push data files to a remote cloud storage bucket using a git-like workflow. This makes it easy for collaborators to pull a specific version of data to reproduce an analysis.
Chaos Katas is a technique that our teams have developed to train and upskill infrastructure and platform engineers. It combines Chaos Engineering techniques—that is, creating failures and outages in a controlled environment—with the systematic teaching and training approach of Kata. Here, Kata refers to code patterns that trigger controlled failures, allowing engineers to discover the problem, recover from the failure, run postmortem and find the root cause. Repeated execution of Katas helps engineers to internalize their new skills.
When building Docker images for our applications, we're often concerned with two things: the security and the size of the image. Traditionally, we've used container security scanning tools to detect and patch common vulnerabilities and exposures and small distributions such as Alpine Linux to address the image size and distribution performance. In this Radar, we're excited about addressing the security and size of containers with a new technique called distroless docker images, pioneered by Google. With this technique, the footprint of the image is reduced to the application, its resources and language runtime dependencies, without operating system distribution. The advantages of this technique include reduced noise of security scanners, smaller security attack surface, reduced overhead of patching vulnerabilities and even smaller image size for higher performance. Google has published a set of distroless container images for different languages. You can create distroless application images using the Google build tool Bazel, which has rules for creating distroless containers or simply use multistage Dockerfiles. Note that distroless containers by default don't have a shell for debugging. However, you can easily find debug versions of distroless containers online, including a busybox shell.
At ThoughtWorks, as early adopters and leaders in the agile space, we've been proponents of the practice of incremental delivery. We've also advised many clients to look at off-the-shelf software through a "Can this be released incrementally?" lens. This has often been difficult because of the big-bang approach of most vendors which usually involves migrating large amounts of data. Recently, however, we've also had success using incremental delivery with COTS (commercial off-the-shelf), launching specific business processes incrementally to smaller subsets of users. We recommend you assess whether you can apply this practice to the vendor software of your choice, to help reduce the risks involved in big-bang deliveries.
For some time now we've recommended increased delivery team ownership of their entire stack, including infrastructure. This means increased responsibility in the delivery team itself for configuring infrastructure in a safe, secure, and compliant way. When adopting cloud strategies, most organizations default to a tightly locked-down and centrally managed configuration to reduce risk, but this also creates substantial productivity bottlenecks. An alternative approach is to allow teams to manage their own configuration, and use an Infrastructure configuration scanner to ensure the configuration is set in a safe and secure way. Watchmen is an interesting tool, built to provide rule-driven assurance of AWS account configurations that are owned and operated independently by delivery teams. Scout2 is another example of configuration scanning to support secure compliance.
In more complex architectures and deployments, it may not be immediately obvious that a build that depends on the code currently being checked in is broken. Developers trying to fix a broken build could find themselves working against a moving target, as the build is continually triggered by upstream dependencies. Pre-commit downstream build checks is a very simple technique: have a pre-commit or pre-push script check the status of these downstream builds and alert the developer beforehand that a build is broken.
As large organizations transition to more autonomous teams owning and operating their own microservices, how can they ensure the necessary consistency and compatibility between those services without relying on a centralized hosting infrastructure? To work together efficiently, even autonomous microservices need to align with some organizational standards. A service mesh offers consistent discovery, security, tracing, monitoring and failure handling without the need for a shared asset such as an API gateway or ESB. A typical implementation involves lightweight reverse-proxy processes deployed alongside each service process, perhaps in a separate container. These proxies communicate with service registries, identity providers, log aggregators and other services. Service interoperability and observability are gained through a shared implementation of this proxy but not a shared runtime instance. We've advocated for a decentralized approach to microservices management for some time and are happy to see this consistent pattern emerge. Open source projects such as Linkerd and Istio will continue to mature and make service meshes even easier to implement.
When organizations choose a vanilla Hadoop or Spark distribution instead of one of the vendor distributions, they have to decide how they want to provision and manage the cluster. Occasionally, we see "handcranking" of Hadoop clusters using config management tools such as Ansible, Chef and others. Although these tools are great at provisioning immutable infrastructure components, they're not very useful when you have to manage stateful systems and can often lead to significant effort trying to manage and evolve clusters using these tools. We instead recommend using tools such as Ambari to provision and manage your stateful Hadoop or Spark clusters.
The major cloud providers have become increasingly competitive in their pricing and the rapid pace of releasing new features. This leaves consumers in a difficult place when choosing and committing to a provider. Increasingly, we're seeing organizations prepare to use "any cloud" and to avoid vendor lock-in at all costs. This, of course, leads to generic cloud usage. We see organizations limiting their use of the cloud to only those features common across all cloud providers—thereby missing out on the providers' unique benefits. We see organizations making large investments in home-grown abstraction layers that are too complex to build and too costly to maintain to stay cloud agnostic. The problem of lock-in is real. We recommend approaching this problem with a multicloud strategy that evaluates the migration cost and effort of capabilities from one cloud to another, against the benefits of using cloud-specific features. We recommend increasing the portability of the workloads by shipping the applications as widely adopted Docker containers: use open source security and identity protocols to easily migrate the identity of the workloads, a risk-commensurate vendor strategy to maintain cloud independence only where necessary and Polycloud to mix and match services from different providers where it makes sense. In short, shift your approach from a generic cloud usage to a sensible multicloud strategy.
A defining characteristic of a microservices architecture is that system components and services are organized around business capabilities. Regardless of size, microservices encapsulate a meaningful grouping of functionality and information to allow for the independent delivery of business value. This is in contrast to earlier approaches in service architecture which organized services according to technical characteristics. We've observed a number of organizations who've adopted a layered microservices architecture, which in some ways is a contradiction in terms. These organizations have fallen back to arranging services primarily according to a technical role, for example, experience APIs, process APIs or system APIs. It's too easy for technology teams to be assigned by layer, so delivering any valuable business change requires slow and expensive coordination between multiple teams. We caution against the effects of this layering and recommend arranging services and teams primarily according to business capability.
Master data management (MDM) is a classic example of the enterprise "silver bullet" solution: it promises to solve an apparently related class of problems in one go. Through creating a centralized single point of change, coordination, test and deployment, MDM solutions negatively impact an organization's ability to respond to business change. Implementations tend to be long and complex, as organizations try to capture and map all "master" data into the MDM while integrating the MDM solution into all consuming or producing systems.
Microservices has emerged as a leading architectural technique in modern cloud-based systems, but we still think teams should proceed carefully when making this choice. Microservice envy tempts teams to complicate their architecture by having lots of services simply because it's a fashionable architecture choice. Platforms such as Kubernetes make it much easier to deploy complex sets of microservices, and vendors are pushing their solutions to managing microservices, potentially leading teams further down this path. It's important to remember that microservices trade development complexity for operational complexity and require a solid foundation of automated testing, continuous delivery and DevOps culture.
On a number of occasions we have seen system designs that use request-response events in user-facing workflows. In these cases, the UI is blocked or the user has to wait for a new page to load until a corresponding response message to a request message is received. The main reasons cited for designs like this are performance or a unified approach to communication between backends for synchronous and asynchronous use cases. We feel that the increased complexity—in development, testing and operations—far outweighs the benefit of having a unified approach, and we strongly suggest to use synchronous HTTP requests when synchronous communication between backend services is needed. When implemented well, communication using HTTP rarely is a bottleneck in a distributed system.
Robotic process automation (RPA) is a key part of many digital transformation initiatives, as it promises to deliver cost savings without having to modernize the underlying architecture and systems. The problem with this approach of focusing only on automating business processes, without addressing the underlying software systems or capabilities, is that this can make it even harder to change the underlying systems by introducing additional coupling. This makes any future attempts to address the legacy IT landscape even more difficult. Very few systems can afford to ignore change and hence RPA efforts need to be coupled with appropriate legacy modernization strategies.