Much documentation can be replaced with highly readable code and tests. In a world of evolutionary architecture, however, it's important to record certain design decisions for the benefit of future team members as well as for external oversight. Lightweight Architecture Decision Records is a technique for capturing important architectural decisions along with their context and consequences. We recommend storing these details in source control, instead of a wiki or website, as then they can provide a record that remains in sync with the code itself. For most projects, we see no reason why you wouldn't want to use this technique.
We've seen a steep increase in interest in the topic of digital platforms over the past 12 months. Companies looking to roll out new digital solutions quickly and efficiently are building internal platforms, which offer teams self-service access to the business APIs, tools, knowledge and support necessary to build and operate their own solutions. We find that these platforms are most effective when they're given the same respect as an external product offering. Applying product management to internal platforms means establishing empathy with internal consumers (read: developers) and collaborating with them on the design. Platform product managers establish roadmaps and ensure the platform delivers value to the business and enhances the developer experience. Some owners even create a brand identity for the internal platform and use that to market the benefits to their colleagues. Platform product managers look after the quality of the platform, gather usage metrics, and continuously improve it over time. Treating the platform as a product helps to create a thriving ecosystem and avoids the pitfall of building yet another stagnant, underutilized service-oriented architecture.
Borrowed from evolutionary computing, a fitness function is used to summarize how close a given design solution is to achieving the set aims. When defining an evolutionary algorithm, the designer seeks a ‘better’ algorithm; the fitness function defines what ‘better’ means in this context. An architectural fitness function, as defined in Building Evolutionary Architectures, provides an objective integrity assessment of some architectural characteristics, which may encompass existing verification criteria, such as unit testing, metrics, monitors, and so on. We believe architects can communicate, validate and preserve architectural characteristics in an automated, continual manner, which is the key to building evolutionary architectures.
Many organizations we work with are trying hard to use modern engineering approaches to build new capabilities and features, while also having to coexist with a long tail of legacy systems. An old strategy that, based on our experience, has turned out to be increasingly helpful in these scenarios is Eric Evans's Autonomous bubble pattern. This approach involves creating a fresh context for new application development that is shielded from the entanglements of the legacy world. This is a step beyond just using an anticorruption layer. It gives the new bubble context full control over its backing data, which is then asynchronously kept up-to-date with the legacy systems. It requires some work to protect the boundaries of the bubble and keep both worlds consistent, but the resulting autonomy and reduction in development friction is a first bold step toward a modernized future architecture.
In previous editions of the Radar, we've talked about using Chaos Monkey from Netflix to test how a running system is able to cope with outages in production by randomly disabling instances and measuring the results. Chaos Engineering is the nascent term for the wider application of this technique. By running experiments on distributed systems in production, we're able to build confidence that those systems work as expected under turbulent conditions. A good place to start understanding this technique is the Principles of Chaos Engineering website.
In previous Radars issues we mentioned tools such as git-crypt and Blackbox that allow us to keep secrets safe inside the source code. Decoupling secret management from source code is our way to remind technologists that there are other options for storing secrets. For example, HashiCorp vault, CI servers and configuration management tools provide mechanisms for storing secrets that are not linked to the source code of an application. Both approaches are viable and we recommend you use at least one of them in your projects.
Inspired by the DevOps movement, DesignOps is a cultural shift and a set of practices that allows people across an organization to continuously redesign products without compromising quality, service coherency or team autonomy. DesignOps advocates for the creation and evolution of a design infrastructure that minimizes the effort necessary to create new UI concepts and variations, and to establish a rapid and reliable feedback loop with end users. With tools such as Storybook promoting close collaboration, the need for upfront analysis and specification handoffs is reduced to the absolute minimum. With DesignOps, design is shifting from being a specific practice to being a part of everyone's job.
Working with legacy code, especially large monoliths, is one of the most unsatisfying, high-friction experiences for developers. Although we caution against extending and actively maintaining legacy monoliths, they continue to be dependencies in our environments, and developers often underestimate the cost and time required to develop against these dependencies. To help reduce the friction, developers have used virtualized machine images or container images with Docker containers to create immutable images of legacy systems and their configurations. The intent is to contain the legacy in a box for developers to run locally and remove the need for rebuilding, reconfiguring or sharing environments. In an ideal scenario, teams that own legacy systems generate the corresponding boxed legacy images through their build pipelines, and developers can then run and orchestrate these images in their allocated sandbox more reliably. Although this approach has reduced the overall time spent by each developer, it has had limited success when the teams owning the downstream dependencies have been reluctant to create container images for others to use.
We've seen significant benefits from introducing microservices architectures, which have allowed teams to scale the delivery of independently deployed and maintained services. Unfortunately, we've also seen many teams create front-end monoliths — a single, large and sprawling browser application — on top of their back-end services. Our preferred (and proven) approach is to split the browser-based code into micro frontends. In this approach, the web application is broken down into its features, and each feature is owned, frontend to backend, by a different team. This ensures that every feature is developed, tested and deployed independently from other features. Multiple techniques exist to recombine the features — sometimes as pages, sometimes as components — into a cohesive user experience.
The use of continuous delivery pipelines to orchestrate the release process for software has become a mainstream concept. However, automatically testing changes to infrastructure code isn’t as widely understood. Continuous integration (CI) and continuous delivery (CD) tools can be used to test server configuration (e.g., Chef cookbooks, Puppet modules, Ansible playbooks), server image building (e.g., Packer), environment provisioning (e.g., Terraform, CloudFormation) and integration of environments. The use of pipelines for infrastructure as code enables errors to be found before changes are applied to operational environments — including environments used for development and testing. They also offer a way to ensure that infrastructure tooling is run consistently, from CI/CD agents, as opposed to being run from individual workstations. Some challenges remain, however, such as the longer feedback loops associated with standing up containers and virtual machines. Still, we've found this to be a valuable technique.
The use of serverless architecture has very quickly become an accepted approach for organizations deploying cloud applications, with a plethora of choices available for deployment. Even traditionally conservative organizations are making partial use of some serverless technologies. Most of the discussion goes to Functions as a Service (e.g., AWS Lambda, Google Cloud Functions, Azure Functions) while the appropriate patterns for use are still emerging. Deploying serverless functions undeniably removes the nontrivial effort that traditionally goes into server and OS configuration and orchestration. Serverless functions, however, are not a fit for every requirement. At this stage, you must be prepared to fall back to deploying containers or even server instances for specific requirements. Meanwhile, the other components of a serverless architecture, such as Backend as a Service, have become almost a default choice.
Many development teams have adopted test-driven development practices for writing application code because of their benefits. Others have turned to containers to package and deploy their software, and it's accepted practice to use automated scripts to build the containers. What we’ve seen few teams do so far is combine the two trends and drive the writing of the container scripts using tests. With frameworks such as Serverspec and Goss, you can express the intended functionality for either isolated or orchestrated containers, with short feedback loops. This means that it’s possible to use the same principles we’ve championed for code by TDD'ing containers. Our initial experience doing so has been very positive.
The amount of data collected by IT operations has been increasing for years. For example, the trend toward microservices means that more applications are generating their own operational data, and tools such as Splunk, Prometheus, or the ELK stack make it easier to store and process data later on, to gain operational insights. When combined with increasingly democratized machine learning tools, it’s inevitable that operators will start to incorporate statistical models and trained classification algorithms into their toolsets. Although these algorithms have been available for years, and various attempts have been made to automate service management, we're only just starting to understand how machines and humans can collaborate to identify outages earlier or pinpoint the source of failures. Although there is a risk of overhyping Algorithmic IT operations, steady improvement in machine learning algorithms will inevitably change the role of humans in operating tomorrow's data centers.
Blockchains have been widely hyped as the panacea for all things fintech, from banking to digital currency to supply chain transparency. We’ve previously featured Ethereum because of its feature set, which includes smart contracts. Now, we're seeing more development using Ethereum for decentralized applications in other areas. Although this is still a very young technology, we're encouraged to see it being used to build decentralized applications beyond cryptocurrency and banking.
As event streaming platforms, such as Apache Kafka, rise in popularity, many consider them as an advanced form of message queuing, used solely to transmit events. Even when used in this way, event streaming has its benefits over traditional message queuing. However, we're more interested in how people use event streaming as the source of truth with platforms (Kafka in particular) as the primary store for data as immutable events. A service with an Event Sourcing design, for example, can use Kafka as its event store; those events are then available for other services to consume. This technique has the potential to reduce duplicating efforts between local persistence and integration.
The adoption of cloud and DevOps, while increasing the productivity of teams who can now move more quickly with reduced dependency on centralized operations teams and infrastructure, also has constrained teams who lack the skills to self-manage a full application and operations stack. Some organizations have tackled this challenge by creating platform engineering product teams. These teams operate an internal platform which enables delivery teams to self-service deploy and operate systems with reduced lead time and stack complexity. The emphasis here is on API-driven self-service and supporting tools, with delivery teams still responsible for supporting what they deploy onto the platform. Organizations that consider establishing such a platform team should be very cautious not to accidentally create a separate DevOps team, nor should they simply relabel their existing hosting and operations structure as a platform.
The major cloud providers (Amazon, Microsoft and Google) are locked in an aggressive race to maintain parity on core capabilities while their products are differentiated only marginally. This is causing a few organizations to adopt a Polycloud strategy — rather than going ‘all-in’ with one provider, they are passing different types of workloads to different providers in a best-of-breed approach. This may involve, for example, putting standard services on AWS, but using Google for machine learning, Azure for .NET applications that use SQLServer, or potentially using the Ethereum Consortium Blockchain solution. This is different than a cloud-agnostic strategy of aiming for portability across providers, which is costly and forces lowest-common-denominator thinking. Polycloud instead focuses on using the best that each cloud offers.
As large organizations transition to more autonomous teams owning and operating their own microservices, how can they ensure the necessary consistency and compatibility between those services without relying on a centralized hosting infrastructure? To work together efficiently, even autonomous microservices need to align with some organizational standards. A service mesh offers consistent discovery, security, tracing, monitoring and failure handling without the need for a shared asset such as an API gateway or ESB. A typical implementation involves lightweight reverse-proxy processes deployed alongside each service process, perhaps in a separate container. These proxies communicate with service registries, identity providers, log aggregators, and so on. Service interoperability and observability are gained through a shared implementation of this proxy but not a shared runtime instance. We've advocated for a decentralized approach to microservice management for some time and are happy to see this consistent pattern emerge. Open source projects such as linkerd and Istio will continue to mature and make service meshes even easier to implement.
Microservices architecture, with a large number of services exposing their assets and capabilities through APIs and an increased attack surface, demand a zero trust security architecture — ‘never trust, always verify’. However, enforcing security controls for communication between services is often neglected, due to increased service code complexity and lack of libraries and language support in a polyglot environment. To get around this complexity, some teams delegate security to an out-of-process sidecar — a process or a container that is deployed and scheduled with each service sharing the same execution context, host and identity. Sidecars implement security capabilities, such as transparent encryption of the communication and TLS (Transport Layer Security) termination, as well as authentication and authorization of the calling service or the end user. We recommend you look into using Istio, linkerd or Envoy before implementing your own sidecars for endpoint security.
Traditional approaches to enterprise security often emphasize locking things down and slowing the pace of change. However, we know that the more time an attacker has to compromise a system, the greater the potential damage. The three Rs of enterprise security — rotate, repair and repave — take advantage of infrastructure automation and continuous delivery to eliminate opportunities for attack. Rotating credentials, applying patches as soon as they're available and rebuilding systems from a known, secure state — all within a matter of minutes or hours — makes it harder for attackers to succeed. The three Rs of security technique is made feasible with the advent of modern cloud-native architectures. When applications are deployed as containers, and built and tested via a completely automated pipeline, a security patch is just another small release that can be sent through the pipeline with one click. Of course, in keeping with best distributed systems practices, developers need to design their applications to be resilient to unexpected server outages. This is similar to the impact of implementing Chaos Monkey within your environment.
We're compelled to caution, again, against creating a single CI instance for all teams. While it's a nice idea in theory to consolidate and centralize Continuous Integration (CI) infrastructure, in reality we do not see enough maturity in the tools and products in this space to achieve the desired outcome. Software delivery teams which must use the centralized CI offering regularly have long delays depending on a central team to perform minor configuration tasks, or to troubleshoot problems in the shared infrastructure and tooling. At this stage, we continue to recommend that organizations limit their centralized investment to establishing patterns, guidelines and support for delivery teams to operate their own CI infrastructure.
We've long been advocates of continuous integration (CI), and we were pioneers in building CI server programs to automatically build projects on check-ins. Used well, these programs run as a daemon process on a shared project mainline that developers commit to daily. The CI server builds the project and runs comprehensive tests to ensure the whole software system is integrated and is in an always-releasable state, thus satisfying the principles of continuous delivery. Sadly, many developers simply set up a CI server and falsely assume they are "doing CI" when in reality they miss out on all the benefits. Common failure modes include: running CI against a shared mainline but with infrequent commits, so integration isn't really continuous; running a build with poor test coverage; allowing the build to stay red for long periods; or running CI against feature branches which results in continuous isolation. The ensuing "CI theatre" might make people feel good, but would fail any credible CI certification test.
When the enterprise-wide quarterly or monthly releases were considered best practice, it was necessary to maintain a complete environment for performing testing cycles prior to deployment to production. These enterprise-wide integration test environments (often referred to as SIT or Staging) are a common bottleneck for continuous delivery today. The environments themselves are fragile and expensive to maintain, often with components that need manual configuration by a separate environment management team. Testing in the staging environment provides unreliable and slow feedback, and testing effort is duplicated with what can be performed on components in isolation. We recommend that organizations incrementally create an independent path to production for key components. Important techniques include contract testing, decoupling deployment from release, focus on mean time to recovery and testing in production.
Kafka is becoming very popular as a messaging solution, and along with it, Kafka Streams is at the forefront of the wave of interest in streaming architectures. Unfortunately, as they start to embed Kafka at the heart of their data and application platforms, we're seeing some organizations recreating ESB antipatterns with Kafka by centralizing the Kafka ecosystem components — such as connectors and stream processors — instead of allowing these components to live with product or service teams. This reminds us of seriously problematic ESB antipatterns, where more and more logic, orchestration and transformation were thrust into a centrally managed ESB, creating a significant dependency on a centralized team. We're calling this out to dissuade further implementations of this flawed pattern.
Back in the days when SOAP held sway in the enterprise software industry, the practice of generating client code from WSDL specs was an accepted—even encouraged—practice. Unfortunately, the resulting code was often complex, untestable, difficult to modify and frequently didn't work across implementation platforms. With the advent of REST, we found it better to evolve API clients that use the tolerant reader pattern for extracting and processing only the fields needed. Recently we have observed a disturbing return to old habits with developers generating code from API specifications written in Swagger or RAML—a practice that we refer to as spec-based codegen. Although such tools are very useful for driving the design of APIs and for extracting documentation, we caution against the tempting shortcut of simply generating client code directly from these specifications. The chances are that such code will be difficult to test and maintain.