Much documentation can be replaced with highly readable code and tests. In a world of evolutionary architecture, however, it's important to record certain design decisions for the benefit of future team members as well as for external oversight. Lightweight Architecture Decision Records is a technique for capturing important architectural decisions along with their context and consequences. We recommend storing these details in source control, instead of a wiki or website, as then they can provide a record that remains in sync with the code itself. For most projects, we see no reason why you wouldn't want to use this technique.
We've seen a steep increase in interest in the topic of digital platforms over the past 12 months. Companies looking to roll out new digital solutions quickly and efficiently are building internal platforms, which offer teams self-service access to the business APIs, tools, knowledge and support necessary to build and operate their own solutions. We find that these platforms are most effective when they're given the same respect as an external product offering. Applying product management to internal platforms means establishing empathy with internal consumers (read: developers) and collaborating with them on the design. Platform product managers establish roadmaps and ensure the platform delivers value to the business and enhances the developer experience. Some owners even create a brand identity for the internal platform and use that to market the benefits to their colleagues. Platform product managers look after the quality of the platform, gather usage metrics, and continuously improve it over time. Treating the platform as a product helps to create a thriving ecosystem and avoids the pitfall of building yet another stagnant, underutilized service-oriented architecture.
Borrowed from evolutionary computing, a fitness function is used to summarize how close a given design solution is to achieving the set aims. When defining an evolutionary algorithm, the designer seeks a ‘better’ algorithm; the fitness function defines what ‘better’ means in this context. An architectural fitness function, as defined in Building Evolutionary Architectures, provides an objective integrity assessment of some architectural characteristics, which may encompass existing verification criteria, such as unit testing, metrics, monitors, and so on. We believe architects can communicate, validate and preserve architectural characteristics in an automated, continual manner, which is the key to building evolutionary architectures.
Many organizations we work with are trying hard to use modern engineering approaches to build new capabilities and features, while also having to coexist with a long tail of legacy systems. An old strategy that, based on our experience, has turned out to be increasingly helpful in these scenarios is Eric Evans's Autonomous bubble pattern. This approach involves creating a fresh context for new application development that is shielded from the entanglements of the legacy world. This is a step beyond just using an anticorruption layer. It gives the new bubble context full control over its backing data, which is then asynchronously kept up-to-date with the legacy systems. It requires some work to protect the boundaries of the bubble and keep both worlds consistent, but the resulting autonomy and reduction in development friction is a first bold step toward a modernized future architecture.
In previous editions of the Radar, we've talked about using Chaos Monkey from Netflix to test how a running system is able to cope with outages in production by randomly disabling instances and measuring the results. Chaos Engineering is the nascent term for the wider application of this technique. By running experiments on distributed systems in production, we're able to build confidence that those systems work as expected under turbulent conditions. A good place to start understanding this technique is the Principles of Chaos Engineering website.
It’s important to remember that encapsulation applies to events and event-driven architectures just as it applies to other areas of software. In particular, think about the scope of an event and whether we expect it to be consumed within the same application, the same domain or across an entire organization. A domain-scoped event will be consumed within the same domain as it’s published, as such we expect the consumer to have access to a certain context, resources or references in order to act on the event. If the consumption is happening more widely within an organization, the contents of the event might well need to be different, and we need to take care not to "leak" implementation details that other domains then come to depend upon.
Identity management is a critical platform component. External users on mobile apps need to be authenticated, developers need to be given access to delivery infrastructure components, and microservices may need to identify themselves to other microservices. You should ask yourself whether identity management should be “self-hosted”. In our experience, a hosted identity management as a service (SaaS) solution is preferable. We believe that top-tier hosted providers such as Auth0 and Okta can provide better uptime and security SLAs. That said, sometimes self-hosting the solution is a realistic decision, especially for enterprises that have the operational discipline and resources to do so safely. Large enterprise identity solutions typically offer a much more expansive range of capabilities such as centralized entitlements, governance reporting and separation of duties management among others. However, these concerns are typically more relevant for employee identities, especially in regulated enterprises with legacy systems.
We've seen significant benefits from introducing microservices architectures, which have allowed teams to scale the delivery of independently deployed and maintained services. Unfortunately, we've also seen many teams create front-end monoliths — a single, large and sprawling browser application — on top of their back-end services. Our preferred (and proven) approach is to split the browser-based code into micro frontends. In this approach, the web application is broken down into its features, and each feature is owned, frontend to backend, by a different team. This ensures that every feature is developed, tested and deployed independently from other features. Multiple techniques exist to recombine the features — sometimes as pages, sometimes as components — into a cohesive user experience.
The use of continuous delivery pipelines to orchestrate the release process for software has become a mainstream concept. However, automatically testing changes to infrastructure code isn’t as widely understood. Continuous integration (CI) and continuous delivery (CD) tools can be used to test server configuration (e.g., Chef cookbooks, Puppet modules, Ansible playbooks), server image building (e.g., Packer), environment provisioning (e.g., Terraform, CloudFormation) and integration of environments. The use of pipelines for infrastructure as code enables errors to be found before changes are applied to operational environments — including environments used for development and testing. They also offer a way to ensure that infrastructure tooling is run consistently, from CI/CD agents, as opposed to being run from individual workstations. Some challenges remain, however, such as the longer feedback loops associated with standing up containers and virtual machines. Still, we've found this to be a valuable technique.
Organizations are becoming more comfortable with the Polycloud strategy — rather than going "all-in" with one provider, they are passing different types of workloads to different providers based on their own strategy. Some of them apply the best-of-breed approach, for example: putting standard services on AWS, but using Google for machine learning and data-oriented applications and Azure for Microsoft Windows applications. For some organizations this is a cultural and business decision. Retail businesses, for example, often refuse to store their data on Amazon and they distribute load to different providers based on their data. This is different to a cloud-agnostic strategy of aiming for portability across providers, which is costly and forces lowest-common-denominator thinking. Polycloud instead focuses on using the best match that each cloud provider offers.
Previously in the Radar, we’ve discussed the rise of the perimeterless enterprise. Now, some organizations are doing away with implicitly trusted intranets altogether and treating all communication as if it was being transmitted through the public internet. A set of practices, collectively labeled BeyondCorp, have been described by Google engineers in a set of publications. Collectively, these practices — including managed devices, 802.1x networking and standard access proxies protecting individual services — make this a viable approach to network security in large enterprises.
When developing mobile applications, our teams often find themselves without an external server for testing apps. Setting up an over-the-wire mock may be a good fit for this particular problem. Developing the HTTP mocks and compiling them into the mobile binary for testing — embedded mobile mocks — enables teams to test their mobile apps when disconnected and with no external dependencies. This technique may require creating an opinionated library based on both the networking library used by the mobile app and your usage of the underlying library.
Blockchains have been widely hyped as the panacea for all things fintech, from banking to digital currency to supply chain transparency. We’ve previously featured Ethereum because of its feature set, which includes smart contracts. Now, we're seeing more development using Ethereum for decentralized applications in other areas. Although this is still a very young technology, we're encouraged to see it being used to build decentralized applications beyond cryptocurrency and banking.
As event streaming platforms, such as Apache Kafka, rise in popularity, many consider them as an advanced form of message queuing, used solely to transmit events. Even when used in this way, event streaming has its benefits over traditional message queuing. However, we're more interested in how people use event streaming as the source of truth with platforms (Kafka in particular) as the primary store for data as immutable events. A service with an Event Sourcing design, for example, can use Kafka as its event store; those events are then available for other services to consume. This technique has the potential to reduce duplicating efforts between local persistence and integration.
One pattern that comes up again and again when building microservice-style architectures is how to handle the aggregation of many resources server-side. In recent years, we've seen the emergence of a number of patterns such as Backend for Frontend (BFF) and tools such as Falcor to address this. Our teams have started using GraphQL for server-side resource aggregation instead. This differs from the usual mode of using GraphQL where clients directly query a GraphQL server. When using this technique, the services continue to expose RESTful APIs but under-the-hood aggregate services use GraphQL resolvers as the implementation for stitching resources from other services. This technique simplifies the internal implementation of aggregate services or BFFs by using GraphQL.
For some time now we've recommended increased delivery team ownership of their entire stack, including infrastructure. This means increased responsibility in the delivery team itself for configuring infrastructure in a safe, secure, and compliant way. When adopting cloud strategies, most organizations default to a tightly locked-down and centrally managed configuration to reduce risk, but this also creates substantial productivity bottlenecks. An alternative approach is to allow teams to manage their own configuration, and use an Infrastructure configuration scanner to ensure the configuration is set in a safe and secure way. Watchmen is an interesting tool, built to provide rule-driven assurance of AWS account configurations that are owned and operated independently by delivery teams. Scout2 is another example of configuration scanning to support secure compliance.
We're seeing some interesting reports of using Jupyter for automated testing. The ability to mix code, comments and output in the same document reminds us of FIT, FitNesse and Concordion. This flexible approach is particularly useful if your tests are data heavy or rely on some statistical analysis such as performance testing. Python provides all the power you need, but as tests grow in complexity, a way to manage suites of notebooks would be helpful.
One problem with observability in a highly distributed microservices architecture is the choice between logging everything — and taking up huge amounts of storage space — or randomly sampling logs and potentially missing important events. Recently, we’ve noticed a technique that offers a compromise between these two solutions. Set the log level per request via a parameter passed in through the tracing header. Using a tracing framework, possibly based on the OpenTracing standard, you can pass a correlation id from service to service in a single transaction. You can even inject other data, such as the desired log level, at the initiating transaction and pass it along with the tracing information. This ensures that the additional data collected corresponds to a single user transaction as it flows through the system. This is also a useful technique for debugging, since services might be paused or otherwise modified on a transaction-by-transaction basis.
We’ve previously talked about the technique of Chaos Engineering in the Radar and the Simian Army suite of tools from Netflix that we’ve used to run experiments to test the resilience of production infrastructure. Security Chaos Engineering broadens the scope of this technique to the realm of security. We deliberately introduce false positives into production networks and other infrastructure — build-time dependencies, for example — to check whether procedures in place are capable of identifying security failures under controlled conditions. Although useful, this technique should be used with care to avoid desensitizing teams to security problems.
As large organizations transition to more autonomous teams owning and operating their own microservices, how can they ensure the necessary consistency and compatibility between those services without relying on a centralized hosting infrastructure? To work together efficiently, even autonomous microservices need to align with some organizational standards. A service mesh offers consistent discovery, security, tracing, monitoring and failure handling without the need for a shared asset such as an API gateway or ESB. A typical implementation involves lightweight reverse-proxy processes deployed alongside each service process, perhaps in a separate container. These proxies communicate with service registries, identity providers, log aggregators, and so on. Service interoperability and observability are gained through a shared implementation of this proxy but not a shared runtime instance. We've advocated for a decentralized approach to microservice management for some time and are happy to see this consistent pattern emerge. Open source projects such as linkerd and Istio will continue to mature and make service meshes even easier to implement.
Microservices architecture, with a large number of services exposing their assets and capabilities through APIs and an increased attack surface, demand a zero trust security architecture — ‘never trust, always verify’. However, enforcing security controls for communication between services is often neglected, due to increased service code complexity and lack of libraries and language support in a polyglot environment. To get around this complexity, some teams delegate security to an out-of-process sidecar — a process or a container that is deployed and scheduled with each service sharing the same execution context, host and identity. Sidecars implement security capabilities, such as transparent encryption of the communication and TLS (Transport Layer Security) termination, as well as authentication and authorization of the calling service or the end user. We recommend you look into using Istio, linkerd or Envoy before implementing your own sidecars for endpoint security.
Traditional approaches to enterprise security often emphasize locking things down and slowing the pace of change. However, we know that the more time an attacker has to compromise a system, the greater the potential damage. The three Rs of enterprise security — rotate, repair and repave — take advantage of infrastructure automation and continuous delivery to eliminate opportunities for attack. Rotating credentials, applying patches as soon as they're available and rebuilding systems from a known, secure state — all within a matter of minutes or hours — makes it harder for attackers to succeed. The three Rs of security technique is made feasible with the advent of modern cloud-native architectures. When applications are deployed as containers, and built and tested via a completely automated pipeline, a security patch is just another small release that can be sent through the pipeline with one click. Of course, in keeping with best distributed systems practices, developers need to design their applications to be resilient to unexpected server outages. This is similar to the impact of implementing Chaos Monkey within your environment.
The major cloud providers continue to add new features to their clouds at a rapid pace, and under the banner of Polycloud we've suggested using multiple clouds in parallel, to mix and match services based on the strengths of each provider’s offerings. Increasingly, we're seeing organizations prepare to use multiple clouds — not to benefit from individual provider’s strengths, though, but to avoid vendor "lock-in" at all costs. This, of course, leads to generic cloud usage, only using features that are present across all providers, which reminds us of the lowest common denominator scenario we saw 10 years ago when companies avoided many advanced features in relational databases in an effort to remain vendor neutral. The problem of lock-in is real. However, instead of treating it with a sledgehammer approach, we recommend looking at this problem from the perspective of exit costs and relate those to the benefits of using cloud-specific features.
Kafka is becoming very popular as a messaging solution, and along with it, Kafka Streams is at the forefront of the wave of interest in streaming architectures. Unfortunately, as they start to embed Kafka at the heart of their data and application platforms, we're seeing some organizations recreating ESB antipatterns with Kafka by centralizing the Kafka ecosystem components — such as connectors and stream processors — instead of allowing these components to live with product or service teams. This reminds us of seriously problematic ESB antipatterns, where more and more logic, orchestration and transformation were thrust into a centrally managed ESB, creating a significant dependency on a centralized team. We're calling this out to dissuade further implementations of this flawed pattern.