Fitness functions introduced by evolutionary architecture, borrowed from evolutionary computing, are executable functions that inform us if our applications and architecture are objectively moving away from their desired characteristics. They're essentially tests that can be incorporated into our release pipelines. One of the major characteristics of an application is the freshness of its dependencies to other libraries, APIs or environmental components that a dependency drift fitness function tracks to flag the out-of-date dependencies that require updating. With the growing and maturing number of tools that detect dependency drifts, such as Dependabot or Snyk, we can easily incorporate dependency drift fitness functions into our software release process to take timely action in keeping our application dependencies up to date.
Automating the estimation, tracking and projection of cloud infrastructure's run cost is necessary for today's organizations. The cloud providers' savvy pricing models, combined with the proliferation of pricing parameters and the dynamic nature of today's architecture, can lead to surprisingly expensive run costs. For example, the price of serverless based on API calls, event streaming solutions based on traffic or data processing clusters based on running jobs, all have a dynamic nature that changes over time as the architecture evolves. When our teams manage infrastructure on the cloud, implementing run cost as architecture fitness function is one of their early activities. This means that our teams can observe the cost of running services against the value delivered; when they see deviations from what was expected or acceptable, they'll discuss whether it's time to evolve the architecture. The observation and calculation of the run cost is implemented as an automated function.
As the technology landscape is becoming more complex, concerns such as security need more automation and engineering practices. When building systems, we need to take into consideration security policies, which are rules and procedures to protect our systems from threats and disruption. For example, access control policies define and enforce who can access which services and resources under what circumstances; by contrast, network security policies can dynamically limit the traffic rate to a particular service.
Several of our teams have had a great experience treating security policy as code. When we say as code, we not only mean to write these security policies in a file but also to apply practices such as keeping the code under version control, introducing automatic validation in the pipeline, automatically deploying them in the environments and observing and monitoring their performance. Based on our experience and the maturity of the existing tools — including Open Policy Agent and platforms such as Istio which provide flexible policy definition and enforcement mechanisms that support the practice of security policy as code — we highly recommend using this technique in your environment.
Since we last mentioned tailored service templates, we've seen a broader adoption of the pattern to help pave the road for organizations moving to microservices. With constant advances in observability tooling, container orchestration and service mesh sidecars, a template provides sensible defaults to bootstrap a new service, removing a great deal of setup needed to make the service work well with the surrounding infrastructure. We've had success applying product management principles to tailored service templates, treating internal developers as customers and making it easier for them to push code to production and operate it with appropriate observability. This has the added benefit of acting as a lightweight governance mechanism to centralize default technical decisions.
About a decade ago we introduced continuous delivery (CD), our default way to deliver software solutions. Today's solutions increasingly include machine-learning models and we find them no exception in adopting continuous delivery practices. We call this continuous delivery for machine learning (CD4ML). Although the principles of CD remain the same, the practices and tools to implement the end-to-end process of training, testing, deploying and monitoring models require some modifications. For example: version control must not only include code but also the data, the models and its parameters; the testing pyramid extends to include model bias, fairness and data and feature validation; the deployment process must consider how to promote and evaluate the performance of new models against current champion models. While the industry is celebrating the new buzzword of MLOps, we feel CD4ML is our holistic approach to implement an end-to-end process to reliably release and continuously improve machine-learning models, from idea to production.
Data mesh marks a welcome architectural and organizational paradigm shift in how we manage big analytical data. The paradigm is founded on four principles: (1) domain-oriented decentralization of data ownership and architecture; (2) domain-oriented data served as a product; (3) self-serve data infrastructure as a platform to enable autonomous, domain-oriented data teams; and (4) federated governance to enable ecosystems and interoperability. Although the principles are intuitive and attempt to address many of the known challenges of previous centralized analytical data management, they transcend the available analytical data technologies. After building data mesh for multiple clients on top of the existing tooling, we learned two things: (a) there is a large gap in open-source or commercial tooling to accelerate implementation of data mesh (for example, implementation of a universal access model to time-based polyglot data which we currently custom build for our clients) and (b) despite the gap, it's feasible to use the existing technologies as the basic building blocks.
Naturally, technology fit is a major component of implementing your organization's data strategy based on data mesh. Success, however, demands an organizational restructure to separate the data platform team, create the role of data product owner for each domain and introduce the incentive structures necessary for domains to own and share their analytical data as products.
Many data pipelines are defined in a large, more or less imperative script written in Python or Scala. The script contains the logic of the individual steps as well as the code chaining the steps together. When faced with a similar situation in Selenium tests, developers discovered the Page Object pattern, and later many behavior-driven development (BDD) frameworks implemented a split between step definitions and their composition. Some teams are now experimenting with bringing the same thinking to data engineering. A separate declarative data pipeline definition, maybe written in YAML, contains only the declaration and sequence of steps. It states input and output data sets but refers to scripts if and when more complex logic is needed. A La Mode is a relatively new tool that takes a DSL approach to defining pipelines, but airflow-declarative, a tool that turns directed acyclic graphs defined in YAML into Airflow task schedules, seems to have the most momentum in this space.
We're seeing more and more tools that enable you to create software architecture and other diagrams as code. There are benefits to using these tools over the heavier alternatives, including easy version control and the ability to generate the DSLs from many sources. Tools in this space that we like include Diagrams, Structurizr DSL, AsciiDoctor Diagram and stables such as WebSequenceDiagrams, PlantUML and the venerable Graphviz. It's also fairly simple to generate your own SVG these days, so don't rule out quickly writing your own tool either. One of our authors wrote a small Ruby script to quickly create SVGs, for example.
When building Docker images for our applications, we're often concerned with two things: the security and the size of the image. Traditionally, we've used container security scanning tools to detect and patch common vulnerabilities and exposures and small distributions such as Alpine Linux to address the image size and distribution performance. We've now gained more experience with distroless Docker images and are ready to recommend this approach as another important security precaution for containerized applications. Distroless Docker images reduce the footprint and dependencies by doing away with a full operating system distribution. This technique reduces security scan noise and the application attack surface. There are fewer vulnerabilities that need to be patched and as a bonus, these smaller images are more efficient. Google has published a set of distroless container images for different languages. You can create distroless application images using the Google build tool Bazel or simply use multistage Dockerfiles. Note that distroless containers by default don't have a shell for debugging. However, you can easily find debug versions of distroless containers online, including a BusyBox shell. Distroless Docker images is a technique pioneered by Google and, in our experience, is still largely confined to Google-generated images. We're hoping that the technique catches on beyond this ecosystem.
As many more companies migrate away from their legacy systems, we feel it's worth highlighting an alternative to change data capture (CDC) as a mechanism for getting data from these systems. Martin Fowler described event interception back in 2004. In modern terms it involves forking requests on ingress to a system so that it's possible to gradually build a replacement. Often this is done by copying events or messages but forking HTTP requests is equally valid. Examples include forking events from point-of-sale systems before they're written to a mainframe and forking payment transactions before they're written to a core banking system. Both lead to the gradual replacement of parts of the legacy systems. We feel that as a technique, obtaining state changes from the source, rather than trying to recreate them postprocessing using CDC, has been overlooked which is why we're highlighting it in this issue of the Radar.
Replacing legacy code at scale is always a difficult endeavor and one that often benefits from executing a parallel run with reconciliation. In practice, the technique relies on executing the same production flow through both the old and new code, returning the response from the legacy code but comparing the results to gain confidence in the new code. Despite being an old technique, we've seen more robust implementations in recent years building on continuous delivery practices such as canary releases and feature toggles and extending them by adding an extra layer of experimentation and data analysis to compare live results. We've even used the approach to compare cross-functional results such as response time. Although we've used the technique multiple times with bespoke tooling, we certainly owe a nod to GitHub's Scientist tool, which they used to modernize a critical piece of their application and which has now been ported to multiple languages.
As the pandemic stretches on it seems that highly distributed teams will be the "new normal," at least for the time being. Over the past six months we've learnt a lot about effective remote working. On the positive side, good visual work-management and collaboration tools have made it easier than ever to collaborate remotely with colleagues. Developers, for example, can count on Visual Studio Live Share and GitHub Codespaces to facilitate teamwork and increase productivity. The biggest downside to remote work might be burnout: far too many people are scheduled for back-to-back video calls all day long, and this has begun to take its toll. While online visual tools make it easier to collaborate, it's also possible to build complex giant diagrams that end up being very hard to use, and the security aspects of tool proliferation also need to be carefully managed. Our advice is to remember to take a step back, talk to your teams, evaluate what's working and what's not and change processes and tools as needed.
While the fabric of computing and data continues to shift in enterprises — from monolithic applications to microservices, from centralized data lakes to data mesh, from on-prem hosting to polycloud, with an increasing proliferation of connected devices — the approach to securing enterprise assets for the most part remains unchanged, with heavy reliance and trust in the network perimeter: Organizations continue to make heavy investments to secure their assets by hardening the virtual walls of their enterprises, using private links and firewall configurations and replacing static and cumbersome security processes that no longer serve the reality of today. This continuing trend compelled us to highlight zero trust architecture (ZTA) again.
ZTA is a paradigm shift in security architecture and strategy. It’s based on the assumption that a network perimeter is no longer representative of a secure boundary and no implicit trust should be granted to users or services based solely on their physical or network location. The number of resources, tools and platforms available to implement aspects of ZTA keeps growing and includes: enforcing policies as code based on the least privilege and as granular as possible principles and continuous monitoring and automated mitigation of threats; using service mesh to enforce security control application-to-service and service-to-service; implementing binary attestation to verify the origin of the binaries; and including secure enclaves in addition to traditional encryption to enforce the three pillars of data security: in transit, at rest and in memory. For introductions to the topic, consult the NIST ZTA publication and Google's white paper on BeyondProd.
One of the most nuanced decisions facing companies at the moment is the adoption of low-code or no-code platforms, that is, platforms that solve very specific problems in very limited domains. Many vendors are pushing aggressively into this space. The problems we see with these platforms typically relate to an inability to apply good engineering practices such as versioning. Testing too is typically really hard. However, we noticed some interesting new entrants to the market — including Amazon Honeycode, which makes it easy to create simple task or event management apps, and Parabola for IFTTT-like cloud workflows — which is why we're including bounded low-code platforms in this volume. Nevertheless, we remain deeply skeptical about their wider applicability since these tools, like Japanese Knotweed, have a knack of escaping their bounds and tangling everything together. That's why we still strongly advise caution in their adoption.
Polyfills are extremely useful to help the web evolve, providing substitute implementations of modern features for browsers that don't implement them (yet). Too often, though, web applications ship polyfills to browsers that don't need them, which causes unnecessary download and parsing overhead. The situation is becoming more pronounced now as only a few rendering engines remain and the bulk of the polyfills target only one of them: the Trident renderer in IE11. Further, market share of IE11 is dwindling with support ending in less than a year. We therefore suggest that you make use of browser-tailored polyfills, shipping only necessary polyfills to a given browser. This technique can even be implemented as a service with Polyfill.io.
In 2016, Christopher Allen, a key contributor to SSL/TLS, inspired us with an introduction of 10 principles underpinning a new form of digital identity and a path to get there, the path to self-sovereign identity. Self-sovereign identity, also known as decentralized identity, is a “lifetime portable identity for any person, organization, or thing that does not depend on any centralized authority and can never be taken away,” according to the Trust over IP standard. Adopting and implementing decentralized identity is gaining momentum and becoming attainable. We see its adoption in privacy-respecting customer health applications, government healthcare infrastructure and corporate legal identity. If you want to rapidly get started with decentralized identity, you can assess Sovrin Network, Hyperledger Aries and Indy OSS, as well as decentralized identifiers and verifiable credentials standards. We're watching this space closely as we help our clients with their strategic positioning in the new era of digital trust.
Cloud providers have slowly started supporting Kubernetes-style APIs, via custom resource definitions (CRDs), for managing their cloud services. In most cases these cloud services are a core part of the infrastructure, and we've seen teams use tools such as Terraform or Pulumi to provision them. With these new CRDs (ACK for AWS, Azure Service Operator for Azure and Config Connectors for GCP) you can use Kubernetes to provision and manage these cloud services. One advantage of these Kube-managed cloud services is that you can leverage the same Kubernetes control plane to enforce the declarative state of both your application and infrastructure. The downside is that it tightly couples your Kubernetes cluster with infrastructure, so we're carefully assessing it and you should too.
We've talked a lot about the benefits of creating platform engineering product teams in support of your other product teams, but actually doing it is hard. It seems that the industry is still searching for the right abstraction in the world of infrastructure as code. Although tools such as Terraform and Helm are steps in the right direction, the focus is still on managing infrastructure as opposed to application development. There are also shifts toward the concept of infrastructure as software with new tools such as Pulumi and CDK being released. The Open Application Model (OAM) is an attempt to bring some standardization to this space. Using the abstractions of components, application configurations, scopes and traits, developers can describe their applications in a platform-agnostic way, while platform implementers define their platform in terms of workload, trait and scope. Whether the OAM will be widely adopted remains to be seen, but we recommend keeping an eye on this interesting and needed idea.
Secure enclaves, also identified as trusted execution environments (TEE), refer to a technique that isolates an environment — processor, memory and storage — with a higher level of security and only provides a limited exchange of information with its surrounding untrusted execution context. For example, a secure enclave at the hardware and OS levels can create and store private keys and perform operations with them such as encrypt data or verify signatures without the private keys leaving the secure enclave or being loaded in the untrusted application memory. Secure enclave provides a limited set of instructions to perform trusted operations, isolated from an untrusted application context.
The technique has long been supported by many hardware and OS providers (including Apple), and developers have used it in IoT and edge applications. Only recently, however, has it gained attention in enterprise and cloud-based applications. Cloud providers have started to introduce confidential computing features such as hardware-based secure enclaves: Azure confidential computing infrastructure promises TEE-enabled VMs and access through the Open Enclave SDK open-source library to perform trusted operations. Similarly, GCP Confidential VMs and Compute Engine, still in beta, allow using VMs with data encryption in memory, and AWS Nitro Enclaves is following them with its upcoming preview release. With the introduction of cloud-based secure enclaves and confidential computing, we can add a third pillar to data protection: in rest, in transit and now in memory.
Even though we're still in the very early days of secure enclaves for enterprise, we encourage you to consider this technique, while staying informed about known vulnerabilities that can compromise the secure enclaves of the underlying hardware providers.
Controlled experiments using A/B testing is a great way to inform decisions around product development. But it doesn't work well when we can't establish independence between the two groups involved in the A/B test — i.e., adding someone to the "A" group impacts the "B" group and vice versa. One technique to address this problem space is Switchback experimentation. The core concept here is we switch back and forth between the "A" and "B" modes of the experiment in a certain region at alternating time periods instead of both running during the same time period. We then compare the customer experience and other key metrics between the two time buckets. We've tried this to good effect in some of our projects — it's a good tool to have in our experiments toolbelt.
Credentials are everywhere in our lives and include passports, driver’s licenses and academic certificates. However, most digital credentials today are simple data records from information systems that are easy to modify and forge and often expose unnecessary information. In recent years, we've seen the continuous maturity of Verifiable Credentials solve this issue. The W3C standard defines it in a way that is cryptographically secure, privacy respecting and machine verifiable. The model puts credential holders at the center, which is similar to our experience when using physical credentials: users can put their verifiable credentials in their own digital wallets and show them to anyone at any time without the permission of the credentials’ issuer. This decentralized approach also enables users to better manage their own information and selectively disclose certain information and greatly improves data privacy protection. For example, powered by zero-knowledge proof technology, you can construct a verifiable credential to prove that you are an adult without revealing your birthday. The community has developed many use cases around verifiable credentials. We've implemented our own COVID health certification with reference to the COVID-19 Credentials Initiative (CCI). Although verifiable credentials don't rely on blockchain technology or decentralized identity, this technique often works with DID in practice and uses blockchain as a verifiable data registry. Many decentralized identity frameworks are also embedded with verifiable credentials.
When we first covered GraphQL in the Radar, we cautioned that its misuse can lead to antipatterns which, in the long run, has more disadvantages than benefits. Nevertheless, we’ve seen an increasing interest in GraphQL among our teams because of its ability to aggregate information from different resources. This time we want to caution you about using Apollo Federation and its strong support for a single unified data graph for your company. Even though at first glance the idea of having ubiquitous concepts across the organization is tempting, we have to take into account previous similar attempts in the industry — such as MDM and canonical data model among others — that have exposed the pitfalls of this approach. The challenges can be significant, especially when the domain we find ourselves in is complex enough to create a unique unified model.
We've long warned against centralized enterprise services buses and defined "smart endpoints, dumb pipes" as one of the core characteristics of a microservices architecture. Unfortunately, we're observing a pattern of traditional ESBs rebranding themselves, creating ESBs in API gateway's clothing that naturally encourage overambitious API gateways. Don't let the marketing fool you: regardless of what you call it, putting business logic (including orchestration and transformation) in a centralized tool creates architectural coupling, decreases transparency, and increases vendor lock-in with no clear upside. API gateways can still act as a useful abstraction for crosscutting concerns, but we believe the smarts should live in the APIs themselves.
Several years ago, a new generation of log aggregation platforms emerged that were capable of storing and searching over vast amounts of log data to uncover trends and insights in operational data. Splunk was the most prominent but by no means the only example of these tools. Because these platforms provide broad operational and security visibility across the entire estate of applications, administrators and developers have grown increasingly dependent on them. This enthusiasm spread as stakeholders discovered that they could use log aggregation for business analytics. However, business needs can quickly outstrip the flexibility and usability of these tools. Logs intended for technical observability are often inadequate to infer deep customer understanding. We prefer either to use tools and metrics designed for customer analytics or to take a more event-driven approach to observability where both business and operational events are collected and stored in a way they can be replayed and processed by more purpose-built tools.
Since we originally introduced the term in 2016, micro frontends have grown in popularity and achieved mainstream acceptance. But like any new technique with an easy-to-remember name, it has occasionally been misused and abused. Particularly concerning is the tendency to use this architecture as an excuse to mix a range of competing technologies, tools or frameworks in a single page, leading to micro frontend anarchy. A particularly egregious form of this syndrome is using multiple frontend frameworks — for example, React.js and Angular — in the same "single-page" application. Although this might be technically possible, it is far from advisable when not part of a deliberate transition strategy. Other properties that should be consistent from team to team include the styling technique (e.g., CSS-in-JS or CSS modules) and the means by which the individual components are integrated (e.g., iFrames or web components). Furthermore, organizations should decide whether to standardize on consistent approaches or to leave it up to their teams to decide on state management, data fetching, build tooling, analytics and a host of other choices in a micro frontend application.
Over the last few decades computational notebooks, first introduced by Wolfram Mathematica, have evolved to support scientific research, exploration and educational workflows. Naturally, in support of data science workflows and with the likes of Jupyter notebooks and Databricks notebooks, they've become a great companion by providing a simple and intuitive interactive computation environment for combining code to analyze data with rich text and visualization to tell a data story. Notebooks were designed to provide an ultimate medium for modern scientific communication and innovation. In recent years, however, we've seen a trend for notebooks to be the medium for running the type of production-quality code typically used to drive enterprise operations. We see notebook platform providers advertising the use of their exploratory notebooks in production. This is a case of good intentions — democratizing programming for data scientists — implemented poorly and at the cost of scalability, maintainability, resiliency and all the other qualities that a long-lived production code needs to support. We don't recommend productionizing notebooks and instead encourage empowering data scientists to build production-ready code with the right programming frameworks, thus simplifying the continuous delivery tooling and abstracting complexity away through end-to-end ML platforms.