More and more companies are building internal platforms to roll out new digital solutions quickly and efficiently. Companies that succeed with this strategy are applying product management to internal platforms. This means establishing empathy with internal consumers (the development teams) and collaborating with them on the design. Platform product managers create roadmaps and ensure the platform delivers value to the business and enhances the developer experience. Unfortunately, we're also seeing less successful approaches, where teams create a platform in the void, based on unverified assumptions and without internal customers. These platforms, often despite aggressive internal tactics, end up being underutilized and a drain on the organization's delivery capability. As usual, good product management is all about building products that consumers love.
Although infrastructure as code is a relatively old technique (we’ve featured it in the Radar in 2011), it has become vitally important in the modern cloud era where the act of setting up infrastructure has become the passing of configuration instructions to a cloud platform. When we say "as code" we mean that all the good practices we've learned in the software world should be applied to infrastructure. Using source control, adhering to the DRY principle, modularization, maintainability, and using automated testing and deployment are all critical practices. Those of us with a deep software and infrastructure background need to empathize with and support colleagues who do not. Saying "treat infrastructure like code" isn't enough; we need to ensure the hard-won learnings from the software world are also applied consistently throughout the infrastructure realm.
We've seen significant benefits from introducing microservices, which have allowed teams to scale the delivery of independently deployed and maintained services. Unfortunately, we've also seen many teams create a front-end monolith — a large, entangled browser application that sits on top of the back-end services — largely neutralizing the benefits of microservices. Micro frontends have continued to gain in popularity since they were first introduced. We've seen many teams adopt some form of this architecture as a way to manage the complexity of multiple developers and teams contributing to the same user experience. In June of last year, one of the originators of this technique published an introductory article that serves as a reference for micro frontends. It shows how this style can be implemented using various web programming mechanisms and builds out an example application using React.js. We're confident this style will grow in popularity as larger organizations try to decompose UI development across multiple teams.
The pipelines as code technique emphasizes that the configuration of delivery pipelines that build, test and deploy our applications or infrastructure should be treated as code; they should be placed under source control and modularized in reusable components with automated testing and deployment. As organizations move to decentralized autonomous teams building microservices or micro frontends, the need for engineering practices in managing pipelines as code increases to keep building and deploying software consistent within the organization. This need has given rise to delivery pipeline templates and tooling that enable a standardized way to build and deploy services and applications. Such tools use the declarative delivery pipelines of applications, adopting a pipeline blueprint to execute the underlying tasks for various stages of a delivery lifecycle such as build, test and deployment; and they abstract away implementation details. The ability to build, test and deploy pipelines as code should be one of the evaluation criteria for choosing a CI/CD tool.
We firmly believe that pair programming improves the quality of code, spreads knowledge throughout a team and allows overall faster delivery of software. In a post COVID-19 world, however, many software teams will be distributed or fully remote, and in this situation we recommend pragmatic remote pairing: adjusting pairing practices to what's possible given the tools at hand. Consider tools such as Visual Studio Live Share for efficient, low-latency collaboration. Only resort to pixel-sharing if both participants reside in relative geographic proximity and have high-bandwidth internet connections. Pair developers who are in similar time zones rather than expecting pairing to work between participants regardless of their location. If pairing isn't working for logistical reasons, fall back to practices such as individual programming augmented via code reviews, pull-request collaboration (but beware long-lived branches with Gitflow) or shorter pairing sessions for critical parts of the code. We've engaged in remote pairing for years, and we've found it to be effective if done with a dose of pragmatism.
Unfortunately, feature toggles are less common than we'd like, and quite often we see people mixing up its types and use cases. It's quite common to come across teams that use heavyweight platforms such as LaunchDarkly to implement feature toggles, including release toggles, to benefit from Continuous Integration, when all you need are if/else conditionals. Therefore, unless you need A/B testing or canary release or hand over feature release responsibility to business folks, we encourage you to use the simplest possible feature toggle instead of unnecessarily complex feature toggle frameworks.
Applying machine learning to make the business applications and services intelligent is more than just training models and serving them. It requires implementing end-to-end and continuously repeatable cycles of training, testing, deploying, monitoring and operating the models. Continuous delivery for machine learning (CD4ML) is a technique that enables reliable end-to-end cycles of development, deploying and monitoring machine learning models. The underpinning technology stack to enable CD4ML includes tooling for accessing and discovering data, version control of artefacts (such as data, model and code), continuous delivery pipelines, automated environment provisioning for various deployments and experiments, model performance assessment and tracking, and model operational observability. Companies can choose their own tool set depending on their existing tech stack. CD4ML emphasizes automation and removing manual handoffs. CD4ML is our de facto approach for developing ML models.
Over the past year, we've seen a shift in interest around machine learning and deep neural networks in particular. Until now, tool and technique development has been driven by excitement over the remarkable capabilities of these models. Currently, though, there is rising concern that these models could cause unintentional harm. For example, a model could be trained inadvertently to make profitable credit decisions by simply excluding disadvantaged applicants. Fortunately, we're seeing a growing interest in ethical bias testing that will help to uncover potentially harmful decisions. Tools such as lime, AI Fairness 360 or What-If Tool can help uncover inaccuracies that result from underrepresented groups in training data and visualization tools such as Google Facets or Facets Dive can be used to discover subgroups within a corpus of training data. We've used lime (local interpretable model-agnostic explanations) in addition to this technique in order to understand the predictions of any machine-learning classifier and what classifiers (or models) are doing.
We see more and more tools such as Apollo Federation that can aggregate multiple GraphQL endpoints into a single graph. However, we caution against misusing GraphQL, especially when turning it into a server-to-server protocol. Our practice is to use GraphQL for server-side resource aggregation only. When using this pattern, the microservices continue to expose well-defined RESTful APIs, while under-the-hood aggregate services or BFF (Backend for Frontends) patterns use GraphQL resolvers as the implementation for stitching resources from other services. The shape of the graph is driven by domain-modeling exercises to ensure ubiquitous language is limited to subgraphs where needed (in the case of one-microservice-per-bounded-context). This technique simplifies the internal implementation of aggregate services or BFFs, while encouraging good modeling of services to avoid anemic REST.
Since introducing it in the Radar in 2016, we've seen widespread adoption of micro frontends for web UIs. Recently, however, we've seen projects extend this architectural style to include micro frontends for mobile applications as well. When the application becomes sufficiently large and complex, it becomes necessary to distribute the development over multiple teams. This presents the challenge of maintaining team autonomy while integrating their work into a single app. Although we've seen teams writing their own frameworks to enable this development style, existing modularization frameworks such as Atlas and Beehive can also simplify the problem of integrating multiteam app development.
The adoption of cloud and DevOps — while increasing the productivity of teams who can now move more quickly with reduced dependency on centralized operations teams and infrastructure — also has constrained teams that lack the skills to self-manage a full application and operations stack. Some organizations have tackled this challenge by creating platform engineering product teams. These teams maintain an internal platform that enables delivery teams to deploy and operate systems with reduced lead time and stack complexity. The emphasis here is on API-driven self-service and supporting tools, with delivery teams still responsible for supporting what they deploy onto the platform. Organizations that consider establishing such a platform team should be very cautious not to accidentally create a separate DevOps team, nor should they simply relabel their existing hosting and operations structure as a platform. If you're wondering how to best set up platform teams, we've been using the concepts from Team Topologies to split platform teams in our projects into enablement teams, core "platform within a platform" teams and stream-focused teams.
Security policies are rules and procedures that protect our systems from threats and disruption. For example, access control policies define and enforce who can access which services and resources under what circumstances; or network security policies can dynamically limit the traffic rate to a particular service. The complexity of the technology landscape today demands treating security policy as code: define and keep policies under version control, automatically validate them, automatically deploy them and monitor their performance. Tools such as Open Policy Agent or platforms such as Istio provide flexible policy definition and enforcement mechanisms that support the practice of security policy as code.
Semi-supervised learning loops are a class of iterative machine-learning workflows that take advantage of the relationships to be found in unlabeled data. These techniques may improve models by combining labeled and unlabeled data sets in various ways. In other cases they compare models trained on different subsets of the data. Unlike either unsupervised learning where a machine infers classes in unlabeled data or supervised techniques where the training set is entirely labeled, semi-supervised techniques take advantage of a small set of labeled data and a much larger set of unlabeled data. Semi-supervised learning is also closely related to active learning techniques where a human is directed to selectively label ambiguous data points. Since expert humans that can accurately label data are a scarce resource and labeling is often the most time-consuming activity in the machine-learning workflow, semi-supervised techniques lower the cost of training and make machine learning feasible for a new class of users. We're also seeing the application of weakly supervised techniques where machine-labeled data is used but is trusted less than the data labeled by humans.
We had this technique in Assess previously. The innovations in the NLP landscape continue at a great pace, and we're able to leverage these innovations in our projects thanks to the ubiquitous transfer learning for NLP. The GLUE benchmark (a suite of language understanding tasks) scores have seen dramatic progress over the past couple of years with average scores moving from 70.0 at launch to some of the leaders crossing 90.0 as of April 2020. A lot of our projects in the NLP domain are able to make significant progress by starting from pretrained models from ELMo, BERT, and ERNIE, among others, and then fine-tuning them based on the project needs.
Distributed teams come in many shapes and setups; delivery teams in a 100% single-site co-located setup, however, have become the exception for us. Most of our teams are either multisite teams or have at least some team members working off-site. Therefore, using "remote native" processes and approaches by default can help significantly with the overall team flow and effectiveness. This starts with making sure that everybody has access to the necessary remote systems. Moreover, using tools such as Visual Studio Live Share, MURAL or Jamboard turn online workshops and remote pairing into routines instead of ineffective exceptions. But "remote native" goes beyond a lift-and-shift of co-location practices to the digital world: Embracing more asynchronous communication, even more discipline around decision documentation, and "everybody always remote" meetings are other approaches our teams practice by default to optimize for location fluidity.
The technology landscape of organizations today is increasingly more complex with assets — data, functions, infrastructure and users — spread across security boundaries, such as local hosts, multiple cloud providers and a variety of SaaS vendors. This demands a paradigm shift in enterprise security planning and systems architecture, moving from static and slow-changing security policy management, based on trust zones and network configurations, to dynamic, fine-grained security policy enforcement based on temporal access privileges.
Zero trust architecture (ZTA) is an organization's strategy and journey to implement zero-trust security principles for all of their assets — such as devices, infrastructure, services, data and users — and includes implementing practices such as securing all access and communications regardless of the network location, enforcing policies as code based on the least privilege and as granular as possible, and continuous monitoring and automated mitigation of threats. Our Radar reflects many of the enabling techniques such as security policy as code, sidecars for endpoint security and BeyondCorp. If you're on your journey toward ZTA, refer to the NIST ZTA publication to learn more about principles, enabling technology components and migration patterns as well as Google's publication on BeyondProd.
Data mesh is an architectural and organizational paradigm that challenges the age-old assumption that we must centralize big analytical data to use it, have data all in one place or be managed by a centralized data team to deliver value. Data mesh claims that for big data to fuel innovation, its ownership must be federated among domain data owners who are accountable for providing their data as products (with the support of a self-serve data platform to abstract the technical complexity involved in serving data products); it must also adopt a new form of federated governance through automation to enable interoperability of domain-oriented data products. Decentralization, along with interoperability and focus on the experience of data consumers, are key to the democratization of innovation using data.
If your organization has a large number of domains with numerous systems and teams generating data or a diverse set of data-driven use cases and access patterns, we suggest you assess data mesh. Implementation of data mesh requires investment in building a self-serve data platform and embracing an organizational change for domains to take on the long-term ownership of their data products, as well as an incentive structure that rewards domains serving and utilizing data as a product.
Since the birth of the internet, the technology landscape has experienced an accelerated evolution toward decentralization. While protocols such as HTTP and architectural patterns such as microservices or data mesh enable decentralized implementations, identity management remains centralized. The emergence of distributed ledger technology (DLT), however, provides the opportunity to enable the concept of decentralized identity. In a decentralized identity system, entities — that is, discrete identifiable units such as people, organizations and things — are free to use any shared root of trust. In contrast, conventional identity management systems are based on centralized authorities and registries such as corporate directory services, certificate authorities or domain name registries.
The development of decentralized identifiers — globally unique, persistent and self-sovereign identifiers that are cryptographically verifiable — is a major enabling standard. Although scaled implementations of decentralized identifiers in the wild are still rare, we're excited by the premise of this movement and have started using the concept in our architecture. For the latest experiments and industry collaborations, check out Decentralized Identity Foundation.
Many data pipelines are defined in a large, more or less imperative script written in Python or Scala. The script contains the logic of the individual steps as well as the code chaining the steps together. When faced with a similar situation in Selenium tests, developers discovered the Page Object pattern, and later many behavior-driven development (BDD) frameworks implemented a split between step definitions and their composition. Some teams are now experimenting with bringing the same thinking to data engineering. A separate declarative data pipeline definition, maybe written in YAML, contains only the declaration and sequence of steps. It states input and output data sets but refers to scripts if and when more complex logic is needed. With A La Mode, we're seeing the first open source tool appear in this space.
DeepWalk is an algorithm that helps apply machine learning on graphs. When working on data sets that are represented as graphs, one of the key problems is to extract features from the graph. This is where DeepWalk can help. It uses SkipGram to construct node embeddings by viewing the graph as a language where each node is a unique word in the language and random walks of finite length on the graph constitutes a sentence. These embeddings can then be used by various ML models. DeepWalk is one of the techniques we're trialling on some of our projects where we've needed to apply machine learning on graphs.
We recommend caution in managing stateful systems via container orchestration platforms such as Kubernetes. Some databases are not built with native support for orchestration — they don't expect a scheduler to kill and relocate them to a different host. Building a highly available service on top of such databases is not trivial, and we still recommend running them on bare metal hosts or a virtual machine (VM) rather than to force-fit them into a container orchestration platform.
Even though we strongly advocate in favor of CI rather than Gitflow, we know that committing straight to the trunk and running the CI on a master branch can be ineffective if the team is too big, the builds are slow or flaky, or the team lacks the discipline to run the full test suite locally. In this situation a red build can block multiple devs or pairs of devs. Instead of fixing the underlying root cause — slow builds, the inability to run tests locally or monolithic architectures that necessitate many people working in the same area — teams usually rely on feature branches to bypass these issues. We discourage feature branches, given they may require significant effort to resolve merge conflicts, and they introduce longer feedback loops and potential bugs during conflict resolution. Instead, we propose using preflight builds as an alternative: these are pull request–based builds for “micro branches” that live only for the duration of the pipeline run, with the branch opened for every commit. To help automate this workflow, we've come across bots such as Bors, which automates merging to master and branch deletion in case the mini branch build succeeds. We're assessing this flow, and you should too; but don't use this to solve the wrong problem, as it can lead to misuse of branches and may cause more harm than benefit.
It is rather curious, that after over a decade of industry experience with cloud migration, we still feel it's necessary to call out cloud lift and shift; a practice that views cloud simply as a hosting solution, resulting in the replication of an existing architecture, security practices and IT operational models in the cloud. This fails to realize the cloud's promises of agility and digital innovation. A cloud migration requires intentional change across multiple axes toward a cloud-native state, and depending on the unique migration circumstances, each organization might end up somewhere on the spectrum from cloud lift and shift to cloud native. Systems architecture, for example, is one of the pillars of delivery agility and often requires change. The temptation to simply lift and shift existing systems as containers to the cloud can be strong. While this tactic can speed up cloud migration, it falls short when it comes to creating agility and delivering features and value. Enterprise security in the cloud is fundamentally different from traditional perimeter-based security through firewalls and zoning, and it demands a journey toward zero trust architecture. The IT operating model too has to be reformed to safely provide cloud services through self-serve automated platforms and empower teams to take more of the operational responsibility and gain autonomy. Last but not least, organizations must build a foundation to enable continuous change, such as creating pipelines with continuous testing of applications and infrastructure as a part of the migration. These will help the migration process, result in a more robust and well-factored system and give organizations a way to continue to evolve and improve their systems.
We find that more and more organizations need to replace aging legacy systems to keep up with the demands of their customers (both internal and external). One antipattern we keep seeing is legacy migration feature parity , the desire to retain feature parity with the old. We see this as a huge missed opportunity. Often the old systems have bloated over time, with many features unused by users (50% according to a 2014 Standish Group report) and business processes that have evolved over time. Replacing these features is a waste. Our advice: Convince your customers to take a step back and understand what their users currently need and prioritize these needs against business outcomes and metrics — which often is easier said than done. This means conducting user research and applying modern product development practices rather than simply replacing the existing ones.
Several years ago, a new generation of log aggregation platforms emerged that were capable of storing and searching over vast amounts of log data to uncover trends and insights in operational data. Splunk was the most prominent but by no means the only example of these tools. Because these platforms provide broad operational and security visibility across the entire estate of applications, administrators and developers have grown increasingly dependent on them. This enthusiasm spread as stakeholders discovered that they could use log aggregation for business analytics. However, business needs can quickly outstrip the flexibility and usability of these tools. Logs intended for technical observability are often inadequate to infer deep customer understanding. We prefer either to use tools and metrics designed for customer analytics or to take a more event-driven approach to observability where both business and operational events are collected and stored in a way they can be replayed and processed by more purpose-built tools.
Five years ago we highlighted the problems with long-lived branches with Gitflow. Essentially, long-lived branches are the opposite of continuously integrating all changes to the source code, and in our experience continuous integration is the better approach for most kinds of software development. Later we extended our caution to Gitflow itself, because we saw teams using it almost exclusively with long-lived branches. Today, we still see teams in settings where continuous delivery of web-based systems is the stated goal being drawn to long-lived branches. So we were delighted that the author of Gitflow has now added a note to his original article, explaining that Gitflow was not intended for such use cases.
The value of snapshot testing is undeniable when working with legacy systems by ensuring that the system continues to work and the legacy code doesn't break. However, we're seeing the common, rather harmful practice of using snapshot testing only as the primary test mechanism. Snapshot tests validate the exact result generated in the DOM by a component, not the component's behavior; therefore, it can be weak and unreliable, fostering the "only delete the snapshot and regenerate it" bad practice. Instead, you should test the logic and behavior of the components emulating what users would do. This mindset is encouraged by tools in the Testing Library family.