The continued adoption of containers for deployments, especially Docker, has made container security scanning a must-have technique and we've moved this technique into Adopt to reflect that. Specifically, containers introduced a new path for security issues; it's vital that you use tools to scan and check containers during deployment. We prefer using automated scanning tools that run as part of the deployment pipeline.
Today, many organizations' answer to unlocking data for analytical usage is to build a labyrinth of data pipelines. Pipelines retrieve data from one or multiple sources, cleanse it and then transform and move it to another location for consumption. This approach to data management often leaves the consuming pipelines with the difficult task of verifying the inbound data's integrity and building complex logic to cleanse the data to meet its required level of quality. The fundamental problem is that the source of the data has no incentive and accountability for providing quality data to its consumers. For this reason, we strongly advocate for data integrity at the origin, by which we mean, any source that provides consumable data must describe its measures of data quality explicitly and guarantee those measures. The main reason behind this is that the originating systems and teams are most intimately familiar with their data and best positioned to fix it at the source. Data mesh architecture takes this one step further, comparing consumable data to a product, where data quality and its objectives are integral attributes of every shared data set.
We've seen significant benefits from introducing microservices, which have allowed teams to scale the delivery of independently deployed and maintained services. Unfortunately, we've also seen many teams create a front-end monolith — a large, entangled browser application that sits on top of the back-end services — largely neutralizing the benefits of microservices. Micro frontends have continued to gain in popularity since they were first introduced. We've seen many teams adopt some form of this architecture as a way to manage the complexity of multiple developers and teams contributing to the same user experience. In June of this year, one of the originators of this technique published an introductory article that serves as a reference for micro frontends. It shows how this style can be implemented using various web programming mechanisms and builds out an example application using React.js. We're confident this style will grow in popularity as larger organizations try to decompose UI development across multiple teams.
The use of continuous delivery pipelines to orchestrate the release process for software has become a mainstream concept. CI/CD tools can be used to test server configuration (e.g., Chef cookbooks, Puppet modules, Ansible playbooks), server image building (e.g., Packer), environment provisioning (e.g., Terraform, CloudFormation) and the integration of environments. The use of pipelines for infrastructure as code lets you find errors before changes are applied to operational environments — including environments used for development and testing. They also offer a way to ensure that infrastructure tooling is run consistently, using CI/CD agents rather than individual workstations. Our teams have had good results adopting this technique on their projects.
Automating the estimation, tracking and projection of cloud infrastructure's run cost is necessary for today's organizations. The cloud providers' savvy pricing models, combined with proliferation of pricing parameters and the dynamic nature of today's architecture, can lead to surprisingly expensive run cost. For example, the price of serverless based on API calls, event streaming solutions based on traffic or data processing clusters based on running jobs, all have a dynamic nature that changes over time as the architecture evolves. When our teams manage infrastructure on the cloud, implementing run cost as architecture fitness function is one of their early activities. This means that our teams can observe the cost of running services against the value delivered; when they see deviations from what was expected or acceptable, they'll discuss whether it's time to evolve the architecture. The observation and calculation of the run cost is implemented as an automated function.
When adopting continuous delivery (CD) successfully, teams strive to make the various test environments look as close to production as possible. This allows them to avoid bugs that would otherwise only show themselves in the production environment. This remains just as valid for embedded and Internet of Things software; if we don't run our tests in realistic environments we can expect to find some bugs for the first time in production. Testing using real devices helps avoid this issue by making sure the right devices are available in the CD pipeline.
The power and promise of machine learning has created a demand for expertise that outstrips the supply of data scientists who specialize in this area. In response to this skills gap, we've seen the emergence of Automated machine learning (AutoML) tools that purport to make it easy for nonexperts to automate the end-to-end process of model selection and training. Examples include Google's AutoML, DataRobot and the H2O AutoML interface. Although we've seen promising results from these tools, we'd caution businesses against viewing them as the sum total of their machine-learning journey. As stated on the H2O website, "there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models." Blind trust in automated techniques also increases the risk of introducing ethical bias or making decisions that disadvantage minorities. While businesses may use these tools as a starting point to generate useful, trained models, we encourage them to seek out experienced data scientists to validate and refine the results.
As the usage of containers, deployment of large fleet of services by autonomous teams and increased speed of continuous delivery become common practice for many organizations, the need for automated deploy-time software security controls arise. Binary attestation is a technique to implement deploy-time security control; to cryptographically verify that a binary image is authorized for deployment. Using this technique, an attestor, an automated build process or a security team signs off the binaries that have passed the required quality checks and tests and are authorized to be deployed. Services such as GCP Binary Authorization enabled by Grafeas, and tools such as in-toto and Docker Notary support creating attestations and validating the image signatures before deployment.
With an increased popularity of ML-based applications, and the technical complexity involved in building them, our teams rely heavily on continuous delivery for machine learning (CD4ML) to deliver such applications safely, quickly and in a sustainable manner. CD4ML is the discipline of bringing CD principles and practices to ML applications. It removes long cycle times between training models and deploying them to production. CD4ML removes manual handoffs between different teams, data engineers, data scientists and ML engineers in the end-to-end process of build and deployment of a model served by an application. Using CD4ML, our teams have successfully implemented the automated versioning, testing and deployment of all components of ML-based applications: data, model and code.
One of the main points of friction for data scientists and analysts, in their workflow, is to locate the data they need, make sense of it and evaluate whether it's trustworthy to use it. This remains a challenge due to the missing metadata about the available data sources and lack of adequate functionality needed to search and locate data. We encourage teams who are providing analytical data sets or building data platforms to make data discoverability a first-class function of their environments; to provide the ability to easily locate available data, detect its quality, understand its structure and lineage and get access to it. Traditionally this function has been provided by bloated data cataloguing solutions. In recent years, we've seen the growth of open-source projects that are improving developer experiences for both data providers and data consumers to do one thing really well: to make data discoverable. Amundsen by Lyft and WhereHows by LinkedIn are among these tools. What we like to see is a change in providers' behavior to intentionally share the metadata that help discoverability in favor of discoverability tools that infer partial metadata information from silos of application databases.
Many teams and organizations have no formal or consistent way of tracking technical dependencies in their software. This issue often shows itself when that software needs to be changed, at which point the use of an outdated version of a library, API or component will cause problems or delay. Dependency drift fitness function is a technique to introduce a specific evolutionary architecture fitness function to track these dependencies over time, thus giving an indication of the possible work needed and whether a potential issue is getting better or worse.
As application development becomes increasingly dynamic and complex, it's a challenge to achieve the effective delivery of accessible and usable products that are consistent in style. Design systems define a collection of design patterns, component libraries and good design and engineering practices that ensure consistency in the development of digital products. We've found design systems a useful addition to our toolbox when working across teams and disciplines in product development, because they allow teams to focus on more strategic challenges around the product itself without the need to reinvent the wheel every time they need to add a visual component. The types of components and tools you use to create design systems can vary greatly.
The day-to-day work of machine learning often boils down to a series of experiments in selecting a modeling approach, the network topology, training data and various optimizations or tweaks to the model. Because many of these models are still difficult to interpret or explain, data scientists must use experience and intuition to hypothesize changes and then measure the impact those changes have on the overall performance of the model. As these models have become increasingly common in business systems, several different experiment tracking tools for machine learning have emerged to help investigators keep track of these experiments and work through them methodically. Although no clear winner has emerged, tools such as MLflow or Weights & Biases and platforms such as Comet or Neptune have introduced rigor and repeatability into the entire machine learning workflow. They also facilitate collaboration and help turn data science from a solitary endeavor into a team sport.
Deep neural networks have demonstrated remarkable recall and accuracy across a wide range of problems. Given sufficient training data and an appropriately chosen topology, these models meet and exceed human capabilities in certain select problem spaces. However, they're inherently opaque. Although parts of models can be reused through transfer learning, we're seldom able to ascribe any human-understandable meaning to these elements. In contrast, an explainable model is one that allows us to say how a decision was made. For example, a decision tree yields a chain of inference that describes the classification process. Explainability becomes critical in certain regulated industries or when we're concerned about the ethical impact of a decision. As these models are incorporated more widely into critical business systems, it's important to consider explainability as a first-class model selection criterion. Despite their power, neural networks might not be an appropriate choice when explainability requirements are strict.
Security policies are rules and procedures that protect our systems from threats and disruption. For example, access control policies define and enforce who can access which services and resources under what circumstances; or network security policies can dynamically limit the traffic rate to a particular service. The complexity of the technology landscape today demands treating security policy as code: define and keep policies under version control, automatically validate them, automatically deploy them and monitor their performance. Tools such as Open Policy Agent, or platforms such as Istio provide flexible policy definition and enforcement mechanisms that support the practice of security policy as code.
Many of the technical solutions we build today run in increasingly complex polycloud or hybrid-cloud environments with multiple distributed components and services. Under such circumstances, we apply two security principles early in implementation: zero trust network, never trust the network and always verify; and the principle of least privilege, granting the minimum permissions necessary for performing a particular job. Sidecars for endpoint security is a common technique we use to implement these principles to enforce security controls at every component's endpoint, e.g., APIs of services, data stores or Kubernetes control interface. We do this using an out-of-process sidecar — a process or a container that is deployed and scheduled with each service sharing the same execution context, host and identity. Open Policy Agent and Envoy are tools that implement this technique. Sidecars for endpoint security minimize the trusted footprint to a local endpoint rather than the network perimeter. We like to see the responsibility of sidecar’s security policy configuration left with the team that is responsible for the endpoint and not a separate centralized team.
Zhong Tai has been a buzzword in the Chinese IT industry for years, but it has yet to catch on in the West. At its core, Zhong Tai is an approach to delivering encapsulated business models. It's designed to help a new breed of small businesses deliver first-rate services without the costs of traditional enterprise infrastructure and enabling existing organizations to bring innovative services to market at breakneck speeds. The Zhong Tai strategy was originally proposed by Alibaba and soon followed by many Chinese Internet companies, because their business model is digital native, making it suitable to replicate for new markets and sectors. Nowadays, more Chinese firms are using Zhong Tai as a lever for digital transformation.
BERT stands for Bidirectional Encoder Representations from Transformers; it's a new method of pretraining language representations which was published by researchers at Google in October 2018. BERT has significantly altered the natural language processing (NLP) landscape by obtaining state-of-the-art results on a wide array of NLP tasks. Based on Transformer architecture, it learns from both the left and right side of a token's context during training. Google has also released pretrained general-purpose BERT models that have been trained on a large corpus of unlabelled text including Wikipedia. Developers can use and fine-tune these pre-trained models on their task-specific data and achieve great results. We talked about transfer learning for NLP in our April 2019 edition of the Radar; BERT and its successors continue to make transfer learning for NLP a very exciting field with significant reduction in effort for users dealing with text classification.
Data mesh is an architectural paradigm that unlocks analytical data at scale; rapidly unlocking access to an ever-growing number of distributed domain data sets, for a proliferation of consumption scenarios such as machine learning, analytics or data intensive applications across the organization. Data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture, with a shift from the centralized paradigm of a lake, or its predecessor, the data warehouse. Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product and implementing open standardization to enable an ecosystem of interoperable distributed data products.
Over the past year, we've seen a shift in interest around machine learning and deep neural networks in particular. Until now, tool and technique development has been driven by excitement over the remarkable capabilities of these models. Currently though, there is rising concern that these models could cause unintentional harm. For example, a model could be trained to make profitable credit decisions by simply excluding disadvantaged applicants. Fortunately, we're seeing a growing interest in ethical bias testing that will help to uncover potentially harmful decisions. Tools such as lime, AI Fairness 360 or What-If can help uncover inaccuracies that result from underrepresented groups in training data and visualization tools such as Google Facets or Facets Dive can be used to discover subgroups within a corpus of training data. However, this is a developing field and we expect standards and practices specific to ethical bias testing to emerge over time.
Model training generally requires collecting data from its source and transporting it to a centralized location where the model training algorithm runs. This becomes particularly problematic when the training data consists of personally identifiable information. We're encouraged by the emergence of federated learning as a privacy-preserving method for training on a large diverse set of data relating to individuals. Federated learning techniques allow the data to remain on the users' device, under their control, yet contribute to an aggregate corpus of training data. In one such technique, each user device updates a model independently; then the model parameters, rather than the data itself, are combined into a centralized view. Network bandwidth and device computational limitations present some significant technical challenges, but we like the way federated learning leaves users in control of their own personal information.
Linking records from different data providers in the presence of a shared key is trivial. However, you may not always have a shared key; even if you do, it may not be a good idea to expose it due to privacy concerns. Privacy-preserving record linkage (PPRL) using Bloom filter (a space-efficient probabilistic data structure) is an established technique that allows probabilistic linkage of records from different data providers without exposing privately identifiable personal data. For example, when linking data from two data providers, each provider encrypts its personally identifiable data using Bloom filter to get cryptographic linkage keys and then sends them to you via a secure channel. Once data is received, the records can be linked by computing similarity scores between sets of cryptographic linkage keys from each provider. Among other techniques, we found PPRL using Bloom filters to be scalable for large data sets.
Semi-supervised learning loops are a class of iterative machine-learning workflows that take advantage of the relationships to be found in unlabeled data. These techniques may improve models by combining labeled and unlabeled data sets in various ways. In other cases they compare models trained on different subsets of the data. Unlike either unsupervised learning where a machine infers classes in unlabeled data or supervised techniques where the training set is entirely labeled, semi-supervised techniques take advantage of a small set of labeled data and a much larger set of unlabeled data. Semi-supervised learning is also closely related to active learning techniques where a human is directed to selectively label ambiguous data points. Since expert humans that can accurately label data are a scarce resource and labeling is often the most time-consuming activity in the machine-learning workflow, semi-supervised techniques lower the cost of training and make machine learning feasible for a new class of users.
The old term 10x engineer has come under scrutiny these past few months. A widely shared Twitter thread essentially suggests companies should excuse antisocial and damaging behaviors in order to retain engineers who are perceived as having immense individual output. Thankfully, many people on social media made fun of the concept, but the stereotype of the "rockstar developer" is still pervasive. In our experience, great engineers are driven not by individual output but by working in amazing teams. It's more effective to build teams of talented individuals with mixed experiences and diverse backgrounds and provide the right ingredients for teamwork, learning and continuous improvement. These 10x teams can move faster, scale more quickly and are much more resilient — without needing to pander to bad behaviors.
When teams embrace the concept of micro frontends they have a number of patterns at their disposal to integrate the individual micro frontends into one application. As always there are antipatterns, too. A common one in this case is front-end integration via artifact. For each micro frontend an artifact is built, usually an NPM package, which is pushed into a registry. A later step, sometimes in a different build pipeline, then combines the individual packages into a final package that contains all micro frontends. From a purely technical perspective this integration at build time results in a working application. However, integrating via artifact implies that for each change the full artifact needs to be rebuilt, which is time consuming and will likely have a negative impact on developer experience. Worse, this style of integrating frontends also introduces direct dependencies between the micro frontends at build time and therefore causes considerable coordination overhead.
We've been building serverless architectures on our projects for a couple of years now, and we've noticed that it's quite easy to fall into the trap of building a distributed monolith. Lambda pinball architectures characteristically lose sight of important domain logic in the tangled web of lambdas, buckets and queues as requests bounce around increasingly complex graphs of cloud services. Typically they're hard to test as units, and the application needs must be tested as an integrated whole. One pattern we can use to avoid these pinball architectures is to draw a distinction between public and published interfaces and apply good old domain boundaries with published interfaces between them.
We find that more and more organizations need to replace aging legacy systems to keep up with the demands of their customers (both internal and external). One antipattern we keep seeing is legacy migration feature parity, the desire to retain feature parity with the old. We see this as a huge missed opportunity. Often the old systems have bloated over time, with many features unused by users (50% according to a 2014 Standish Group report) and business processes that have evolved over time. Replacing these features is a waste. Our advice: Convince your customers to take a step back and understand what their users currently need and prioritize these needs against business outcomes and metrics — which often is easier said than done. This means conducting user research and applying modern product development practices rather than simply replacing the existing ones.