With the growing and diverse data needs of enterprises comes a growing need for metadata management. Apache Atlas is a metadata management framework that fits the data governance needs of enterprises. Atlas provides capabilities to model types for metadata, classify data assets, track the data lineage and enable data discovery. However, when building a metadata management platform, we need to be careful not to repeat the mistakes of master data management.
We first placed AWS in Adopt seven years ago, and the breadth, depth and reliability of its services have improved in leaps and bounds since then. However, we're now moving AWS back into Trial, not because of any deficiencies in its offering, but because its competitors, GCP and Azure, have matured considerably and selecting a cloud provider has become increasingly complex. We reserve Adopt for when we see a clear winner in a field. For many years, AWS was the default choice, but we now feel that organizations should make a balanced selection across cloud providers that takes into account their geographic and regulatory footprint, their strategic alignment (or lack thereof) with the providers, and, of course, the fit between their most important needs and the cloud providers' differentiating products.
Microsoft has steadily improved Azure and today not much separates the core cloud experience provided by the major cloud providers—Amazon, Google and Microsoft. The cloud providers seem to agree and seek to differentiate themselves in other areas such as features, services and cost structure. Microsoft is the provider who shows real interest in the legal requirements of European companies. They’ve a nuanced and plausible strategy, including unique offerings such as Azure Germany and Azure Stack, which gives some certainty to European companies in anticipation of the GDPR and possible legislative changes in the United States.
Headless content management systems (CMSes) are becoming a common component of digital platforms. Contentful is a modern headless CMS that our teams have successfully integrated into their development workflows. We particularly like its API-first approach and implementing CMS as code. It supports powerful content modeling primitives as code and content model evolution scripts, which allow treating it as other data store schemas and applying evolutionary database design practices to CMS development. Other notable features that we've liked include inclusion of two CDNs to deliver media assets and JSON documents, good support for localization and the ability—albeit with some effort—to integrate with Auth0.
As Google Cloud Platform (GCP) has expanded in terms of available geographic regions and maturity of services, customers globally can now seriously consider it for their cloud strategy. In some areas, GCP has reached feature parity with its main competitor, Amazon Web Services, while in other areas it has differentiated itself—notably with accessible machine learning platforms, data engineering tools, and a workable Kubernetes as a service solution (GKE). In practice, our teams have nothing but praise for the developer experience working with the GCP tools and APIs.
As we've gained more experience with the public cloud across organizations large and small, certain patterns have emerged. One of those patterns is a virtual private cloud network managed at the organizational level and divided into smaller subnets under the control of each delivery team. This is closely related to the idea of multiaccount cloud setup and helps to partition an infrastructure along team bounds. After configuring this setup many times using VPCs, subnets, security groups and NACLs, we really like Google's notion of the shared VPC. Shared VPC makes organizations, projects, VPCs and subnets first-class entities in network configurations. VPCs can be managed by an organization's administrators who can delegate subnet administration to projects. Projects can then be explicitly associated with subnets in the VPC. This simplifies configuration and makes security and access control more transparent.
TICK Stack is a collection of open source components that combine to deliver a platform for easily storing, visualizing and monitoring time series data such as metrics and events. The components are: Telegraf, a server agent for collecting and reporting metrics; InfluxDB, a high-performance time series database; Chronograf, a user interface for the platform; and Kapacitor, a data-processing engine that can process, stream and batch data from InfluxDB. Unlike Prometheus, which is based on the pull model, TICK Stack is based on the push model of collecting data. The heart of the system is the InfluxDB component, which is one of the best time series databases. The stack is backed by InfluxData and although you need the enterprise version for features such as database clustering, it's still a fairly good choice for monitoring. We're using it in a few places in production and have had good experiences with it.
Azure DevOps services include a set of managed services such as hosted Git repos, CI and CD pipelines and artifact repository. Azure DevOps services have replaced Visual Studio Team Services. We've had a good experience in starting projects quickly with Azure DevOps services—managing, building and releasing applications to Azure. We've also run into a few challenges—such as lack of full support for CI and CD pipeline as code, slow build agent startup time, separation of build and release into different pipelines—and experienced a few downtimes. We're hoping that Azure DevOps services improve over time to provide a good developer experience when hosting applications on Azure, with a frictionless experience integrating with other Azure services.
CockroachDB is an open source distributed database inspired by the white paper Spanner: Google's distributed database. In CockroachDB, data is automatically divided into ranges, usually 64MB, and distributed across nodes in the cluster. Each range has a consensus group and, because it uses the Raft consensus algorithm, the data is always kept in sync. With its unique design, CockroachDB provides distributed transactions and geo-partitioning while still supporting SQL. Unlike Spanner, which relies on TrueTime with atomic clock for linearizability, CockroachDB uses NTP for clock synchronization and provides serializability as the default isolation level. If you're working with structured data that fits in a single node, then choose a traditional relational database. However, if your data needs to scale across nodes, be consistent and survive failures, then we recommend you take a closer look at CockroachDB.
Debezium is a change data capture (CDC) platform that can stream database changes onto Kafka topics. CDC is a popular technique with multiple use cases, including replicating data to other databases, feeding analytics systems, extracting microservices from monoliths and invalidating caches. We're always on the lookout for tools or platforms in this space (we talked about Bottled Water in a previous Radar) and Debezium is an excellent choice. It uses a log-based CDC approach which means it works by reacting to changes in the database's log files. Debezium uses Kafka Connect which makes it highly scalable and resilient to failures and has CDC connectors for multiple databases including Postgres, Mysql and MongoDB. We're using it in a few projects and it has worked very well for us.
Google Cloud Dataflow is useful in traditional ETL scenarios for reading data from a source, transforming it and then storing it to a sink, with configurations and scaling being managed by dataflow. Dataflow supports Java, Python and Scala and provides wrappers for connections to various types of data sources. However, the current version won’t let you add additional libraries, which may make it unsuitable for certain data manipulations. You also can’t change the dataflow DAG dynamically. Hence, if your ETL has conditional execution flows based on parameters, you may not be able to use dataflow without workarounds.
gVisor is a user-space kernel for containers. It limits the host kernel surface accessible to the application without taking away access to all the features it expects. Unlike existing sandbox technologies, such as virtualized hardware (KVM and Xen) or rule-based execution (seccomp, SELinux and AppArmor), gVisor takes a distinct approach to container sandboxing by intercepting application system calls and acting as the guest kernel without the need for translation through virtualized hardware. gVisor includes an Open Container Initiative (OCI) runtime called runsc that integrates with Docker and provides experimental support for Kubernetes. gVisor is a relatively new project and we recommend assessing it for your container security landscape.
In most cases, blockchain is not the right place to store a blob file (e.g., image or audio). When developing DApp, one option is to put blob files in some off-chain centralized data storage, which usually signals lack of trust. Another option is to store them on InterPlanetary File System (IPFS), which is a content-addressed, versioned, peer-to-peer file system. It’s designed to distribute high volumes of data with high efficiency and removed from any centralized authority. Files are stored on peers that don’t need to trust each other. IPFS keeps every version of a file so you never lose important files. We see IPFS as a good complement to blockchain technology. Beyond its blockchain application, IPFS has an ambitious goal to decentralize the Internet infrastructure.
When building and operating a microservices ecosystem, one of the early questions to answer is how to implement cross-cutting concerns such as service discovery, service-to-service and origin-to-service security, observability (including telemetry and distributed tracing), rolling releases and resiliency. Over the last couple of years, our default answer to this question has been using a service mesh technique. A service mesh offers the implementation of these cross-cutting capabilities as an infrastructure layer that is configured as code. The policy configurations can be consistently applied to the whole ecosystem of microservices; enforced on both in and out of mesh traffic (via the mesh proxy as a gateway) as well as on the traffic at each service (via the same mesh proxy as a sidecar container). While we're keeping a close eye on the progress of different open source service mesh projects such as Linkerd, we've successfully used Istio in production with a surprisingly easy-to-configure operating model.
As application developers, we love to focus on solving core business problems and let the underlying platform handle the boring but difficult tasks of deploying, scaling and managing applications. Although serverless architecture is a step in that direction, most of the popular offerings are tied to a proprietary implementation, which means vendor lock-in. Knative tries to address this by being an open source serverless platform that integrates well with the popular Kubernetes ecosystem. With Knative you can model computations on request in a supported framework of your choice (including Ruby on Rails, Django and Spring among others); subscribe, deliver and manage events; integrate with familiar CI and CD tools; and build containers from source. By providing a set of middleware components for building source-centric and container-based applications that can be elastically scaled, Knative is an attractive platform that deserves to be assessed for your serverless needs.
Ethereum is the leading developer ecosystem in blockchain tech. We've seen emerging solutions that aim to spread this technology into enterprise environments that usually require network permissioning and transaction privacy as well as higher throughput and lower latency. Quorum is one of these solutions. Originally developed by J.P. Morgan, Quorum positions itself as "an enterprise-focused version of Ethereum." Unlike the Hyperledger Burrow node, which creates a new Ethereum virtual machine (EVM), Quorum forks code from Ethereum's official client so that it can evolve alongside Ethereum. Although it keeps most features of the Ethereum ledger, Quorum changes the consensus protocol from PoW to more efficient ones and adds private transaction support. With Quorum, developers can use their Ethereum knowledge of using, for example, Solidity and Truffle contracts to build enterprise blockchain applications. However, based on our experience, Quorum is not yet enterprise ready; for example, it lacks access control for private contracts, doesn't work well with load balancers and only has partial database support, all of which will lead to significant deployment and design burden. We recommend that you're cautious in implementing Quorum while keeping an eye on its development.
Resin.io is an Internet of Things (IoT) platform that does one thing and does it well: it deploys containers onto devices. Developers use a software as a service (SaaS) portal to manage devices and assign applications, defined by Dockerfiles, to them. The platform can build containers for various hardware types and deploys the images over the air. For the containers, Resin.io uses balena, an engine based on the Moby framework created by Docker. The platform is still under development, has some rough edges and lacks some features (e.g., working with private registries), but the current feature set, including the option to ssh into a container on a device from the web portal, points toward a promising future.
Rook is an open source cloud native storage orchestrator for Kubernetes. Rook integrates with Ceph and brings File, Block and Object storage systems into the Kubernetes cluster, running them seamlessly alongside other applications and services that are consuming the storage. By using Kubernetes operators, Rook orchestrates Ceph at the control plane and stays clear of the data path between applications and Ceph. Storage is one of the important components of cloud-native computing and we believe that Rook, though still an incubating-level project at CNCF, takes us a step closer to self-sufficiency and portability across public cloud and on-premise deployments.
Making key elements of Google's groundbreaking, high-scale platform available as open source offerings appears to have become a trend. In the same way that HBASE drew on BigTable and Kubernetes drew on Borg, SPIFFE is now drawing upon Google's LOAS to bring to life a critical cloud-native concept called workload identity. The SPIFFE standards are backed by the OSS SPIFFE Runtime Environment (SPIRE), which automatically delivers cryptographically provable identities to software workloads. Although SPIRE isn't quite ready for production use, we see tremendous value in a platform-agnostic way to make strong identity assertions between workloads in modern, distributed IT infrastructures. SPIRE supports many use cases, including identity translation, OAuth client authentication, mTLS "encryption everywhere," and workload observability. Istio uses SPIFFE by default.
Data-hungry packages are solutions that require absorption of data into themselves in order to function. In some cases they may even require that they become the "master" for that data. Once the data is owned by the package, that software becomes the only way to update, change or access the data. The data-hungry package might solve a particular business problem such as ERP. However, inventory or finance "data demands" placed upon an organization will often require complex integration and changes to systems that lie well outside of the original scope.
Low-code platforms use graphical user interfaces and configuration in order to create applications. Unfortunately, low-code environments are promoted with the idea that this means you no longer need skilled development teams. Such suggestions ignore the fact that writing code is just a small part of what needs to happen to create high-quality software—practices such as source control, testing and careful design of solutions are just as important. Although these platforms have their uses, we suggest approaching them with caution, especially when they come with extravagant claims for lower cost and higher productivity.