Azure DevOps services contain a set of managed services, including hosted Git repos, CI/CD pipelines, automated testing tooling, backlog management tooling and artifact repository. We've seen our teams getting more experience in using this platform with good results, which means Azure DevOps is maturing. We particularly like its flexibility; it allows you to use the services you want even if they're from different providers. For instance, you could use an external Git repository while still using the Azure DevOps pipeline services. Our teams are especially excited about Azure DevOps Pipelines. Nevertheless, all the services offer a good developer experience that helps our teams deliver value.
Debezium is a change data capture (CDC) platform that can stream database changes onto Kafka topics. CDC is a popular technique with multiple use cases, including replicating data to other databases, feeding analytics systems, extracting microservices from monoliths and invalidating caches. Debezium reacts to changes in the database's log files and has CDC connectors for multiple databases, including Postgres, MySQL, Oracle and MongoDB. We're using Debezium in many projects, and it has worked very well for us.
Honeycomb is an observability service that ingests rich data from production systems and makes it manageable through dynamic sampling. Developers can log large amounts of rich events and later decide how to slice and correlate them. This interactive approach is useful when working with today's large distributed systems, because we've passed the point where we can reasonably anticipate which questions we might want to ask of production systems. The Honeycomb team is actively developing for a number of languages and frameworks with plugins now available for Go, Node, Java and Rails among others; other new features are being added at a rapid pace. The pricing model has also been simplified to make it more attractive. Our teams love it.
Since introducing JupyterLab in the Assess ring in our last issue, it has become the preferred web-based user interface for Project Jupyter for many of our data practitioners. JupyterLab use is rapidly overtaking Jupyter Notebooks, which it will eventually replace. If you're still using Jupyter Notebooks, you should give JupyterLab a try. Its interactive environment is an evolution of Jupyter Notebook: it extends the original capabilities with drag-and-drop cells and tab autocompletion among other new features.
Data scientists spend a large part of their time on data discovery, which means tooling to help in this space is bound to generate some excitement. Although the Apache Atlas project has become the de facto tool for metadata management, data discovery is still not easily accomplished. Enter Amundsen, which can be deployed in concert with Apache Atlas to provide a much nicer search interface for data discovery.
Organizations are looking to support and streamline development environments through developer portals or platforms. As the number of tools and technologies increases, some form of standardization is becoming increasingly important for consistency so that developers are able to focus on innovation and product development instead of getting bogged down with reinventing the wheel. A centralized developer portal can offer easy discoverability of services and best practices. Backstage is an open-source platform for creating developer portals by Spotify. It is based upon software templates, unifying infrastructure tooling and consistent and centralized technical documentation. Its plugin architecture allows for extensibility and adaptability into an organization’s infrastructure ecosystem.
Dremio is a cloud data lake engine that powers interactive queries against cloud data lake storage. With Dremio, you don't have to manage data pipelines in order to extract and transform data into a separate data warehouse for predictive performance. Dremio creates virtual data sets from data ingested into a data lake and provides a uniform view to consumers. Presto popularized the technique of separating storage from the compute layer, and Dremio takes it further by improving performance and optimizing cost of operation.
DuckDB is an embedded, columnar database for data science and analytical workloads. Analysts spend significant time cleaning and visualizing data locally before scaling it to servers. Although databases have been around for decades, most of them are designed for client-server use cases and therefore not suitable for local interactive queries. To work around this limitation analysts usually end up using in-memory data-processing tools such as Pandas or data.table. Although these tools are effective, they do limit the scope of analysis to the volume of data that can fit in memory. We feel DuckDB neatly fills this gap in tooling with an embedded columnar engine that is optimized for analytics on local, larger-than-memory data sets.
K3s is a lightweight Kubernetes distribution built for IoT and edge computing. It's packaged as a single binary and has minimal to no OS dependencies, making it really easy to operate and use. It uses sqlite3 as the default storage backend instead of etcd. It has a reduced memory footprint because it runs all relevant components in a single process. It also achieves a smaller binary by stripping out third-party storage drivers and cloud providers that are not relevant for the K3s use cases. For environments with constrained resources, K3s is a pretty good choice and worth considering.
Materialize is a streaming database that enables you to do incremental computation without complicated data pipelines. Just describe your computations via standard SQL views and connect Materialize to the data stream. The underlying differential data flow engine performs incremental computation to provide consistent and correct output with minimal latency. Unlike traditional databases, there are no restrictions in defining these materialized views and the computations are executed in real time.
Tekton is a young Kubernetes-native platform for managing continuous integration and delivery (CI/CD) pipelines. It not only installs and runs on Kubernetes but also defines its CI/CD pipelines as Kubernetes custom resources. This means the pipelines can now be controlled by native Kubernetes clients (CLI or APIs) and can take advantage of underlying resource management features such as rollbacks. The pipeline declaration format is flexible and allows defining workflows with conditions, parallel execution paths and handling final tasks to clean up among other features. As a result, Tekton can support complex and hybrid deployment workflows with rollbacks, canary release and more. Tekton is open source and also offered as a managed service by GCP. Although the documentation has room for improvement and the community is growing, we've been using Tekton successfully for production workloads on AWS.
Continuous challenges with how individuals and organizations establish trust digitally, over the internet, is giving rise to a new approach on how to prove identity, how to share and verify attributes needed to establish trust and how to securely transact. Our Radar features some of the foundational technologies such as decentralized identity and verifiable credentials that enable this new era of digital trust.
However, such a global scale change won't be possible without a standardization of a technical governance stack that enables interoperability. The new Trust over IP Foundation, part of the Linux Foundation, has set out to do just that. Taking its inspiration from how TCP/IP standardization as the narrow waist of the internet has enabled interoperability across billions of devices, the group is defining a four-layer technical and governance Trust over IP stack. The stack includes public utilities such as decentralized identifiers, decentralized identity comms to standardized protocols for agents such as digital wallets to communicate, data exchange protocols such as flows to issue and verify verifiable credentials, as well as the application ecosystems such as education, finance, healthcare, etc. If you're revisiting your identity systems and how you establish trust with your ecosystem, we suggest looking into ToIP stack and its supporting tooling, Hyperledger Aries.
Technologies, especially wildly popular ones, have a tendency to be overused. What we're seeing at the moment is Node overload, a tendency to use Node.js indiscriminately or for the wrong reasons. Among these, two stand out in our opinion. Firstly, we frequently hear that Node.js should be used so that all programming can be done in one programming language. Our view remains that polyglot programming is a better approach, and this still goes both ways. Secondly, we often hear teams cite performance as a reason to choose Node.js. Although there are myriads of more or less sensible benchmarks, this perception is rooted in history. When Node.js became popular, it was the first major framework to embrace a nonblocking programming model which made it very efficient for IO-heavy tasks. (We mentioned this in our write-up of Node.js in 2012.) Due to its single-threaded nature, Node.js was never a good choice for compute-heavy workloads, though, and now that capable nonblocking frameworks also exist on other platforms — some with elegant, modern APIs — performance is no longer a reason to choose Node.js.