How to select technology for Data Mesh

Disclaimer: AI-generated summaries may contain errors, omissions, or misinterpretations. For the full context please read the content below.

Ryan Dawson

Published: December 16, 2022

There's one particular question that has been coming up a lot at the moment: what tech should be used for Data Mesh? People often ask whether Databricks is a good choice, or whether they should use AWS, Snowflake or open source options. However, just as there’s no right tech for Microservices, there’s also no right tech for Data Mesh. That means that while this blog post won’t provide you with a shopping list of technologies, it will offer you some help in understanding what tech is out there — and how you should go about evaluating it for your mesh implementation.

Data Mesh is a paradigm, not a solution architecture

Different organizations will have different Data Mesh implementations supported by different architectures. In short, Data Mesh is a style, not a single architecture. That means there’s more than one way of architecting a Mesh on AWS or on Azure or Google. A good basic picture for Mesh is ‘like microservices, but for analytical data’.

This means there is no simple list of technologies that will let you start doing Data Mesh. As we’ll see, there are some useful tools, But rather than diving straight into the tools, it’s best to consider the characteristics or capabilities of a Data Mesh — that will help us understand what kinds of tools are needed and for what purpose.

The easiest way to do this is to begin with the core principles of Data Mesh and to consider what these principles mean from a technology perspective:

Domain ownership

This requires data products to be divided according to clear value streams rather than technical boundaries. It’s also essential that every individual data product team is to be able to look after their own pipelines and policies, as well as the data storage and output ports (such as APIs).

Data as a product

Consumers of data want the data in the form that works for them. Teams need to be able to transform and distribute the data in ways that delight the consumers of the data. It’s for this reason that the data as a product principle requires a polyglot ecosystem that is flexible to the demands of the data products. This demands flexibility and can constrain your tech choices — while an opinionated off-the-shelf solution may look great, it could be restrictive.

Self-serve data platform

Teams shouldn’t have to constantly reinvent the wheel when it comes to infrastructure — that wastes their time and energy and keeps them from focusing on building great data products. A self-serve platform empowers and supports developers so that tasks like provisioning are taken care of.

Federated computational governance

Data Mesh products should have some level of interoperability — this means we need to ensure that distributed ownership is balanced with standardization. There are two key reasons for this: the first is to make data products more discoverable inside an organization, and the second is to guarantee and maintain certain quality, interoperability and security standards.

Mapping principles to features and technologies

Now we’ve outlined the core principles of Data Mesh we can begin to map these to a number of features and functionalities. (For the sake of simplicity, I’ve combined domain ownership and data as a product into ‘data products’.)

For each I’ve included some possible tools that can be used.

Platform
- Provisioning shared capabilities (API, UI, connector code)
  - Infrastructure as Code tools like Terraform, Ansible or CloudFormation etc; Off-the-shelf Data Platform API; Continuous Integration tools such as Jenkins.
- Streaming capabilities
  - Tools such as Kafka, MQ, Flink, Kinesis.
- Developer portal (full integration)
  - Built in-house or using something like Spotify’s Backstage.
Governance
- Catalog (documentation and metadata index; crawlers for discoverability)
  - Wiki tools such as confluence (for a basic catalog); commercial catalogs such as collibra or open source options — amundsen, DataHub.
- Data and API standards
  - data governance tooling, wiki, swagger, open data protocol
- Access policies
  - wiki, metadata for catalog, individual policy implementation likely in data products, OPA or similar for custom http APIs or integrated tools, ranger for Hadoop
- Monitoring and automated compliance
  - data governance tooling; data observability tools and libraries such as Great Expectations.io, in-house dashboards
- Data lineage
  - data lineage tools or data governance tools with integrated lineage, could also be in the ETL (e.g. pachyderm) or storage (e.g. delta lake)
- IDM integration
  - Custom code or connectors in data storage tools.
Data products
- Data stores
  - Databases: SQL, NoSQL, Graph, Search.
- Input and output ports
  - Custom APIs, ETL tools, connectors (e.g. plug-ins to streaming input or to BI or reporting output)
- Cross-product integrations
  - data virtualization tools, starburst.io, cloud provider or off-the-shelf data platform tools

There’s more detail about all of these types of tools available from my overview in GitHub.

What about data ownership?

Data ownership doesn’t translate to features — it is more a question of organizational practices and processes. However, this doesn’t mean there aren’t some important things to consider in the context of a Data Mesh implementation: methodologies such as Domain-Driven Design, Event Storming, Team Topologies, Use Cases and Discovery Workshops are all particularly useful in Data Mesh.

From lightweight solutions to a fully-fledged developer experience portal

It’s important to note that there is a spectrum of possible features you may need. At the most sophisticated and “mature” end, you might have a data product developer experience portal where everything is automatically provisioned. This would essentially offer developers a wizard from which new data products can be created, making it really easy to select things like data storage tech they want, and which technologies for input, output ports and connectors and so on. This is essentially what Zhamak Dehghani has called an ‘experience plane’ (You can see more of what an experience plane can look like in the webinar Lessons From the Trenches in Data Mesh.)

Building a developer experience portal is difficult. You certainly shouldn’t feel like it’s essential. At the more lightweight end of the spectrum you could just use templates for bootstrapping infrastructure for data products and using a wiki for the data catalog; that could still be very effective. Zhamak Dehghani has suggested (at 52:00 of the above-referenced webinar) putting platform APIs in place early, even if they don’t initially do everything you’d like them to do. You can begin at the lighter end and gradually add more to the platform as required.

Can I use off-the-shelf solutions?

There are off-the-shelf cloud and data platform solutions that can be used in Data Mesh implementations. Here I don’t just mean particular tools but solutions that pitch themselves as end-to-end platforms. These have value but it should be noted that at the time of writing they are typically generic. If you’re looking for the level of customization offered by a developer experience platform, you’ll have to go and develop it yourself.

If you’re interested in using an off-the-shelf solution, you can find my overview of the capabilities of some of the leading vendors on GitHub.

There are a number of caveats that you need to bear in mind if you are considering an off-the-shelf approach:

Off-the-shelf data platforms are based around services/features, not data products. So there’s no way to say that you want to build a data product and that it should have a certain storage and certain output ports etc. So off-the-shelf data platforms can’t provide a true mesh experience plane.
Off-the-shelf data platforms have no concept of custom APIs for data. There may be generic data APIs but if developers want to write code to expose a custom API for data or for a machine learning model, that may require going outside the data platform.
Their ETL philosophy isn’t polyglot — they are typically opinionated (with integrations tailored to the platform’s native storage and metadata).
They offer a generic experience without any option to curate what tools developers choose from.
The account-based tenancy model can be very restrictive. If you try to do a lot of things in a single cloud account then you’ll likely hit limits (as discussed in ‘Lessons from the trenches in data mesh’).

Despite those caveats, off-the-shelf platforms can certainly be of value to mesh implementations. It’s just important to remember that they are not Data Mesh platforms; none (at the time of writing) are a silver bullet for a successful Data Mesh implementation.

So, how do I select technology for Data Mesh?

Hopefully you’re now clearer about what the technology options are and what elements you might want in your mesh implementation. But you might still have some high level tool selection questions that you’re not sure about. It’s you that knows your use cases and your landscape so you’ll need to decide how to select tools based on your aims and your company’s technology strategy. What I can offer is some general advice on key themes:

Should you use an off-the-shelf data platform or assemble your own? To decide this you’ll want to look at factors like cost, what your specific use cases are and how tailored you want your platform experience to be.

Do you need a data platform UI or could you make do with an API or even just some guidance and templates? It is good to get building the platform early but your initial platform for your first couple of use cases might look different from the platform you’ll eventually end up with. There are trade-offs between getting a platform going early to bootstrap the first data products vs investing a lot upfront in the platform.

Do you need governance that extends to policies and monitoring? This likely depends on the sensitivity of your data, how concerned you are about its potential quality and interoperability and what the impact of quality or data security issues would be for your use cases.

The most important advice is to identify key use cases early on. If you try to evaluate tools without clarity about your use cases then you don’t know what you’re trying to solve for. You don’t want to get sucked into looking for the ‘best’ tools that there are or for the ‘ideal’ mesh implementation as these are rabbit holes. You want to be clear about what you’re trying to achieve so that you can find the tools that work best for you.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights