Brief summary
In our latest episode, we explore the ideas of data meshes, an alternative approach to serve and service data organizationally. Our regular co-hosts Mike Mason and Neal Ford talk to Ken Collier, Head of Data Science and Engineering at Thoughtworks, and Zhamak Dehghani, one of our regular co-hosts and also a Principal Consultant, with a focus on distributed systems architecture.
Podcast Transcript
Mike Mason:
Hello everyone and welcome to the Thoughtworks Podcast, my name is Mike Mason.
Neal Ford:
And I'm another of your regular hosts, Neil Ford, and we're joined today by two of our colleagues, I'll let them introduce themselves.
Ken Collier:
I'm Ken Collier, I'm the head of Data Science and Data Engineering at Thoughtworks.
Zhamak Dehghani:
Hi everyone, I'm Zhamak Dehghani, and I'm a Technical Principal at Thoughtworks from San Francisco.
Mike Mason:
And of course you might recognize a Zhamak as one of the hosts of the podcast, but in fact today we have her as a guest, so we're going to be picking her brains.
Neal Ford:
That's right, we're into one of her areas of domain expertise, and so we're going to be talking today about data and data architecture, and in particular, the ideas around what's the next generation architecture beyond the Data Lake.
Neal Ford:
So many of you may be familiar with the concept of Data Lake that Martin Fowler writes about in his website, but we've been doing some thinking about what goes beyond that, and that's what Ken and Zhamak are here to talk about.
Zhamak Dehghani:
Thank you for having us.
Neal Ford:
So, what's the problem we're trying to address here?
Ken Collier:
So data, I'll jump in a little bit because I am an old guy who goes way back in data.
Ken Collier:
Data management and data architectures have largely been a centralized focus, so from Enterprise data warehousing, to even data-marts and Kimbell-style architectures.
Ken Collier:
We've continued to focus on centralizing data. In 2009 James Dixon introduced the concept of Data Lake, which captured everyone's attention and got everybody's imagination working.
Ken Collier:
Largely, Data Lake architectures have followed the paradigm of collating data, harmonizing data in one central, or a few central places. So Zhamak has introduced some new ideas that I think are very exciting
Zhamak Dehghani:
And I think in fact, I don't go back so much in data and I don't have that historical background, but what I've noticed working with our clients over the course of the last two years, is that there are many failure modes to building big data architectures or big data platforms.
Zhamak Dehghani:
We have customers and clients that are stuck in building, designing the Data Lake that never realize any value. We have customers that have invested immensely in data warehouse and proprietary hardware and software and they don't get the value they want, so there is problems with boot-strapping, there are problems with scaling and are problems with actually getting value from the investments of big data.
Zhamak Dehghani:
So that led to my curiosity to look under the hood to see what's going on and bring some of the observations and thinking from the operational systems, and how the operational systems and architectures that large, has evolved over the last 10 years and kind of bring that thinking to the world of data.
Ken Collier:
And a lot of those failure modes that Zhamak talks about are pretty well understood in the data community; trying to consolidate data from a broad set of disparate source systems is complicated, trying to do it for all imaginable use cases is nearly impossible, trying to manage all of your transformation logic in batch jobs or even streaming jobs is overwhelming, and for data teams that are only about 13 to 20 people, there's just a number of friction points and bottle-necks.
Mike Mason:
So help me out here, cause I thought the part of the premise of the Data Lake was that because you could store the raw data, you could do all that kind of sorting everything out later on in the process, because you were capturing everything, you could then figure out, okay, we do need to do this harmonization or we need to build a Lake Shore Mart for a particular purpose.
Mike Mason:
That seemed promising to me because you could do just enough of that work, are you saying even that has not really worked out as we'd hoped.
Ken Collier:
So it's gotten better, and in fact over the last several years, I've been talking a lot about keeping data in its raw native state until the last responsible moment, applying business logic and transformation logic as close to the business as possible, that's a fundamental shift from the way that data warehousing has been done in the past.
Ken Collier:
It's been an improvement, I think it certainly has been more helpful in reducing friction, but we still have a number of conversations with IT leaders who have a fairly vague sense that there's a lot of use cases that need to be supported, and many of those use cases come from very desperate, "end users", so whether it's marketing or finance or supply chain, one centralized data management platform to do all of those things, kind of doesn't make sense.
Ken Collier:
So I think Zhamak has a good point of view on this.
Zhamak Dehghani:
I think Mike, when you just mentioned Data Lake, you say what the premise was that we just store data from everywhere. Understated, we're just going to store data from everywhere, is part of the problem.
Zhamak Dehghani:
So one side of the problem, as Ken mentioned, is responding to a very diverse set of use cases and supporting accessibility to those, but the other point of the friction, and point of kind of problem for scale, is getting data, or making data ubiquitously available, from all these diverse domains by a centralized team who doesn't know about these domains, doesn't intimately understand what the data means, what the business represents, how the business is represented through those data sets or data events, and try to capture that, keep that up to date, keep up with changes, and I think that's flawed, data has to be, I think we should accept the reality, that data is ubiquitous, and the people who are responsible for operational domains at the point of generation of that data, should think about that data as an asset for the rest of the organization to consume.
Zhamak Dehghani:
So the distribution of the ownership is a fundamental shift to think about, whether it's the ownership of the native data or the raw data at the point of origin, or ownership of the data that is aggregated and modeled at the point of consumption.
Zhamak Dehghani:
So try to kind of distribute that across and make that fabric of the organization, as opposed to this Data Lake thing on the side.
Ken Collier:
And there's an interesting pattern that I think can carry us forward, if we, going back to data warehousing and Data Lake, there is the need to ingest data from sources, there's the need to transform or process that data, and there's the need to serve that data up.
Ken Collier:
So that's a three-tiered architecture that is not new, it's that sort of thinking we've been doing for a long time, the challenge or I think this new thinking, that is more of a federated approach, enables that sequence of steps, to be broken down at a domain level, and those data domains can take care of their own ingestion of whatever upstream data are needed, take care of their own processing specific to that or relative to that domain of data, and serve up the data for the consumers that are going to be using that data.
Ken Collier:
Now when you package up lots of those small domain data products, or domain data nodes in a mesh, now you can start to isolate and encapsulate the work, the governance, and the trustworthiness of the data, so that you don't have it all trying to be done in one large Lake or one collection of data.
Neal Ford:
Well, and that's exactly what Zhamak was alluding to earlier, when she was talking about taking modern thinking in architecture, and applying it to the world of data, and this whole domain versus technical partitioning stuff, right?
Zhamak Dehghani:
Yeah, absolutely, I think we've seen this in operational systems that when the technology is new, especially in the maturity of a trend, we put technology and techniques and tooling, at the top of thinking and they become the first class concerns around which we decompose.
Zhamak Dehghani:
So for example, we created layered enterprise architectures with the customer touch-based applications as one layer, and business process as another, and centralized databases with DBA's as another layer, maybe because the database technologies, those business process modeling technologies, were kind of new, so that in creating boundaries of architecture around the technology, was the main concern.
Zhamak Dehghani:
And with micro services we realized that is actually not an efficient way, or effective way of decomposing architecture, because the change actually requires to go across all those layers, so the axis of changes are orthogonal to the axis of the architectural decomposition, and we changed it around and said, "let's localize functionality, to where the change happens," and we realized that this concept of domains with Eric Evans, thinking around domain driven design, was a nice boundary for localizing and creating the components of the distributed architecture, and we still have layered architecture, as internal implementations of those services in the operational world.
Zhamak Dehghani:
So I think the same thing is happening in the data world that we are, as Ken was mentioning, that the pipelines, and anytime you talk to a data engineer, and you talk about their architecture, they talk about data pipelines. It's a 90 degree rotation of a layered architecture, but horizontally a pipeline of ingestion, transformation and serving, but I think it's because the technology is still new, so we're focusing on challenges around technology and optimizing our ingestion services, optimizing our surveying services, as opposed to consider that a second class concern, break the data around domains, as Ken was mentioning, and still have data pipelines as a second class kind of decomposition layer as an implementation of data domains.
Neal Ford:
At the architectural level, we refer to this as the top level partitioning, whether it's a technical top level partitioning, like a layered architecture or domain at the top level, and so what you're talking about is shifting the top level partitioning away from the kind of mechanics of data and toward domains as the first class citizen and embed the mechanics within each of those domains.
Zhamak Dehghani:
Exactly, and I think the beauty of that is this kind of ecosystem effect that you get, because then you can compose new solutions or new data models and new data aggregates by composing these domain data products.
Mike Mason:
so I want to keep exploring this, but I'm actually a little bit lost, and I think the listeners might be as well.
Mike Mason:
[inaudible 00:11:50] We talked about paradigm shifts, software engineering approaches, and problems with centralization, and you touched a little bit on data producers thinking about data as a product, from the top, can you sketch for me what this thing looks like?
Mike Mason:
I'm enjoying the discussion, but I'm feeling it's a tiny bit theoretical and I've kind of lost track of the moving part.
Ken Collier:
An example might, would be good right about now, as Ward Cunningham would say.
Ken Collier:
So I think about this, we're working right now with a healthcare provider, as an example, so they have insurance claims and claims processing, they have the health of their members and the care, the medical care that their members receive, they have financial concerns that are sort of back office issues that all companies deal with.
Ken Collier:
So what this data architecture might look like, from a decomposition point of view, is an identification of the key domain, So we may have a member domain that is going to be de-identified so that members are anonymous and data scientists can do machine learning from that, they may be the consumers of the data from that data domain, there may be a longitudinal data about member's care over time, with a prediction of what's to come next, or maybe the outcome is a recommended action for the member or the patient to not eat so much salt, or something like this.
Ken Collier:
So if you think about the end use cases, either being data scientists, consumers for machine learning or business consumers for other purposes, that starts that begins to drive those data domains, and Zhamak and I are both working with the same company, so we're familiar with their scenario, it could be call center for another company, it could be streaming music for "You Know", I think that's your example that you talk about, I'll let you add.
Zhamak Dehghani:
Sure, I think maybe it's the point that we introduce the concept, give it a name?
Ken Collier:
Sure, give it a name.
Zhamak Dehghani:
So we talk about this architecture, as Data Mesh, and there are certain characteristics around this architecture and I think we touched on a few characteristics.
Zhamak Dehghani:
What is the Data Mesh? Data Mesh is an alternative approach to manage data, surface data, and serve data organizationally, and address a diverse set of needs, such as analytical needs, or business needs, or machine learning based needs.
Zhamak Dehghani:
The main characteristics of this architecture, is One, what we just talked about, which is thinking about capturing and exposing your data around domains, and have these data sets be consumed by whoever wants to consume them downstream, so there is not a direct pipeline, and have a distributed ownership of those data sets as opposed to a central ownership of the data sets, and bringing the ownership of those data sets as closely as possible to the point of origin or the point of consumption.
Zhamak Dehghani:
So as Ken was mentioning in the healthcare example, people who are actually responsible for building the operational domains that deal with claims and claims systems, they're also become responsible for providing claims information as easily consumable, trustworthy data, whether as events or you know, historical snapshots to the rest of the organization. Similarly, people who deal with members.
Zhamak Dehghani:
So those are the data sets around the point of origin, and there will be a set of data products or data domains, that might be newly aggregated to just patient history for various fit-for-purpose consumptions.
Zhamak Dehghani:
So that's one of the concepts, data around domains with distributed ownership. The Second attribute of this architecture is that if we need to build data pipelines to provide those data sets that we turn into domain data sets, we do so, but those pipelines are just in internal implementations that are specific to a particular data domain.
Zhamak Dehghani:
And the third characteristics that we haven't touched on yet, is that to support this distributed ownership of data across your organization, and allow this rapid creation of different data domains, we need to provide some form of a self-serve infrastructure that is designed for supporting, building these datasets. So that means I need to have easily set up POLYGLOT storage, depending on the type of data that I have, I need to have ways of providing data securely, to the rest of the organization, because now I have exposed my data to whoever has access to it.
Zhamak Dehghani:
So there's a set of infrastructure, as a platform that needs to be put in place, to support these distributed data domains.
Zhamak Dehghani:
We emphasize, and you might have heard in our conversation, we emphasize these attributes around products, so think about data as a product, because now you're providing this as an asset to the rest of the organization, so there is certain attributes that comes with product thinking, that this data should be easily discovered, and it should have its own SOO's or service-level objectives, what's the quality that's associated with the data, have good documentations, to provide a delightful experience to the data scientists that want to find this data and use it, and that's the concept of Data Mesh.
Ken Collier:
And I'll add a little bit to that because as a long time data guy, when Zhamak first introduced these ideas to me, one of the first things I said was, "So does that mean all this Data Lake stuff we've been talking about for the last 10 years is nonsense?" and it's not, there may still be Data Lakes designed around domain collections of data, but instead of a centralized, I think you use the term one Data Lake to rule them all, instead of focusing there, we might have multiple data lakes that are consolidation points along the way.
Ken Collier:
The other thing that you didn't mention is the question of "How do I master my data so that if a customer record is coming from one source and/or from say five sources, which one do I trust?"
Ken Collier:
So that interoperability issue requires some Global Governance Umbrella that sits over the top of this, Mesh, and you may elaborate on that.
Zhamak Dehghani:
Yeah, I think with any distributed architecture, if you don't have standardization in the seams, which is how we communicate information and we don't have that global governance, the system just falls apart.
Zhamak Dehghani:
The example analogy I give, is that the API revolution happened because we had standardization around HTTP and REST to allow interoperability, so if now I need to join data from two different sources, what sort of standardization or interoperability I need to build in so I can actually join data from different sources and that comes under the umbrella of that global or central governance.
Neal Ford:
Well, so being an all a long time architecture guy, I can see the parallels where the Mesh idea comes from because Service Meshes are very popular in the microservices world, and it's a way to consolidate, and couple the operational concerns in the architecture, and leave all the domains de-coupled and you're using this for the same purpose here, as a way of tying your operational concerns together, like query-ability but also leaving the domains highly de-coupled from one another, so the name matches very nicely.
Zhamak Dehghani:
Absolutely, I didn't use a whole lot of imagination to come up with anything.
Neal Ford:
But that also provides you a platform for doing this kind of automated governance stuff at that Data Mesh level, because if there are certain things that the services need to expose, you can build that into the platform, and make sure that all of these have a consistent interface, so that you can get to the data that you need to, so that allows you to pave over differences in graph databases or relational or name value pairs, and those kind of differences.
Ken Collier:
So capabilities like encryption or de-identification, or other kinds of common transformations or calculations, could live at that infrastructure level, and be consumed by the product teams that are creating the data domains.
Mike Mason:
That's something that's quite interesting to me, is one of the use cases that you talked about, was de-identifying the data, so that the identified data product could be used by a different team to generate useful machine learning based insights or whatever, but in a way where they don't need to be cleared for access to that PII data or whatever else it is with a healthcare provider.
Mike Mason:
That actually seems quite interesting and powerful cause one of the problems that we run into, we talk about democratizing access to data and all this stuff, and the first thing that you get is somebody saying, "well that's highly personal patient data and we need to secure it really well", and then you've run into roadblocks on being able to do anything interesting. This seems to be a really interesting way of producing safe data sets that people can be authorized to use.
Ken Collier:
I think so. One of the things that I've thought about in this, especially in the healthcare sector, is you may now focus on role-based access-control, at the data domain level, as opposed to worrying about cellular level or row level authentication, so you may say, well we have this data domain that is de-identified and it's trustworthy and it's verified that we're not going to identify patients, therefore a broader universe of users, analysts, data scientists can can subscribe to that, meanwhile, here's another data domain that has patients that are still identifiable and a smaller subset of people are allowed to subscribe to or consume data from there.
Mike Mason:
And the thing that's interesting is even with the de-identified version, people can create useful insights and say, "Look, I've clustered the data and this cluster of patients I want to give this advice to." And you can create that insight and give it to the folks who still actually have all the patient names and addresses, in order to actually get the advice out there, but without causing a privacy issue through throughout.
Zhamak Dehghani:
Yeah, absolutely. And I think one of those key subjects around interoperability is that sort of global identifiers or federated identifiers you could still pass around without passing the personal details and using those global identifiers, you can join back the personal information with the insights that you found.
Ken Collier:
One of the other benefits to this way of thinking, especially bringing product thinking into this discussion, is if each data domain is supported by a cross functional product team, and that product team includes business domain experts, as well as technologists, and whoever else needs to be involved. But through product ownership, those data domains don't need to live on if they're no longer useful, whereas in our current paradigm, data accumulates immutably, data models grow immutably, data just never really gets cleaned up and incurs a lot of technical debt.
Ken Collier:
So this notion of encapsulating data as domain data products, with investment being made in the product, as long as that product is serving a useful life, and then kill it when it's not. If you don't have users anymore that need that data domain, then the data still is available upstream, and you don't need the domain products to live on.
Neal Ford:
That's a huge advantage of not centralizing all that data, cause when it gets centralized, it gets coupled too, and you can't get rid of it, and so you're left with it forever.
Neal Ford:
So let me engage in a little bit of metaphorical whimsy here, so I think what you guys are suggesting is that rather than Data Lakes, what we've actually had our Data Oceans, and what you guys are suggesting are Data Ponds, with canals between them.
Ken Collier:
Yeah, that's a fair analogy. One of the architects that our healthcare company, refers to this as Michigan? The Land of Lakes?
Zhamak Dehghani:
I've been trying to not use the water metaphor.
Ken Collier:
Yeah. Just trying to stay away from water.
Zhamak Dehghani:
I think to your point Ken, that these cross functional teams, with that product ownership, being a recognized role, it's so important. But I think there is another side effect.
Zhamak Dehghani:
As an industry we are struggling to find data engineers, both ways, the organizations are struggling to hire, and generally, software engineers are struggling to double up those skills because of the silo-ing. If you're a data engineer already somehow, then you know all these tools, you go into these silo data engineering team, work with your fellow data engineers on these data platform, and that has caused data engineers to miss out on a lot of advancements in software engineering practices, that has happened in operational world, and conversely, if you're a software engineer, you never talk or sit next to a data engineer colleague to learn from them.
Zhamak Dehghani:
I read the stats in 2016, I think LinkedIn had 60,000, if I remember correctly, 60,000 data engineers, 60,000 people who had claimed to be data engineers, and that year only in the Bay area, there were 65,000 jobs open for data engineers, and I'm sure it had gotten more since then. I think like bringing software engineers and data engineers together as one team, we allow that cross-pollination, so that software engineers can add working with data engineering tools to their tool belts, and that becomes just part of the generalist toolkit.
Ken Collier:
I think it is important to point out that the tooling doesn't change, the underlying technologies don't need to change, we don't need any special new things, we just need to re-think how we're implementing and managing.
Zhamak Dehghani:
Yeah, I think it's just an inverted model and I hope that with this inverted thinking, we develop a whole new language to go with it, cause that's important.
Neal Ford:
Well, it sounds like taking some of the best practices and some of the good perspectives we picked up from the software architecture world, particularly micro-services, and applying it to the data world, which seeing the obvious parallels there and then applying those same principles.
Zhamak Dehghani:
I think when I visualize the ideal state, when we get there, data is really part of the fabric of the organization, the same way that API's today are part of the fabric of organization, and they're not siloed in some Lake or Platform in the corner.
Ken Collier:
And one question that comes up, and I think it's a legitimate one, isn't there still the need to get a holistic view of the enterprise, and be able to do interesting analysis about a more 360 view from all these data sources? And in that context, it may make sense to have some very clearly stated use cases or analytical goals, and if that's necessary, then that becomes a data domain, it's just a different type of data domain, and you create that for the kind of holistic view that you want, rather than being the central source.
Zhamak Dehghani:
And I think for that to happen, we do need some central centralized views, or global views of this data. So even though we have this kind of de-centralized world with, different teams owning the data, there should be a governance in place that says if your data wants to become a data product we use for the rest of the organization and it's registered itself with this catalog, so it can be found. It needs to have this sort of documentation, so that data catalog or data discovery tool, it is a globally available central tool. Right now there are a lot of technologies around data cataloging, but they came from a different place. They came from a need for discovering data that is siloed and hidden, not data that is intentionally designed to be shared, so I hope that the next generation data catalogs would actually support data teams that are intentionally trying to make their data available and discoverable.
Neal Ford:
Okay, so if you had to summarize this approach in one sentence, what do you think that that summary would be?
Zhamak Dehghani:
It's going to be a long sentence.
Neal Ford:
That's okay, as long as it's a sentence.
Zhamak Dehghani:
It's a Mesh of data that is organized around domains, and owned by cross functional teams, and governed by a centralized governance to allow interoperability, and served by a self-serve infrastructure.
Zhamak Dehghani:
I hope that makes sense.
Neal Ford:
Perfect.
Mike Mason:
People wanted to find out more, where could they do that?
Zhamak Dehghani:
If they look for, actually, on Martin Fowler website right now, there is an article how to move beyond Lake to a distributed Data Mesh, that's where they can find it or they can reach us on Twitter.
Mike Mason:
And we'll link it in the show notes as well.
Neal Ford:
All right, thank you very much, very interesting and very informative.
Mike Mason:
Thank you for having us.
Zhamak Dehghani:
Thank you.
Ken Collier:
Thank you.
Rebecca Parsons:
Next time on the Thoughtworks podcast, I will be speaking with Satyam Argawala, about compliance as code, and we'll be seeing how you're bringing yet another operations and organizational function into the * as code family, looking at compliance as code and some of the implications of automating governance, risk and compliance. So please join us. Thank you.