InfoQ Homepage Podcasts Zhamak Dehghani on Data Mesh, Domain-Oriented Data, and Building Data Platforms

Zhamak Dehghani on Data Mesh, Domain-Oriented Data, and Building Data Platforms

Bookmarks

Mar 01, 2020

In this podcast, Daniel Bryant sat down with Zhamak Dehghani, principal consultant, member of technical advisory board, and portfolio director at ThoughtWorks. Topics discussed included: the motivations for becoming a data-driven organization; the challenges of adapting legacy data platforms and ETL jobs; and how to design and build the next generation of data platforms using ideas from domain-driven design and product thinking, and modern platform principles such as self-service workflows.

Key Takeaways

Becoming a data-driven organization remains one of the top strategic goals of many organizations. Being able to rapidly run experiments and efficiently analyse the resulting data can provide a competitive advantage.
There are several “architecture failure modes” within existing enterprise data platforms. They are centralized and monolithic. The composition of data pipelines are often highly-coupled, meaning that a change to the data format will require a cascade of changes throughout the pipeline. And finally, the ownership of data platforms is often siloed and hyper-specialized.
The next generation of enterprise data platform architecture requires a paradigm shift towards ubiquitous data with a distributed data mesh.
Instead of flowing the data from domains into a centrally owned data lake or platform, domains need to host and serve their domain datasets in an easily consumable way.
Domain data teams must apply product thinking to the datasets that they provide; considering their data assets as their products, and the rest of the organization's data scientists, ML and data engineers as their customers.
The key to building the data infrastructure as a platform is (a) to not include any domain specific concepts or business logic, keeping it domain agnostic, and (b) make sure the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner.

Subscribe on:

Show Notes

Can you introduce yourself?

01:15 I work for Thoughtworks as a principal consultant, which means I wear several hats.
01:20 I act as the technical portfolio director, which means I work across multiple clients, and play a role of the technology director on the ground.
01:35 I sit on the tech advisory board for Thoughtworks, helping to produce the technology radar.
01:40 It's a great way to analyse the projects on the ground by Thoughtworkers across the globe.

I recently read your article on Data Meshes - Before we explore this, I wanted to understand what the benefits are for organisations becoming data driven?

02:25 That's how the idea of data mesh came to be, by looking at the problem space and the challenges.
02:40 I noticed that a lot of CxO have a mandate to become data driven, and treat data as assets.
02:55 It means to use data to improve operational performance of an organisation.
03:05 For example, you might have a lot of manual interactions with your customers and you want to improve the operation of your company.
03:10 If you are an e-commerce company, you might want to price the products more efficiently using data.
03:25 The other possibility is to use data to improve the customer experience and behaviour to serve them better.
03:45 To serve your customers better, you can personalise your recommendations based on their requirements.
03:50 It's either around optimising business - essentially, removing operational cost - or serving your customers better.
04:00 At the end of the day, competing with your competitors is using the data and the insights about the industry, customers and business.

I'm hearing a lot of things about using experimentation to improve things.

04:20 I think experimentation is a big element of that.
04:25 Rule-based decision making things are easy; you make rules, and logic, and code it and put it out there.
04:35 When you're speaking about changing your business based on data, a lot of it is about experimenting, tweaking your business and seeing how the customers respond.
04:40 You can harvest the data back into the system - the cycle of intelligence, which is about connecting the data from your customers into intelligence.
05:00 Turning that data into insights, and to augmented intelligent decision making (AI/ML) you can see it's a cycle.

What solutions have we explored in this space before?

05:30 There is a spectrum of tooling and technology out there.
05:40 If you think about the cycle, the application sits at the top, the compute platforms that run our applications.
05:50 As you talk through the cycle, we look at the data - capture, management, we can build on top machine learning models that augment those applications.
06:10 If we focus on the data management side, we can see the train of technology and challenges.
06:20 I see three generations; early days of big data, analytical uses of the data have been contained to analytics, reports, visualisation of metrics, sales trends.
06:50 The technology that addressed that concern, and that use case of the data were data warehouses, where you were gathering through ETL jobs from many systems into one warehouse.
07:15 The data is modelled into schemas into which you can run SQL queries.
07:25 At the end of the day, a human would sit and look at the reports produced by that application.
07:30 That has been OK for that use case, for the monthly reports.
07:40 You throw a lot of data people at the warehouse to model the data and create thousands of reports.
07:50 The technology was problematic at scale, and not relevant to the data scientist type access for ML and AI reports, where you want the facts as they happen as a raw format.
08:15 We migrated towards a data lake, where we assumed that the data would get extracted from the point of origin, and instead of being modelled gets dumped into the storage.
08:35 You can then unleash the data scientist to swim in this lake and discover the insights.
08:50 Once we have figured out we want to get out of these data lake, downstream data marts come into play.
08:55 That's the dominant paradigm we live in today.
09:00 Going one step further, the technology we talk about data lakes in the cloud.
09:10 That comes with its own tech stack, which is different from the one on prem.

You talked about architectural failures of existing solutions?

09:40 I was shocked when I arrived into the data from the distributed computing world where decentralisation of capabilities was the norm.
10:00 Seeing microservices and coming to the world of big data, I was shocked - even the most modern articulations of the data architecture involved a centralised lake.
10:15 You talk to CIOs and they dream about putting their data into the lake, and their incentives are about getting the data into the lake in the cloud.
10:30 It was shocking to me because I came from a different world.
10:35 The symptoms of this centralisation and the failure mode are around the centralised architecture which is hard to scale, deal with failure, or get value from.
10:50 Go to any of the white papers of the cloud and their data lake architectures, and you will see is a bunch of technologies wired together, moving data into one place.
11:10 Conway's law replicates the structure of an organisation into code; one lake to rule them all (or one warehouse) should rule them all and have downstream consumers from them.
11:30 All of the data lake architectures I see today have a monolithic and centralised view of the world, and that view is dictated by the technology.
11:45 The most fundamental concept is the domains; it's a function of the business, and the data that it is generating or consuming.
11:55 That domain thinking and decomposition is lost when we think about big data.
12:05 Any monolithic architecture will fail under its own weight; the architects lay awake thinking how to break it up.
12:25 Folks that think about this in the big data world think about this in terms of pipelines at best.
12:30 Dividing the work around pipelines and phases of the pipeline, such as ingestion/transformation/serving phases.
12:50 It's the division of that monolith around technology than anything else.

One of the other failures you mentioned was 'coupled pipeline composition' - can you explain what that was?

13:10 If you talk to any data engineer or listen to a data podcast, the phrase you hear most about is the data pipeline.
13:20 I would like to remove the data pipeline from our vocabulary.
13:30 As it stands today, your units of architecture are measured in pipelines.
13:40 A data pipeline is a set of jobs that are responsible from taking data one or more upstream sources and run it through some transformations, and provide a cleaned view downstream.
14:40 It could be produced in columnar format, some sort of object storage or data lake, or via events.
14:55 The pipeline, which is the unit of architecture, and if you zoom out, the monolithic structure is a labyrinth of data pipelines connected together.
15:10 An upstream failure can cause a cascading effect of all of the downstream pipelines.
15:20 At best, what I see when people try to break down the pipelines to a set of reusable components, to scale out to work on different services.
15:30 You can scale out to have people working on the ingestion service, how data is being ingested from upstream, transformation services etc.
15:40 The pipeline is a fundamental unit of the current architecture that we live in.

The final architectural failure was 'siloed and hyper-specialised decomposition' which leads to problems?

16:10 If you think about the s-curve of growth or the maturity of technology, data engineering is on the early days of that curve.
16:30 We focus on skill sets of people, or the organisation around those skill sets.
16:40 The skill set of data engineers is around experience with tooling of building these data pipelines, and you put them together and make them responsible for providing consuming data.
17:00 You silo them into a corner of the organisation.
17:05 If you see the typical organisation of a company today, you see the chief digital officer and data department separate from all of the other domains that form the operational world.
17:15 If you're in the commerce world, you might have the data team separated from the e-commerce team, the buying team etc.
17:25 Siloing people based on their skills leads to a lot of friction, because these people don't know the domains - any change in the signals from upstream would break the pipelines.
17:50 They don't understand the consumption models for the data scientists to work on or generate reports.
18:00 They're fairly siloed because of their separation; that means there is high friction; and second, the skillset gap still remains.
18:20 The cross section between people who know the domains and people who know the tooling is very limited.

Could you introduce what a data mesh is?

18:55 It's a new paradigm (an over used phrase) because it needs thinking about in a different way/perspective/language.
19:15 The principles is in response to the problems we've talked about.
19:25 The first principle is to bring distributed architecture and modelling to the world of big data; how can we split up the problem into smaller problems?
19:40 It introduces domain thinking and domain driven design and architecture.
19:45 If you look at a data mesh at a high level, you shouldn't see pipelines any more.
19:55 What you should see is nodes that are representative of a particular domain data set, with a polyglot representation.
20:05 For example, if you are in the commerce domain, you probably have your own e-commerce system that is generating customer interaction data.
20:15 You have your order management generating data, payments, customer data ...
20:30 There's a difference between operational data and analytical data.
20:40 Analytical data; the nodes of the mesh, the top-level components are polyglot oriented domain data sets.
20:55 A particular domain might provide data sets in different formats.
21:00 For example, it might provide the order data as an infinite log of events, for consumption by near real-time systems.
21:15 It might also provide daily snapshots as columnar data for consumption by data learning through batch processes.
21:25 The same domain might provide this analytical data (events and facts of the businesses) in multiple formats for analytical or machine learning processes.
21:35 That's the domain oriented decomposition of the architecture, which concludes the first pillar.
21:45 The second principle is platform thinking; if you think about any distributed architecture, the very first problems faced was around the data domains.
22:25 We talked about large scale data storage for polyglot data, something like spark clusters.
22:40 There's a fair bit of complexity in the technology stacks inside these data teams, while providing consistency.
22:55 That leads to the idea of platform thinking, by abstracting the technology into a layer of self serving data structures.
22:15 One of the measurements for the data infrastructure team is how quickly a new data product can be created.
22:25 If I'm in the customer domain, how quickly can I build and serve a pipeline for customer data events?
22:40 Platform thinking and abstracting technology can give autonomy into these data teams.
22:55 In the third pillar, we have decentralised data with siloed databases with each application.
24:20 We can learn a lot from product thinking - having data as an asset, how do we get there?
24:35 We have to treat the users (data scientists) as consumers of that product.
24:50 It has to be discoverable, trustworthy, timely, well documented so that they are meaningful and secured.
25:15 You need to have people who are data product owners, who go round the organisation and evangelise and talk about the data sets that exist.
25:30 They should be measured by how easily someone can come and find this data and use it, and later on, how can it be scaled.
25:50 The fourth pillar is around interoperability and standardisation.
25:55 One of the big challenges of big data is how to harmonise the identities of data that crosses boundaries.
26:05 You want the customer id in the commerce system and in the order system to be linked together.
26:35 If you unpack the standardised way of representing the data and identifying entities, the governance that needs to be applied to the mesh to act as an ecosystem.
26:55 The data mesh is a paradigm shift that is applying domain-driven distributed architecture to big data and uses platform thinking to create self-serving data infrastructure, federated governance to create a harmonious ecosystem and product thinking to deliver this data as a node on the mesh to the rest of the organisation.

What are the main challenges in a data mesh?

27:40 The pain points fall into two camps; the technology isn't geared towards distributed architecture in big data domain.
27:55 There are technical challenges in this space; and the other side is the organisational challenge.
28:05 Some of the challenges are related to data mesh, but some are related to how we are approaching data driven programs of work.
28:15 The target state of a data mesh requires that the operational domains take the responsibility of providing their domain data sets as data products, and are not incentivised to do that.
28:35 If I'm running the e-commerce system, I'm incentivised based on how many transactions take place.
28:45 I'm not incentivised in providing customer interaction events to the rest of the organisation.
29:05 Being able to create incentivised metrics and KPI etc. so that the domains take the ownership of providing those capabilities and analytical data to the rest of the mesh is an area of focus.
29:25 I don't expect organisations being able to get there on day one; bootstrapping a data mesh looks very different to being in the final state.
29:30 One of them would be incentive structures about decentralised ownership of the data.
29:40 Governance is very centralised today, and governance teams have a difficult time to do what they need to do - being able to federate that would help.
29:55 That applies to any migration from monolithic to distributed ownership journey.
30:00 The second set of challenges are around technology.
30:05 There are a lot of advances in open-source world, which you can use the technology to apply it to a distributed data consumption model.
30:15 Interestingly enough, those consumption technologies have been incentivise by the fragmented world of operational data.
30:25 In a way, they have come to exist to deal with silos of data stores, but they can't be used in the world of data mesh.
30:35 One of the concerns around a data mesh is discoverability - ability to discover data, to make sense of it, to see the documentation of it.
30:50 The reason the open source solutions exist isn't because they want to intentionally share the documentation and lineage - the data was hidden so they had to make sense of it.
31:10 For example, Lyft's discoverability tool came to be because it was so hard to discover data without it.
31:20 There needs to be a new generation of data discoverability tools to allow intentionally sharing data, the same way that API gateways allowed the sharing of those.
31:35 The other aspect of technology is access control: having distributed access control is a solved problem in the microservice world.
32:00 That doesn't exist in the data storage world, because you don't have a unified way of applying a federated access control to all of the data nodes with different polyglot data storage.
32:30 how can we apply the data access rules to these different kinds of data?

If people want to follow your work or data mesh, how can they do that?

32:45 You can follow me on Twitter https://twitter.com/zhamakd or LinkedIn. https://www.linkedin.com/in/zhamak-dehghani/
32:50 I am setting up a data mesh website, and will try to collect all of my talks into one place.
33:05 I'm excited to read other blogs that people have written, so I'd like to collect those.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.