Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Data Mesh Paradigm Shift in Data Platform Architecture

Data Mesh Paradigm Shift in Data Platform Architecture



Zhamak Dehghani introduces Data Mesh, the next generation data platform, that shifts to a paradigm drawing from modern distributed architecture considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.


Zhamak Dehghani is a principal technology consultant at ThoughtWorks with a focus on distributed systems and modern data platform architecture at Enterprise. She is a member of ThoughtWorks global Technology Advisory Board and contributes to the creation of ThoughtWorks Technology Radar.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Dehghani: For the next 50 minutes I'll talk about data mesh, long overdue paradigm shifts in data architecture. I know I did resist using the phrase "paradigm shift." I couldn't resist. It ended up in the title, and it's one of the most used and abused phrases in our industry. Have you heard the phrase? Do you know the origin of the phrase?

Participant 1: Thomas Kuhn, "The Structure of Scientific Revolutions."

Dehghani: Thank you very much. You are one of the very few people who actually know the origin of this. The other person who knows the origin of this, knew the origin of this was our CTO, Rebecca Parsons. As you rightly said, in 1962, an American physicist and a historian of science, a philosopher of science, wrote this book, "The Structure of Scientific Revolutions." He coined the term paradigm shift in this very controversial book at the time. He made actually quite a few scientists upset.

What he shared in his book was his observations about how science progresses through the history. What he basically said was scientists start their journey in terms of progressing science in this phase he called normal science, where, essentially, scientists are working based on the assumptions and theories of the existing paradigm. They're looking and doing observations to see what they expect to see, what they expect to prove. Not a whole lot of critical thinking is going on there, and you can imagine why scientists weren't so happy about this book.

After that, they start running into anomalies. They're making observations that don't quite fit the current norm, and that's when they go into the phase of crisis. They start doubting what they believe to be true, and they start thinking out of the box. That's where the paradigm shift happens to the revolutionary science. Essentially, we're going from incremental improvements in whatever scientific field we are to a completely new order. An example of that, when scientists couldn't make sense of their observations in subatomic level, we had the paradigm shift from the Newtonian mechanics to quantum mechanics.

What does that have anything to do with modern data architecture? I think we are in that crisis phase in the Kuhnian observation. The paradigm that we have adopted for 30, 40, 50 years about how to manage data doesn't really solve our problems today. The inconvenient truth is that companies are spending more and more on data. This is an annual survey, NewVantage from the Fortune 1000 companies and they surveyed the leaders.

What they found out is that we're seeing an immense amount of increase in the pace of investment. That increase over the course of one year is budgets that are being spent between 50 million to 500 million and above, despite the fact that the leaders in those organizations seeing a downfall in their confidence as that money is actually giving measurable results. Even though there are pockets of innovation in terms of using data, we don't have to go far. We just look around Silicon Valley where we see how digital natives are using data to change their businesses.

The incumbents and a lot of large organizations are failing to measuring themselves failing on any transformational measure. Are they using data to compete? Are they using analytics to change their business? Have they changed their culture? Why I don't want to underestimate the amount of work that goes into multifaceted change and transformation in organizations to actually use data to change the way we behave, changing our culture, changing our incentive structure, changing how we make decisions, but technology has a big part in it. This is an architecture track, so that's where I'm going to focus.

Data Technology Solutions Today

The current state is that the current accepted norm and paradigm has put this architectural landscape into these two different spheres with hardly much intersection. We have the sphere of operational systems. That's where the microservice is happening. That's where the systems running the business are operating, your e-commerce, your retail, your supply chain. Really, we've seen an immense amount of improvements over the last decade in how we run our operational businesses. You just have to go to microservices track or DevOps track to see how much we have moved forward.

Then on the other side of the organization down to haul in the data department, we are dealing with the big data analytical architecture. Its purpose is, "How can I optimize the business? How can I run the business better so I can upsell, cross-sell personalized experience of my customer, find the best route for my drivers, see the trends of my business, BI Analytics ML?"

That has a very different architectural patterns and paradigms that we've accepted. If you think about that sphere of big data architecture, there's three big generational technologies that I've seen, the ones working with a lot of clients, start with data warehousing. Do you know when was the first writing, and research, and implementation of the data warehouse entered the industry? In the 70s. Late 60s, the first research papers, and the data marts, and implementation of that were in the 70s. We had the data warehousing. We improved in 2010. We evolved to data lake, and now data lake on the cloud. If you look at the implementation or the existing paradigms of data warehousing, the job of data warehousing has been always get the data from the operational systems, whether you run some a job that goes into the guts of database and extract data.

Before you use the data, try to model it into this model that's going to solve all the problems like world hunger, and we can do all sorts of analysis on it into snowflake schemas or star schemas and run a bunch of SQL like queries over it so we can create dashboards and visualizations, put a human behind the analytical system, to see what the heck is going on around that business.

The type of technologies that we've seen at the space – by the way, disclaimer, this is no endorsement of any of these technologies. It's just a random selection of things that you might see in the wild as a representative technology to support data warehousing. You have things like cloud providers, like the BigQuery, or Power BI if you're an insurer, that gives you the full stack to get the data into hundreds of tables and be able to query them in different ways. Then you have your dashboard and analytics on top for your reporting.

The data warehousing, which we used for about 40 years, had been problematic at scale. This notion that we can get data from all different complex domains and put them in one model thousands of tables and thousands of reports, then we can really use that in an agile and nimble way, has been an unfulfilled promise. We improved, we evolved, and we said, "You know what, don't worry about that whole modeling we talked about, just get the data out of the operational systems, bring them to this big, fat data lake in its original form. Don't do so much modeling, we deal with modeling afterwards."

Then we throw a few data scientists to swim in this data lake and figure out what insights they can discover. Then we would model the data for downstream consumption in a fruitful purpose way, whether it's specific databases or a data warehouse down the line. That has also been problematic at scale. The data department of running Hadoop clusters or other ways of storing this big data hasn't been that responsive to the data scientists that need to use that data.

The type of technology that we see around here, the big storage like the Blob Storage, because now we're talking about storing data in its native format so we go with a plain Blob Storage who have tools for processing the data, Spark, and so on to join, to filter, to model it, and then we have orchestrators, like Airflow and so on to orchestrate these jobs. A lot of clients that I work with still are not satisfied. They still don't get value at scale in a responsive way from data lake.

Naturally, the answer to that is, get the lake on to the cloud. Cloud providers are speeding and competing in getting your data in the cloud and provide services that are easier to manage. They're doing a great job, but essentially, they're following the same paradigm. This is a sample solution example or solution architecture from GCP. I can promise you, if you google AWS or Azure, they pretty much look the same. You've got, on the left-hand side, this idea that your operational systems, your TP, everything, through batch, through stream processing, throw it into the data lake and then downstream model it into BigQuery or big table if you want to be faster, and so on.

Look Convincing?

That looks wonderfully convincing. Wiring, fabulous technology to shove the data from left to right into this big cloud. I want to step back for a minute, look at 50,000-foot view the essential characteristics that are commonly shared across these different solutions that we've built, and get to the root cause of why we're not seeing the benefits that we need to see. 50,000-foot view, I can promise you that I've seen so many enterprise data architectures that pretty much look like this. Obviously, they're drawn with more fancier diagrams rather than my squiggly hand drawing.

Essentially, it's a big one big data platform data lake, data warehouse, and its job is consuming data from hundreds of systems, the yellow-orange boxes that are drawn, across the organization or beyond the bounds of the organization, cleanse, process, serve, and then satisfy the needs of hundreds of consumer use cases feed the BI reports, empower the data scientists, train the machine learning algorithms, and so on.

If you look at that technology, the solution architecture that I showed you, there is nowhere a discussion around the domains, around the data itself. We always talk about throw the data in one place. This idea of this monolithic architecture, the idea of domains, the data itself is completely lost. The job of the architects in organization, when they find themselves with this big architecture is to somehow break it down into its pieces, so that they can assign different teams to implement the functionality between different boxes here.

This is one of the ways that companies at scale are trying to break down their architecture into smaller pieces. They design ingestion services, so services that are getting the data out of the devices or operational systems. They have the processing team that is building the pipelines to process that, and there are teams that are working on the API's or downstream databases to serve them.

I'm very much simplifying. Behind this is actually a labyrinth of data pipelines stitched together. When you step back for a minute, what are we seeing here? We're seeing a layered architecture that has been a top-level decomposition. It's been decompose based on its technical capability: serving, ingesting, and so on. The boundaries are the technical functionality. If you tilt your head 90 degrees, you have seen this before. We have seen layered enterprise architecture where we had UI and business logics, and databases underneath.

What was wrong with that? We moved from that to microservices. Why? Because the change doesn't happen, the change is not constrained to these boxes that we've drawn on the paper. The change happens orthogonally to these layers. If I want to introduce a new signal that I want to get from my device, and now process it, if I want to introduce a new source or introduce a new model, I pretty much have to change all of these pieces. That's very friction-full for process. The handover, the handshake, it makes sure that consistently happens across those layers. If you come down a little bit more closer and look at the life of people who actually build this architecture and support them, what do we see?

We see a group of people, siloed, data engineers, ML engineers in the middle stuck in between the world of operational systems that generate this data and the world of consumers that need to consume the data without any domain expertise. I really don't envy the life of data engineers I work with. I'm hoping that we can change the life of these data engineers right here right now from here on. What happens is that the orange people that are running the operational systems, they have no incentive to provide their analytical data, those historical snapshots, the events and reality and facts of the business to the rest of the organization in an easily consumable way.

They are incentive to run their operational business they are incentive to run that e-commerce system and build a database that is optimized to run that e-commerce system. On the other side, the purple folks, they are just hungry for the data. They need the data to train the machine learning, and they're frustrated because they constantly need to change it and modify it, and they're dependent on the data engineers in the middle. The data engineers are under a lot of pressure because they don't understand the data coming to them. They don't really have the domain expertise. They don't know how the data is being used.

They've been essentially siloed based on the tools expertise. Yes, we are at that point of evolution or the growth of technology then, and still, the data tooling is fairly niche space. Knowing Spark and Scala and Airflow, it's a very niche space than, generally, software engineers. We've seen these silos before. We saw the silo of DevOps and remove the wall. The wall came down, and we brought the folks together. We created a completely new generation of engineers, called them SREs, and that was wonderful, wasn't it? With silos, we just have a very difficult process full of friction.

The stats, just to show the skill set gap that we are facing and we'll continue to face with a wall in between is the stats that you can get from LinkedIn. Last time I searched was a few weeks back. I doubt things have changed much in three weeks. If you look for data jobs open today for the label "data engineer," you find about 46,000 jobs open on LinkedIn. If you look for people who are claiming to be data engineers on the platform, you see 37,000 folks. I'm pretty sure all of them are in good jobs with good pay.There's this huge gap in the skill set that we can't close, which is silo and people.

This centralized monolithic paradigm, it was great maybe for a smaller scale. The world we live in today is a world that data is ubiquitous. Every touchpoint, every action and interaction is generating data, and the business are driven to innovate. That cycle of innovation: test, and learn, and observe, and change, that requires constant change to the data and modeling and remodeling. This centralized system simply doesn't scale. A centralized monolithic system that has divided the work based on the technical operation, implemented by a silo of folks.

Going back to Thomas Kuhn's observation, you had the data warehouse, and the lake, and the lake on the cloud, what have been doing for 40 and 50 years? We've been stuck in that normal science. We believe that the only way we get use of data is just getting into big, fat data Lego platform, get our arms around this so we can make sense of it. That this centralization was the dream of his CIOs of 30 years ago that, "I have to get the data centralized because it's siloed in this databases that I can get into." That's the paradigm shift I'm hoping to introduce.

Where Do We Go From Here?

Let's talk about data mesh. Hopefully, so far, I've nudged you to question the existing paradigm. I'm going to go a bit top-down - my mental model was fairly top-down - talk about the principles that drives this change, and then go deeper into some of the implementations and hopefully leave you with a couple of next steps. The principles of that underpinning data mesh are basically the ingredients of the best and most successful projects that we have had globally at ThoughtWorks. It's applying the learnings of modern architecture that we've seen in the adjacent world of operational system and bring that to the data. The very first one is the decentralization.

How can we apply domain-driven thinking and distributed architecture to data? How can we hide the complexity of that cell infrastructure that runs and operates the big data? I don't want to trivialize that, it is very hard to operate a Kafka cluster at scale. It is very difficult to run your Spark cluster. How can we abstract that away into self-serve infrastructure with platform thinking, and to avoid those silos of hard-to-find hard-to-use meaningless, not trustworthy data? How can we apply product thinking to really treat data as an asset? Finally, to have a harmonious and well-played ecosystem, what sort of governance we need to bring to the table? I'm going to go into each of these ones one by one, and hopefully, [inaudible 00:19:44] better.

Domain-driven distributed architecture. Raise your hand [inaudible 00:20:00] of Eric Evans', "DDD." About 10%. Go on Amazon, [inaudible 00:20:14] just through this and get the book, or go to [inaudible 00:20:20] website [inaudible 00:20:21] stake.

What domain-driven design or domain-driven distributed architecture introduces is this idea of breaking down monolithic systems into pieces that are designed around domain. Picking the business domains that you have, right now, what we discussed was the way we're trying to break down these centralized monolithic data platforms around pipelines the job of the different pipeline phases. Now we are applying a different approach. We're saying, find the domains. The examples I put up here are from health insurance because that's where I am. We're waist-deep right now with a client implementing their next-generation data platform.

When you think about the operational domains, a lot of organizations are already divided that way. In the healthcare domain, you have your claim systems that provides claims like pharmaceutical or medical claims that you're putting together. You might have your biomarkers lab results and so on. These are the different domains that you see in this space. If you think about these data domains as a way to decompartmentalize your architecture, you often find either domains that are very much closer to the source, so where the data originates, for example claims. You have the claim systems already either accepting, or rejecting, or processing different claims.

Those systems are generating historical analytical data about claims. There are domains that are closer to the facts of the business as they're getting generated. We're talking about immutable data. We're talking about historical data that is just going to be infinitely forever and ever generated and stay there. These data domains hardly change because the facts of the business don't change as much. Of course, there are industries where I get a new app, and my app features changes, so the signals coming from that app constantly changes. Normally, in bigger organizations, these are more permanent and static data domains.

Then you have domains that you are refining, you're basically creating based on the need of your business. These are aggregate data domains. I put in the example of patients critical moments of intervention, which is a wonderful actually use case on a data set that the client that I'm working with right now is generating by aggregating a lot of information about the members, members behavior, their demographic, the change of address, and apply machine learning to find out, "What are those moments that, as an insurance provider, I need to reach out to my members and say, 'You need to do something about your health. You just changed your address. You haven't seen a doctor for a while. You don't have a support network. Probably you haven't picked a doctor or done your dental checkups. Go and visit Dr. AOB,'" so creating these data sets.

These are aggregate views, or the holy grail of healthcare data right now is longitudinal patient records, so aggregating all of your clinical visits and lab results into some a time series data. These are more consumer-oriented designed domains. Theoretically, we should be able to always regenerate these and recreate these from those native data products that we saw. Where did the pipelines go? The pipeline still exists. Each of those data domains still needs to ingest data from some upstream place, maybe just a service next door that is implementing the functionality or the operational systems.

They still have to cleanse it and serve that data, but those pipelines become the second class concern. They become the implementation details of these domain data sets or domain data products. As we go towards the right-hand side, the orange and red blobs, we see more of the cleansing, more of the integrated testing to get accurate source of data out built into the pipelines. As you go towards the consumer-facing and aggregate views, you see more of the modeling, and transformations, and joins, and filters, and so on.

In summary, with distributed domain-driven architecture, your first partition, architectural partition becomes these domains and domains data products, which I go into details towards the end. I really hope that we don't use pipeline, data pipeline as a first-class concern. Every time I ask one of our data engineers, "Can you draw your architecture?" He just talks about pipelines. Pipelines are just layers implementation details. What really matters is the data itself and a domain that it belongs to. There's this wonderful concept, architectural quantum, that Neil Ford and Rebecca Parsons, the co-authors of "Evolutionary Architectures" book turn, which is, the smallest piece of your architecture, units of your architecture that has high cohesion and can be deployed independently of the rest.

We are moving to a world that the architectural quantum becomes this domain data products, that are immutable showing the snapshots and the history of the business. How can we avoid this the problem that we have had to move from centralization, this problem of having these silos of databases and data stores now spread across these domains and nobody knows what is going on, and how do we get to them? That's where product thinking helps us.

I think it's become quite actually common for us to think about the technical platforms that we build as products because the developers, the data scientists, they are the consumers and customers of those products, and we should treat them so. If you ask any data scientist today, they would tell you that they spent 80% to 90% of their time to actually find the data that they need, and then make sense of it, and then cleanse it, and model it to be able to use it. Why don't we apply product thinking to really delight the experience of that data scientist and remove that 80%, 90% waste?

What does that mean? That means each of these domains that we talked about, like the claims domain, becomes the data, the historical analytical data for it, becomes a product. Yes, it has multiple shapes, it's a polyglot data set. You might have streams of claims for the users that prefer real-time or near real-time events about the claims. It might have buckets of batch or historical snapshots for data scientists because they love bucket files and batch processing for 80% of their job. For data to be an asset and be treated as such, I think there are some characteristics that each of these data products need to carry.

First and foremost, they need to be discoverable. Chris [Riccomini] mentioned in the previous talk that with the world of data cataloging and Sunfire in a good way and there are tons of different applications because data discoverability is the first and foremost characteristics of any healthy data platform. Once we discover the data, we need to programmatically address it so we get access to the data easily. As a data scientist or data analyst, if I can't trust the data, I will not use it. It's really interesting because in the world of API's and microservices, running a microservice without announcing your uptime and having an SLO, it's crazy.

You have to have an understanding of what your commitment to the rest of the organization is in terms of your SLOs. Why can't we apply the same thing to the data? If you have maybe real-time data with some missing events and some inconsistencies that's acceptable, you just got to explicitly announce that and explicitly support that for people to trust the data they're using. Good documentation, description of the schema where the owners are, anything that helps data scientists or data users to self serve using your product.

Interoperability – if I cannot have distributed data, if I can't join the customer from the sales domain to the customer from the commerce domain, I really can't use these pieces of data. That interoperability to unify the IDs or some other failed formats to allow that join and filter and correlation is another attribute of a data product.

Finally, security. It's such a privilege to talk while Chris [Riccomini] is here because I can just point to his talk. He talks about Orback and applying access control in an automated way at every endpoint, at every data product. These things just don't happen out of good intention. We need to assign people with specific roles, so particular role that we're defining where we're building this data product or data mesh is the data product owner. Someone whose job is care about the quality, the future, the lifecycle of a particular domain's analytical data, and really evangelize this to the rest of the organization, "Come and see. I've got this wonderful data you can tap into," and show how that can create value.

In summary, treating data, bringing the best practices of product development and product ownership to data. If you're putting one of these cross-functional teams together with the data product owner, I will start with asking for one success criteria, one KPI to measure. That is delighting the experience of the data users the decreased lead time for someone to come and find that data, make sense of it, and use it. That would be the only measure that I track first, and then, of course, the growth and more number of users using it.

If you've been listening so far, you're probably wondering, "What are you asking me?" A question that a lot of CIOs and people that actually spend the money ask me is that, "You're telling us distribute the analytical data ownership to different domains, create different teams. Then what happens with all that technical complexity, the stack that needs to implement each of these pipelines?" Each of these pipelines needs some a data lake storage, need to have the storage account setup, need to have the clusters to run their jobs, probably to have some services. There's a lot of complexity that goes into that.

Also, decisions such as you want to have your compute closer to your data. You want to have perhaps a consistent storage layer. These decisions, if we just distribute that, we create a lot of duplication duplicated effort, probably inconsistencies. That's where our experience in the operational world to creating infrastructure as a platform comes to play. We can apply the same thing here. Capabilities like the data discovery, setting up the storage account, all of those technical, the metalwork that we have to do to spin up one of these data products can be pushed down to a self-serve infrastructure with a group of data infrastructure engineers to support that. Just to give you a flavor of type of complexity that exists that needs to be abstracted away, here's just some list.

Out of this list, if I had a magic wand, and I could ask for one thing, that's unified data access control. Right now, it's actually a nightmare to set up a unified policy-based access control to different mediums of storage. If you're providing access control to your buckets, or if you're on Azure ADLs versus your Kafka versus your relational database, every one of them has a proprietary way of supporting that. There are technologies that are coming to play to support that, like extensions, future extensions to open policy agents, and so on. There's a lot of complexity that goes into that.

In summary, the platform thinking or data infrastructure, self-serve data infrastructure is set up to build all of the domain agnostic complexity to support data products. If I set up one of these teams - and often, very early on in the projects, we set up a data infrastructure team, they ask for them - the metrics they get measured by is the amount of time that it takes for a data product team to spin up a new product. How much complexity they can remove from the job of those data engineers or data product developers so that it takes a very little amount of time to extract, to get data for one domain and provide it in a polyglot form to the rest of the organization. That's their measure of success.

Anybody who's worked on distributed systems know that without interoperability, a distributed system will just fall on its face. If you think about microservices and the success of APIs, we had the one thing that we all agreed on. We had HTTP and REST. We pretty much all agree that's a good idea. Let's just start getting these services talk to each other based on some standardization. That was the key to the revolution of APIs. We need something similar here when we talk about this independent data products, that providing data from different domains so that that data can be correlated, joined, and processed or aggregated.

Formulating, what we are trying to do is creating this nexus of folks that are coming from different domains and formulating a federated governance team to decide what are those standardization we want to apply. Of course, there always going to be one or two data product that are very unique, but most of the time, you can agree upon a few standards. The areas that we are standardizing first and foremost is how each data product describe itself so that it can be self-discovered. The APIs to describe a data product, which I will share with you in a minute, to find and describe a data product and discover it. The other area that we very early on work on is this federated identity management. In the world of domain-driven design, there are often entities that cross boundaries of domains. Customer is one of them, members' one of them, and every domain has its own way of identity, these, what we call PoliSims.

There are ways to build inference services in machine learning to identify the identity of the customer across different domains that has a subset of attributes and generate a global ID so that now, as part of publishing a data product out of my domain, I can do internally a transformation to a globally identifiable customer ID so that my data product is now consistent in some ways with the other data products that have the notion of customer in them. Most importantly, we try to really automate all of the governance capabilities or capabilities that are related to the governance. The Federated ID system management is one of them. Access Control is another one. How can we really abstract away the policy enforcement and policy configuration for accessing polyglot data into the infrastructure?

Let's bring it together. What is data mesh? If I can say that with one breath, in one sentence, a decentralized architecture where your units of architecture is a domain-driven data set that is treated as a product owned by domains or teams that most intimately know that data, either they're creating it or they're consuming and re-sharing it. We allocated specific roles that have the accountability and the responsibility to provide that data as a product abstracting away complexity into infrastructure layer a self serve infrastructure layer so that we can create these products much more easily.

Real World Example

This is a real-world example from the health insurance domain. On the top corner, we see a domain. We call it Call Center Claims. It happens that these organizations have been running for 50 years, and they usually have some a legacy system. This is the Online Call Center application, that is a legacy system. The owners and writers of it are no longer with us. We had no other option but running some change data capture as an input into a data product that we call Online Call Center that's running within the domain of Online Call Center. It's not something different. What this data product does is provide the Call Centers claims daily snapshots because that's the best representation of the data from that domain from this data of that legacy.

In the other corner of the organization, we have this like brand new microservices, the handling the online claims information. You have a microservice is new, the developers are sharp, and they're constantly changing it, so they're providing the claims events as a stream of events. We bundle a data product within that domain called Online Claims data domain that now gets data from the Event Stream for the claims and provides polyglots data output, essentially. One is, similarly, the events that it's getting, it does a bit of transformation. It unifies the IDs and a few different field formats that we agreed upon. Also, for data scientists, it provides [inaudible 00:39:54] files in some a data lake storage.

A lot of the downstream organizations, they don't want to deal with duality of whether it's online or whether it's data center, so we created a new data product. We called it just the claims data product. It's consumed from upstream data products ports, from the Online and from the Call Center and aggregate that together as one unified stream. Obviously, it provides a stream. We want to still maintain the real-timeness of the online.

The events that get generated are actually for the legacy system is synthesized from the daily changes, so they're not as frequent. We also have this snapshot, so we have now the claims domain. We can play this game forever and ever. Let's continue. You've got the claims, on the other side of the organization you've got the members, people who deal with registration of the new members, change of their address, change of their marital status, and so on. They happen to provide member information right now as buckets of file-based information.

We had this wonderful ambitious plan to use machine learning to aggregate information from claims from members and a bunch of other upstream data products and create a new data product that can provide to the staff information about members that needs some an intervention for a better health, less claims, and less cost for the insurance company. That downstream data product, the member interventions data product runs actually a machine learning model as part of its pipeline. You saw in the previous diagram, these ones are more native and closer to source data product. As we move towards this, you move towards aggregated and new models and consumer-oriented.

Look Inside a Data Product

One of the questions or puzzles for a lot of the new clients is, "What is this data product? What does it look like? We can't really understand what it is because we're kind of inverting the mental model." The mental model has always been upstream into the lake, and then lake converted into downstream data. The system is very much a pipeline model. It looks like this, looks like a little bug. This is your unit of architecture. I have to say, this is the first incarnation of this. We've been building this now for a year, but, hopefully, you will take it away, you make your own bug, and build a different model. This is what we're building. Every data product that I just showed, like the claims, online claims, and so on, it has a bunch of input data ports that gets configured to consume data from upstream streams, or phy dumps, or CDC, or APIs, depending on how they're consuming the data from the upstream systems or upstream data products.

They have a bunch of polyglot output data ports. Again, it could be streams. This is what the data that they're serving to the rest of the organization. It could be files, it could be SQL query interfaces, it could be APIs. It could be whatever makes sense for that domain as a representative of its data. There are two other lollipops here, what we call control ports. Essentially, every data product is responsible for two other things rather than just consuming data and providing data. The first one is, be able to describe itself. All of the lineage metadata information addresses of these ports, the output ports that people care about, schemas, everything comes from this endpoint. If there is a centralized discovery tool, they would call this endpoint to get the latest information. Because of the GDPR, or CCPA, or some of the audit requirements that usually the governance teams have in the organization, we provide also an audit port.

If you think about on-prem to cloud movement, if your upstream happens to be on-prem and downstream happens to be on cloud, the conversion from the copying from on-prem and cloud happens in the ports configuration. If you look inside, you will see a bunch of data pipelines that they are really copying the data around and transforming it and snapshotting it or whatever they need to do to provide a downstream. We also deploy a bunch of services or site cars with each of these units to implement the APIs that I just talked about, the audit API or the self-description API.

As you can see, this is quite a hairy little bug. In microservices world is wonderful, you build a Docker image and inside it has all the complexity of implementing its behavior. Here, every CICD pipeline, which is a CICD pipeline independent for every data product, actually deploy a bunch of different things. For example, on Azure, on the input data ports are usually Azure data factory, which is like the data connectors to get the data in the pipelines or as data bricks, Spark jobs storage is ADLs. There's a whole bunch of things that need to be configured together as a data product.

Discoverability is a first-class concern. You can hit a RESTful endpoint to get the general description of each of these data products. You can hit an endpoint to get the schemas for each of these output ports that you care about, and documentation and so on. Where's the lake? Where's the data warehouse? Where is it? It's not really in this diagram. I think the data warehouse as a concept like having BigQuery for your big table for running fast SQL queries, it can be a node on the mesh. The lake as a storage, you can still have a consistent storage underneath all of these data products, but they're not a centralized piece of architecture any longer.

The paradigm shift we're talking about is from centralized ownership to decentralized ownership of the data, from monolithic architecture to distributed architecture, to really thinking about pipelines as a first-class concern to domain data as a first-class concern. Data being a byproduct of what we do, as an exhaust of an existing system, to be a product that we solve.

We do that through, instead of having siloed data engineers and ML engineers, we'll have cross-functional teams with data ownership, accountability, and responsibility. With every paradigm shift needs to be a language shift and a language change. Here are a few tips, how we can use a different language in our everyday conversation to change the way we imagine data, to change the way we imagine data as this big divide between operational systems to analytical systems as one piece of architecture that is fabric of our organizations.

Hopefully, now I've nudged you to question the status quo, to question the paradigm, a 50-year paradigm of centralized data architecture. Get started implementing this decentralized one. This is a blog post that, to be honest, I got really frustrated and angry and I wrote in a week. I hope it helps.


See more presentations with transcripts


Recorded at:

Feb 27, 2020