BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Data Pipelines & Data Mesh: Where We Are and What the Future Looks Like

Data Pipelines & Data Mesh: Where We Are and What the Future Looks Like

Bookmarks
40:25

Summary

Zhamak Dehghani, Tareq Abedrabbo and Jacek Laskowski discuss the current challenges for building Modern Data Pipelines and applying Data Mesh in the real world, what the future looks like, and tools.

Bio

Zhamak Dehghani is Director of Emerging Technologies @thoughtworks & Creator of the Data Mesh concept. Tareq Abedrabbo is Core Data Principal Engineer @CMCMarkets. Jacek Laskowski is an IT freelancer, Java Champion & Author of "The Internals Of".

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Nardon: My name is Fabiane Nardon. The panelists are Zhamak and Tareq, and Jacek Laskowski, who is the author of several books on technologies used for creating data pipelines, and creates data pipelines every day. Can I ask you to briefly introduce yourselves?

Dehghani: I'm Zhamak. I work with ThoughtWorks as the director of emerging technologies in North America. Myself and a group of us at ThoughtWorks came up with the idea of data mesh a few years back, and since then, that has taken over my life. That's all I do. I talk about data mesh and try to implement it by clients.

Abedrabbo: My name is Tareq. I'm a core data engineer at CMC Markets. My background has been in software development. I've had an interest in data for the last few years, and the whole NoSQL big data stuff. I used to work for different consultancies as well, and now focusing more in a product company. We started on our data mesh journey around six months ago, so we're very excited about that.

Laskowski: Jacek Laskowski here from Poland. I've been with Spark for the past five-plus years as an independent consultant. Can't imagine any day without doing more Spark, more Delta Lake, and Apache Kafka, so all around data.

Nardon: Several of my Spark problems were solved with Jacek's books.

The Main Bottlenecks in Data Engineering Today

Since this panel is about where we are and where we are going, everybody knows that data engineering has lots of bottlenecks right now. The promise of data science is not still delivering on the promise it had in the beginning, because we don't have all the investments we should have made in data engineering in the past. What are the main bottlenecks for data engineering that you see today? I know that data mesh tries to solve some of the problems, and probably Jacek that builds data pipelines every day, knows what his problems are, the difficulties that developers have to create scalable data pipelines and so on. What do you see are the biggest bottlenecks for the data space right now?

Dehghani: Of course, I'm biased towards view of the data mesh, I have to say that. I feel like the image that I have in my head around where we are today and what are our problems is really a diagram, a graph that shows the inflection point that we're in. On the x-axis of that graph, you have this progression of the scale and complexity. We're getting data from more sources, the aspirations of data science and models that we have goes beyond a single dataset within a single domain, within a single organization. It requires data from many different origins and places. Those aspirations are growing. We have more use cases. We have more sources. You have that complexity and scale growing on one hand. On the other hand on the y-axis, you have the impact of that complexity. Are we becoming faster to respond to our data needs? Are we getting more value from our data? Are we resilient to the change of our organization? I see that really, we are tapering off instead of going up and becoming more agile and get value. We're really at that inflection point.

I think the challenge that we are tapering off and clattering in getting results, or perhaps declining, and that's why we came up with, why don't we challenge some of these assumptions? Some of those assumptions and those challenges are based on our assumptions that we've got to bring data all together in one place, one lake, one warehouse, or one team. We've got to build all of these pipelines, so that we can get value. Immediately we've got a bottleneck, we have created a bottleneck: organizational bottleneck, and a technology bottleneck. What if the same way that we mesh APIs together, what if alternatively we could actually take our requests and queries and compute to where the data is and create this new notion of the data as a product, so that we can really federate and decentralize this problem?

I think the other problem that we have is this separation of people who write the microservices and operations and know Kubernetes, and people that know the data. There's like this bifurcation of both people and technology and skill set. What does it take to have a generalist? Jacek, you're an amazing specialist, and you know these tools. We are proud to call ourselves a generalist developer, a full stack developer, because we're expert enough that we can move around. What does it take to really elevate the data platform complexities or abstract some of those complexities so every generalist can be using data? Can we put the data science in a box, so we can talk about how we can abstract complexity, so the expert generalists can really address the needs of the data, rather than a group of specialist data engineers that we will never have enough of them?

Tech Shortcomings in Creating Data Pipelines

Nardon: Right now, you need to be an expert on several tools to be able to deliver a data product. What do you see are the main difficulties for developers that have to create data pipelines? What is technology still not delivering? What do you see?

Laskowski: When I was invited to this panel, I was worried that all these questions are going to be very high level, and architectural, and given my age, 40-plus, I still try to be very practical and very tool oriented. I've been focusing on Apache Spark for the past 5-plus years. A year ago, Databricks, the company behind Apache Spark open sourced this great product called Delta Lake. Zhamak mentioned this architectural term, Data Lake, and I can see proliferation of tools in this space now. There is this tool, Apache Software Foundation projects like Apache Hudi from Uber and Facebook's Iceberg. These are data formats for large datasets, and Delta Lake itself. We've got three tools that try to solve the very same problem, having a consistent schema across different versions of this data shared across different teams.

There's this question about, what are our thoughts on conflicting requirements of data scientists and other teams? There is this conference, Data + AI Summit organized by Databricks, the guys behind Spark and Delta Lake, and they just announced a new product to share large datasets in a more unified and streamlined way. Data scientists using different tools and data engineers can all benefit from all the data that was collected in this space called Data Lake. I try to be practical, not pragmatic, but definitely tool oriented. My world is as wide as Spark plus Delta Lake are, or gave me this opportunity.

I had a call with a customer, and I was surprised, given that Spark has been on the market for 10 years, they're still with the very early versions of Spark. I couldn't believe me thinking that Spark is all over the place. Everybody is using Spark, or at least heard about it. I still, even after 5, 10 years, see teams that only quite recently have heard about Spark, the tool they needed. I couldn't believe that with all the years, with all the advances in how to manage large datasets, there are still people who don't even know about tools they could use for problems they have. There was this term machine learning, AI, data scientists, all these great terms. I keep hearing all these phrases, and I'm worried that the more I'm in this data pipeline world, the far I am from my customers, who haven't even started their data journey yet. I was worried that my experience might not match expectations of my customers, and our audience. Usually, it's like, people need recipes for how to do stuff better using more efficient tools. Yes, I had this thought. That's my take on it.

Nardon: That's probably actually the problem, because right now, when you start your data journey, you have so many options of architectural tools, and there's so much fragmentation, that it's really a challenge to choose your path. Finding people to actually guide you on this is also very hard. I don't know how the market in your country is. At least here in Brazil, it's very hard to find good, experienced data engineers, or data scientists, or senior people with experience on different technologies and architectures. There's also a problem in the market of finding these people.

Abedrabbo: Just very quickly on the main blockers point. I think it cannot be explained by the technology, because you can take any space in data, for example, and you will find multiple alternatives that are possible to adopt somehow. With the cloud, there's a large choice. It cannot be the technology itself. From my personal experience, what I saw is that the organizations I worked with, one thing is that there is focus on the technology, actually, which is trying to find a technology based solution to all data problems. Also, a lack of understanding that as in again biased, the data is central. The problem is more about the organization and the people around it. The data is what is actually doable. I saw a few organizations, for example, who want to adopt data science, and then they get a data science team, but they have absolutely no data engineering capability to get the data to the data scientists.

In terms of having multiple choices and so on, this is a bit of a problem everywhere. There will always be some skills that are niche, more or less, and I think some of the data engineering aspects beyond the specificity of any tool, but things like designing a system that is resilient and fault tolerant and so on, will require some level of specialization. This works very well with a more squad based and iterative delivery model. We try to match both. We know that we cannot always have someone who's competent in all data engineering aspects on every single squad, but we can cover different things at different points in time. I think this will always be a struggle, similar to what we saw in DevOps, and cloud engineering, and so on.

The Gap between Experimentation and Prod

Nardon: Another aspect that maybe you can shed some light on how you're solving the problem is that there's a huge gap between experimentation and production of data solutions. Jacek, for example, probably, when you're creating a data pipeline, someone has to give you sample data to experiment and create, and then you have to deploy this actually, in production. Depending on how you do it, it's like doing everything over again. It's very common today to experiment using Jupyter Notebooks, for example, but then you have to deliver these in production. How do you keep this because you're going to implement or use an architecture to run the notebook in production? Also, thinking on data mesh architecture, when you're creating your data pipelines, you have to find data that in your organization becomes more data all the time. I see, Tareq, that you're using data discovery tools for that, but then the data discovery has to make for experimentation and then for production. There are different security concerns on both of them. This is huge for us to have productivity in creating these products. How do you see that? How do you solve these? Is there a good solution for that? How do you see this problem of actually developing the product and then using these in production?

Laskowski: In the Data + AI Summit, there were products that could address problems you mentioned may help solve, among them is something they called Databricks Unity. There's this product they designed or they thought might be relevant to some data teams, where there is a data catalog, like a Hive catalog or a Hive metastore, where you can just query for all available datasets. There is a brand new protocol developed by people behind Apache Spark, who worked for Databricks with other companies, they developed this open source specification for sharing large datasets. It looks like something I haven't thought much about, because I'm more practical in a sense that people are telling me what to do, so I'm doing it, rather than thinking just like architects might be doing all day long. While being busy with Spark, and seeing no end with me pushing Spark to teams, I see there is a need for data orchestration, data discovery, and all these high level concepts. It's only now, Databricks, the company behind all these large data complex processing, pushed these new products to the market. It looks like what you mentioned, it's a very brand new space to explore and address with new tools. I'm yet to learn more from these tools about the problems people might have.

It might also be that many teams, I've been in touch with, they don't even know they are having problems, because no one told them. They are struggling with something you mentioned, just like data discovery. It's like me very often when I'm doing something, and I don't see I'm repeating myself over and over, until someone comes and says, "Why are you repeating this and that? You can just automate it. Why don't you do this, or why don't you use this?" It looks like the problems you mentioned are new to me, but not for others. There are new products from Databricks that were just announced.

Dehghani: Unlike Jacek who's actually spent all his day on the keyboard writing, I spend my day concluding and thinking. Really looking at, what are we doing? Someone asked, how do you respond to the doubters? I will just give you this quote from Lao that if you don't change direction, you may end up where you're heading, which is the challenges we've had no matter what shiny tool we get. I'm actually a big fan of Databricks folks on what they've done and their contribution to open source. I think the tooling is a part of it. What I would say is that, again, pondering what things can we fundamentally change?

You had two questions, Fabiane, in what you posed to us. One was around discoverability, ability to find data. Then the other one was the difference between what it takes to have this production quality data, and I think there are two fundamental characteristic assumptions that I would like us to rethink. One is the fact that data is unintelligent. Data is bits and bytes that we dump into a sink. We read from one sink, you have this intelligent pipeline code, and then dump it into another sink. It's just dead bits and bytes on the disk or in the stream. Then what do you have to do to make that data discoverable and useful, understandable? You build all these fancy systems outside of that data to bring intelligence and usability to it. Let's invert that model, why data? Why can't we now define a new unit for data, or a new concept, data product, a new quantum, a new unit that encapsulates not only the data, but also that intelligence, the code that makes the data maintainable? Maybe that's the pipeline with it. All the metadata that makes the data usable, the documentation. Yes, if you want to have a fancy tool on top that gets that information out, and then gives it in an aggregate view so you can search it across many data products, it's possible.

I think we've got to shift our thinking from data as a byproduct of something that we dump somewhere and that somebody else needs to figure out how to actually make sense of it, as really data as a product. That means a very new unit of architecture than how we think about data. It's not files. It's not tables. It's that, and the code that makes it usable and movable. That's I think one aspect of its discoverability, that data inherently needs to include all of those information and the capabilities. Then the other aspect was around product quality, this experience of making this unusable data, usable products. I think, if we now think about data and code as a new unit, then you can apply all of the CI/CD practices that we've had in the microservices operational world, to that unit itself. You can manage and verify and test the quality of the data, as well as test the code that is managing that data, and that code very well might be a Spark job or a Spark pipeline. That's a new unit of architecture that can be verified, built, tested, deployed, and really apply CI/CD practices that we've been applying for decades.

Abedrabbo: Just to add an engineering perspective to what you said about the data as a product. As an engineering problem, it doesn't start and stop at just coding the jobs, part of the engineering problem is making it usable for the organization. This does include the development lifecycle, testability, deployment, all these things. In our case, again, you need to rely on a sound technical foundation, where you have CI/CD, automation, multiple environments, what have you. Then you have to work with the people to understand how that actually would work for them. What is the life cycle? Because different teams have different life cycles in terms of getting the data, doing something with it, and then building product. Actually, we include that as part of the engineering problem we are trying to solve. This is what engineering is, it's solving problems, and the problems are a tiny bit technical, most of the time organizational and people related as well.

Conflicting Requirements of Data Scientists and Data Mesh

Nardon: What are your thoughts on the seemingly conflicting requirements of data scientists and data mesh? Having all data very quickly versus having proper data products?

Dehghani: In fact, I disagree with that question. I feel that question comes again, from an assumption that you made at the back of your head, when you asked that question, what do data scientists really care about? I work with a lot of data scientists. I go to a lot of data science talks. At the end of it, regardless of all these beautiful models that generate, they say this disappointing sentence that we didn't have the right data to tune this and actually deploy it. What they care about is the ability to experiment, go from a question that would this model answer this particular question, to, find the data that they can explore, to discover, to very quickly learn where that data is. To actually trust it, know that it has the quality and integrity that it requires, and then be able to use it. Yes, they need many datasets of different kinds that can be stitched together.

Data scientists don't care if you physically put that data into one lake, and they shouldn't care. What they care about is those affordances that they can easily discover and use. The idea of the mesh, in fact, is that instead of waiting for some centralized data team to build the pipelines to give the data far away from the truthfulness of the data where the data actually came from, through some pipelining later on. Let's get them as close to the source as possible. Give the power and autonomy to the source to actually share that data as quickly as possible, as close to the source as possible, but yet have the mesh interconnectivity so that when you go and run a search or explore function on your mesh, you can see all of the data products. Just the idea that it is a data product doesn't mean that it will take a long time. I don't know why that assumption was made. I'm curious about that. If you ask every data scientist they also say, I spend 80%, 60%, whatever, to do data wrangling, to do data cleansing, to change the data. That wrangling and cleansing needs to be delegated and decentralized back to the sources of the data. The data should be really usable when they get it.

In fact, I disagree that because you have a mesh, you have conflicting requirements. You just have to see that requirement implemented in a different way that removes some of the challenges that data scientists have today. With the caveat, that from technology and implementation, we're still a few years away from that beautiful target state. We have to work with our technology providers who actually have the tools that exist perhaps today, the Delta sharing that Jacek was referring to, or Spark, these are very wonderful tools, they just have to be reconfigured in a different architecture.

The Data Mesh Architecture

Nardon: There's another question here around the architecture of the data mesh. Is there a case study that you can share of a client moving directly from a siloed data to data mesh without migrating to a centralized Data Lake house first?

Abedrabbo: This is my own personal understanding and interpretation. The way we interpret data mesh is it's not a technology recipe that you just implement. It does not dictate what technology you use. An architectural and organizational approach that puts data at the center, and in that it shares some of the ethos of microservices. For example, where we used to look at services as just monolithic applications, and then things became more decentralized and then more collaborative. One thing that we are lucky with in our context is that the backdrop is a transformation initiative. Because data mesh requires a change in the mindset, or a shift that is not just the technology, you need this remit or background, if you want, to be able to implement something that goes beyond cosmetic changes to the technology and beyond just linking up some Data Lakes or silos. That would be great to link to one, but for us, it wouldn't realize the promise of the data mesh.

Again, I see it actually as an enabler, but then there's the question as an enabler it requires some believing, that there'll be different and positive outcomes and an investment. For me, it's not something that will give you so-called low hanging fruits or immediate gains. In terms of the approach to implement it, we're obviously trying to focus on identifying the big problems and the important things and so on. Then turning this into multiple implementation initiatives or components or projects that actually add up to something coherent. This was the idea about the data mesh itself as a product. It's a target state, and it's a journey, and you have to imagine the concrete steps on the journey. It's probably a journey where you never get there. It's a continuous refinement, basically.

Dehghani: Something about that question makes me a little bit suspicious. I'm going to maybe try to double click into that. I don't have a case study where people didn't have Data Lakes, didn't have data warehouse. The reason for that is the people who are now going to the data mesh, they did have those. They were at the forefront of the technology and data-driven transformation, so they invested in the warehouse, invested in the lake, and they have pain points so that now they are trying something new. If your organization doesn't have any of those, or maybe it has it but it's in the silos, then I question, does your organization have a data-driven strategy?

That's the question at the back of my mind, is that if you don't have that data-driven strategy or aspirations, if your CEO doesn't envision certain functions of your business changing, using ML, then I would be suspicious of using data mesh. However, you might say, no, we do, but just for some other reason that I can't think of right now. We haven't invested in those approaches. If you have the engineering capacity, if you want to bring those data experts like Jacek into your organization to actually build that technology for you, then there is not a problem to go back to the source rather than rely on some interim lake or warehouse. In fact, in a way, it might be easier for you because you don't have a legacy to deal with. You can go to the source and start creating those data products as a point of sharing your data.

Data Roles at Companies

Nardon: There are a few questions around the roles in the company to work with data. What's the role of a data owner, or there's a discussion of full stack developers. How do companies organize themselves in terms of roles to implement data mesh? Jacek, the companies you consult with, how are they organized? What works best?

Laskowski: I'm going to disappoint you with my answer, because from my practical point of view, people are struggling with very basic concepts, and hearing all these high level concepts, I'm really not surprised, but I'm really astonished by how well we named all these problems, yet we've got almost no solutions or easy solutions. My clients usually say, we hire you, because we've got problems, whatever you call them, solve them. I say, I've got only a hammer called Spark, if you need Spark, I can just guide you on how to use it. If you need some architectural design, or some business transformation, that's definitely not my case, and I can't help you. Hearing all these nice concepts, observability, discoverability, all these abilities made me wonder how little I know about the problems of my clients, focusing solely on Spark, Delta Lake, and thinking, I am in this data crowd already. It's amazing. I thought I know a lot, yet given all these great speeches, and answers, I'm so little. I'm wondering what I'm doing here. I can't help you, unless you ask something about Spark.

Nardon: I think you represent the present very well, of course. Data mesh is still missing lots of tooling, and even some technologies to make it easier and real. It's good to have a present perspective, so we know that we still have lots of things to address.

How to Organize Teams to Put Data Mesh in Production

Explain how the teams are organized to put data mesh in production.

Abedrabbo: I'm happy to give a brief overview of what we are doing. One is the core data team. It's important to mention what we are not. We are not the data police. We are not the data implementers of every single thing in the organization. We are enablers because many of the data mesh concerns are quite crosscutting and require bridging gaps, and also investing in some core capabilities like the low-latency bridge, or data discovery and what have you. Some of the other roles exist in the organization but I think they need reframing with a data mesh perspective. For example, as an owner of a dataset currently, how do you become a data owner? What are the responsibilities about understanding what data to share and what metadata to advertise, and so on? Obviously, crosscutting discussions about finding a common language. Humans don't have to just go around and wonder what to do. Also, in a self-service perspective, the responsibility of data consumers to go and when they discover what data they need, and where it is, and the characteristics, and so on, to consume it and adapt it to their requirements. A lot of it is there, it just requires framing and logic in the right direction. It's a long journey. I'm not saying that this is where everything is now. This is what we're trying to head towards.

 

See more presentations with transcripts

 

Recorded at:

Mar 25, 2022

BT