BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Data Mesh in Action: A Journey From Ideation to Implementation

Data Mesh in Action: A Journey From Ideation to Implementation

49:52

Summary

Anurag Kale discusses the transition from centralized data bottlenecks to a decentralized Data Mesh architecture at Horse Powertrain. He explains the four pillars - domain ownership, data as a product, self-serve platforms, and federated governance - to empower autonomous teams. Learn how to apply DDD and platform engineering to scale analytical value and align data strategy with business goals.

Bio

Anurag Kale is a seasoned Cloud and Data Architect with over 11+ years of experience driving digital transformation across multiple industries. Currently serving as Cloud and Data Architect at Horse Powertrain, he spearheads the design of a hybrid Data Mesh. Anurag is an AWS Data Hero, recognized for his contributions to the cloud and data community.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Anurag Kale: I would just like to quickly understand, how many of you here are data professionals? You're working with data somehow? Software engineers? Architects? Product people? Today I'm going to be talking about data mesh. What exactly is data mesh? I'm going to walk you through the journey that we have been through at my current organization, Horse Powertrain. Just to give you a little bit of a background, Horse Powertrain is a engines and transmissions manufacturing company. We were earlier part of Volvo Cars in Sweden, but we took the engine division and the transmission division, broke it off and made it its separate company. You will see why this is important and relevant as well during the talk, because for services that we were relying on our mothership, one of them was data analytics, we were cut off abruptly and we had to develop something on our own.

Who am I? Why should you listen to me? My name is Anurag. I'm what we call an AWS Data Hero. Ironically, I'm going to talk about Azure today. You have to sometimes deal the cards that you have been dealt with. I have been a public speaker for a few years. My biggest achievement has been that I spoke at re:Invent in 2023, and I take that as a flag bearer. I've been working in the industry for about 11 years. I have worked across the value chain of software. What I mean by that is I've been a developer, I have been an ops person, I have been testing. It's due to my unique ability to jump between companies and try to do whatever the company asks me to do. Is it good or bad? I'll let you decide. I work as a cloud and data architect at Horse Powertrain, and specifically I work for their Swedish department, or Swedish side of the business. I'm based in Gothenburg in Sweden. That's where Volvo Cars is headquartered, Volvo Trucks, as well as Volvo Buses.

Data Teams

Let's take a moment and let's talk about data teams. Specifically, these data teams when it comes to big organization and enterprises. I hope that you will agree with the picture that I'm going to show you. What typically happens is, these data teams, what you have in the company, their main job, main work is to produce reports or dashboards, or if you're lucky and if your organization is really a little bit one step ahead then you might be doing some machine learning and AI as well on top of it. To produce these reports, you typically have a data warehouse or a data lakehouse backing these up. In order to run this data warehouse or this data lakehouse, you have a collection of centralized teams where you have data engineers, data architects, ETL developers and whatnot, and maybe even analysts, and they try to work as a team.

If there is a request for some kind of report, dashboard, ML use case, they will try to figure out what application does this data come from, try to talk to that application. Figure out how to get this data in, model this data, and then hopefully produce a report that has some meaningful value out of it. Everything is good and well until there are a few applications that you are supposed to work with, but as time goes, the list of the applications that you need to fetch data from to support these business use cases keep on growing. If it's an enterprise application, we are talking about hundreds of different small products, small systems, third party applications, SaaS tools, Excel sheets, CSVs, and whatnot. It becomes a little bit messy.

The reports, dashboards, machine learning outcomes, these are the main products. I'm curious to know how many of you agree that instead of spending the majority of time on figuring out the reports, most of the people spend majority of the time trying to build, maintain, and operate these ETL pipelines because these ETL or ELT pipelines depending on your use case, they are extremely brittle and they are ops heavy.

What I mean by brittle is, any small change in your source data, if you rename the column, if you misplace the hyphen, if you misplace the underscore, if you change this, your whole ETL pipeline is down, so you have to figure out. It's very susceptible to changes at the moment. It takes away a lot of energy. It saps away a lot of time of this particular team. The irony of all of these things is we were once the heroes who were producing the reports which were used by business, and after a while when you are supporting a lot of applications, you become the villains. There is a famous saying in Batman, it says you either die as a hero or you live long enough to become a villain. That's what happens for most of the centralized data teams out there.

Another irony is the data belongs to the application but these people are not able to freely adjust. They are not able to make changes. They are not able to do as they will, because data belongs to them in the first space. Although that is the case, these are like a bunch of 5, 10 people at tops that are taking care of all of these applications. It's a problem, probably. If not, you will see why it is a problem in a few minutes. There is a methodology out there in the market, there is an architectural style that can help us solve some of these problems. I use some because it all depends on how you implement and interpret this technology. In comes the superhero, in comes data mesh. Data mesh basically says all of these problems that you are having at hand, I can help you solve. You can only solve the problems that you acknowledge. That's the first problem that we need to start with.

Like we said, the setup that we have with this team setup, it's challenging but it's not bad. You can in theory find really senior engineers. You can have really good practices. You can have good communication, and hopefully you don't have those ETL breaks and your team can run a well-oiled machinery. It's far from reality. You will see this picture a lot during this talk. I'm going to challenge it on two different perspectives. The one perspective is that it ignores agile principles. What I mean by that is there are four agile principles.

The first principle basically says, if you want to deliver software that really creates value, you should focus on individual and interactions, or processes and tools. We do exactly opposite. We are all about tools and processes when you're dealing with a data team. Without the documentation you cannot go anywhere, so you have to spend inordinate amount of time in writing documentation of the data platform so that you can do good data governance. It also violates that the customer is waiting for the data. You are not customer friendly. At the end you become a bottleneck and you are not responding to changes fast.

The second set of principles that it violates is the DevOps mindset. DevOps mindset specifically thinks like it kills autonomy. If I'm a data producer, I want to be autonomous. I want to make my own decisions. I cannot do that in this particular model. It creates impedance. I have to wait for you to be available. I need to wait for you to have time to build some things for me that are important for me, so that's not ideal either. There is always a large backlog. You have to wait for services from this data team. Then the data ownership lines are really blurred. I own the data, but I can never fully own the data. That's a problem for me. Most of this can be fixed if we bring a software terminology that is very popular in the microservices world, and that is decoupling. That is what data mesh is all about. It is all about achieving a state of decoupled data analytics practice that allows teams to run freely, but at the same time, make sure that you are compliant.

Outline

What are we going to cover today? I'm going to cover what exactly is data mesh with some examples. I'm going to tell you how it helps the data practice thrive. I will show you a reference implementation of how we have done data mesh at Horse Powertrain. Then, what's needed to implement it. What are the things? You'll be surprised, it's not always technology. Then, finally, if you have the same pain points that I have just told you and you want to implement data mesh, what are some of the concepts that you can take away so that you can sell it to the business? Because now you are senior developers and half of your job most of the times is selling a solution to your business counterparts.

What is Data Mesh?

What is data mesh? Let's start with a formal definition. The data mesh, as written by Zhamak Dheghani, she published a book like 3 or 4 years ago. This is the definition that I like very much. It basically says data mesh is a decentralized, sociotechnical approach to bringing analytical data value in a complex environment, in complex organizations. The key term here is sociotechnical. Sociotechnical basically refers to how people and technologies contact with each other. How do they interconnect? How do they work with each other? As engineers, we are really comfortable with code, we are really comfortable with tools. If I'm throwing a new tool at you, you will be very familiar with it in a few weeks' time at tops. Bringing that communication side of things, that's the hardest part.

The Four Pillars of Data Mesh

There are four key pillars of data mesh to achieving what we just said, the decoupling. If I want to achieve decoupling, there are four key aspects of data mesh. The first pillar is domain ownership. Domain ownership basically means if the data is produced by the system, the system should be the owner of that particular data. They should have full control. They should have full say on what that data is, how it is used, how it is exposed and whatnot. Right now, it doesn't happen. The second part is you need to stop treating data as a sub product, or you need to look at data as a product. It is an outcome, not an output.

The third thing is, in order to achieve and to bring this harmony, you will use self-service data platforms so that this data that is generated in our ecosystem is freely available, accessible, and consumable to whoever wants to use, given they have the right permissions and accesses. The fourth pillar is federated governance, which basically means the governance of the data, the control of the data should happen where the data resides. We are going to look at each of these pillars individually. I will try to give you an example of how we have interpreted that pillar in our organization so that you have a reference use case to take with you when you leave from here.

Pillar 1: Domain Ownership

Let's talk about the first pillar, that is domain ownership. What does this mean? Right now, what we are seeing in this particular setup, the ETL or ELT is owned by the central team. Let's take a shift left approach and let's say that the application team, which hopefully is a product team in your case, a cross-functional product team, we enable them enough that we can draw a line in here and say anything that is left to this line is your responsibility. Anything on the right is the responsibility of data analytics team. This requires you to also move some competences or bring some competences in these individual product teams so that they can expose the data at the type of structures that they want. This is a little bit of a shift left approach, and this is where domain ownership starts. We need to figure out where that line is. The biggest challenge that we have is where do you draw the line. As an architect, you need to figure out, and it is very easy to just basically say, let's replicate the organization structure.

In the organization structure, if I have a team that is delivering some of the product, let's draw the line there and let each and every team get ownership of their particular data. That's a very easy solution to go out. As Conway's Law says, it's not very optimal if you just replicate the communication structures in the technical systems. How do you find that tipping point? How do you find that particular line? How many teams should shift left? How many products should we just keep operating at the same time? What I typically like to do in this particular use case is I like to use a concept called domain-driven design. Again, domain-driven design is a big discipline that is used by architects to try to figure out how the business is talking about the value chains in the business. I'm selling engines. When you sell the engines, how are the teams organized when you are selling the engines? You try to understand that, and you try to bring that in a diagram that is helpful. We'll talk about that diagram.

In manufacturing, just to give you a little bit of context, this is what a typical product lifecycle looks like, the actual product, the engine product. It starts from concept to reality. You start from design. You do the design in CAD, and you do some verification. You do the production of the engines. Then, once the engine is sold, there is also an aftermarket, which is basically selling you spare parts and whatnot. To support all of this, you will be using a variety of software, starting from a CAD system to defect management system, and these are all collecting data in some way or the other. If I have to get the full value chain answers, then I need to collect the data from all of these systems. As you see, these are salt problems, which basically means that I have a lot of commercial off-the-shelf products that are serving some of the needs. There is a lot of vendors, and there is lock-in, and people don't want to leave data out of their commercial systems as well. It's a lot of chaos.

In order to try to find that particular line, like I said, domain-driven design is one of the key tools that you can use. A specific subpart of domain-driven design that you can start with is called a context map. In a context map, you try to talk to your business, and it's a very elaborate process. You will find a lot of documentation around this. At the end of this context map exercise, you will be able to build a diagram, something like this. Let's say, in my ecosystem the dominant team or the dominant system is the manufacturing execution system. That manufacturing execution system is feeding data or feeding things into my ERP, which basically means that ERP has a conformist kind of relationship with manufacturing execution system.

Similarly, you can also have things like your identity access management, auditing, and whatnot. After this, you will figure out that, ok, maybe the manufacturing execution system and all the things that are helping me display this manufacturing execution system, they can be bundled together as one team, one cohesive deployment of data, because they are often represented together. At the end of this exercise, these diagrams only show three, you will have maybe 10, 15, 20, who knows, depending on how big your ecosystem is. Each of these circles represent that shift left approach.

In that case, you have aligned your data delivery on the shift left towards the business capability, towards the business value that is being delivered by that use case. That is where you draw the shift left. In our case, to build that, we started utilizing Azure Databricks. In Azure Databricks, this is where we draw the line. We identified, just to keep consistency, the three use cases that I've shown you on the screen. With the three use cases, we said, these are the business capabilities. Maybe behind the business capabilities, there are three applications that are coming together to give the data out for that kind of business value decision making. These are the three businesses contributing to the answers that the business might typically have. This gives you a brief idea where to draw this particular line. It's a little bit of a trial and error. Look up domain-driven design, and you will find a lot of examples out there on the internet that will help you go deeper into it.

Pillar 2: Self-Serve Data Platform

The next pillar that we are going to talk about is self-serve data platform. This diagram is awesome. The business capability now just wants to use Databricks. These people, they are CAD experts, they are testing experts, they are engine optimization experts, they just understand the data that is there for consumption. They can only see how much air-fuel mixture was there and what kind of output, if they were not Databricks expert or they are not data platform experts. You need to make this data platform really easy for them so that they can do self-serving on top of it. Self-serve basically means that you don't have to ask somebody from the data team, can I get access? Can I do this? I want to do this kind of analysis. You should not be bottlenecked by them. You should, in theory, be able to just go in, ask for access, ask for requests in an automated fashion most of the time, and then you will be able to start working on your area. What does a platform mean? There are multiple definitions of a platform, and this is the definition of platform that I will use for the rest of the chat.

Basically, what I mean is, when I'm doing platform engineering, I am building two decoupled lifecycles. The first lifecycle that is a little bit oriented towards infrastructure, reusable patterns, reusable systems that we are using to deliver capabilities that are required by more than one team in the organization. These are designed as products. There are well defined interfaces. You can interact with them as any other product out in the market. Then, on top of using these services, you can deploy your applications or you can do your data things on top of it. These are two decoupled systems. The underlying system is what I call as platform, and anything you deploy on that is value delivery on top of that particular platform.

How did we implement this particular platform in Horse Powertrain? We started with Azure Databricks as my image showed. In Azure Databricks, in order to create this mesh of things, there is a hierarchy of things, how Azure set this up. The biggest abstraction that we have in Azure is an Azure tenant. Inside of a tenant, you have a Databricks account. This Databricks account is across your entire Azure tenant. In the Databricks account, you have regions. Azure has multiple regions. For each region, Databricks deploys something called a metastore. This is their mechanism of collecting metadata about data that you're storing in their application. Then inside of that region, you can have subscriptions, which are logical separation of workloads. Then inside of that particular subscription, you can have your workspace.

Workspace is a logically separated user space in which whatever data you put, you are the controller of that particular data. Unless you specify that I want to share this data with someone, it will not be available to the other workspace. You can have multiple workspaces per subscription per region. You will see where I'm going with this in just a moment. In this particular case, the system of getting a tenant, of getting a Databricks account, and getting a region is managed by the data and platform team. You as a product team or the business capability team that we identified, you get a workspace that is fully aligned and fully given for you for your use case. You have admin access in there.

How do you request these workspaces? We did this by utilizing a chain of things using GitHub Actions and a combination of infrastructure as code tools using Terraform and Bicep. This is a platform or admin repo that is managed by the platform team. In these particular repos, there is a dedicated repo for requesting a new subscription. There is a dedicated repo for requesting virtual networks. Then you have certain things that are required for Databricks to work, like setting up endpoints if you want to use serverless compute. We bake in certain strategic roles into each workspace.

Once this is done, all you need to do is make a pull request, and then you will get a working Databricks workspace in 10 minutes or 15 minutes at tops. How do we distribute that? We take these repos that are collected together and we bundle them together in a CLI application called Copier. This CLI Copier, basically the developer can run on their own computers. Inside of that particular CLI, they get all of the templates that are available. You will have one template for requesting subscription, one for VNet, one for Databricks workspace, and you choose what you want to do in that case. You make a local copy of that template. We also have a fully configured GitHub Actions in that case. Then you set up some variables. We do basic linting and basic checks. There are CI checks as well.

Then, you make a pull request. It gets merged into the main repository. Sometimes it is the platform repository, sometimes it's your own repository. The moment you have passed most of the tests, you will get a working location out. None of the developers who are in the platform team are involved at all at this point. This is well tested. It only takes 10 to 15 minutes for you to request a new workspace.

Now let's re-visualize. At the beginning, I said I have multiple different applications. These are business delivery areas. Each of these business delivery areas now has a dedicated workspace. In this particular workspace, they are the admins. They have full control of how they can bring the data in, and how they can run this particular data, and what kind of processing they want to do.

At the same time, they are all reporting to the Databricks central account. There is the metastore. That metastore is collecting metadata about all the repositories, all the workspaces, all the tables that are in the workspaces, just the metadata. Unless the workspace then tells me that you have access to this particular asset, nobody has access to this particular asset. Doesn't it look similar to microservices? I, as a product team, I have a fully functional data space, where I have full control, and I can try and test different things. Databricks in itself provides a lot of capabilities that I can use, from AI, BI to machine learning, and all of the use cases. We have decoupled these two lifecycles. Infrastructure is taken care by a data team, a platform team, and then you are taking care of the data lifecycle.

At the end of this two, I hope that we are a little bit more agile. Let's do a quick check. Are we doing individual, and are we focusing more on interactions? We are, because, again, you have full control of the data and the lifecycle. Are we producing useful data assets? We are. You are not waiting for the data team to have certain resources available to build this report, to build this AI bot that you are hoping to do. You are in full control of this. We are a little bit more agile. What we have not solved is contracts. Again, if you are putting in some data and your customer is expecting some data and you have breaking changes that are propagating, how do we fix that? You need some plan in place from the platform teams. The response to change is not that fast. Let's see if we can fix this with the rest of the two pillars that we have at hand.

Pillar 3: Data as a Product

The third pillar that is going to help us out with this is treating data as a product. What do I mean by that? It's basically data product. The product is a combination of data, metadata, and semantics. Basically, I have a table, what are the columns? What does this column mean? How do I interpret these particular columns? That's semantics. When you combine this all together, that's a data product. A data product is curated and it's trustworthy. I can trust that the CAD people who are giving me the data product, they have done all the quality checks because they are the domain experts of that particular data. It is curated and it's trustworthy. It is packaged and it is discoverable and it is governed.

Finally, it does serve a specific purpose. To enable that, we use something called Databricks Asset Bundles. These asset bundles, it's like a YAML file project collection, which you just take it on your local environment. You create a project bundle, and it has a specific structure. The data pipelines that you are producing to make some reports, you can declaratively define them. You can specify how they are being orchestrated using Databricks' ecosystem. They can be version controlled. All you have to do is commit them into a version control, and through CI/CD, they can be deployed into multiple workspaces. You can have a dev, stage, production workspace. You can basically create a new pipeline on your local computer, do the testing, do sanity check, put it in version control, CI/CD, and yes, in minutes you have that particular thing inside of your workspace. You get rapid feedback and that's what we are after.

One good thing about this entire setup, it comes built in with a thing called data contracts. These are synonymous to the API contracts that you are used to, the clauses. How do you do post? How do you do all of those requests? The clauses that you have for APIs, you have similar semantics available inside of the project. The moment it is committed, you will have all of the information, you will be able to easily get metadata out of the systems. You can also put CI checks. If the particular data that you are committing does not have the definition of the column, so you need to describe the column, what does the column mean, user_ID. What kind of user ID it is. You need to write one line description at least of that particular user ID in this table. You can do those CI checks inside of the CI/CD so you will always get an enriched data inside of your data platform.

Next, how do you build these data products? There are multiple levels on which you can build the data products. We chose three levels on which we build data products at Horse. The first level is where the data is in third normal form, which is what typically, the data would be in the operational systems. It is replicated as-is. This is useful for workspaces where there are machine learning use cases, where you want to build a feature store, where you want to do vector embeddings for RAG and whatnot. This is extra useful there.

Second thing, what we call as technical data products. In these technical data products, when you are bringing data in to the second layer, you have cleaned it, you have denormalized it. All the attribute names are results, so user_ID actually is changed to something else, maybe CAD system user ID, or MES system user ID so that I understand a little bit more what that is. Then, if your use case requires you to do BI, then you can also remodel it in terms of facts and dimensions in this layer.

The third layer that we do is we build business data products. This is where you can bring together data from any of the two levels together, but this should be finally a flat data or flat table that you can essentially plug in a BI tool on top of it, and even a business user should be able to then interpret that. This is only possible because throughout this value chain, I have made sure through CI checks, through asset bundles, through multiple things that there is a description of the column, there is a definition of that column. With that, a business user can easily say, ok, price point. Which price point is it? It basically says it is manufacturing price point, or is it aftermarket price point. You have fully qualified names in there, which helps you do self-service BI as well, which is a big buzzword in the market today.

Pillar 4: Federated Data Governance

The final pillar - I've talked about all of these - what about control? What about making sure that there is certain data that is not supposed to be given out to somebody? Sensitive financial details. Apart from the CFO, nobody should be able to see it. If you remember, we also baked in certain rules in every workspace previously. As the admin of the workspace, I have full control over how do we show the access. How does it happen? It's basically when there is a business capability with full control of workspace. You get Unity Catalog built in inside of Databricks. This Unity Catalog, basically, you can see here, there is details, there is lineage, there is quality, there is also another property here, where you can, for each table, define which role has access or which user has access to it. You can create in your workspace, business reader user, and you can put all your analysts in it, and you can say, the analyst only have access to these three tables.

All of that data governance, that control is in your hands. I'm enabling you as a user, you don't have to depend on the central team to always figure out how to share accesses with your counterparts. With that, governance is also solved. If you look at the idea of decoupling from before, and are we a little bit more agile, you will see that you are also fulfilling these last two criterias of being a bit more agile. My customers, which are business users, they can request the data easily. If it is a technical user, I can do that as well.

Then I can create a specific role for the technical user. If it's a business user, I can do that as well. I can respond to changes over time. Because we have data contracts, you can do contract versioning. You can build a data pipeline with version one of the data pipeline. You can also build version two of the pipeline without introducing breaking changes. Again, just bringing some software experience into your data space as well.

What Does the Data Analytics Team Do?

Shift left enough. We have now requested the workspace owners or the business owners to do a lot of things. What does the data analytics team do then? What are their roles? What are their functions? Their main job is to maintain, upgrade the platform, and support products. This is one way of how we do it. Let's say if there is a business initiative that has started, and let's say it's a timeline from left to right. We do a division of labor between the local team, which is the product team, and the data analytics team. The division of labor starts with basically asking them, does this particular use case have a business initiative?

Let's work together and try to ask the why behind this particular initiative. We dig deep into trying to understand what is it that they are trying to build. You look at business value. You look at budget. Do you have local competence and resources to make sure that this project happens? Those are some of the questions that we ask the business teams to answer. Once they have shown that, yes, we have those things in place, we start what we call an incubation phase, which is something similar to what YC and startup incubators do. We inject a couple of people, or one or two people, depending on the use case, into the product teams. We will work with them for one or two sprints, depending on how complex that use case is. Build a POC, train the people on using this particular platform.

Onboard them, show the tools, show the templates. Walk them around and show a proof of value. Once they have gotten used to this particular phase, we help them productionize this use case. This is where then we start to take a step back. We say, now that you have productionized, you have shown the value, you need to take a bigger role in that.

At the top, you will see that there is color coding that tells you in what split we are in. Hopefully, towards the end, then a data product is born. Depending on the use case, it's either level one, level two, or level three data product. That data product now, since it is baked in with all of the goodness of CI/CDs, the checks and all of those, it is an independent, scalable, and repeatable product. We say, product teams, that's your responsibility now. They can always come back to us and ask for support, ask for guidance, ask for new tools.

Then, ultimately, because of the presence of the data catalog, we can build a marketplace. I can say basically the dataset that is being exposed by finances, which is all of the invoices that has been processed. I have those invoices metadata available in a single catalog. Tomorrow, somebody from ESG, which is the compliance side of things also needs access to those particular invoices data, all you need to do is add that particular person into that group that allows them to read that particular data. They can take an invoice data, they can take the CAD data, they can take the incidence data, they can bring these data things together in their own workspace and build a new report that is used for their use cases. This cycle continues from there on until the point where we see there is no help needed for the product teams.

How do you make these changes happen? Again, building a platform is easy. What you don't know is, will people use it? How do you push changes? Of course, we are building backwards from the organization needs. We are starting small from one particular use case, but you also need help from the management to drive these changes. We use a combination of OKRs and KPIs that are pushed on top of the business, not on developers.

Developers have their own KPI, but management KPIs as well. We work closely with the management and then we build certain of these KPIs. We have quarterly tech-focused OKRs for product teams, and these are owned by the business owner. They are not owned by the technical lead. They have their own set of KPIs. We follow a framework called SMART, where in the OKRs or KPIs, they need to be Specific, Measurable, Achievable, Relevant, and Time-bound. What does this mean? An example of this would be, for a particular business area, which is not so mature, which is still using Excel, we can basically say, you need to move at least five Excel reports that you are building, like 30 hours on each of them. These five Excels should be moved to the data platform in this Q1 or Q2 or Q3. This is then putting pressure on the management so that these management teams can then work with the developers and drive the changes. The business requirements are coming in, not only from technical side, but also from business side.

Summary

In summary, shift left, it works, but you need to make sure that you have the management structures in place so that you can enable the people to shift left. If you are just basically saying that you need to own ETL pipelines, but all I have in my team is one product expert who knows how to use this particular product, they are going to struggle big time to build anything out of this sort. You need to have management mechanisms in place. This management structure basically means the possibility to let the people being borrowed, or you will also need to have buy-in from the management to say, yes, we need to hire a few people here. Maybe we need a consultant for three months, whatnot. Unless there is a management structure and you have the ambition, it's bound to fail if you don't. The platform, again, it's a no-brainer. You have heard a lot of people say already in multiple talks, build the platform backwards. When we started with this particular journey, we just went to one particular use case.

That one particular use case was our engines testing team. They were testing the engines. Whenever an engine is born out of the factory, they plug it in and do various stress tests on top of it. They wanted to analyze this data. We went to them, help us understand what is it that you need. They needed some analytical use cases. From those use cases, using automation right from the beginning, we created a flag bearer use case and used that flag bearer use case to show that it brings value, this approach works. Since we have done everything with infrastructure as code, it's very easy to replicate for the next team that comes in, which basically means that the start time that I had for this particular team, which might have been two or three weeks, next time it's only one day.

The second day, they can actually start working on the platform. Yes, build the platform backwards from one particular use case, find that one big flagship use case, which has a lot of business value, and then your use cases become easy. Also, meet where your developers are or have the possibility to train them. In this use case, when I used the CLI tool to distribute my templates, I did that because the developers in our use case were very comfortable using CLI tools. If I had non-technical users, then I might have used ServiceNow's ticketing system, where then you could have done the automation behind the scenes. It depends on how mature your organization is. Only you can make that decision for yourself, no consultant, no other person can make that decision for you.

Questions and Answers

Participant 1: Do you have any issues when you want to create reports, like overall products?

Anurag Kale: Initially, yes, because we were missing a lot of data. One thing that is not so visual from the diagrams that I've shown is once you have multiple workspaces that are exposing this data, in the central workspace, you will be able to see one unified data catalog. In that one unified data catalog, and because we have implemented so many measures for things like, who is the owner of this particular data product? Which team is it coming from? What cost center? Unless you have, we will not allow you to put the data in the platform. In that metadata catalog, the metastore that I mentioned, you will see all of the data products that are used by other teams. I'm not seeing that sample data. I'm not seeing what is inside of the table. All I'm seeing is name of the table, the columns, and what the description of that column means.

In theory, what you can do is you can say, my workspace ID is X. I need this particular data product from workspace Y. In the data product description, there is an owner of that particular workspace. You can contact that workspace owner and say, I need this data asset. I need to have read access for this use case. You can then start collecting 4, 5, 10, whatever these data products that you need to build that report that you need. That's the whole value proposition of data mesh. Once you have started exposing data as reusable assets, that life becomes very simple.

The example that I used, which was ESG reports, employment, sustainability, and governance. This is the report that is required by EU, and it spans across people, economics, as well as incidents. For that, you will have to get data from multiple teams. That's how this entire methodology helps you get there. What we are trying out now is we are trying to put another data catalog, which is a little more business-oriented data catalog, which should allow business users to do that as well.

Participant 1: Like a layer on top.

Anurag Kale: There are so many different data products out there, so you just need to pick and choose. Then you will get a copy, or most of the time in Databricks, you don't even copy, you just get access. You write whatever new ETL that you want to write on top of the data products that are inside of Databricks and you get a whole new table. Again, depending on use case, you can make facts and dimensions or you could build one flat table, whatever the case. Databricks, in a sense, that's the most complete data platform I find on the market. It has a publish to Power BI button. You publish it, the new dataset, and the semantic model of that is reflected in Power BI, and then you can build the Power BI report on top of it. I think it supports other BI use cases as well, but Power BI I'm sure.

Participant 2: What kind of data tables do you support? Do you support Sheets, MySQL? Do you have to do some work whenever a new need arises or is this built in Databricks?

Anurag Kale: No. Databricks, behind the scenes, it uses a technology called Open Table Formats. Open Table Formats, this is a new piece of technology that is built on top of stores like S3 or Azure Blob Storage. Essentially, they store this data on Blob. The format that Databricks specifically natively uses is called as Delta Lakes, but there are open-source alternatives of them as well. There's Apache Hudi and Apache Iceberg out there in the market. These are the storage formats. No matter where you read the data from, when you store it in Databricks, it is stored in that Delta table format, and then Databricks can natively read that Delta Lake format.

Participant 2: When you read it in the first place, do you need to support the teams to convert this?

Anurag Kale: Yes and no, sometimes, because it depends on how complex the data from the team is. When you use any kind of lakehouse platform, you have a two-step approach. The first step is you have to just land that data into a Blob storage. To do that, depending on the source technology, you will have to use either some connectors or an ETL tool. Most of the time, Azure Data Factory has connectors to MySQL or something like that. Use Azure Data Factory, pick that data from MySQL database, SQL Server database, Oracle, dump it into a Blob Storage, and then Databricks can pick it up from there. Yes, it is possible.

Participant 2: How often do you need to?

Anurag Kale: That depends on your use case. If you need a report that runs every hour, then you pick every hour. If you need it every night, you run it every night.

Participant 2: Did you encounter scalability issues?

Anurag Kale: No. Because the compute model behind Databricks is when you run these pipelines, you can either choose to connect that to a cluster or you can choose serverless mode. Serverless mode is basically available all the time. You pay for how much you use. Let's say if the pipeline is running for one hour, you pay for one hour. That's it. That serverless service is always available.

 

See more presentations with transcripts

 

Recorded at:

Mar 23, 2026

BT