BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Taming the Data Mess, How Not to Be Overwhelmed by the Data Landscape

Taming the Data Mess, How Not to Be Overwhelmed by the Data Landscape

Bookmarks
46:02

Summary

Ismaël Mejía reviews the current data landscape and discusses both technical and organizational ideas to avoid being overwhelmed by the current lack of consolidation of the data engineering world.

Bio

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data and AI team. He has more than a decade of experience architecting systems for startups and financial companies. He has been recently focused on distributed data frameworks, he is an active open-source contributor of Apache Beam (Google Dataflow SDK), Apache Avro and many other open-source projects.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Mejía: We're going to talk about taming the database. We're going to talk about how not to be overwhelmed by the data landscape. There are many systems, and we have to choose from them, so it's difficult. My name is Ismaël Mejía. I'm a Microsoft Advocate for Azure.

I got interested in this subject, because I had a new job opportunity, and this can be the case for many of you for different things. This can be for example, for a new project that starts from a new perspective for creating or integrating a new system. In my background, I have been doing open source for many years. I have worked with the Apache Software Foundation. I worked with Hadoop and Spark first as a client, creating jobs and data transformation pipelines, and then I moved deep into open source, working in Apache Beam and doing many other things. In both cases, I was working with AWS's infrastructure, and then afterwards with Google Cloud. When I got an opportunity to work at Microsoft, it was really interesting, because I was not so knowledgeable about the technology. I was presented with Azure, and I was like, this is really nice. This is a cloud infrastructure from Microsoft. They welcomed me with this little diagram of the fundamental services that we have in Azure. Of course, I quickly found some similarities, like the data services, like databases like Cosmos DB, or with Synapse that integrates Apache Spark, so it was an easy place to fit well. Then I realized the platform was quite big. As you can see, you can easily become overwhelmed with the big size of the things. Then I started to discuss with people and they asked me questions about security or about networking issues that they had while integrating with data systems. It's a feeling of being overwhelmed, because I had to ramp up in so many technologies and understand everything to explain it. I had this feeling that I have already lived this, this was not the first time I have to deal with this system, lack of experience, like I have already in the past been confronted with other systems like that, like when you saw in the old times the Hadoop ecosystem, tools was a little bit like that.

The Data Landscape

Then the real point that triggered my attention some years ago was this image. This image is the data landscape produced by Matt Turck. This image is quite overwhelming, because you've seen many of the different systems and frameworks and databases and companies that produce all of this. It gives a broad view of all the things that we can integrate in our architectures. When we zoom in just in a little part of this diagram, we can see that there are many things that we will have to choose from to create our architectures or solutions. It's the moment of despair when we get to the real question is how we can choose the right project or the right service, the right product, or the right platform with so many options. This happens to everyone. Of course, there is this big element that they call fear of missing out that is this feeling that if I'm not in the latest thing or the latest family of systems, I probably am missing something that can help me have an advantage in my company. With those two in mind, even with all of these, nobody can learn or even evaluate all of these. There are so many solutions that you can sit all day and not produce any work just by trying to decide from so many options.

This happens because there is a lack of consolidation. We can arguably hope and be optimistic about this, like, in the future systems we'll consolidate, so why should I care about this? There are many reasons why this has not happened. One is business wise, of course. There is many people at companies who are putting money outside, like venture capitalists who give money just to create the next new big data thing, because there is a lot of money in data. Of course, there is this legislation part also, there is nothing that oblige any of these companies to interact between all of them. Of course, in a more optimistic way, there is also the technology aspect of it, we are creating new systems that were not possible 15 or 20 years ago. All of this opens new opportunities, and there is innovation. We have to live with this. This is the message we have to deal with, so all this variety.

Current Data Trends

In the end, if we want to tackle this, we come to the critical problem of decision making. I am not an MBA, so I'm not an expert on decision making, or a person who's certified on IT along with these problems. I got interested in all of these from the perspective of a data engineer, or a data architect. To understand how we can do choices, we probably first need to understand what is the current data landscape trends. In the batch family of systems, let's say, we have two big families. We have operational data. Operational data corresponds to all the databases that we know, like the Postgres and the SQL servers, and Oracle's from the old time. All of this is quite mature. Now we have many of those who have cloud versions. This is the more developed area. Then we have the analytical data part was when we used to produce reports and do data analysis. Even more recently, to start with all these annual trends. In this, we have two subfamilies that were appearing because of the cloud. We have the data lake part. Basically, we use the data storage part of the cloud vendors, and we use a distributed file system because this can be done on-premise too. We put files there and we create an open system that we can access. Of course, there's data warehouses that come from a long time ago, but now there are the cloud versions of those. These are more closed systems that give you a more consolidated experience, mostly around SQL to do all these operations. This is one part. Then we have the recent appeal of streaming data. The idea of streaming data is that we are going to deal with data at the time it happens, so we want to reduce latency and produce results and integrate results as fast as possible. Streaming data is like the Year of Linux Desktop of the data engineering world. Why is this? Because in the Linux desktop community, there's always people saying, the next year is going to be the year of world domination for Linux, but this never happens. In the data engineering world, something like this, every year we say, streaming is going to be mainstream now. Everybody's going to use streaming data. We're getting there slowly, but we are not fully there. Of course, there have been a lot of progress because most companies now have a replicated log like Kafka. There is all this event streaming and change data capture things going on. More recently, there's this trend around SQL over Streams, and to have real-time materialized views of what's going on. That's another big set of products and tools that we can integrate in our systems.

More recently, we have data mesh. Data mesh is a set of concepts that came from a problem that appeared because of data lakes and poorly integrated data ownership. It's not because of data, it's because infrastructure became complex, and really specialized people took control of data. What data mesh proposes is just to have a more decentralized view of data. This is not 100% about tools, but tools help you to get there. We can bring data back to the people who have the domain knowledge. It's a question of access and ownership.

Of course, now also with the big different scenarios for data that we can have like batch and maybe using notebooks, or maybe using SQL, and maybe also some parts with streaming. This has given rise to what we call the cloud data platform. These are fully integrated vendors. Some of them are cloud vendors. Things like Snowflake, or Databricks. We're going to put both and fully integrated solution for data. That's another thing. The opposite of this is the so-called modern data stack movement. These are tools that mostly are service oriented. They're cloud first, so there's low configuration, very easy to set up. Their goals are more focused into specific problems. The good part of this is that I would call this the second-generation tools somehow. Because in the previous 10 or 15 years, we have been really dealing with the problem of scale with data, how we scale data and get it faster. How will you scale data for shipping? All of this is really important. A really big part of the promise that data engineers have are around different things like governance. How do I govern these data accesses? How can I trace those? How can I do the data lineage? How can I create better catalogs? All of these are things that were not immediately solved but now there are many companies and services that are trying to solve those. Again, another set of things that we can integrate. Of course, we have the ML and AI systems. The recent trend is that data supports ML, is basically one big thing. The other thing is that we also can use ML to improve our data system, so its implementations. Again, many things we can choose from. When we are going to choose, we need some approach to choose. That's the real question, how we can choose with so many tools, with so many trends? What is the correct approach?

Strategy

I came with a small framework of three things that we can take into account to choose. The first one is the strategy, because the strategy of the companies, or the project we're working on is the main area we should focus on. Then we have the technology, of course, because technology constrains the relation of our choices. Then we have the people, because people matters, and it's something that we engineers let for a second thought.

If we go with our first one, with the strategy, the first step we have to check is, what are our current systems now? What is the legacy we have as a company? Also, what are our business priorities? Why are we doing this? What results do we want to have first? Maybe there's this report that we want to have first, and with different requirements. Of course, none of this can be done without paying for this in one way or another. We have to take into account cost, how much we pay for licenses, but how much also we pay for engineers, and how much we pay for operations, all of these things. We have to also bring into account support. That's something that we tend to forget. When I say support, this is how fast your vendor is going to answer your questions and help you get there. This is something that Microsoft does well in Azure. Of course, another important dimension is knowledge and experience. If you have a team, does this team have knowledge on this particular tool we want to use? That's something that's interesting. For example, when you're recruiting, you're going to recruit someone, but this next thing we want to integrate, that's pretty good. Of course, all these decisions have an impact in the future of what we do, we have to take into account this. Is choosing this tool maybe solving the problem right now but bringing me more issues in the future? That's another thing we have to analyze.

One common thing that happens is that people get pressed a little bit because of marketing or because of buzz of the peers about what is the next nice thing everybody's using? Sometimes this is not appropriate for the problems you have. This is something that happened to me with a friend a long time ago, who asked me, "You see there is these companies. Now we're hearing a lot about this thing called feature stores, and we want to do machine learning in our company. This is a little startup. Maybe I need a feature store to this." That's not the right way to approach this problem. The first question I asked was, do you already have data available and is this data clean and of good quality? Do you have processes to keep it like that? Because it's not only a one-time shot, you have to keep these things going on. More important, is that data already used in your business? Do you have reports, or things like that? This is all the steps that were pretty important, even before thinking about machine learning and stuff like that. Even if we go into some machine learning specific project, like text recognition, for example, do you have to do this by yourself or can you use a managed service to do that? That's another thing you have to consider. In general, marketing new projects and ideas are really good and exciting, but you have to take things with a grain of salt.

Another thing that is common is that we have cognitive biases. Cognitive biases happen when we saw our first image of something and we get a little bit assuming that, that's the way it is. One common approach on these four data systems is performance. Everybody talks about, my system is the most performant, the fastest than all the others, or beat this benchmark. Performance is just only one dimension of so many things you have to take into account when you choose a data system. This is a photo of the countryside in Scotland, which is really beautiful. As you can see, this is a pretty day with sun and not so many clouds. In general, this can be quite a different experience, if you go during the winter and it's raining. Don't let yourself be anchored by the things that they show you first.

Another thing that matters is when we do these decisions, this timeliness, for example, that we use to decide if we go with some new, more experimental technology or we wait until we jump into something, matters. That's something that we as engineers have to play, or decision makers, we have to play a little bit of distance before taking this decision. We cannot do it immediately. We have to think a little bit more. Of course, even with all of these into account, we end up having to deal with things that are not set in stone. We have to have faith on many things like, for example, is your tech provider going to be there in three or four years, or maybe is going to change price or strategy, and how this will impact you? We can say, but this is for vendors and all this stuff, but what about open source? This will be a guarantee for the future? It can be, but you don't know if this open source project will be maintained in the future. This is something that happened recently with a workflow system for data that a big company open sourced and now they say they're not going to maintain it anymore. If your company is dependent on the system, you have to be ready. All of this is to say that part of the strategy is to be ready for changes. This is what we have to do.

Technology

What about the technology? The technology part is something that we always are quite strong about. We always say, our systems are bad. There's this negativity about the current solutions. We have to be realistic. We have to deal with what we have, and every company has a mess somewhere. If you want to change all your architecture or these things that are done very well, this is what it is and you have to deal with it. We can incrementally fix things. This is something that we have to consider first. Part of choosing new technologies or dealing with all these products in the landscape, it's exciting, and get us the possibility to experiment with new things. Sometimes we have to play a little bit conservative now, and choose the technologies that are in the proven path for this critical operation, because nobody wants to be called in the night because something's not working because of some experiment. We are now open to experimentation in many areas, but we also have to be conservative. This is a common tension and tradeoff when choosing systems. Of course, cloud services are a fantastic thing because they let us experiment with an easy setup. Sometimes with a really affordable cost at least for the exploration part when you're still doing a proof of concept. Doing proof of concept is also something that helps your engineers to be interested. It's something that we should consider.

Another technology that we can arguably say is good is to choose open source tools, because open source tools gives you the control of the tool. You can adapt the tool for your needs. That's attractive, of course. Open source has to be looked in detail in one sense, because there are always incentives in the tool that are open source. One thing you have to check first is who open sourced this tool and why? More importantly, is to check if there is a healthy community around the project that you are going to choose. Of course, the question of licenses. More important is to also see if there are multiple actors, there are people who are interested in this tool, or it's going probably to disappear tomorrow if someone does not maintain it anymore. If you choose open source tools, and you say, ok, but we can take care of it, look that you have the resources to take care of it. This is something that matters. Of course, a part of open source recently, we also have open data formats, and table formats that are ways we could present the data, we store the data into our data lakes. Since you control the data, you control the format, and you control how you modify the data, or how can you access the data. The good thing here is to choose formats that have a healthy ecosystem. There is a new family of table format tools like Apache Iceberg, or Delta, that allows to do new nice things. If you control the data, take care that the data is in an open format that you can access in the future, you need to have the guarantee, in time. Of course, a good table format also allows you to control schema evolution, allows you to time travel in between different versions of the data, which is really nice and a recent feature of these systems.

Of course, when you choose a tool, probably you also need to decide how you integrate with other tools that you already have, or systems. Then we have to check about the compatibility, we have API compatibility, which is really important, and is something that is happening more. In the case of data, sometimes the concrete API is like the MongoDB API that is supported by Cosmos DB, for example, or sometimes that are like SQL flavors, like is the SQL flavor supported? That can be another way to check this out. More recently, there are tools that are supporting data at the level of the wire compatibility. For example, they support, let's say, for Postgres compatibility. The nice thing with this is once you support the DB protocol, the tools that integrate with this database will integrate with your tool too. For example, this new database like CockroachDB, for example, supports Postgres and the Postgres mode of connection. You can just, in theory, at least replace them easily. Of course, all the existing tools that you have are supported, all your Power BIs, and Tableau, and all these things are already something that you can integrate with the new system. That's another question you have to ask when you choose technology.

Another thing we have to choose when we choose technology is to choose how we're going to operate it. I realized with the proposed images, that operations are something that has three traits. One is that operations are critical. Operations are hard, and are done mostly by experts. As you can see in these different domains, from the control center, to the surgery, to the military operation, these are things that have these patterns. Let's not forget that we also have to do this for software. Operations in software are critical, so if you can go cloud and deal less with this, probably it's a good idea. You have to be prepared to deal with the complicated part.

Of course, when we choose technology, we have the issue of risk planning. Will I be locked with this vendor and why? Especially, what is the tradeoff because being locked with a vendor can have some advantages, but also can have some hidden costs. One common in the cloud, for example, is the cost of data egress. Is data egress something we care about in the long term? If we're going to get all this data out it's going to be really expensive, so we have to think in advance what's going to happen in two or three years. Of course, another part that we usually don't consider a risk but is important is the user experience of these different tools. Do our developers like them? Why I say this is critical, because if people don't like these things, they don't use them. If they don't use them, we are losing the advantages. Of course, part of this experience was about documentation, is documentation good and sufficient? Of course, the different support channels. It's easy to find a community, find answers to my questions, all of this is part of the technology choice.

Of course, we can use one advantage that we also sometimes forget is communication. Before choosing these tools you can discuss with your peers and ask them, why would you use this tool? What do you think about this tool? Commonly, people are really enthusiastic and tell you what's going on, what's good, and what is bad. What is bad is really important also to take your decisions. You can, of course, start with the providers who will give you an overview of things. You can engage with the communities. This is especially important for open source tools. When you decide to use an open source tool, you can check the community, check the GitHub, see how many people are active in the project. You can ask a question and see how easily they answer to, or they go into their Slack, and of course, read technical information about all these things.

People

The last aspect is the people dimension. The people dimension, something that is important is that people happiness matters. This is important. We need to balance stability and innovation. Sometimes people are happy because they're going to do something new and it's pretty cool, and it's different. Sometimes people are happy because they can sleep at night and they're not bothered because of production issues. We have to make happy bugs counts. One thing that's important for people is the expertise, what are the things that engineers know, and how they can do things with their tools. Sometimes when someone knows a tool pretty well, maybe it's not the right tool to do the job, but sometimes it's the right tool for the person. Use the right tools. Of course, when you're working, for example, repair your home or do something like that, and you don't have enough tools to do the actual work, you learn. You can decide how critical some tool could be for you. This can be something we can use to decide what is the more critical thing we need when we choose all of these tools in the landscape.

No amount of data and ML technology will solve issues for any project without domain knowledge. We have to know what we are talking about. This domain specialist knowledge is something that really matters. We used to say that data is the new gold. This is a very nice expression, so I decided I'm going to search for this in the internet. I put this, data is the new phrase, and then Bing gave me, as a result, data is the new bacon. I was like, maybe this is because of Bing. Then I decided to Google it, and data is the new bacon was also there. What matters with this data is the new bacon especially that some data should be something we care about, is the new oil, is the new gold, call it as you want. More critical than this is that data must be part of our culture. Data must be part of our culture. This is not only for companies and software projects, this is for all of our culture. You have seen all the things that have happened in recent years with politics and all this stuff, and fake news, and all this. If we have an indication about data, we can be aware of so many things. Data matters for everything.

Sometimes it's not only about technology, and I want to tell this little story about a recent project I was part of. In Microsoft there is this thing that is called Microsoft Docs. Microsoft Docs is all the people who produce the websites with the documentation and the examples, and then modules. This is huge. You think about the whole complete Microsoft, this is huge. They did this hackathon project where they wanted to integrate all the data that users produce from all these systems. This is like an analytics project, when you start clicking a link, or you do a module, or you read documentation, all of these is tracked and stored in some datastore. They wanted to call different stakeholders that'd be interested on data to be part of this. There were many people who were interested in this. I was interested because I wanted to know what were the trends, what people are looking for on particular products? There are also product managers who are interested in this, advocates, the people who are producing the content too.

I participated in this hackathon. The idea was to consolidate also the queries we have. What is interesting is that this hackathon first made us, especially the ones who were not aware of, I was not aware of the data catalog we had internally for this. Then we could contribute to this with our own queries for analytics. We had this consolidated shared library, of course, that was produced because of this. Of course, because some of us were new, and we were dealing with how to get into this, this also produced an onboarding guide for all of this data analytics. There were discussions about the future strategy. How can we integrate all this? How can we produce more data queries that are part of these? Are we using the right technology? There was the critical point that is this the right language we have to use for this. All we did here did not require any tool. Of course, they required some tools that were already in place, but we didn't have to buy a new solution for this. The return for the company is interesting, because a lot of us who were not aware of data became aware of the data for some of the parts of the task we have to do now. This is an interesting project. This reminds me that there is this company mandated compliance trainings, you probably have seen those. Does your company have those four data processes? We talk about data is the most important thing. Data is the new gold or the new bacon, but we're not giving data the importance it should have in your company. As you're learning already about the data use in your company, that's something that probably everyone cares about. Even as a normal software engineer, you want to know how your company is growing, or are we really growing? This data that we can have access to, it's important to know. If data is the new gold, we probably need more miners, so this is, data matters.

Another approach we can have to get this control of data is to have some role rotation. So many times, there is the tension between data analysts who are producing the reports or analyzing the current data, and data engineers who are in many cases dealing with infrastructure. Sometimes, the other parts of the organization don't know the struggle they have, so for example if you're doing just frontend or backend engineering. Maybe an interesting approach would be to have a rotation program like DevOps people have, you can just put someone who does not normally do this data analytics or engineering job and put it there just to see how it works. Definitely, this will create empathy about the different issues that one or the other have. This is another good idea.

Recap

There are many things which are not white or black. They are always not in the extremes. There are always tradeoffs, and we have to choose our tradeoffs wisely. Of course, probably the most easy way to approach things is to do it step by step, don't jump into the next product that is going to solve everything for you, because you can hit the wall with this this product as opposed to doing a step by step adoption, seeing what we need or what we care about, and take a little bit of it, step by step. Don't focus on magical solutions because magical solutions basically don't exist, especially these. If it was at x company, especially because big companies don't have the same problems, the dimension is different and what they are doing is different, so focus in your own. Of course, focus on the three main issues or the [inaudible 00:35:09] aspects to take the decision: strategy, technology, and people. If you have doubts, people are normally always more important than technology, you can find your way with technology easily than with people.

Data engineering is still exciting. There is a lot of things that feel like we have not completed, and the way this is integrated in companies, the approach is changing and there is a lot of opportunity. There is a lot of things to do.

Questions and Answers

Polak: One of the things that you mentioned was actually about opportunities in the ecosystem. Also, why the data engineering ecosystem is so diverse, because you mentioned that it's existed for more than 60 years, which is incredible. How are those opportunities in particular in the data ecosystem so diverse?

Mejía: A thing that jumped to my mind when I prepared this presentation was the fact that we don't really reflect much on the past. Because the past, let's say, had simpler systems, we think today, because we have the old databases that come from the '60s, but in reality in the industry, we're in this '80s somehow. If you think about it, many things have not changed, we are doing the full cycle, many systems are going back to standard SQL. We are getting into Postgres as a standard binary protocol. All these things show that there is something standard that we can find. This standard is born not only in technology, but also in processes. I think there is a huge opportunity there. The whole data mesh movement, I think is a really right step in that direction. Like we're thinking about, what's critical and why. It's not only technical, of course, there are technical challenges, but we have also evolved a lot in the technical aspect. We have now achieved data storage so we can have data lakes. We have immutability, first principles, reproducibility with containers, we're almost getting to a point when maybe we can create a data architecture to data that can stamp on the time.

Polak: You mentioned data mesh, I remember when it was discussed in QCon, in 2019 in San Francisco, where Zhamak went on stage and shared her vision for data mesh. I remember I created a lot of question marks with people. Because sometimes we tend to focus so much on technology aspects we forget about people and strategies, and how to scale that beyond just the scalability of technology, which I think is fascinating. What do you think about this aspect of decentralizing the data architecture that data mesh suggests?

Mejía: I think decentralization matters. It's a little bit like in microservices, if you know somehow like the equivalent in the data world. I think decentralization matters, if you want to scale in the sense that responsibilities, people who are responsible for these data products are the ones who have the knowledge. This definitely is a step in a good direction. Of course, this requires that there are systems also that help do that. It is not only the systems but you need both, you need the technology and you need knowledge from people.

Polak: Usually in companies there are compliance training for different aspects, for example, security. How do you stay secure? How do you secure your system and laptops? How can we actually build a compliance training around data? Can you share a little bit more about the thought process around this?

Mejía: I heard already long time ago one compliance on GDPR stuff, the European law. At the point when I do this, it's interesting because we are learning about something really generic, but we are not learning about data in our companies, like what is happening, what data we produce. That's probably one of the reasons why we always end up with the domain expert, who only knows the dataset and knows what to do. Nobody knows about it, because the other people are maybe just plumbing the data back into the place but not learning about the data. I think this is something that should change with the new more domain-oriented mindset that comes also from data mesh. When I mentioned this role rotation, I think that's interesting, especially for junior engineers. You want to touch many things to see, what's the area that you like the most? One of these things you can do is just go play with this area and learn about it. I think we need diversity. That's the point that is different, for example, from infrastructure, that data is a way bigger branch of people, like for really analytics people, pure business people to infrastructure. You have to think wider also.

Polak: I remember when being data driven just started, there was a whole conversation on how do you make data driven decisions? Where's the data? Can you trust that data? How can we collect the data in a safe manner? You mentioned also GDPR. There's a lot of compliances that we need to adhere to as data engineers, as people that work with data. How do you see the trend of doing data driven decisions in the industry compared to what it was before it started. People did data driven decisions, or it's more a gut feeling?

Mejía: I think we all love the idea of having data driven decisions, but there is still a lot of gut feeling out there now. The good part somehow is that we are maturing into this and we are realizing as software engineers in general. Sometimes we always have this analogy of, software architectures like normal architectures, let's say, [inaudible 00:42:48]. We are not at all like this. We are always evolving. That's something that is recent. It's recent that everybody has this realization, I didn't do my big application, I never touch it again. No, it doesn't work like that. You should continue iterating on this, especially now with the cloud, you're still in that live system. Your machine that you turned off is now a live system. I think slowly we're getting into this with data processes, and data knowledge that will hopefully help us solve this issue.

Polak: You said some of the strategies that you presented around technology is making sure that you manage the tradeoffs, that you have a clear understanding of tradeoffs. However, one of the challenges is actually that it is a complicated world. How do you manage these tradeoffs in such a complicated world where there are many moving parts, and you have to decide on adopting a new technology or a new paradigm?

Mejía: It's definitely a question of balance, of risk. Sometimes being an early player is an advantage for you, sometimes, maybe you are paying the price like people who adopted early Hadoop, for example, is one of the good examples. It was really interesting as a technology but it was too early for many people to adopt, even Kubernetes is arguably the same. We cannot create a rolling stone, it's case by case. You also have to be consistent that sometimes things are shiny in the exterior, but they are not the thing that you need. You have to know how to prioritize them.

 

See more presentations with transcripts

 

Recorded at:

Mar 10, 2023

BT