BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Data Versioning at Scale: Chaos and Chaos Management

Data Versioning at Scale: Chaos and Chaos Management

Bookmarks
48:00

Summary

Einat Orr discusses several technologies that version large data sets, the use cases they support and the technology developed to best support those use cases.

Bio

Einat Orr has 20+ years of experience building R&D organizations and leading the technology vision at multiple companies, the latest being Similarweb, which IPO in NYSE last May. Currently she serves as Co-founder and CEO of Treeverse, the company behind lakeFS, an open source platform that delivers a git-like experience to object-storage based data lakes.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Orr: I'm Einat Orr. This is data versioning at scale, chaos and chaos management. When I was 23 years old, I was a master's student. I held a student position in an Israeli networking company called Telrad. I was a BI Analyst within the operations department and I was working on modeling the company's inventory. Inventory comes with very high costs, and it was very important to optimize. It so happened that I was invited to observe a meeting that included the company's management together with a few consultants, and of course, in the presence of the CEO. I didn't have a seat at the table, I was just sitting with my back to the wall with all kinds of people sitting in front of me, and I was listening to the meeting. At some point, the CEO said something about, since our inventory costs are mainly due to extremely high cost components. That is something that I knew to not be true, so I just stood up and said, "I'm afraid, 80% of our costs are actually from low cost, high quantity items." He looked at me, he didn't know who I was, of course, and said, "I believe you're wrong." To which I replied, "I believe I'm not." He said, "I'll bet you a box of chocolates." I said, "Fine. As long as it's dark chocolate with nuts, I'm going for it." We sealed the deal. He did not take a decision according to the data that he had at that moment. He checked the data. Two days later, he changed his decision, and I had a lovely box of chocolates on my desk.

I'm telling you that because that was a moment where I realized how important it is to understand the data and to be able to bring the right data at the right moment to the people who take the decisions. I've never left the world of data since. I have seen the world of data evolving from 10, 15 datasets on a relational database, which is what I had then, to petabytes of data over object storages, distributed compute, and so on. I have realized that when we had those small datasets in relational database, we had the safety of managing complexities in data, such as the transient nature of data in a much better way than we have today. Since data is transient, and we're having a lot of data, our world today is extremely chaotic. Let's talk about how we stabilize that world and make it manageable.

Brief History

I was at my master's then, I have finished a PhD in mathematics at Tel Aviv University. I started my career as an algorithms developer, and later on moved to managing people. Four times VP R&D and CTO with Israeli startup companies, the last of which is SimilarWeb that was a data company. I am now the CEO of Treeverse the company behind the open source project, lakeFS.

The Complexity of Transient Data

I would like to talk about the complexity that comes with the fact that the data is transient. As I mentioned earlier, the amount of data grew, the computation is distributed that comes with a lot of complexity. There are a lot of complexities in the world of data now in the modern time, but something stays the same, and that is that the data changes. If you are now thinking, of course, the data changes, there is additional data every day, then this is not what I mean. The historical data that we have accumulated might change. It might change once. It might change many times. This complexity is something that is becoming harder to manage with the amount of data and the amount of data sources that we have.

Let's look at a few examples so that I can convince you that the data indeed changes. The first one is that at the time of the first calculation, as you can see on the left-hand side, some of the data is missing. You can see the null values in the price column. We did not have the prices of those products at the time that we have done the calculation. This data was late, but after a few days or a few weeks it had arrived with the right value that was true in the history where we actually recorded the table. We can now go back and backfill this information, and instead of the null values, have the values that we have missed earlier. This means that the data has changed from the original version that we had of it.

Another example would be the case where we are fixing a mistake. We knew of values to the data, but for some reason, those values were incorrect, either a human error, someone typed the wrong value. Maybe a wrong calculation, maybe a bug in the logic that collects the data. Once we have fixed that, we will be able to also fix the values that were incorrect. Some of the values were correct, they would stay the same, and other values would be replaced by the correct values because they were influenced from the bug. Another example would be a case where the data that we had originally doesn't come from collecting data, but rather from a certain calculation. Maybe an estimation of a sort. We have estimated the price using a certain algorithm, and then we realize there's a better logic of estimating the price, and we will implement that logic back on historical data to get better estimations of the past as well. Once we do that, the entire dataset we had will now change and be a different, hopefully better version of the data. Those are examples.

If we want unstructured data examples, which is extremely important, we do a lot of work on semi-structured and unstructured data. Think, for example, about pictures. You had a training set for a computer vision algorithm, and you were using a set of pictures that represent a good sample of the world that you would like to develop the machine learning model over. Right now you have decided to replace some of those pictures with other pictures that represent the same point in the space you wish to create, but they do it in a better way. Some of the pictures remain the same, and some you have decided to replace. Another case would be that you may have decided to look at the problem from a different resolution, same picture, but different properties of the picture, and so on.

All these changes that happen in the data make it challenging for us to do the simplest things that we need to do when we work with data. First, we want to collaborate. If we want to collaborate, we need to be able to talk about a single source of truth. If we changed historical data, we now need to talk about, which version of the data are we discussing? If we need to understand what dataset created what result in the past, we have to have lineage. If data changes over time, then lineage becomes more complex. We want to have reproducibility. If we run the same code over the same data, we expect to get the same result. We need to know, what was the dataset? We need to have it available for us in order to run this and get reproducibility. We need to have cross-collection consistency, so if we updated one table with better historical data or different historical data, we would want to make sure that all tables that are lineage to it or depend on it are also immediately and atomically influenced by that change in a way that allows us to view the history in a consistent way. All those things are things that we expect to do. If we do not have the right infrastructure, it would be extremely painful, manual, and error prone to do.

Version Control

The world had already solved the problem of how to deal with sets of data that changed by a lot of people, it is called version control. Yes, it was built for code. It wasn't built for data. It means that if we're going to try and use Git over our data, we would probably fail. The notion of version controlling our data the way we version control our code, would take this chaos of the transient data that I hope I've convinced you exists, and really manage it properly. We are looking to have alternatives to Git, but for data. The 23-year-old me would probably tell you, I'm not sure what the problem is. Remember I was managing five tables in a relational database. I would just add to each row in my table, the time where it was relevant for. This value I have received in the date that is specified in the, from column, and it had stopped being relevant on the date that I would put into the, to column. This is referred to in literature as the by temporal method. It allows us to know for each row of the data, for what time it was relevant. It seems simple, but even in five tables it's quite annoying. Because if the tables are connected to one another, and they're not independent, then if we have relevance of time to enroll in one table, we need to be able to drag that all along our process of building the tables in the views that depend on that table to make sure that everything is updated according to certain dates. Of course, the date would now become the minimum, the maximum, the whatever is needed between all the timestamps that we have on all the tables and all the values. It is not simple at all to cascade that information into a more complex logic that runs with the tables, but it is possible if you have a handful of tables and a handful of processes. The larger things get, the more complex it is, and the better it is to try and use an infrastructure such as Git.

Let's review a few of the solutions that exist today that provide the ability to version control data the way Git version controls code. Let's start really with the database option. Let's say I do run my data over a relational database. Now I would like to make that relational database version-aware, so I'll be able to manage the tables, or the views, or a set of tables as a repository, and version everything, allowing me to use Git-like operations, such as branching, merging, commits. All that would happen naturally as an interface to the database. What do I need to actually change in the database? The answer basically is everything. If you want to have a very simple abstraction of a database, let's say that it is structured with two main components. One is the storage engine. This is where we keep the data. We keep it in a data structure that allows us to retrieve and save data into the database with decent performance. In our case, we also wanted that storage engine to allow us Git-like operations. On the other hand, the other part of the application would be the parser that parses the SQL queries and the executor that executes them, and of course, the optimizer that makes sure they would happen as fast as possible. Those three components are the ones that build the server of the database, and it is usually optimized for the storage engine that the database has. If we want to make the storage engine version-aware, we need to make the application version-aware, and then the entire database actually changes.

The Dolt Database

A very beautiful example of an open source project that is a versioned database is Dolt. Dolt is based on a storage engine called Noms, also open source. It provides Git-like operations. Our use case is I'm using a relational database for my data. I wish to continue using a relational database, but I want to be able to have version control capabilities, and the answer to that is, for example, Dolt, a version controlled database. How does that work? I'll try and explain the most important components in my opinion of the Dolt database, which is the Noms storage engine. It relies on a data structure called Prolly Trees. A Prolly Tree is actually a B-Tree and a Merkle Tree having a baby. It was born, it's called Prolly. Why is it important to merge those two logics? A B-Tree is a tree that is used to hold indices in relational databases, because the balance or the way you structure it allows a good balance between the performance of reading and writing from a database. Clearly, if you want to write very quickly, you would just throw all your tuples into a heap. Then if you want to read them, you would have to scan everything and work pretty hard. While if you save them into a B-Tree, you will have an easier way to logically go through the structure of the tree and allocate the values that you wish to retrieve. Then the retrieval would be in higher performance. This is why B-Trees are used. We have to keep that property because Dolt wants to be a relational database with decent performance.

On the other hand, we want the database to support Git-like operations, so we want this tree to resemble a Merkle Tree, which is the data structure that we use in Git. The logic would be that we would want to save data according to a hash of its content. The data would still be pointed to from the leaves of the tree. We would have sets of actual data that we calculate the hash from, but then the mid and root of the tree would actually be addresses that are calculated from the hash of the content of the values below, so we would be pointing at addresses. That would help us, because if a tuple changes, if this tree represents a table and a row in a table changes, then, of course, the hash of its content changes. Then the calculation going up with hashing the values would change the value of the root of the tree. We would have that tree representing the table and a different hash representing this change in its content. Since other rows haven't changed, their hash didn't change, and then the pointers to them are the same as in the Prolly Tree that represents the table from before the change. We can imagine that it would be easier to calculate diffs between such trees, calculate and implement merges and so on. This is how a Prolly Tree helps us actually combine B-Tree and Merkle Tree into an efficient way of versioning tables within a relational database.

To sum it up, when we have data that is saved in a relational database, then it's probably operational data that is saved there. Usually, this is how you use MySQL. Dolt is consistent with MySQL from an interface perspective. You could think about operational data. You can also think about feature store because features are, at the end of the day, rather small data, could be kept in a database, but should be versioned because they change over time. In general, it means that you are willing to either lift and shift your existing MySQL, or you're willing to lift and shift data into a database. You have all the guarantees of a database, so ACID guarantees and so on, which is extremely important. Of course, you have tabular data because we are dealing with a relational database. It won't help you if you don't want to lift and shift, if you need to keep your data in place. It will probably be harder to manage in petabyte scale. If you are in real need of higher performance in OLAP, then this structure would be less efficient for you. Of course, if you rely heavily on unstructured data, then you would need a different solution. There are different solutions. Let's see what's out there for us, if we don't want to use a relational database.

Git LFS

When we started, we said Git would have been a good idea, it's just that Git doesn't scale for data, and we can't use Git for data. We can use an add-on for Git that is called Git LFS that will help us combine the management of data, actually, together with the management of code. Where does this idea come from? It doesn't come from the world of data, it comes from the world of game development. Game developers had their code, the code of the game that they had to manage. They also had tons of artifacts, mostly binaries that were influencing the way the game looked like. They had to manage those assets together with the code, so the repositories became very strange and extremely heavy, because they were checking in and checking out in Git those very large binary files, and it made their work very hard. They have built an add-on to Git that allows them not to do that if they don't have to. The logic behind it is extremely simple, it relies on managing metadata. This use case that comes from game development grew on people who do machine learning and research. Because they said, we also have files that are not code that we want to version, but they're also a little larger than what you would expect, or there are many of them, from files that are managing code. Still, we want to do that together because we have a connection between the model and the data that it was running on. It's the same use case, why don't we just use the same solution? This is how Git LFS found its way into the world of data. Let's see how it works.

Basically, it's Git, so you have the large file storage that is added to the actual Git repository. The large file storage saves those large binary or those data files, or whatever it is that we want to manage with Git LFS. We then calculate a pointer to this data. When we check out the repository, we do not check out those files, we only check out the pointers to the files. Only when we decide we want those files locally, do we then call them and actually create a local copy of them. If they already exist, of course, they're not called again. Then the back and forth between the large file storage and your local installation of the code and the artifacts is not updated every time. It's a very simple idea, and it makes it easier to manage large files. Same of course would go for data files that you want to manage.

This is definitely good for game developers or repos of others who have additionally to code, either data that they would like to manage, for example, data science repos. The whole thing is format agnostic since we simply create this path or pointer to the file, to the large file, we don't care what format it has. We do care that Git manages your code. It means that the repository of the code is the Git repository, or if you're using a service, then the code is hosted there. It means that those files would also be hosted there, meaning you have to lift and shift your data to coexist with your code wherever it is. Of course, since you're using Git, all changes are in human scale, so this supports up to a certain size of files, up to a certain amount of files depending on the configuration or the hosted solution. Of course, not structured to manage data in place, or have petabyte scale of data. Of course, doesn't support high performance in data writing or reading. This is not something that Git or Git LFS at all optimizes for.

DVC (Data Version Control)

That's optimized. As we continue, there are other solutions we can use. Let's see, what is the next solution? DVC, Data Version Control is a project that was built on the inspiration of Git LFS, but having the data scientists and the researchers in mind. Say, I want Git LFS, but I wrote my own Git LFS that is actually providing additional capabilities that are suitable for my specific use case as a data scientist. I need to keep my data where it is because I have large amounts of data or I have large files, or I can take the data out of the organization, and very many other excuses. Data must stay in place. In place might be in an object storage, or in a local storage, or anywhere. Also, I am working in a team and the retrieval of data, because the files are pretty big, can take some time. I prefer having a caching layer that is shared between me and my coworkers, so that when we retrieve a file we all need, it would be cached and very quickly be able to be used by us, rather than each one of us needing to load the file into our local Git repository, as in Git LFS. Of course, there are other subtleties of what a commit would look like, if I am a data scientist, and what a merge would look like, what are the metadata fields that I would want to save? We want to create here a solution that is targeted really well to the needs of data scientists, and even more specifically, ones that are dealing with ML. This is the premise with DVC. This is why its architecture looks very much like the Git LFS one, but with the required improvements to answer those requirements that we have just specified.

This is what it would look like. We have a remote code storage that is actually a Git server, and this is where the code is kept. We also have a remote storage for the data, which could be any object storage on any cloud provider, or hosted, or on-prem, and of course SSH access that allows us to access file systems, and local storage. Now the data stays in place, and I can edit and see it as part of my repository. I also have a caching layer here mentioned as local cache, so when I would like to read a file that is a data file, in this case, a pkl file that is holding a model. When I would get that file, of course, usually I manage the path, but if I decide to actually pull out this file, it would also be kept in the local cache so that when others are looking to pull that file, the performance would be much better for them. This is a very short version of how DVC is more adequate to data science than Git LFS.

As we said, data science, machine learning use cases. We can support both structured and unstructured data. Data can stay in place. Still, we are using Git infrastructure underneath. This is human scale of changes both in data and in the Git-like operations such as merge, branch, and commit. It allows collaborative caching. What would be missing? If you are a relational database person, then this solution is less adequate for you. If you are with petabyte scale, and you would be using hundreds of millions of objects, caching becomes something that is unrealistic stuff at machine scale. All operations that happen might need to happen in thousands or tens of thousands a second, then this would not be a suitable solution for you. You would have to move to the next solution.

lakeFS

The last solution that I'm going to talk about refers to a world where the architecture looks like this. We have actually very many data sources that are saving data into an object storage. We have ETLs that are running on a distributed computer such as Spark. Those could be ETLs that are built out of tens or hundreds or thousands of small jobs running in an order that is dictated by a directed acyclic graph that is saved somewhere in an orchestration system. Then we have a bunch of consumers consuming the data. Those could be people who are developing machine learning models, those could be BI analysts, or they could be simply the next person to write an ETL over the data for some use that the company has with the data. For this world, which is the world most of us are either already living or would probably live, in the next couple of years, since the amount of data keep on growing. For this one, we would need a slightly different solution. We would want something like lakeFS that allows us first to keep the data in place, so data would stay in the object storage. We're not going to lift and shift it anywhere. That is one. The second would be, we would have to be compatible with all the applications that can now run over an object storage and serve the data to the organization. We would have to impact performance as less as possible. Also, work in extremely high scale when it comes to the amount of objects that we are managing. It could be hundreds of millions or billions of objects. Also, the Git-like operations would have to be extremely efficient so that we can have very many of those a second, according to our architecture and our needs, or the logic by which we want to create commits and manage versions in our data.

How would we build something like that? First, from a logical perspective, since the data stays in place, look at the right-hand side, it says S3, but it means S3 as an interface. It also supports Azure and Google Cloud Storage. Basically, any object storage that you can think of. We can see that both the data itself in pink and the metadata that describes it, those are both saved into the object storage, so the system is extremely reliable. In the middle runs the lakeFS server. Let's look at the top part of the screen. In the top part, we're looking at a case where you would have an application running on Spark. In Java, in Scala you might be running your own Python application, whatever it is that you're using. You can then use a lakeFS client together with your code that would allow you to use Git-like operations. When you use a lakeFS client, your client would be communicating with the lakeFS server to get the lakeFS metadata, while your application would be the one accessing the data for reading and writing. That would be done directly by your application as it is done today. This allows us to work with all the applications, so this allows lakeFS to work with all the applications and be compatible with the architectures that we all have today.

The other options would be to run through our lakeFS gateway that would be saved for cases where you can't use a client, and then your data would go through lakeFS servers. That is an architecture that is useless but it is possible to use. We now understand that this is a wrapper around the parts of our data lake that we would like to version control. The storage status and object storage, and we are adding an additional layer that provides Git-like operations over it, it's somewhat similar to what Noms has done for Dolt with the storage engine. We put an application into the storage that allows us to have versioning. In this case, we don't change the storage, we wrap it, and this interface that was created with now lakeFS to version control the data using a set of metadata that it is creating, once you're using lakeFS and using the data.

When we look under the hood, it beautifully looks like a different version of Prolly Tree. We don't manage lines in tables in lakeFS, we manage objects in an object store. An object can have a hash of its content, just like a tuple in a database has a hash of its content. Then what we have is a two-layered tree. The first layer actually holds a range of objects. The layer on top of it holds the range of ranges, it's called the Meta range. Then just like in a Prolly Tree, if you want to access quickly, or to compare quickly only a small portion of the data that the tree represents, you can do that by allocating the changed ranges of data, compare only them, and get the diff. Also, of course, the same logic would then work for merging or committing data. This is how those data structures are similar, although lakeFS is located in a completely different scenario, and actually relies on Pebbles that is a version of RocksDB.

To sum up what lakeFS can provide us. It is useful if you want to develop and test in isolation over an object storage, if you want to manage a resilient production environment through the versioning. It allows collaboration, so very much what version control gives us in all cases. It supports both structured and unstructured data, so it is format agnostic just like Git LFS, but your data stays in place. It is built to be highly scalable and to keep higher performance. It is of course compatible with all compute engines that exist out there in the modern data stack over in object storage. What wouldn't you do with lakeFS? Really, if you don't have an object storage, lakeFS won't be the right solution for you. Of course, in data diffs and merges, lakeFS only provides information about the objects lists, rather than going into the tables, or if it's unstructured data, the bytes themselves tell us exactly where the diff is. This does not exist. There is a place to contribute code to do that for specific formats of data.

Summary

Here is me at 23. I can go back to being 23 if we use the right data version control tool, to allow us to take this chaos that is created by the amount of data, the complexity, and the transient nature of the data, and turn it from this chaotic environment that is hard to manage into a manageable environment where we have full control. Then we know exactly where the data comes from, we know what had changed and why. We can go back to being the person in the room that corrects the CEO when they're about to take the wrong decision based on the wrong data. Here you have the full advantages of each one of those systems, and you can use them whenever you need them. Please use them to make your life easier and to manage the chaos of data. Or as one of my team members loves to say, you can mess with your data, but don't make a mess of the data.

Questions and Answers

Polak: How does lakeFS relate to existing columnar formats such as Parquet and ORC?

Orr: lakeFS actually is format agnostic. What it manages and versions is the objects within the object store, and those could be in any data format. It basically allows you over whatever it is that the format provides, the ability to treat a set of tables as a repository and manage those together, and travel in time within all those tables together as a repository. It is actually a layer over the actual storage format that you're using. The same answer goes to more advanced formats, such as Delta, Hudi, and Iceberg that provides a bit more of mutability capabilities that are lacking in object storages that are immutable. This is the strength of the format. Again, lakeFS works over that and allows the treatment of very many tables as one repository.

Polak: What grain is in object, single file, Parquet file?

Orr: A single file, yes.

Polak: Single file that exists on object store.

Orr: If you're using some virtualization such as Hive Metastore, you will be able to see a collection of files as a table within Hive, and then lakeFS, and also DVC can work together with Hive Metastore with the table representation of it.

Polak: What does lakeFS do when there is a conflict between the branches?

Orr: It does what Git does. When lakeFS looks at a diff between the main and a branch, it would look at the list of objects that exist in both. If we would see a change that happened in an object that exists in both, that would constitute a conflict. This conflict would be just flared for the user to decide what logic they would like to implement in order to resolve that conflict. That could be done by using one of the special merge strategies that allow you to maybe ignore one of the copies. Or if you have another type of logic that you would like to use, you can always contribute that to the set of logics that already exists within lakeFS.

Polak: Does version control increase storage cost because we now save several versions of the same dataset?

Orr: It does seem like that intuitively at the beginning, because instead of just saving the one latest copy, I will be saving a history of my data and that might create a situation where I am saving way more data than I did before. In actuality, in order to manage ourselves properly, and to have isolation, we copy a lot of data. When you use solutions like lakeFS that are actually soft copies, so it's a very small set of metadata that is created instead of copying large amounts of data. When you look at the overall influence, then actually lakeFS reduces the amount of storage that you use in around 20%, from what we see from our users. If you add to that the fact that lakeFS provides an ability to actually control the data that you decide to delete in a way that is optimal from a business perspective, you get a really powerful result. Except for having the deduplication, you also have the ability to say, I don't want to save more than two months of versions for some datasets, I want to save seven years in cases where I have to because of regulation. Of course, I can decide branches that are given to R&D people for just short tests, are short-lived and would always be deleted after 14 days, for example. You have full control that is actually based on business logic, to make sure that you only have or you only keep the data that you actually need.

Polak: It completely makes sense. I've been seeing people copying data again and again in production, in staging environments, so it completely resonates, also makes sense to keep track of who copied, and where the data is.

Can Git LFS be used on managed services such as GitHub or GitLab?

Orr: Yes, definitely. They both support that. You do have some limitations of the amount of storage that you can use if you're using it within a managed service. It's just a click of a button and you get Git LFS running for you, whether you're using Bitbucket, GitHub, GitLab, any of those solutions.

 

See more presentations with transcripts

 

Recorded at:

Feb 10, 2023

BT