InfoQ Homepage Presentations Petastorm: A Light-Weight Approach to Building ML Pipelines

Petastorm: A Light-Weight Approach to Building ML Pipelines

View Presentation

Speed:

Download

42:45

Summary

Yevgeni Litvin describes how Petastorm facilitates tighter integration between Big Data and Deep Learning worlds, simplifies data management and data pipelines, and speeds up model experimentation.

Bio

Yevgeni Litvin is a senior software engineer with Perception team at Uber Advanced Technology Group (ATG). Litvin builds machine learning infrastructure used to train and deploy models on autonomous vehicles.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Litvin: I'd like to start with a story. I joined Uber ATG and the ATG is a subdivision of Uber that works on developing self-driving technologies about two and a half years ago. Let me tell you what kind of environment I observed.

Deep-Learning for Self Driving Vehicles

We have multiple cars driving around collecting logs. What are these logs? The vehicle that is intended to be a self-driving platform has multiple sensors, multiple cameras, multiple radars, it has the LiDAR, this bucket that spins on top of the roof and it collects lots of data. This data is offloaded to some data center to be stored offline and to be used for training models and analyzing behavior later on.

How does this data look like? Each sensor emits messages with the content. For example, camera would send out camera images, LiDAR would send out sweeps. Each kind of sensor emits messages of different sizes at different frequencies, cameras take 30 frames per second images and LiDAR would emit one sweep every 100 milliseconds. All this data is aggregated into this data center storage.

There are different additional sources of information that are relevant for machine learning which are high-resolution maps and there are also supervision labels, people that sit and draw boxes around vehicles around other pedestrians. Now we need to develop deep-learning models that use all this data to understand the world around us so the vehicle can drive autonomously in the future.

We have multiple research engineers sitting at different sites and trying to develop the best architectures possible and to experiment a lot with the different kind of architectures to produce the best performance possible to allow for the best and safest driving and best self-driving experience. They're using different frameworks because they are coming from different backgrounds, some of them like to use TensorFlow, other ones like to use PyTorch and they want to work with the data.

How did this process of developing a model looked like two years ago? We have multiple researchers, the first thing they need to do is to extract the data. They have three different systems they need to interface with and start querying, getting data and somehow collecting or organizing into tables into records that can be streamed later into the training algorithms. If you're a TensorFlow user, you are likely to be using TFRecords because that's the recommendation from TensorFlow guys. If you're a PyTorch guy, you might be using HDF5 data format or maybe you would be using just the BMG files lying somewhere around not thinking about working at scale because these things cannot go, for example, to HDFS that easily.

Looking at this as an engineer, my conclusion was quite obvious. The work is being done multiple times and the work is not simple because the APIs here are complex. They are not developed only for machine learning, they need to serve different kind of systems that support the organization. The work needs to be done on the cluster usually because we are talking about huge amount of data, the data sets could be tens of terabytes in size. Each role used for training could be multiple megabytes in size so the learning curve is pretty steep both from learning the APIs of the system and both learning how to operate cluster.

This is something that you do once at the beginning of the project, maybe you go and do it again sometimes later but you don't have an opportunity to specialize in doing so. We actually want the research engineers and data scientists to specialize on experimentation and finding the best architectures possible. You end up with that kind of data set by the researcher, the data sets are huge so it becomes a challenge, plus you'll need to track and all the different kind of issues that I'm sure you're all familiar with.

This is the ramble through our story. We want to consolidate all these efforts into one data set and to allow multiple researchers work with a single dataset with minimal amount of effort. Forget about data extraction stage, just go directly and do your work, find the best architecture possible for auto driving.

Our mission as Uber ATG is to introduce self-driving technology through the network in order to make transporting people and goods safer, more efficient, and more affordable around the world but this is just the background on our organization.

Introduction

My name is Yevgeni Litvin. I work with Uber ATG and I'm with perception team. Responsibility of perception is to understand the world around us from the sensor data and provide some predictions on the behavior of the actors around us so we can connect.

In our talk today, we are going to talk about this kind of a one data set approach. How do we bring data closer to the data scientist, to the research engineer and spare this tedious and the non-simple task of creating the data, generating the data? We'll talk which file format we use for this single storage and we'll talk about the library, the data-storing library that we have developed in order to support different ML procedures running directly against this huge storage that is shareable by multiple researchers. When I say one dataset, it doesn't have to be literally one data set, there would be several datasets and where we are updating them frequently but the point is that we don't want to do it for each research project that starts because we experiment a lot, we try many different things.

One Dataset

This is one dataset and if we are able to get there, if we are able to create such a thing, it will be much easier for us to compare models because everyone is training from the same data and everyone is operating on the same data, it becomes trivial to compare. It's easier to reproduce because it's well-known datasets and they go directly to the training, there is no multiple steps we need to track, did I extract this data set I'm training with, with this version of code that had the bug and was fixed later? It's all well documented in one place and we can maintain it.

The ramp up to our engineers, who will be the research engineers is much quicker, if you have a camera image, you know exactly how it looks like, you know what is the name of the field, you know how LiDAR fields looks like, how labels look like, there is no variations on the set. Everyone can jump from one project to another, at least they don't need to think about data whether it looks different or not.

We can move the support and the maintenance of these huge datasets to a infrastructure team so we can gain the knowledge and maybe more and more efficient in creating these datasets, fixing bugs, there is a group of people that specialize specifically in that. What would be the rate inside dataset? We use dataset so since we are talking about sensor data that comes from the vehicle, there's not many variations to think about, the camera image is a camera image. If you keep the model specific pre-processing out of the data set and we push it downstream and you all do all your pre-processing on the fly when you read the data end of the training, if it's possible, in most of the cases it is possible. Then this thing can become just a golden standard of data for all unknown engineers in the organization and there are many research engineers that work with this data. Keep model specific pre-processing out of the data set, keep the data as raw as possible, and develop tools so you can do pre-processing on the fly, you do need tools for that.

You need to be able to access data efficiently, for example, column data access is a must. Imagine you're an engineer working on a project for detecting vehicles from 2-D images. You don't want to load all the LiDAR data, you don't care about the LiDAR data, you just want your columns. Our format and the tools that we are building on top need to support that.

In this image, this frame says I want just these particular columns in my dataset. You would like to be able to focus on particular roles in this dataset, for example, you can say I want to elaborate model specifically on pedestrians, I don't care now looking at all the images that have vehicles or the rest of the objects. You should be able to query your rolls efficiently and maybe it's a combination, I want just the camera data and I only want the camera frames that have pedestrians. This ability is a must if we want to actually have this is one data setup approach implemented.

The research engineers and data scientists wouldn't take anything less than very framework native way to access the data, it must look and feel like this is a TensorFlow or PyTorch primitive bracket. The lines of code, we want to give them access to the data and we’ll show how this looks like if it actually looks like that.

Most of our engineers work with TensorFlow, we do support PyTorch but it's more polished for TensorFlow rather than PyTorch. We chose Apache Parquet file format just because for multiple reasons, first, it's very widely supported by existing tools, existing big data tools like Spark type, we use Spark a lot for our cluster processing, it supports columnar access, you can read only LiDAR data without paying the price for uploading camera data, if this is what the data scientists wants and it provides a row group random access. It's not exactly random access per row but you can load a bunch of rows that are stored together. I won't go specifically into details of Parquet storage format but it allows us to do some nice things with the queries. However, Apache Parquet doesn't support tensors and since tensors is bread and butter for everyone who works in PyTorch or TensorFlow, you want this native look and feel like you have tensor stored in your Parquet store. This is what we need and it doesn't have these abilities on its own.

Petastorm

Petastorm is a library we started developing about one and a half years ago. We open sourced it half a year ago and this is a library that supports our internal processes and it helps us implement this one dataset approach. It has a bunch of features listed here, we'll go a little bit more in deep into most of them in the following slides. The library runs on your training machine within the TensorFlow process space and helps your stream data directly from these parquet stores into deep learning frameworks.

Research Engineer Experience

Before we started, this was our process, just to sum up everything. The Research Engineer and data scientist would first go learn the APIs, we would run the data extraction jawbone class that will be very frustrated because it doesn't run correctly the first time, the second time in each iteration it would take a couple of hours, unpleasant. Then he will go into cycle of developing the architecture, training, evaluation, back and forth multiple times. Then he might figure out that “Well, I had the bug and data extraction” ok, then we start all over again.

He doesn't remember exactly what was done, how did it work the first time or maybe the code changed underneath, this is pretty frustrating. Eventually, we go into deployment and we run the models on the vehicle. Once we've implemented all this infrastructure on all this work was done, the process becomes simpler, you just jump directly into developing your architecture and training that has practically nothing to learn ,the ramp up is really quick.

Two Integration Alternatives

Once we open-sourced Petastorm, we suddenly started getting requests by people saying, "Hey, we don't have any image data. We don't have any LiDAR data, but we have tons of parquet stores in our organization." We want to hook up TensorFlow and PyTorch and train directly from the stores like having data warehouses, tons of data. We don't want to start doing this CTL steps, creating intermediate tests or records or whatever other form of developing framework support."

We went and actually, we disabled a bunch of code to make it easier to do so. If your dataset doesn't have any tensor data but just a bunch of scalas, maybe lists, list of lists, it has only native Parquet data types, you can still use Petastorm and we have a couple of tools that optimize these particular scenario as well. You will get great performance and you don't need to do any additional steps in your training projects.

Extra Schema Information

We said there are two different scenarios, one when we work with Petastorm generated parquet stores which has tensor data, images, etc. and to facilitate this we just have to make Apache Parquet, we add a layer on top of Apache Parquet that adds this additional schema information, we call it uni-schema in Petastorm. You can see that we can specify the shape of the tensor and the type of tensor, we can also specify a specific codec that we would use to serialize our tensor into parquet binary fields and of course on the way back to the serializer. If you have an image you have some a priori knowledge, so it's not just the tensor, you know how to compress it efficiently so it can achieve much better storage compression when you specify it explicitly here, and not using some generic indirect codec, a multidimensional array codec for storing the data.

You can specify the shape, some of the shape dimensions could be unknown, you could have viable number of rows in your matrix. It helps us to do runtime validation, so when we store the data we know that we don't have a bug in the code that just like adds an extra dimension that shouldn't be there. We stack the information especially important for seamless integration and for TensorFlow. TensorFlow lets you define the graph which is strong type, you cannot just have a variable number of dimensions in a tensor, it's supposed to be strong type connections between different operations in the TensorFlow graph. This allows us to do the magic that cooks up Petastorm with TensorFlow seamlessly.

Generating a Dataset

This is the code that generates dataset, if you're familiar with PySpark, you will recognize here are some patterns, if not, we'll go really quick about it. This is something that data engineers specialize in creating right and researchers don't need to know about all this header column. We have this context manager that helps us configure Spark properly and writes the uni-schema and additional metadata at the end of data extraction. By the way, I am pointing out only Petastone specific code that is here. We have this function that turns a dictionary with your row elements into Spark SQL row, using these using our uni schema object to compress the data properly and to check the types. Then we create a data frame, a standard Spark dataframe and we write it as a parquet, this is standard Spark code.

Let's talk about how do we read the data. I promised you that it's going to be really simple, so now you're going to see how this looks like. In the scenario when you have tensors, we have this make reader function call that will determine your reader objects that can function as a regular standard Python iterator. I'm showing you now like pure Python code if you want to access your data just from Python. Still not TensorFlow not PyTorch but if you are doing evaluation you could be reading your samples from your evaluation set and then passing it through your classifier or detector.

Here is pure Python code, you can just iterate, you're going to get numpy arrays back that are stored in your parquet store and we could show we can do whatever we want with it. For the non-Petastorm data set existing repositories with the region in Parquet format, listing conventional data warehouses would create a batch reader. This is a different API but we still can iterate simply each sample is going to be a batch from that dataset. It's more efficient to do it like that.

If you're a TensorFlow person and half of the audience apparently is, this is how it looks like. Makereader we saw in the previous slide, exactly the same function, now we have this reader object, we can use TFTensor. This is Petastorm function to convert the reader into TensorFlow tensor and this structure return here, you can hook up into the rest of your graph transparently.

This is enabled by the uni-schema because we know all the shapes we can do it automatically. My model is just pure TensorFlow code, do whatever you want with the data. Remember we have made batch reader so if you are working with existing data warehouse you can just use make batch reader and you'll get all the same functionality, you get the data stream [inaudible 00:19:43].

PyTorch, similar native look and feel, you use data loader but this data loader is not a standard PyTorch data loader but is from our package. You pass the reader and you can start iterating on the sample, but the implementation is different from the standard PyTorch data loading mechanism.

This is how accessing the LiDAR sweep would looks like within our framework. If you're a data scientist, this is all you need to do to get the LiDAR sweep and start streaming and creating models from LiDAR sweeps. You specify create the reader, you say where the data set is and the path that is well known within the organization. You say I want only the LiDAR data, LiDAR returns, I'm not going to read all the columns that I shouldn't be reading and I can start plotting the points already. This is incomparably much simpler to what we had before.

Reader Architecture

This is a little bit going into the weeds of how the reader is actually structured, as we were going to show some internals. We use Apache Arrow under the hood, a very good library, a better performance library. We use it to load Parquet data, we spin up a pool of workers and it's configurable where we are going to spin up multi-process pool or multi-thread pool. For TensorFlow users, this will become pretty important quite soon because it's not something you get natively with TensorFlow. We spin up a pool of workers that are going to read row groups from parquet, minimal loading units from the parquet repository.

It's going to be coded, it could have PNG codec, they're going to decode later PNG. We have the entire schema information, high-level schema information so we can do all these things automatically. We're going to write the records output to this results scheme and either read it from Python so it will look like numpy or read it with TensorFlow so it's going to look like native TensorFlow tensor. We can also use TF data API, which is a new way of reading data in TensorFlow, it's not that new anymore but it's a modern way of reading the data and you can do whatever you do with data sets or you can use PyTorch code, PyTorch data loader to get your data in PyTorch.

Petastorm Row Predicate

We're going to go through some of the features of Petastorm, these features are pretty important because I show this one data set approach. You need a bunch of tools to enable you actually to work with this data, you need to make sure you're shuffling the data properly, you need to make sure that you can select efficiently just some rows from your data set. The first feature is a row filter, you can specify your piece of a function, a Python function that is going to be executed by the worker to filter out rows that you don't really want to get.

I said I want just pedestrian frames with pedestrian labels, why do we need to do it here? Why don't we do we just in TensorFlow after we get the data? Because we don't want to load sensor data, this is the most expensive part. If I don't have a pedestrian label in my row group, I'm going to skip loading all the rest and this I can do only here inside because I control the way the data is being loaded. I can choose to load first the columns that are used by the predicate by this condition and not load anything if the condition doesn't hold.

Here is an example, I've got a bunch of rows and at the output of the reader, I see only the ones that make my predicate function.

Transform

Transform is an interesting feature especially for TensorFlow users and especially for the case that we are reading data from my organizational data warehouse. Why? We provide a function that also being run here on the worker, it could be a multi-process worker that is going to modify the data on the fly.

Why I said it's interesting for TensorFlow users because this gives you ability to run pure Python code on multiple processors without the limitations of Global Interpreter Lock. If you are a TensorFlow user you have the tools of using tf.python which allows you to run some Python code. It's going to be run inside your process and multiple parallelism is limited by the Global Interpreter Lock. If you are using process tool with TensorFlow, you can actually write heavy Python code that is not GIL friendly and still get a good performance because you are running a multiple process.

This is this something that PyTorch users have like taken for granted because of the data loading pipeline there but for TensorFlow, you get a smart reader. Why it's important for organizational parquet stores? Because frequently we would have various types that are not tensors at all, like lists of lists of variable length. They cannot go into TensorFlow, there is no way to put it in, maybe you could use a rack of tensors but this is tricky.

Our transform function could take any type that is storable in Paquet dataset and convert it into tensors if you basically put your pre-processing code here in this transformed function. Here we have a list of lists and let's say we have a code that converts this list of lists into tensors and this will be hooked up directly into TensorFlow graph and go natively downstream for your training organization.

Local Cache

Another nice feature is local cache. If you have a slow or expensive link and your data is an S3, you can with just one argument here specify that you want to use local cache so the data will not be downloaded twice, depending on the size of your local cache, you can limit it by the size. If you don't trust your caches, you're going to download only once this data and that's it and it's one algorithm.

Sharding

You can use sharding. If you're doing distributed training, often you would want each of your worker to work only on an orthogonal set of data, on a subsample rate. You can specify again when you create the reader that you want to shard your data, let's say into 10 pieces and I want the third piece out of it. You're going to get only this subset of your data.

If you remember the previous slide about local caching, if you have a large data set that doesn't fit to one trainer, to one machine that runs distributed training, you can first shard it and then local cache and maybe that local cache is actually going to fit into your local disk. This was just three parameters to configure the reader you are going to get non-trivial functionality and have a pretty good performance.

NGrams(windowing)

Another interesting example that is pretty unique, maybe not used to solve [inaudible 00:27:17] but for us it's pretty important, when you want to train a model. Imagine you want to train a model that predicts velocity of the vehicle and you have several several images, one after another. You take one to take a sequence of images and predict the velocity. How would you do that? How would you store these samples? The main way is to take one row and to put all these three consequent frames into one row and that now starts streaming.

Imagine for a second it's not 3 images but 10 images, so immediately your dataset grows in size tenfold. If you store your data in a sorted fashion, you might load a sequence of frame and then carve out several sequences out of it and emit the sequences downstream so they would look like this case. In this example if you have 10 consequent frames you achieve 10 times smaller datasets and this is pretty important. We support this kind of access as well in Petastorm.

We say we build a bunch of tools one on top of another, it's all derived from requirements. How do I do this? We add this functionality and now we open source it then it's available for a call.

Conclusion

To conclude, Petastorm has developed the support of this one data set approach, to let the social genius work directly from the data and not work hard on how to create this training data. It extends a patch of parquet to support tensors, to store tensors, and provides a bunch of tools that are a must if you actually want to do this kind of direct training from parquet.

We developed this support to connect to our organizational data warehouses and train directly from them another shortcut that allows you to simplify your pipelines. We still have tons of work to be done, so if anyone is interested, we are planning and looking for specialists to work in this domain.

Questions and Answers

Participant 1: I have one question about the distributed training part. You're saying that you support sharding to different working units, so how is this working? How they set up these things?

Litvin: We use the Horovod library which is also developed internally in Uber and it's quite popular, it's used by many organizations. We do set up the distributed training to work with horabord but we have internal classified infrastructure that makes it very easy and simple to implement. Then eventually, you have multiple trainers that from the data perspective, at least, let's say I have 64 trainers. They are called ranks, so each trainer has its own rank. You would say I have 64, I would say that you pass here 64 as number of shards and here, each trainer would substitute this number with its own rank. Then it would basically start looking at his shard of the data and iterate over it.

Participant 1: So a static sharding crew based on the notes that you have.

Litvin: You would configure probably here one of parameters we didn't talk about is specified number of epochs, so you would say it would be infinite epochs. Between each epoch, the roles would be reshuffled, so the same trainer would continue working on the same subset of data in constantly reshuffling, in a reshuffled fashion.

Participant 1: One more question. You asked about the field and the transform which we'll be doing after you get a data pack and then you do the transform in the future. My question is why are you not doing it from the source like a passbox, that you need to print the internal data and load it to memory, then to the field, and then do the transform.

Litvin: If I understand correctly, your question is why not to do the transform in advance and store the data transform [crosstalk 00:32:16]. This is another approach but this is exactly the difference between one data set approach and better project approach. Both of them are valid, we chose to use this, we see certain benefits. For us, what was important is to reduce the amount of these hopes of EPR hops. We don't need to like to spend too much energy in managing and thinking what are these intermediate private directories that are lying or all around, and thinking what kind of data they exactly contain. We just say we want to streamline the entire process, so we train from the raw data and we do all the processing on the fly. You have CPU power available and you can ask for more CPUs and let them work on pre-processing. Just by paying this price of doing the multiple times basically the same pre-processing, you simplify the whole process.

Participant 1: I can understand this part, but using PySpark you have a way to defend this scala notion of UTF which seems 50 or more hundreds times better than Python video player. Doesn't it mean performance issue?

Litvin: I'm talking about our particular scenario, what's right for us, usually we are spending most of the time loading these huge chunks of data and then going through back and forth in the network. It doesn't matter if this particular code runs 10000 times slower than if it's a native founder's expression or a native arrow expression or running on PySpark on gazillion of nodes. This time for us at least it's neglectable.

Participant 2: [inaudible 00:34:28].

Litvin: Let me repeat your question, the question is whether this is a parquet push down or whether we can actually use this mechanism from Parquet. The answer is we thought that we would be able to do that, we didn't do it. Eventually, it wasn't very beneficial for us and actually, when you get to more complex types, for example, each row has a list of labels, you want to check if any of them is a pedestrian. This is not something that is easy to push down effectively downwards. We are using Apache Parquet but we do feel that it's not exactly built for our scenarios. Once we started growing this huge rows of tens of megabytes, we end up with pretty small row groups and this kind of fights all the optimization that parquet has underneath.

We feel it's not exactly going to support us much more if we continue scaling our data and we will. We are starting to think about other solutions, other back-end storage solutions. However, since parquet provides a high-level front end, it gives us freedom to select what kind of back end storage are we going to use without disrupting any of the downstream processes. This separation layer, like high-level data access library with Python back ends gives us huge amount of flexibility. I would say that parquet probably will be running to a dead end for us while we continue increasing the amount of data we store in one row.

Participant 3: I just wanted to ask you about the relationship between your online data systems, for example, whose car is this? What's the current location of the car? Things that might be mutable and things that might be relational in nature and how that would relate to this data store and also how you would ingest that kind of data into this data store.

Litvin: I would say that the way the data is being logged on the vehicle is sequential logs of everything that is going on. I could see heterogeneous messages written one after another, these data centers we are talking about are focused on a certain slice in time and say that many milliseconds window. We can go and pick whatever color or small vector data we want into this row to use it also in training. Maybe you're talking about the graphical location like latitude, longitude, nothing prevents us from adding more columns that describe that. Does it answer your question?

Participant 3: What I'm trying to get at more generically is the ingestion process. You were talking about the cars logging events or omitting events that is consumed by this system so are there aspects of that that are consumed by other systems?

Litvin: For sure.

Participant 3: Is the architecture that these messages can have many consumers of them or are the other systems also using this data store?

Litvin: This store is fine-tuned for machine learning, we want to simplify data scientists' work as much as possible. However, if you remember in the first slides, we have this huge systems with raw AV data, with labels, with maps. These are big systems with a lot of different consumers that serve different processes within the organization.

Participant 4: ATG is a different group, totally different company from Uber, not different company but different group. All this telemetry gets lands Uber where different groups can access it and then this group is one group that acts as if that telemetry data.

Litvin: Yes, we are quite separate from main Uber fleet that are being used daily by Uber consumer, but we also think about our vehicle as becoming part of the network into the future. For now, we were talking specifically about self-driving. The models that enabled say self-driving, so it is not about payments or rides.

Participant 5: One question is what happens if you change something intrinsic about a vehicle? You already said that if you add longitude-latitude, you can add columns, but what if you move the camera at one point or take in your LiDAR. How do you update your back end single data source?

Litvin: I think it's pretty important in general for autonomous driving discipline to have consistent fleet. If you don't have consistently fleet or your fleet upgrades all the time you will have a lot of issues with creating this dataset.

Participant 5: But on the other hand, I can imagine that your engineers at one point want to experiment where they want to put on a new LiDAR, for example, and they want to collect a bit of data, label a bit of it, and then train not only that bit of it or add it to the same single dataset?

Litvin: That's why one data set is not exactly one dataset, there is a mainstream of large experimentation that is going on. However, if you want to experiment with new sensors or sensor configuration you are likely to use the same tools to generate other instances that are not going to pollute main line of work. There is only one data set in mobile ATG, but this is a way to allow the data and be closer to your researchers.

Participant 6: With this one data set workflow how do you handle the different partitioning that happens in the different datasets? How does that work? For the columnar data store in parquet, you would have to partition it to make it efficient?

Litvin: In which sense? Can you elaborate?

Participant 6: If you convert into parquet, you would have to have some kind of IDs or some indexing?

Litvin: Yes. We enumerate our row groups, we use immutable data sets, right. Once the dataset is created, we are not creating additional partitions, at least for now, so it simplifies thinking about your data right when your data is immutable and we do use IDs of the row groups in order to build the datasets.

Participant 7: I've got a question about the interface to PyTorch because you said you have your own data loader. Is it still working with the native distributed package in PyTorch? Because especially if you have your own sharding, normally you integrate it into the data loader object.

Litvin: Yes. Everything that is behind our version of data loader is different and there is a good reason for that, I think it's good. The PyTorch users are used to be like the dataset recognizing data set requirement that you can access your row randomly by index. Apache Parquet is exactly the opposite of that, we have to work in chunks. It violates immediately a bunch of expectations from PyTorch users, you cannot do stratified sampling, you cannot have fully IAB samples. You need to shuffle and then hope they're pretty IAB. We replaced both the backend and the multiprocessing mechanism, we replace it with the one that already exists for us. For us, it's good to have the same code base, for PyTorch user, it might be a little bit uncomfortable to start working like this. However, you should realize from some data types it's much more efficient to load them in batches, but this is not something that would be trivial for PyTorch users.

See more presentations with transcripts

Recorded at:

Jun 11, 2019

Yevgeni Litvin

InfoQ Software Architects' Newsletter