InfoQ Homepage Presentations Netflix Drive: Building a Cloud Native Filesystem for Media Assets

Netflix Drive: Building a Cloud Native Filesystem for Media Assets

Bookmarks

View Presentation

Speed:

Download

37:05

Summary

Tejas Chopra discusses Netflix Drive, a generic cloud drive for storing and retrieving media assets, a collection of media files and folders in Netflix.

Bio

Tejas Chopra is Senior Software Engineer Data Storage Platform team @Netflix.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Chopra: My name is Tejas Chopra. I'll be talking about Netflix Drive. I work on infrastructure solutions, specifically software that deals with billions of assets and exabyte scale data that is generated and managed by Netflix studios and platforms.

Outline

We'll go over a brief explanation of what Netflix Drive is. Some of the motivations for creating a software such as Netflix Drive. The design and the architecture, lifecycle of a typical instance of Netflix Drive, and some learnings from this process.

What Is Netflix Drive?

You can think of Netflix Drive as an edge software that runs on studio artists' workstations. It's a multi-interface, multi-OS cloud file system, and it is intended to provide the look and feel of a typical POSIX file system. In addition to that, it also behaves like a microservice in that it has REST endpoints. It has backend actions that are leveraged by a lot of the workflows and automated use cases where users and applications are not directly dealing with files and folders. Both the interfaces, the REST endpoint as well as the POSIX interface can be leveraged together for a Netflix Drive instance. They are not mutually exclusive. The other main aspect of Netflix Drive is that it is a generic framework. We intended it to be generic so that there can be different types of data and metadata stores that can be plugged into the Netflix Drive framework. You could imagine Netflix Drive working on the cloud datastores and metadata stores, as well as hybrid datastores and metadata stores. An example could be, you could have Netflix Drive with DynamoDB as the metadata store backend, and S3 as a datastore backend. You could also have MongoDB and Ceph Storage as the backend datastores and metadata stores for Netflix Drive. It has event alerting backends also configured to be a part of the framework, and that eventing, alerting is a first-class citizen in Netflix Drive.

Motivations

Let us get into what are some of the motivations of having Netflix Drive. Netflix is in general pioneering the idea of studio in a cloud. The idea is to give artists the ability to work from different corners of the world, on creating stories and assets to entertain the world. In order to do so, the platform layer needs to provide a distributed, scalable, and performant infrastructure. At Netflix, assets, which you can think of are a collection of files and folders, have data and metadata that are stored and managed by disparate systems and services. Starting at the point of ingestion, where data is produced right out of the camera, till the point where the data eventually makes its way to movies and shows, these assets get tagged with a variety of metadata by different systems based on the workflow of the creative process. At the edge, where artists work with assets, the artists' application and the artists themselves expect a file and folder interface so that there can be a seamless access to these assets. We wanted to make working with studio applications a seamless experience for our artists. This is not just restricted to artists, and it can actually extend to more than just the studio use case as well. A great example is all the asset transformations that happen during the rendering of content for which Netflix Drive is being used today.

The other thing is that studio workflows have a need to move assets across various stages of the creative iterations. At each stage, a different set of metadata gets tagged with an asset. We needed a system that could provide the ability to add and support attaching different forms of metadata with the data. Along with that, there is also a level of dynamic access control, which can change per stage, which projects only a certain section of assets to the applications, users, or workflows. Looking at all of these considerations, we came up with the design of Netflix Drive, which can be leveraged in multiple scenarios. It can be used as a simple POSIX file system that can store the data on cloud and retrieve data from cloud, but also has a much richer control interface. It is a foundational piece of storage infrastructure to support a lot of Netflix studios and platforms' needs.

Netflix Drive Architecture

Let us dive a bit into the architecture of Netflix Drive. Netflix Drive actually has multiple types of interfaces. The POSIX interface just allows simple file system operations, such as creating a file, deleting a file, opening a file, renames, moving, close, all of that. The other interface is the API interface. It provides control interface and a controlled IO interface. We also have events and telemetry as a first-class citizen of the Netflix Drive architecture. The idea is that different types of event backends can be plugged into the Netflix Drive framework. A great example of where this may be used is audit logs that keep a track of all the actions that have been performed on a file or a set of files by different users. We've also abstracted out the data transfer layer. This layer abstracts the movement of data from the different types of interfaces that are trying to move the data. It deals with bringing the files into a Netflix Drive mount point on an artist workstation or machine, and pushing files to the cloud as well.

Getting a bit deeper into the POSIX interface, it deals with the data and metadata operations on Netflix Drive. All the files that are stored in Netflix Drive, get read, write, create and other requests from different applications, users, or there could be separate scripts and workflows that do these operations. This is similar to any live file system that you use.

The API interface is of particular interest to a lot of workflow management tools or agents. This exposes some form of control operations on Netflix Drive. The idea is that a lot of these workflows that are used in studio actually have some notion and awareness of assets or files. They want to control the projection of these assets on the namespace. A simple example is, when Netflix Drive starts up on a user's machine, the workflow tools will only allow a subset of the large corpus of data to be available to the user to view initially. That is managed by these APIs. They are also available to perform dynamic operations, such as uploading a particular file to cloud, or downloading a specific set of assets dynamically and showing them up, and attaching them at specified points in the namespace.

Events have telemetry information. You could have a situation where you want audit logs, metrics, and updates to all be consumed by services that run in cloud. Making it a generic framework allows different types of event backends to get plugged into the Netflix Drive ecosystem.

Data transfer layer is an abstraction that deals with transferring data out of Netflix Drive to multiple tiers of storage. Netflix Drive does not deal with sending the data in line to cloud. This is because of performance reasons. The expectation is that Netflix Drive will perform as close as possible to a local file system. What we do is we leverage local storage, if available, to store the files. Then we have strategies to move the data from the local storage to cloud. There are two typical ways in which our data is moved to cloud. First is dynamically issuing APIs that are done by the control interface to allow workflows to move a subset of the assets to cloud. The other is auto-sync, which is an ability to automatically sync all the files that are there in local storage to cloud. You can think of this the same way that Google Drive tries to store your file to cloud. Here, we have different types of tiers in cloud storage as well. We have peculiarly called out media cache and Baggins here. Media cache is a region-aware caching tier that brings data closer to the edge at Netflix, and Baggins is our layer on top of S3 that deals with chunking and encrypting content.

Overall, picture of Netflix Drive architecture looks as follows. You have the POSIX interface that has data and metadata operations. The API interface that deals with different types of control operations. The event interface that tracks all the state change updates. In fact, events is also used to build on top of Netflix Drive. Notions of having shared files and folders can be built using this event interface. Then, finally, we have the data transfer interface that abstracts moving the bits in and out of Netflix Drive to cloud.

Netflix Drive Anatomy

Let us now discuss some of the underpinnings of the design that go back to the motivation by discussing the anatomy of Netflix Drive. Here are some terminology that we'll use. CDrive is a studio asset-aware metadata store that is used in Netflix. Baggins is Netflix's S3 datastore layer that deals with chunking content and encrypting content before pushing it to S3. Media cache is an S3 region-aware caching tier whose intention is to bring the data closer to the applications and users. Intrepid is an internally developed high leverage transport protocol used by a bunch of Netflix applications and services to transfer data from one service to another.

This is a picture of the Netflix Drive interface, or Netflix Drive in general. We have the interface layer, which is the top layer, and this has all the FUSE file handlers alongside the REST endpoints. The middle layer is the storage backend layer. One thing to note is that Netflix Drive provides a framework where you can plug and play different types of storage backends. Here we have the abstract metadata interface and the abstract data interface. In our first iteration, we have used CDrive as our metadata store, and Baggins and S3 as our datastore. Finally, we have the Intrepid layer, which is the transport layer that transfers the bits from and to Netflix Drive. One thing to note is that Intrepid is not just used to transport the data, but here it is also used to transfer some aspects of the metadata store as well. This is needed to save some state of the metadata store on cloud.

To look at it in another way, we have the abstraction layers in Netflix Drive, so you have the libfuse, because we are using a FUSE based file system, that handles the different types of file system operations. You initially start the Netflix Drive and bootstrap it with a manifest. You have your REST APIs and control interface as well. Your abstraction layer abstracts the default metadata stores and the datastores. You can have different types of data and metadata stores here. In this particular example, we have the CockroachDB adapter as the metadata store, and an S3 adapter as the datastore. We can also use different types of transfer protocols, and they are also a plug and play interface in Netflix Drive. The protocol layer that is used can be REST or gRPC. Finally, you have the actual storage of data.

This here shows the different services and how they are split between workstation and cloud. You have the typical Netflix Drive API and POSIX interface on the workstation machine that sends the bits and bytes to the transport agent and library. You have a bunch of services on cloud as well. Namely, they're your metadata store, which is CDrive in our case. You have a media cache, which is a middle caching tier of storage. You finally have object storage in S3. Netflix Drive on your local workstation will talk to the metadata store and the datastore using the transport agent and the library.

One thing to note here is that we also use local storage to cache the read and the write, and to absorb a lot of the performance that the users expect out of Netflix Drive. Security is a first-class citizen in Netflix Drive. We wanted to provide two-factor authentication on Netflix Drive. The reason is that a bunch of these cloud services are actually used by a lot of applications, they front all of the corpus of assets in Netflix. It is essential to make these assets secure, and to only allow users that have proper permissioning to view the subset of assets that they are allowed to use and view.

Typical Lifecycle

Let us discuss a typical lifecycle of Netflix Drive and some runtime aspects of it. Given the ability of Netflix Drive to dynamically present namespaces and bring together disparate datastores and metadata stores, it is essential to discuss the lifecycle. This may not be true in typical file systems where you do not necessarily have a typical stream of events that happen in the life cycle. In Netflix Drive's case, we initially bootstrap the Netflix Drive using a manifest. An initial manifest could be an empty manifest as well. You have the ability to allow workstations or workflows to download some assets from cloud and preload and hydrate the Netflix Drive mount point with this content. The workflows and the artists would then modify these assets. They will periodically either snapshot using explicit APIs, or leveraging the auto-sync feature of Netflix Drive and upload these assets back to cloud. This is how a typical Netflix Drive instance will run.

Let us get into the bootstrapping part of it. Netflix Drive, typically during the bootstrap process expects a mount point to be specified. Some amount of user identity, for authentication and authorization. Location of the local storage, where the files will be cached. The endpoints. The metadata store endpoint and the datastore endpoint. Optional fields for preloading content. Also, persona. Netflix Drive is envisioned to be used by different types of applications and workflows. Persona gives Netflix Drive its flavor when working for applications. For example, a particular application may specifically rely on the REST control interface because they are aware of the assets, and so they will explicitly use APIs to upload files to cloud. Some other application may not necessarily be aware of when they want to upload the files to cloud, so they would rely on the auto-sync feature of Netflix Drive to upload files in the background to cloud. That is defined by the persona of Netflix Drive.

Here is a sample bootstrap manifest. A particular Netflix Drive mount point can have several Netflix Drive instances that are separate from each other. You have a local file store, which is the local storage used by Netflix Drive to cache the files. The instances get manifested under the mount. In this case, we have two separate instances, a dynamic instance and a user instance, with different backend datastores and metadata stores. In the first instance, which is a dynamic instance, you have a Redis metadata store, and an S3 datastore. You will also uniquely identify a workspace for data persistence. Then in the second one, you have CockroachDB as a metadata store and Ceph as a datastore, again.

Namespace of Netflix Drive is all the files that are viewed inside Netflix Drive. There are two options to actually create namespace. Netflix Drive can create the namespace statically at bootstrap time, where you can specify the exact files and folders that you need to pre-download and hydrate your current instance with. For this, you present a file session and a Netflix Drive container information. You have workflows that can prepopulate your Netflix Drive mount point with some files, so that the subsequent workflows can then be built on top of it. The other way to hydrate a namespace is to explicitly call Netflix Drive APIs in the REST interface. In this case, we use the stage API to stage the files and pull them from cloud, and attach them to specific locations in our namespace. One thing to note is that both these interfaces are not mutually exclusive.

Update

Let us now get into some of the Netflix Drive operations. Modifications for Netflix Drive content can happen through POSIX interface or REST APIs. File system POSIX operations that can modify a file would be open, rename, move, read, write, and close. There could also be a subset of REST APIs that can be used to modify a file. For example, staging a file, which pulls the file from cloud. Checkpointing a file. Saving a file, which actually explicitly uploads the file to cloud. An example of how a file is uploaded to cloud is using the publish API. We have the ability to autosave files, which would periodically checkpoint the files to cloud and the ability to also have an explicit save. The explicit safe would be an API that is invoked by different types of workflows to publish content.

A great example of seeing where these different APIs can be used is the case where artists are working on a lot of ephemeral data. A lot of this data does not have to make to cloud because it's a work in progress. In that case, for those workflows, explicit save is the right call. Because once they are sure of the data, and they want to publish it to cloud to be used by subsequent artists or in subsequent workflows, that's when they would invoke this API. That would pick the files. It will snapshot the files in Netflix Drive mount point, and then pick them and deliver it to cloud, and store it in cloud under the appropriate namespace. That is where you can see a difference between autosaving, which is like a Google Drive way of saving files, and an explicit save that is called by artists or workflows.

Learnings

Given that Netflix Drive is used in multiple personas by different types of workflows, here are some of the learnings that we had while developing Netflix Drive. The number one learning was that there were several points of making different choices for our architecture. We intended it to be a generic framework that can have any datastore and metadata store plugged into it. Also, a lot of our architectural choices were dictated by the performance and the latency aspects of files, workflows, and artists' workstation, and artists' experience that we wanted to provide using Netflix Drive. A great example is we used FUSE based file system. We implemented a lot of the code of Netflix Drive using C++. We compared other languages, and we thought that C++ gave us the best performance results when compared to other languages, and performance was a critical feature of Netflix Drive that we wanted to provide.

The other is designing a generic framework for several operating systems is very difficult. In our case, we support Netflix Drive on CentOS, OS X, and Windows. We leverage FUSE file system. We had to then investigate a lot of the alternatives for FUSE based file systems on these different operating systems. That also multiplied our testing matrix, our supportability matrix.

The third learning is that the key to scalability is handling metadata. In our case, because we work with disparate backends, and we have different layers of caching and tiering, we actually rely heavily on metadata operations being cached in Netflix Drive. That gives us a great performance for a lot of the studio applications and workflows that are very metadata heavy. Having multiple tiers of storage can definitely provide performance benefits. When we designed Netflix Drive, we did not restrict ourselves to just the local storage or cloud storage, we in fact wanted it to be built in a way that different tiers of storage can easily leverage Netflix Drive framework and be added as a backend for Netflix Drive. That came through in our design, in our architecture, and in our code.

Having a stacked approach to software architecture was very critical for Netflix Drive. A great example is again, the idea of shared namespaces. We are currently working on the ability to share files between different workstations or between different artists. This is built on top of our eventing framework, which is again, part of the Netflix Drive architecture itself. When one Netflix Drive has a file that is added to the namespace, it generates an event, which is consumed by different cloud services, and is also then using the REST interface of the subsequent Netflix Drive to inject that file into the Netflix Drive instances namespace. This is how you can build on top of existing primitives of Netflix Drive.

Resources

If you would like to learn more about Netflix Drive, we have a tech blog available on the Netflix tech blog channel.

Questions and Answers

Watson: Being an application you built natively on the cloud, what was your biggest challenge around scalability? I assume everything wasn't just flawless day one. Did you have to make some technical tradeoffs to achieve the scale you wanted given how many assets Netflix handles?

Chopra: We are targeting Netflix Drive to serve exabytes of data and billions of assets. Designing for scalability was one of the cornerstones of the architecture itself. When we think in terms of scaling a solution on cloud, oftentimes, we think that the bottleneck would be the datastore. Actually, it is the metadata store that becomes the bottleneck. We focused a lot on the metadata management, how we could reduce the amount of calls that are done to metadata stores. Caching a lot of that data on the Netflix Drive locally, was something that gave us great performance.

The other thing is, in terms of the datastore itself, we explored having file systems in the cloud, like EFS. With file systems, you cannot scale beyond a point, it impacts your performance. If you really want to serve billions of assets, you need to use some form of an object store and not a file store. That meant that our files and folders which our artists are used to had to be translated into objects. The simplest thing to do is have a one is to one mapping between every file and an object. That is very simplistic, because sometimes file sizes may be bigger than the maximum object size supported. You really want the ability to have a file mapped to multiple objects, and therein lies the essence of deduplication as well. Because if you change a pixel in the file, you will then only change the object that has that chunk of the file. Building that translation layer was a tradeoff and was something that we did for scalability. These are examples of how we designed for the cloud, thinking about scalability challenges there.

Watson: How would you compare Netflix Drive with other cloud storage drive systems like Google Drive?

Chopra: That was one of the first things that we were asked when we were designing Netflix Drive. We compared the performance of Google Drive with Netflix Drive. The advantage that Netflix Drive has is because we have a tiered architecture of storage, where we leverage our media cache. Media cache is nothing but a media store, a caching layer that is closer to the user and the application. We also cache a lot of data on the local file store. Google Drive doesn't do that. We could always get local file system performance compared to Google Drive. If you remove the local file system from the picture, even then, because we have our media caches that are spread across the globe, we can still perform much better than Google Drive.

Watson: It probably doesn't help to log your data in S3 as well, I don't know if Google Drive does well.

What prompted you to invent Netflix Drive over, for example, using the AWS File Storage Gateway?

Chopra: One of our cornerstones for thinking about Netflix Drive is security. We wanted to design a system that could provide our artists that are globally distributed, a secure way of accessing only the data that is relevant to them. We investigated AWS File Storage Gateway, but the performance and the security aspects of it did not match our requirements. The second thing is that the file Storage Gateway, I don't think, when we investigated it, could translate it into objects on the backend. For us, we really wanted the power to have these objects closer to the user in our media caches, and control where these objects are stored. A great example is the case where, let's say multiple artists are working on an asset. These assets, if every iteration of it is stored in the cloud, your cloud costs will explode. What we wanted to do was enable these assets to be stored in media caches, which is something that we own, and only the final copy goes to the cloud. That way, we can leverage a hybrid infrastructure, control what gets pushed to the cloud versus what stays locally, or in shared storage. Those are parameters that were not provided by File Storage Gateway to us.

Watson: We have similar challenges at Pinterest, where people will say, why don't you use EKS? Ninety percent of the work we do is integrating our paths around the cloud offering. It's like you still have to do all that work with security.

How many requests can your system handle in an hour? Do you have any sense of like rough metrics?

Chopra: I ran an Fio to compare Netflix Drive with other alternatives, and we perform much better than Google Drive. Google Drive, I think, has a lot of limits on the number of folders, the number of files that you can put in, Netflix Drive has no such limits.

Would EFS work as a substitute to some extent for Netflix Drive?

EFS could have worked as a substitute, but EFS will not scale beyond a point. Because, in general, if you have to design scalable systems, you have to pick object stores over file stores. Object stores give you more scalability, and you can also cache pieces of objects. Also, deduplication is a challenge because if you change a pixel of a file, then you have to probably have that synced to EFS, and have it stored as a file in multiple tiers of caches. That is not something that we felt was giving us the performance that we needed, so we went with object stores.

Why was CockroachDB selected as a database? Was there a specific use case for it? Did you guys compare it with other options?

We actually have a lot of databases that are built inside Netflix, and we have layers on top of different databases. Cockroach database was the first one that we picked because it gave us a seamless interface to interact with it. We wanted to leverage something that was built in-house, and it is scalable horizontally. It's a SQL database that is horizontally scalable. That's the reason why we went with that. Also, you could deploy it on-premise or on cloud. If we ever wanted to go into a hybrid situation, we always had the option of doing that. That being said, CockroachDB was the first one that we picked, but our code is written in a way that we could have picked any one. We also built a lot of security primitives, event mechanisms, all of that for CockroachDB. We wanted to just leverage what is already built, so we went with CockroachDB.

Watson: Having been at Netflix myself for quite a few years, I know you have so many partners there. Have your partners in the industry adopted use of the drive? I assume you have to work with different publishing houses and producers.

Chopra: At this point we are working with artists that are in Netflix, and they have adopted Netflix Drive. We haven't reached out to the publishers as yet. The independent artists work with our tools that actually would be supported by Netflix Drive. Those tools will be a part of Netflix Drive.

You've described how big objects are split, and every small bit of it is referenced in a file, and then, that there was a challenge deduplicating, how is that different from symlinking?

In object stores, you have versioning. Every time you change, or mutate a small part of an object, you create a new version of the object. If you imagine a big file, but only a small pixel of it gets changed, that means that in a traditional sense, if your file was mapped to an object, you would have to write the entire file again as an object, or send the entire bits again. You cannot just send the delta and apply that delta on cloud stores. By chunking, you actually reduce the size of the object that you have to send over to the cloud. Choosing the appropriate chunk size is more of an art than a science, because now if you have many smaller chunks, you will just have to manage a lot of data and a lot of translation logic, and your metadata will just grow. The other aspect is also encryption, because today we encrypt at a chunk granularity. If you have many chunks, you will have to then have so many encryption keys, and manage the metadata for that. That is how it is challenging to deduplicate content also, in such scenarios.

Do you use copy-on-write techniques in Netflix Drive to optimize your storage footprint?

At this point, we do not use that. We could definitely explore using copy-on-write to optimize the storage footprint. Yes, that's a good suggestion.

What is the usual chunk size?

Chunk size, it depends. I think if I remember correctly, it's in megabytes, like 64 megabytes is what I remember last seeing. It is configurable. Configurable in the sense that it is still statically assigned at the start, but we can change that number to suit our performance needs.

Watson: How are security controls managed across regions, it's usually an external API or built into the drive?

Chopra: It's actually built into our CRDB, as a layer on top of CockroachDB that we have. Netflix Drive leverages that layer to handle all the security primitives. There are several security services that are built within Netflix, so we leverage those at this point. We don't have external APIs that we can plug in to at this point. We plan to abstract them out as well, whenever we release our open source version so that anyone can build those pluggable modules as well.

Watson: You mentioned open sourcing this, and I know open sourcing is always a hurdle and takes a lot of effort. Do you have a general sense as to when this would be open source?

Chopra: We are working towards it for next year. We hope to have it open source next year because we have a lot of other folks that are trying to build studios in the cloud, reach out to us. They want to use Netflix Drive as well, the open source version of it, and build pluggable modules for their use cases. We do intend to prioritize this and get it out next year.

C++ for performance reasons. Have you considered using Rust?

We did not consider using Rust because I think one of the reasons was that a lot of FUSE file system support was not at that time present in Rust. Secondly, C++ was a choice. First of all, you pick a language for the performance reasons. The second thing is you pick a language that you're more familiar with. I'm not familiar with Rust. I have not used it. It would take me some time to get used to it and explore it and use it in the best way possible. That would have meant lesser time devoted to the development of Netflix Drive. There is nothing that stops us from investigating it in the future, when we've made it a framework and we release it out there.

See more presentations with transcripts

Recorded at:

Apr 29, 2022

Tejas Chopra

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?