InfoQ Homepage Presentations Improving Video Encoding System Efficiency @Netflix

Improving Video Encoding System Efficiency @Netflix

View Presentation

Speed:

Download

19:07

Summary

Susie Xia discusses the video encoding system used by Netflix, and the tools and techniques used to analyze performance and to improve the system efficiency.

Bio

Xia Susie works on the Performance Engineering team at Netflix. On the team, she has been focusing on making video encoding systems fast and efficient. Prior to Netflix, she worked at LinkedIn and Salesforce, where she helped improve service performance end-to-end and build automations for root-cause analysis and capacity analysis.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Xia: I'm Susie from Netflix performance engineering team. Netflix is one of the biggest media content providers. During the global pandemic, where everyone is staying at home, 11% of global internet traffic is coming from Netflix. Every time when you hit Play button from Netflix app on your favorite show, whether it is "Tiger King," "Stranger Things" or "Black Mirror," you are contributing to this number. While enjoying a great variety of media content from Netflix, there is a hidden yet very important system running behind the scene to support a pleasant streaming experience to you. This system is video encoding system.

Outline

That is what we're going to talk about. First, I will give an overview of the video encoding system. Then I will share some of the technical challenges we faced. Most importantly, I'm going to share three techniques that we applied to improve the system efficiency, and hope that will be helpful for your domain as well.

Video Is Big Data

First of all, we need to understand that video is big data. Films are shot in 4K or higher. For a full day of shots, it will generate 2 to 8 terabytes of raw camera footage. Imagine your favorite movie may be shot in weeks or months, so this number could be even higher. After editing, when a film is ready, video source files are generated and sent into Netflix systems. Normally, video source files are big, even though not as big as the original footage. For example, for a 1-hour long episode of, "Crash Landing on You," the video source file is almost 500 gigabytes. It will take over 10 hours for me to download with my home WiFi. Delivering several hundreds of gigabytes of video to our customers is not practical. We need to condense the content so that our customers are streaming them smoothly under all sorts of conditions, whether they're watching with a 4K HDR TV with the fastest broadband, or a mobile cell phone with very choppy cellular network. This process of condensing video source file is what we refer to as video encoding. Video encoding is the foundation of a great streaming experience. Netflix is committed to delivering the most optimized picture quality to all our customers on over 1000 supported devices. Video encoding is one of the most important workloads running at Netflix cloud infrastructure.

Video Encoding Pipeline

Let's zoom into what the video encoding pipeline looks like in a bit more detail. On the left-hand side, we have a video source, again coming from a production studio. On the right-hand side, we have a 4K TV on which we want to watch the show. When the original content is first ingested into our system, we run some quick validations. Next, the media content is broken up into smaller chunks. Each of the chunk is a portion of the video, usually about 30 seconds to a couple of minutes in duration. In a massively parallel way, we encode these video chunks independently on our servers. Once all the chunks have finished the encoding process, they're reassembled to become a single encoded video asset. This video file is about 1000 times smaller compared to the original video source file. Then the DRM protections will be added to the video asset and the package, the stream is generated. The package, the stream asset is distributed to the CDNs around the world and is ready to be streamed by customers. This simplified pipeline view shows what it looks like for one profile. In order to support thousands of different devices and optimize the experience of each of them, we need to run this process for different video codecs, resolutions, and bitrates. In reality, for a single title, we do a lot more work. In practice, we have to do this for the entire Netflix catalog, which contains many shows. This becomes a massive parallel computing problem in the cloud, which requests a lot of capacity with hundreds of thousands of CPUs running concurrently at peak.

The Encoding Workload

Now that we know what the encoding pipeline looks like, let's talk about some of the workload characteristics. First, the level of work is bursty and unpredictable, because the load is driven by our partners or teams. For example, we received the entire season of "Emily in Paris" from the production studio, and we want to encode them as quickly as possible. Most work can be decomposed into smaller processing steps that can be run in parallel. The arrival of one, 30 minutes episode can result in 250,000 work items. Third, media processing algorithms are CPU intensive, and can take hours or even days for just a small segment of video. It also requests a lot of memory disk I/O. Lastly, every now and then, we have to generate encodes for portion or sometimes even the entire catalog, so that we can take advantage of the latest codec and optimizations. This process may take months to complete.

Optimize In Two Dimensions

When we analyze a system like this, there were two dimensions that we need to keep in mind, latency and throughput. For latency, at the highest level is about how fast we can release a new title. That doesn't just mean how fast a customer can start watching the show, but also means internal teams can start using the materials for marketing purpose. At a lower level, we want to measure how fast the work items can be completed individually. For throughput, we ask ourselves how long it would take to re-encode the entire catalog, and also, how much work can be done in parallel with a given set of resources?

Three Techniques

With this in mind, let me share with you the three techniques that our teams at Netflix have applied to help improve the video encoding system efficiency. We are going to talk about them in detail, one by one. For quick preview, these techniques are about using the right tools to inspect your workload, knowing what your hardware can provide, and how to take full advantage of it. Finally, figuring out a way to fit the workload properly in a distributed system.

Using High Resolution Tools

Let's start with the first technique, what I call using high resolution tools. The idea is pretty simple. What you can optimize is limited to what you can observe. A lot of you probably have seen something like this, a standard time series diagram showing CPU utilization. What I'm showing to you here is a tool called Atlas that we use at Netflix for analyzing microservices. Atlas gives us 1-minute average on metrics, and works pretty well for what it is meant for. In this chart, we are showing the CPU utilization of a video encoding worker, in a 3-hour span. Note that a worker fetches work items from the queues and processes them sequentially. As you see, the CPU utilization is quite spiky, ranges from 40% to 100%. Let's take a look at the first 30 minutes. During this time, the worker was working on 59 tasks converting the content into H264. In the next 30 minutes, it worked on exactly one task. In the final hour and a half, there were 80 tasks doing VP9 encoding. In a total of 3 hours, there were 140 tasks being run with quite a lot of them starting and finishing within the same minute. Recall that the chart is at 1-minute interval. Now we see that the view is too coarse to tell us exactly what is going on within a task. We want to dig further.

Vector: Per Second System Metrics

To dig further, let's use a tool that can provide more granularity. Here, we use Vector, an open source tool that provides 1-second average of system metrics. Using this tool, you can clearly see that within this 15 second task, some drama is going on. CPU climbs up from 0% to 50%, dropped to 10%, and then it is back up again to 100% before dropping all the way down to 0%. We're very curious about what the system is doing, during this time window.

FlameScope: CPU Subsecond-Offset Heatmap

Then we refer to FlameScope, which is a performance visualization analysis tool that allows us to zoom into a fraction of CPU profile, so we can find out what is going on within a second. In this visualization, each cell has color ranging from white to pink to red. Before going into the details of it, doesn't this look like something delicious? Using these two to analyze the tasks for different codecs, we got different interesting steaks, or rather, different charts. Let's take a look at one of them and see how to interpret it. Bottom to top, each column is one second, and the next second was painted from bottom to top again. It is time range in both axes. The color depth indicates how many CPUs were busy for the time range. There were 50 rows, the size of each cell or pixel is about 20 milliseconds, so we can now see the patterns within a second. What's cooler is by clicking and selecting a set of pixels, we can see the Flame Graph during the time-frame.

Back to the visualization for different encoding tasks. We can see that none of them seem to fully utilize the CPUs, because if they were, they would just be all red. We can also see that there were different types of patterns, which indicates common performance issues. Let's take a look at them and see what we can do about them. Number one, idle CPU time is indicated as the white columns, which means CPU is doing nothing, which usually indicates opportunities for more parallelization. We also have areas of light pink color, which indicates only one CPU was used. If we take H264, as an example, the big pink area with some red dots for perturbations, which could be interrupts or system tasks. For this scenario, a common optimization strategy is to see if you can remove resource contention of these blocks. Finally, we have large red blocks, which is a good indicator of the program taking full advantage of CPUs. We can still try to optimize by looking into which functions are running and rewrite them to make them run faster.

Flame Graph: Identity Hot-spots

To do that, a tool that could be very handy is Flame Graph. You can get to Flame Graph by clicking a subset of pixels in a previous heatmap. It is a tool for visualizing application and kernel stacks, allowing the most frequent code path to be identified quickly and accurately. The wider a frame is, the more often it was presented in the stacks. The top edge usually shows what is on CPU, and beneath is its ancestry. With this, you can easily identify something that we can further optimize. Using high resolution tools is like looking at the system through a magnifier, it helps us to find out all the suspects for potential performance optimizations.

Leverage Hardware Features

We just talked about our first technique of using high resolution tools. Let's talk about the second technique, which is leveraging hardware features to make things run even faster. An analogy here is that even if you have a super car, if you don't drive it at full speed, of course, in a safe racetrack environment, you are not maximizing its full potential.

Processor Specific Optimizations

Looking at the same encoding task as an example, by studying this Flame Graph, we know that 33% of the total CPU time was spent on a particular function, which is highlighted in purple color. This is from Kakadu, which is encoder and decoder library that we use for our J2K. It is our focus area for optimizations. Once we dig into this code, we find that the majority of the encoding work involves time and frequency domain conversions, and other methods of handling convolution. This type of metrics computation is usually a good candidate of leveraging special hardware, such as FPGAs, or GPUs. However, there were easy wins available on most x86 already. For example, in our case, we decided to leverage vector instructions such as AVX2, SSE, which allow faster metrics map. By turning on the corresponding to the functions, you might be able to speed up your programs significantly depending on your workload. Of course, this is not an easy feat, you really need to know your hardware and software well, so you know what to tune. There might be a special knob on a hardware that is waiting for you to discover.

More Work Done With Resources

The third and the final technique we would like to share is about sizing the workload properly for better utilizations, like playing Tetris. Here is an example of improving our AV1 encode. AV1 is a new codec that reduces 15% of the bitrate but requests 1.5 times more CPU time for encoding. Let's say we have already applied the previous two techniques, and you've optimized the worker to its full potential, and also taking full advantage of the hardware features. By monitoring the CPU utilization, we noticed it's running at 25%. We were thinking, how can we improve the throughput of the entire instance? Then what we do is to implement some concurrent workers that scaled based on the instance type that we're running on. Eventually, we achieved 100% CPU, and I also doubled the single instance throughput.

Know the Shape of Your Work

This makes us want to think, maybe we want to do this to all our different encodes. We want to shape the type of encode work item with a lot of parameters, such as resolution, codec, number of frames, and also resource utilization, such as CPU, memory, disk, network. We also want to add the job duration. What we did is, we annotate the encoding process to know the resource consumption at different application phases. Here is an example of annotating CPU utilization of an encode task. In fact, we did this for our other system metrics. To do this, we basically did instrumentation to our code, where at different stages, it automatically emits messages with all the attributes that we want. These attributes include phases, duration, source, encode width and height, codecs, and number of frames. We persist this data to a data warehouse for further analysis, and this turned out to help us better understand the jobs running. Ultimately, we used this data as a guidance of implementing bin packing algorithms for the workers.

Takeaways

The three techniques are, using high resolution tools to know where and how you should optimize. Don't be afraid to go full speed on your hardware. Finally, know that in today's cloud computing world, you need to know how to fit workloads together to maximize throughput.

See more presentations with transcripts

Recorded at:

May 21, 2021

Susie Xia

InfoQ Software Architects' Newsletter