BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Crisis to Calm: Story of Data Validation @ Netflix

Crisis to Calm: Story of Data Validation @ Netflix

Bookmarks
49:56

Summary

Lavanya Kanchanapalli shares her experience in maintaining a great Netflix customer experience while enabling fast and safe data propagation. She also talks about detecting and preventing bad data that is essential to high availability, ways to make circuit breakers, data canaries and staggered rollout effective, and efficient validations via sharing data and isolating change.

Bio

Lavanya Kanchanapalli works as a Senior Software Engineer at Netflix.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

After a full day of very interesting stimulating talks and discussion, I'm also looking forward to chilling with Netflix. But before that, I'm going to talk about February 3, 2015. It was a very normal day at Netflix; we had feature requests, team meeting, bills, pushes, so on. But towards the end of the day, Netflix went down. That means no user was able to play any video in any country. It was just down. Excuse me. This was a big deal. When Netflix goes down, it does feel like eternity, just as much for you as much for the engineers working. Because we want to do everything to bring it back up as soon as possible. That's why all of our stories start with that.

Our SRE team did not know what was the cause. But since my team is responsible for critical data on which all applications at Netflix rely on, we were paged to ask, "Did you do a code push?" "No, we did not do a code push." Did we think it was a data issue? No, we did not think it was a data issue. But when Netflix is down, we try to diagnose it as fast as possible, so we jumped in. First thing we did, since we are a data infrastructure team, was to roll back the data to 2 p.m. that day to see if that would stabilize. And incidentally, the error started going down. This was a great indication, but it also meant that it was a data issue. Surprise!

The impacted applications could not yet auto recover, we had to increase its capacity to handle the extra load. And then after 45 minutes, about the length of this talk, imagine that, Netflix starts to work. It was a huge relief to every team involved. We were all happy. But the next thing was to find what happened. So we debugged, we tried to root cause it, and eventually, we found that one of the applications created duplicate objects, meaning that for same primary key we were creating multiple objects. Yes, it was a glitch.

Applications did not know how to handle duplicate objects as it was not expected. And it went into retry storms and other cascading failures and brought the entire system down. This was a wake-up call for my team to realize what could happen should unexpected data, make it into applications. This is where my team's journey of bad data detection starts. I'm going to share that journey and some of the learnings of how we have tried to come up with bad data detection to help increase availability.

These learnings I hope can be applied to small and big systems to increase availability and resiliency. Before we jump in, I want to introduce myself. My name is Lavanya. I work in Platform Data Technologies at Netflix. My team builds runtime data infrastructure for all applications at Netflix. We do data compaction, data modelling, data validation, all things data.

Big Data

Let's take a step back and talk about data. I spoke to some of you here and I heard that data is the key part that's very attractive. Let's talk about what data means. Depending on where we apply the data, data could mean multiple things. Data could be metrics, logs, reports, Kafka, databases. It could mean many things. In microservice architecture, data can also be used as a way to communicate between the applications. Excuse me. But when I talk about data throughout this presentation, I'm talking about data in data-driven systems. Data that flows through systems and its change causes systems to change their behavior. If you think about it, to change behavior of the software system, we build code, we add new feature. Well, once the code goes into production by changing a flag, or a number, or any data associated to it, you make it behave differently. This has many advantages. Some of them are like, it is convenience, agile, it's fast.

At Netflix also we use this quite a bit. Let me give you an example. Here is an example of data change, a contract being signed for a show. As it makes its way through the microservice architecture of Netflix, it changes the behavior of Netflix, by making that show available to all Netflix users. We can find similar examples in many other systems. For example, you modify authorization for a Google document, it makes it available or takes availability for someone else. And there are many other examples and in Netflix on January 6, 2016, when it was globally made available, it was a few pieces of data that changed at the stroke of midnight, which made it available to 130 countries.

This is a familiar screen with the audio and subtitles. These are not hardcoded either. As new assets, which is like languages and subtitles become available, it automatically shows up for greater enjoyment of Netflix users. Here is a screen that shows what we call catalog metadata. Catalog metadata is the set of attributes that we tag every video with. These are the ones that determine how the video behaves inside of Netflix. As you can see, it is the title of maturity ratings, episode, everything, all the information. This catalog metadata architecture is a great example of a robust, resilient data-driven system within Netflix.

Metadata Architecture

I'm going to jump in a little bit into the architecture of this system. This system gathers data from multiple sources. If you think about it, one source system could deliver all localized data for all videos. Another one, about all images. Another one, about streams. Streams are nothing but the attributes or data that makes a video playable. Video Metadata Service is an application that aggregates data from all these sources, applies business logic, compacts, and creates an efficient data model and publishes to S3. Once new data is available, all applications at Netflix are asked to get the latest data. And they do by going to S3. As we have seen, it's a single publisher, multiple consumers. And before I go on any further, the reason I'm talking about this architecture is because I'm going to use this as an example when I talk about data validation.

Under the covers, we use a technology called Hollow. This is open sourced by Netflix, you can find more at hollow.how. It provides us the capability to version every data change that we publish. Versioning data changes has a lot of advantages. One of them is to be able to provide delta updates. So from one data state to another data state, I can publish just the changes. And this makes it very efficient, both for transportation and also for application by the applications.

This also makes it possible for very fast data propagation. Just to give you an idea, before we started, we used to publish every few hours. All the data changes that happened in those hours will be published to all applications. Once we started using Hollow and after doing a couple of performance improvements, now we publish in minutes. So imagine, if something changes in the source data, it becomes available to Netflix users within minutes. That is very fast. This became a very efficient and fast way for any data to propagate through Netflix applications.

But “any” is the catch; bad data happens. And even that propagates as fast and brings down the entire system. We have seen over time that data can result in leaked content. One of the popular Netflix shows was released a couple of weeks before its launch date, we scrambled and brought it down very quickly, because it was a big problem for us. Later, we found that it was a single date attribute that caused this. It could disable features as you can clearly see something is missing there. Or it could enable features prematurely, it could delete data, or it could bring the entire system down as we have seen in the incident that I first talked about.

It was because in video metadata service, there was a race condition that created these duplicate objects. And like any data, it propagated within minutes to all applications, and the entire system was down. So once this incident happened, we started with one goal, that was we wanted to be able to handle data issues much better the next time. That is all we wanted to do. This is where we started. As we talked more, and in discussions, we found that the data changes that we were propagating through all these systems every few minutes, were exactly like code pushes.

At this point, our data changes were riskier than code pushes, because code pushes like the industry standard went through detection, with unit test, integration test, canaries, it went through staggering. We never pushed new code to all applications. We pushed to a small set of instances, and then rolled out over time gradually. And rollback with red black pushes, you push code should something happen, you quickly stabilize to move to a safer version while we debug issue with the new thing. But our data pushes, no, we didn't have any of it. Every few minutes, we just keep pushing.

Knowing how impactful data changes were, we considered each of these techniques to apply to our data pushes. But trust me, not everybody, even within our team was convinced that prevention was a better approach than fixing forward. You might have heard Netflix has a bias towards action. And we do not try to build too many roadblocks into our systems. We were concerned that it might make our data propagation too slow. We have seen the impact of going from three hours to a few minutes for data propagation, and it was extremely beneficial for Netflix. We did not want to slow it down, or on the other hand, our techniques to find failures might be too successful. We might find a few failures every couple of hours for somebody to sit there, validate it, and then letting it through. It would cause too much operational noise. Or it might be too expensive. We did not want our solution to become like the white elephant that you see in this picture, which is rare, luxurious but expensive and impractical. We did not want to do that. Yet, knowing how impactful the data changes could be, we decided to go down the path.

Detection

Our first task was how do we detect bad data? How do we know a data change that we are propagating is not going to cause adverse impact on the applications that depend on it? So we looked back at our past data issues, we spoke to the teams that depended on this data, we spoke to the teams that produced the source data to identify what's important, how can we guard it, how do we know? We came up with set of checks, and these set of checks fell into two categories: circuit breakers and canaries.

If you think about it, circuit breakers are like unit tests. Unit test for code push are these checks, or are these scenarios that we validate before the code is pushed to production. In data case, circuit breakers are these things or checks that we apply on data before the data is made available to other applications in production. This is the architecture that I've just shown. Source systems, metadata service, Amazon S3, and two other applications in Netflix. What we wanted was, we added a new step at the end of creation of data, which would run through all data to figure out whether the data's health is good. As the name indicates, circuit breakers, are this concept where a bad data condition is met, circuit opens, and then whatever is flowing through would not be able to flow through. Same with data; these circuit breakers would check for conditions that indicate bad data and circuit opens, the data does not flow through until we go debug and validate it.

The circuit breakers that we came up with fell into a few types. The first one was integrated checks. These are like sanity checks. Can the consumers of data actually get the data from S3? Does the data check some work? And so on. Duplicate detection; we talked about the glitches, we just wanted to identify if that situation happened again. Object counts; this is depending on how many important objects your data might have, you would consider to identify the key objects and look at the count of the objects from data change to data change.

I'll give you an example from Netflix world. In Netflix, the most important things are videos. So I'd say at 4 o'clock when it published the data, there were 500 videos. At 4:30, when I'm publishing, if there are 20 videos, it does not sound right, something must be wrong. So I go validate the data. This is object checks, object counts, checks. The last one are semantic checks. These are completely business and data dependent. Again, going back to the Netflix world, for example, these will be like, every video would have a title, every season would have episodes and so on.

It's important to know your data, especially when you're writing assertions that would validate this, is really important to know your data, and how it changes. And when we added the circuit breakers, we really thought we knew the data. We said, “a show needs to have seasons. Large number of data changes is an indication of a problem.” Boy, were we wrong. Apparently, before a show goes live, it does not have seasons. That does not mean it's bad data, and a big deal is signed by Netflix. A lot of videos come online, it results in a lot of changes. Is that bad? No, it's not bad, we didn't want to block it at all.

So when we added circuit breakers, they would fail, and in that case, it's either bad data or we learn why the data was not bad. And I can tell you after we added the circuit breakers in the first few weeks, we did a lot of learning, a lot of it. The only way to go from noisy circuit breakers is to add knobs. One of the things that we used frequently was on and off. Say, we have a circuit breaker that identified a condition it is valid, but we assumed it was bad data, the only way we can correct that is to actually turn off the circuit breaker, fix the circuit breaker while the data keeps flowing. Because we do not want to stop the data propagation for a day while we fix the circuit breaker.

Threshold. We started with these naive checks where I gave the example, right. Four o'clock, I have 500 videos, 4:30 I have 20 videos, I wanted to do the objects check that way. But we soon realized that this was brittle. What if there are millions? What would I do, right? We quickly moved to a percentage. And even with percentage, we could not be rigid. Like I said, a large data change is still valid in some cases, and we needed to be able to adjust based on how the data was.

Another good one was exclusions. So I was talking about the show that did not have any seasons, where we expected it to have; there are two ways once that condition is met. Either I could turn off the circuit breaker for all data, or I could exclude the show and continue to provide the coverage for the rest of the data. So exclusions was something that we added later as well.

We quickly realized that not all data was equal at all. But for our circuit breakers, this test video was as important as The Crown. We all know that is not the case. What we were after was to realize what is the business value; what is that that makes some data more important than the other? And for Netflix, what was important was what our users loved. Whatever our users liked was most important to us. So the first thing that we did was to teach our systems and our checks to learn how popular a particular video was. How does a video become popular, and that much, it is important to Netflix.

So we initially started with naive checks with numbers, then moved on to percentage thresholds. And then came to smart circuit breakers, which would actually identify the business value and behave accordingly. This kind of evolution made these techniques a key part of availability story for many applications at Netflix. To add circuit breakers to your systems, identify a few checks, a few key use cases that your data supports to allow for exceptions to rules. Have lots of knobs, which can impact, dynamically change behavior of the circuit breakers or checks that you're making. And third thing is, use business value as the key to make noiseless checks, and also very effective checks.

You could also consider the following techniques for efficiency. One of the things that we have used is change isolation. If you think about it, we processed all the data changes in a few minutes, we did not want to spend hours validating and making it very safe. We had to balance between the safety and the speed. So one of the things that we invested in is identifying what exactly changed, and just validating it. If you think about it, the more frequent your data changes are, the less changes there would be, and less there would be to validate.

Sampling. This is very, again, similar to semantic checks. This is very data and business specific. If you can come up with a sample of data that better represents your data, then you could end up amplifying the effect of providing the coverage to such data. I'll give you an example. Knowing our data, we knew that metadata associated to a video is exactly same over different countries. So if we actually validated that data for that one particular pair of video and country, we were providing coverage for many other countries as well. We use this technique to our advantage to make our circuit breakers very efficient.

Moving on to data canaries. Like circuit breakers are like unit tests, canaries are like integration tests. Integration tests, the way they work for code is you have a set of checks that run to validate a full functionality or a use case including more than one system or application. Data canaries have the same principle.

Again, going back to the architecture picture, I just want to talk about code canneries, traditional canaries, how they work. New code and old code is deployed in two separate clusters. Each time there's a request coming through to Netflix, a copy of that request is sent to both new code and old code. While this is running, key performance metrics like memory, heap or any other thing that is important to that particular application is measured. Once the test is done, these two things are compared to see if the new code has impacted, or if the new code has degraded its measurements in any of the dimensions that were measured.

We wanted to do the same thing for our data as well. What we ended up doing was once the new data becomes available and gets published to S3, instead of propagating the latest data to all applications at Netflix, we would send it to Netflix Data Canary Service. If the Data Canary Service gets the green signal, we send it to all applications. Should that be red, all applications continue to work as they were, while we go debug, fix the issue, then have good data and send it to all applications.

Let's talk a little bit about what Data Canary Service did, and how it ended up actually giving us confidence that we got from these canaries. We picked a key use case that was supported by our data. The key use case for the catalogue metadata was to play a given video, like our Netflix users would. The next thing was we picked the data that was important to us. Remember all the effectiveness and efficiency techniques that I talked about in circuit breakers case, to pick the data that was key to our business, to pick the data that changed, to pick the data that was representative of rest of the data? We used all these things, and came up with a data set that was not exhaustive, but provided very high coverage, and sent it to Netflix Data Canary Service.

Netflix Data Canary Service would run each video with old data, and then with new data. This would help us reduce the noise because of environmental issues. Imagine if a video was playing with old data and was down with new data, it was obvious that the data brought it down. And then we would go debug the issue and fix it. To apply data canary to your application, you could use the same technique. Pick key use cases, pick representative data to test, run it with old and new code to eliminate noise and give you a clear signal about the health of the data.

Staggering

Going to staggering now. We just finished talking about how to detect bad data, I'm going to talk about staggering. Staggering for code pushes has been used for a while now. New code, when it is pushed in production, it is applied to a small portion of all applications in production then rolled out gradually over time to ensure stability.

We wanted to do the same thing for data; we called it staggered rollout. As you know, Netflix works on AWS cloud. And this is a picture of AWS regions. The key technique here is when you're doing staggered pushes, you're reducing the blast radius of that particular thing, either data or code. We wanted to use AWS region as proxy for blast radius. We push the data to a single region, and after that, after we get the signal that it's stable, then we push it to more regions until all regions have the latest data. Wait a minute, did I say that an application that is deployed in multiple regions will see different data depending on the regions? Yes, I said that.

I can imagine what you must be thinking, because these applications from different regions interact with each other, and could cause huge problems if they're seeing consistent data. But the solution came in the form of eventual consistency. Eventual consistency is something built into Netflix architecture. It's another way of saying over time, all applications see the same data. But at any point of time, there might be inconsistencies in the data seen by the applications. Because this was built into Netflix architecture, we were able to implement staggered rollout for data and get the protection right away. In fact, after the February incident, this was the very first thing that was implemented.

Rollback

Rollback. Rollback is something that we have used for code pushes - one of the techniques is red, black pushes. Even after all these checks and other techniques, should a bad code push get out into production, and should it start impacting applications, one of the first things that we can do is quickly point the traffic to old stable build, while we go debug issue with the new build, and then roll it out again eventually at a later point of time.

This is what we wanted to do for data. Whenever there's a data issue, we tend to what we call “pin”. We roll back the data to a stable time to make the applications work with stale, but stable data. We root cause, debug, fix the data, then make the latest data available to all applications. And we call this unpin the data. To be able to do this, we need two things. The first one is visibility. We would need to understand what data went into which data version, and how it impacted which data object. Another thing we need is for the applications to have the ability to traverse from one data version to another data version. As I mentioned before, the Hollow technology which provided us a way to have version datasets also provides the ability to see what data impacted which version and how it changed a particular data object.

So also, this is another tool that exactly shows how the particular video was impacted, and how it changed, at what time, and what is the data version. I don't think the data version is shown here, but you can actually see exactly what data version it was. So this gives us the ability, whenever there's a data issue, we consult these kinds of tools to identify, “oh, this data change, this video was impacted at 2 p.m. today. Let's go back to a data version to 1 p.m. today.” So that application stabilizes, and then we can go fix the data.

As we talked about staggering detection and rollback, I want to highlight the tools and the UI that are needed for these techniques. None of these techniques would have been as effective or usable without the tools. As you're considering adding these techniques to your systems, you should also consider what tools are required to make this operational or usable. As you're building these techniques, you should consider building the tools that are required for this one. Here is a tool that we use for pinning, which is to roll back the data or roll forward to the latest version.

With all these tools and techniques that I was talking about, our systems became robust and resilient to bad data changes. It came at this point with all these things in our systems. This became a blanket protection against even unforeseen or unpredictable data changes. These could happen due to demigration, new features, anything, you name it.

Over a year ago, a new feature was rolled out into production. But this feature was turned off, it was not supposed to impact production. This new feature, used a key called package, and unfortunately, this key was also used by another data object that was running in production. Due to this key space collision, it started impacting stream data. Stream data is what makes the video playable. And then as more and more data was being processed, more and more videos were impacted, they were brought down. Until an entire catalog for many countries was completely down. Netflix users would not be able to play videos for hours until the features turned off. All impacted data was brought back, fixed and then rolled out again.

But none of this happened. As soon as data was impacted, our circuit breakers and canaries kicked in, and as the circuit opened, no data flowed through. And alerts went off to appropriate teams where they started investigating data issues, and the bad data was fixed, feature was turned off. Eventually, latest data was rolled out to all applications, all this while all Netflix users were enjoying Netflix.

As I wrap up, I want to say that data change is as impactful as code push, and we should look at it as such. Circuit breakers and canaries can be used for detection. Staggering and rollback can be used for quickly stabilizing applications while debugging. As you think about your systems, if your system or data is supporting a key business use case, think of ways where you can actually identify such scenarios, check the use case using circuit breakers and canaries. Identify ways to protect your systems with staggering and rollback. And that is all I have for today. I'm ready to take questions.

Questions and Answers

Man 1: I was just wondering, so the circuit breaker, you say that the data flow stops when the circuit breaker is open. What does it actually mean?

Kanchanapalli: I'll go back to the slide with the architecture. It just means that although the data is prepared and published, it will not go to any of the applications. So the other applications within Netflix will not see this data. For example, I go back to the example where I said, at 4 o'clock, I publish 500 videos, the 4:30 version has only 20 videos. Say, for example, circuit breakers detected that scenario, then all applications will continue to work with 500 videos, while I go debug why there are only 20 videos, and if it is a valid use case.

And either I fix the data to bring it back to close to 500, then we fill the circuit, we close the circuit, and the data continues to flow along with other changes. So each version does not only have one data change, it has multiple data changes. And one of them might have brought down the videos, but the others will still need to propagate.

Man 2: In order to keep the data version are you using event sourcing?

Kanchanapalli: I'll address what we're doing. In order to keep the data version, we're using a technology called Hollow. What it does is each time we create this new data version, we attach it to a particular version. And from whatever version we had to the next version, we create a delta. So we'll be able to propagate, move forward and backwards.

Man 2: You build a copy that is for reading, right?

Kanchanapalli: We build a copy for reading, exactly.

Woman 1: Did you have to make any performance considerations? Or what was the performance impact of these data validation systems?

Kanchanapalli: Yes, we did have performance considerations. Like I said, we did not want to make it flow at all. So to improve the performance, we actually came up with the change isolation and sampling. Which we were effectively able to use to bring down the impact of the performance. And also the system architecture, as you can see, the metadata service, which is the service that makes the data available, was a system that does the pre-processing, and it is not in the request path.

Woman 2: I'm still a little fuzzy, with the circuit breaker. Is there a possibility that some services which have already consumed that previous bad data are working with that bad data set, and are now inconsistent up until the time you roll back and bring it back to a stable version?

Kanchanapalli: Yes. So it's not about circuit breakers, it's about the rollback part you're asking about. In rollback scenario, it's a reactive scenario, where we actually had bad data, it was already rolled out, and then we are actually trying to recover, stabilize. It's similar to doing a code push, that is identified to be a bad one, then you actually roll back to previous stable one. So in that case, yes, the data was bad, it was rolled out, we're just stabilizing it until the data is fixed. And then we roll it out again, with the fixed data.

Man 3: Do you apply any data quality checks upon the data that is being delivered by the source systems?

Kanchanapalli: Actually, when I said that, not even all of our team was onboard when we talked about these checks at the end of video metadata service. One of the major pushbacks, including me, I wanted the source systems to take on the validating, because they own the data, and they knew the data as much. But then it was most important for Netflix that we had these checks at any point at all. So we did end up implementing it, but guess what, now that they're so stable, that our source systems actually handle that.

Man 4: In this metadata architecture, what is the scale of the data that flows through the system, and do the techniques that you talk about apply only to metadata, or other types of data as well?

Kanchanapalli: I cannot talk about the numbers, as you might have guessed, but if you're thinking about Hadoop, big data, stuff like that, this is not like that; this is runtime data. And we use Hollow, which is great for compacting [inaudible 00:45:43] data. So that said, your second part of the question was, whether it is applied to just metadata or other systems? So what we have done after we have seen success with the metadata systems is, we have ported this to an infrastructure, where all applications at Netflix can take advantage of this infrastructure where they would publish and subscribe using this architecture, and validation is built into the system.

Man 5: Yes, going back to the source system thing. So if let's say, one of the sources is now throwing bad data and your circuit breaker triggers, do you cut off all sources while you figure it out? Or do you have it segregated based off that?

Kanchanapalli: Looks like you're actually trying to implement this.

Man 5: Yes.

Kanchanapalli: We started with a small thing where we would stop the whole thing. But we actually built in so much to it, that at this point, we have the capability to isolate. But we usually do not use that.

Woman 3: When you call out bad data, were most of the issues coming from the shape of the data, or was it more from the content of the data?

Kanchanapalli: Okay, I think I understand the question this way. You're saying were these data issues caused by structural changes to data, versus the content of the data?

Woman 4: Yes.

Kanchanapalli: I'm more talking about the content of the data, because I have not seen many systems which do structural data changes dynamically without doing code pushes. When it is a code push scenario, we have established that we have good coverage, good detection for such things. But yes, structural data changes is something that can be easily done by publishers without realizing that they would have bad impact on the data. We do have some validations for those cases as well.

For example, in structural data changes cases, the canary would go down, and then you clearly know although there's no red-green signal, canary is down now you can't do anything. You will have to identify it, fix it, and then roll it out.

Man 6: Hey, thanks for the nice talk, it was good. And I have one question actually. Each data field has their own boundary conditions, which we cannot control. Sometimes, the attacker inject some excess attacks, some other SQL injection, something like that. So how do we protect our input data validation?

Kanchanapalli: So if I understand the question correctly, you're saying each of the source systems have their own limitations and error conditions. How do we prevent those conditions? How do we detect those conditions?

Man 6: How do we detect those conditions, and how do we mitigate if some attacker can inject the SQL injections or excess attacks? Sometimes they can do that, yes. So how do we protect our services?

Kanchanapalli: I'm afraid I don't have a silver bullet. It's either the application needs to detect what the problem is, or if that is not possible, then you actually have the checks at the end of it where all the data is combined and put it into one common place. If I didn't do justice to the question let's talk offline.

See more presentations with transcripts

 

Recorded at:

Jan 02, 2019

BT