BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Managing 238M Memberships at Netflix

Managing 238M Memberships at Netflix

Bookmarks
50:04

Summary

Surabhi Diwan discusses how the Netflix’ membership team outgrew many of its technology and architectural choices as memberships went from a few hundred thousand to 200 million.

Bio

Surabhi Diwan builds large scale distributed systems by day and paints by night. She has spent the last decade building and operating complex software systems in insurance, advertising tech at Yahoo, mission critical financial software at Deutsche Bank and doing cloud management at Vmware.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Diwan: When a backend engineer works at Netflix, and when you tell somebody that I work for Netflix, there's a whole bunch of misconceptions that follows everywhere. Some of these misconceptions start at home. You'll have a long hard day at work, of course, doing the usual debugging, writing code, your meetings, design review, so on. You get back home, finally, it's your moment to sit back, relax, feet up on the ottoman, you hit play. That's when your kids will be quick to confront you and ask, what are you doing? I'll be like, I'm watching Netflix. Then they'll be like, what did you do the whole day? I wish I had the luxury of watching Netflix the whole day. Actually, I don't.

Sometimes these misconceptions travel with us to work. There was this instance, one time, an Uber driver, he was dropping me off to work and probably his first time on the campus. It can happen, our lobby is really amazing, and we've got all of these Emmy Awards, Oscars, you name it. They look at the lobby, and they look at me and then they're like, finally. They asked me, I hope you have a change of clothes in your backpack for this red-carpet event that you're going to. I'm like, I wish. While Netflix is synonymous with a lot of red-carpet events, we backend engineers are often not the VVIPs there, but we could be VVIPs here. I'll take this any day.

These misconceptions travel internationally. There was this one time my teammate BD, he's traveled to India for the summer, long flight, New York to Mumbai, 17 hours, jet lag. He's at the Mumbai airport, and the port of entry, the immigration official, like going through all of it, and then finally asked him, so where do you work? He said, I work for Netflix. That's when the person looks up and asks, so what do you do for Netflix? What show? What movie? Where are you featured? Here we go. I'm telling you, the immigration official pretty disappointed with whatever BD had to tell them that this is membership. I'm an engineer, this is what I do.

Background, and Outline

I'm a Senior Software Engineer at Netflix. I'm Surabhi Diwan. This is what the technical programming looks like. I'll start out by establishing what we do. Where do we sit? Netflix has a big microservice architecture, there's probably hundreds of thousands of services in the backdrop. I'll try to establish where membership sits. If you as an end user were to use the Netflix application, where is it that you would have a run-in with some of our UIs, or some of the fun things that you could do on the app? What is it that we do on the background? Roughly, once you're at the end of that is when we go deeper. Till then I'm just establishing what we do, where we're at, and what technical choices we made. The use case studies are where we go deeper. I will start out with some of the pricing, technology choices we made probably a decade ago that predates my tenure at Netflix. Moving on to a use case study for member history that is really our second persistent store that helps us answer questions to the microsecond granularity about any and every single change that has been made to anybody's subscription. I'm sure most of you are Netflix members. If you're not, I will show you exactly how to sign up as we get into the details of this. Last but not least, I will try to answer, what does the subscriptions ecosystem evolution look like? It's 238 million subscribers. Really, what is the journey like? If you have to add another 5 million subscribers, what does that look like? If you have to add another 100, what does that look like?

Membership Engineering

This is really about what my team does. We are in the critical path for both signups and streaming at Netflix. Those are just words there. I have UIs which I'll show how that all comes together. We operate a dozen microservices. While Netflix itself could be 500,000 microservices, we, within membership have a dozen microservices. We have the four nines availability, because the critical paths that we serve, if any of our services is down, it would have a direct impact on signups and somebody's ability to play and have a good evening. We own mid-tier services, so we are not at the edge, we are not Zuul. We're really in the middle of how things come together. Our services serve a lot of traffic. Based on what use case it is, whether it is subscriptions, or whether it is plan or pricing, that scale can go up to millions of requests per second. I will talk about the technical choices we made. What is off-the-shelf tech that we just use to make all that scale happen. We act as a source of truth for total membership count, for that number, whether it's to 220 million, 238 million. It's actually membership systems, which emit the events and writes to the persistent store, which are used by downstream analytics, both within Netflix, outside Netflix, to actually say for real that that's the actual global Netflix subscription count on a given day. In short, we're solving hard distributed problems at scale with high accuracy. How we do it is what lies ahead.

We own and operate the catalog of all plan and pricing. In the United States, you can count it on your fingers. It's really basic, standard, premium. We have a set number of choices. That's how we run our business for a good length of time. We are the source of truth. Particularly in your subscription, we care about the correct plan, price, the billing country, and the payment status, because this would determine what is the quality of service you receive. Is your account in good standing? If you are a paying member, are we doing right by you and allowing you to have the concurrent number of streams, or the devices that you need to be streaming on, or if downloads are allowed or not? We manage membership lifecycle. We shall get into the details of some of this. It could be really when you're trying to begin your journey with this, you're trying to sign up for Netflix. You can also sign up for Netflix using other partner channels. You could sign up with T-Mobile, or any other partner, Jio in India, and so on. As a member, you could change your plan. We will try to renew you. We are a subscription service, so we try to use your method of payment to get the amount that we need to make sure there's continuity of service. That could actually put your account on hold, or you might receive a grace period in case there's some issues for us to charge your account. You could pause your membership, cancel, or at the end of your journey with us. We are also on the hook to delete all customer data.

When Do Members Use Our Flows?

I think that was a broad establishment of what my team does. I think as end users you would really care about, if you were to go on the Netflix app right now, these are some of the places. This is a very important flow. This start membership button is what the counts add up to. If you hit start membership, pretty much all of the flow, it hits a bunch of our apps on the backend. We'll actually go deeper. I have this whole flowchart where we talk about all the different services within membership, outside of membership, all the different things that actually happen correctly, before your membership is in good standing and you can start streaming. On the right is the very coveted play button. When you hit play, there's a direct call to the membership systems, which actually determines what is the quality of service associated with your plan. Which plan are you on? What concurrent streams are you allowed, what devices, and so on? That's also where we get the highest traffic perpetually in all of our paths is really the streaming button because that's billions of streams every day. The other place you could run into membership flows is your account page. Not everything displayed here is what's owned by our team. If you were to change your plan, if you were to manage your extra members, or do some of the other actions, hopefully not cancel, all of these flows go directly into membership services. On the right, what you see is bundle activations or partner signups, I have a Xfinity example. Sometimes when you get your internet from Comcast, they might give you an offer that sign up for Netflix, three months free, and so on. All of those signup flows are also actuated by membership services on the background.

How Do We Make It Happen?

I think this is the meat of the puzzle, is establish what we do. This is really how we do it. This is a little hard to unpack. Let's start from the right side. We have the membership plan and pricing catalog service, which holds all of the plans. In the United States, it's a limited set. Across geographies, across different countries, we do a lot of experimentation. That's where all the plan and prices go. That service is also responsible for managing the rules, because the availability of plans is really managed by where you're at, and so on. Given we have different offerings in different places, there is a whole rule set that goes there. In terms of database choices, we have two CockroachDB databases in our fleet. The plan pricing is backed by CockroachDB. We also have code redemptions. During winters or more like Christmas season, we see that a lot of friends, family buy Netflix gift cards, and then they hand it out to when you're gifting. All the code redemption flows also work with membership services. Then we have a member pricing service once you're signed up, once we know what your plan price tier is for upholding all of your member actions, for example, changing plan, if you want to sign up an extra member, so on, all of those would flow through the member pricing services.

In the middle is what takes care of some of our partner interactions. There's a dedicated microservice which will take care of all bundle activation, signup, or if you were to change plans using a bundle. We have the membership app store service. We also have an integration with Apple, with the recent gaming launches, you have the ability to sign up for a Netflix subscription from your iOS App Store. Those signups are actuated in the background using the member App Store or the member iTunes service. On the left is where we have the other persistent store where all your memberships are saved. Cassandra is the choice of database. We have the subscription service and the subscriptions history service. These are really our persistence backed stores. This is where the meat of the 238 million memberships sit. 238 million is the active membership count. As engineers, our concerns are not just about the current members, we also have to care for former members and rejoins, and so on. There are folks who come to Netflix and then use it for a few days and go away, but we want to uphold the rejoin experience for you. Really, we are not in the millions game, we are in the billions game. Fronting some of this, as I mentioned, all the state management that we do. We have the member state service, and we have member holds and grace service. I have omitted a few other services, which do have other use cases. This, in a broad sense give you an overview of all of the things that we concerned ourselves with and how are they broken down. Finally, we have a whole bunch of work that happens using big data. All of our databases have backups. Then we use Casspactor, which would put the active live databases in backup tables where we can run Apache Spark jobs to do reconciliation within membership, across membership, and other teams. Also, emit events which are used by a lot of downstream consumers, for example, messaging and analytics, which actually pick up all of this data and come back with numbers that this is what the current signups are and this is what the revenue is projected to be.

Signup Journey

I will teach you how to sign up. We are at that point now. If you just go to netflix.com, and if you've not punched in your credentials yet, you would see something on the left, which is really what we call the non-member homepage. Once you start your journey, you would see these options of the plans that you can sign up on. This is something that's powered by membership systems. This is where we get millions of requests per second. Because you may or may not sign up, it's important for us to render this page for you correctly. There's complexity in the background because these numbers vary by geography. The currency and the pricing would change, what actual plans are available would change. Historically, Netflix did have offers. What you see, don't see, what your terms and conditions for signup, all of that varies. These rules are also managed by our team. In the background, if you were to start signing up, that's the plan selection page, those are all your options. Just a little disclaimer here, so all the green boxes here is what is membership services purview. Everything else in white is the other sister teams that membership collaborates with to orchestrate a flow, which is like signup, as an example.

We start out with a plan selection. It's fronted by the growth engineering apps. The member chooses a plan. This is queried from the membership plan and pricing service, which is backed by the rules and the persistent store. This is the CockroachDB. This is really the plan pricing bits of it. This is being queried, it's established what offerings are available, the plan is selected. That's when we take you to the next place where we're just trying to confirm and say that this is everything, everything looking good. Once you say start membership, that's when it hits membership state service and the membership history service. We will write to our persistent records, and say that this is the row which is suggesting this is some PII information, of course. The other bits being what plan, what price tier, what country. This would establish what catalog we offer you and what streaming experience you would enjoy at that point. This is also the juncture where we emit all the events which suggest that this membership is in a good state, it is started out. Messaging pipelines would pick that up and you would receive your welcome email saying, "Welcome to Netflix. This is a subscription. This is what you're paying," so on. That same image data is also read by downstream DSC teams and analytics. They will now know that the membership count has gone up by a little bit. Again, it's simplified. There are a lot more collaborating entities within Netflix. There are more persistent stores that we have to write to. This is also just the happy path. This is all being done at extremely high scale. This is a distributed system, there are lots of failures. Our concerns are not just to uphold the happy path, but also to jump in, reconcile, retries, and make choices so that if our online systems fail us, we are able to still go find where the gaps are at and then fix them when we can.

Membership Team Tech Footprint

This is roughly what our tech footprint looks like. We are a distributed systems architecture. We are optimized for high read RPS, which is requests per second. We have about 12-plus microservices. We are running gRPC or the Google Remote Procedure Call at the HTTP layer. As far as the source code goes, we are really a Java shop. More recently, we've been rewriting our apps using Kotlin. We use Spring Boot to bring it all together. That's the server right there. We've been using Kafka a whole lot for all our message passing. That's where we interface with, for example, other teams like messaging or the downstream analytics. We have Spark and Flink to perform a lot of offline reconciliation over our big data. We will be getting deeper into what the reconciliation looks like, and why it's needed, and so on.

Technology Choices for Operations, and Monitoring

We don't just architect for the happy path. The moment you're done writing code, deploy to production, the job doesn't end there. We also go on-call and take care of operations and hope that no pagers go out. When they do, that can mean a lot of critical failures. There are a bunch of things we've chosen to do to make life easy, so here we go. This talks about once the code is in production, what else can happen? We use a lot of lightweight transactions and retries for writes. This is to avoid gaps as the first line of defense in our online system. This is really before anything bad can happen, we are aware it's a distributed system, so we try to retry or use lightweight transactions at the Cassandra level to make sure that all the systems that need to know about a certain record are in sync. That can fail, network failures. We run multiple reconciliation jobs. This is where the Spark and Kafka work comes in, that reconciles these records within membership. As you've already seen, membership itself has two systems of record. We have the subscriptions database, and we have the member history database. If a write is present here, but not in the other place, our offline jobs would work with that entire dataset, find that discrepancy, and then bring both of those systems up to date. Having said that, membership also upholds that same accuracy guarantees outside of our team so if membership thinks it's a write, we also believe that payments, and billing, and messaging, and all the other systems in the ecosystem should also know that this is the latest, greatest state. In the offline world, we run reconciliation across all of those places to catch gaps. We have a lot of data alerts that are built off of these Hive tables or the backups from the Cassandra to make sure that if there are any gaps, we find them and we fix them. If all of our online systems, like we also run repair jobs in Spark, like this is a discrepancy, we know that the last write wins, so update the other place. We have a lot of those checks. If everything says at the very end, an on-call human would go check every last record and put every single one of them in good state. That's the promise that our team really values that if our members are in good state, they're paying, they are good standing citizens, they should be able to stream.

Now we get to the monitoring, or really observability. Bad things happen. When the pager goes off, you usually have about a few minutes to respond. It's important that these choices are weaved into the development cycle. It's never an afterthought. We do extensive request and response logging. We have the ability to create dashboards and read all of this data using Kibana, Elasticsearch to make sure that if there's a spike in the error rate, we really can just go look it up on the logs dashboard and figure out very quickly, what endpoint, and what is it that's actually erroring out. We use distributed tracing extensively. This is, again, going back to Netflix having a humongous number of microservices. The ability to isolate and say which particular microservice is being latent, or which particular microservice just had a bad deployment and is suffering from a higher error rate, to be able to cascade that back and triangulate and say exactly what's going on, distributed tracing comes in really handy. We have a lot of production alerts. This would monitor all of your usual operational metrics, like what's the latency per endpoint, every app, the averages, and we really track latency. We have those thresholds defined that if a particular endpoint is 20% more latent, like fire an alert. We do care about some of our core metrics. Of course, if the signup rate is not what it should be, or if a lot more people are going on hold, we need to go back and check. We get alerts for all of those. We at Netflix believe that being operationally on top of things is extremely valuable, because that's how our members enjoy the guarantee that when you go home, hit play, it actually plays. In the backdrop, we use a lot of this operational data to build ML models that can get better at anomaly detection. The second an issue happens, somebody is paged, or we have that resilience or that auto-correction, auto-repair built in. All of those are either fixed automatically, or an on-call human could just jump in and resolve that problem for you.

Use Case Studies

We have also established ground as to, where does my team sit? What is our architecture? What are some of the core flows that we uphold? I think it's important that we get deeper and see, what is it that could have been different? How are we getting ready for the future? Here, I would also like to draw a comparison, getting good at system design would be a lot like getting good at a game of chess. Just like you need to know the rules, you need to figure out your strategy, you need to play a lot of games, you need to join a club. Sometimes you need to go back and analyze what were your plays, what worked out and what didn't. That's how you will really get better for the next set of games that await you. Our journey in system design or architectural evolution would be something similar. I roughly started my tenure here in this team about five years ago. There are some decisions that were made that predate me, but there is a learning for us and how we are using that learning as we position ourselves to make the next set of architectural choices. That first one is really about the Netflix pricing, technology choices, and what is it that we learned there? The next one is history, or more like cataloging every single modification made to every single record in our systems, and why that's important. Why at our scale, why we get asked all of these questions all the time, and where were we at? Finally, the last one is where I'll walk you through, what is that journey in that scale expansion looks like and what is it that we are doing to get better. Where is it that we started out? Some more requirements happened, we evolved, and then what does that path look like?

1. Learning from the Past - Netflix Pricing Tech Choices

This pricing architecture choice was made about a decade ago. This was at a time where the Netflix plan and pricing model was very basic. We just had a few plans. Like anybody familiar with relational databases, that's a very simplistic representation. We just had the three plans, one country, one set of prices. The membership services were more or less stateless. There was probably one monolith architecture or just one monolith at the time. Membership services were called by a lot of other Netflix services, because we would still uphold important things. We would still be the service or the team of choice, if you had to start a membership, make any modifications, if you had to switch billing partners, cancel, so on. The decision that was made was to have an in-memory library that would just be these few items here. It would be small. It would just be a few MBs. It would be quick to edit and release. It would just be consumed by one or two applications within Netflix. At the time, this seemed like a good choice, but Netflix got extremely successful. Again, some of these errors and times predate my tenure here. This is what happened. The plan offerings grew. Netflix went global in 2016. The expansion of, let's go in a few more countries, let's try Canada, let's try Mexico, let's try Europe, all of that goes back to 2010. The domain scope of this particular membership library expanded to include quality of service, supported devices, download access, and the list goes on. In some ways, this library started doing much more than it had originally thought out or originally proposed to be doing. More applications within and outside membership became part of critical flows, and the adoption of this library grew. In simple terms, what I'm trying to say is if there was a popularity contest for libraries, this library just won hands down. We'll see what that journey looked like.

The plan offerings grew. As an example, we have the basic, standard, premium, but then we did experimentation, just to name a few. We had the mobile plan that's now offered in India, Asia, and many countries in Africa. It was not there from the get-go. We do run experimentation. We've had a mobile-plus plan and so on. Basically, we grew in the plan offerings dimension. On the other hand, as the Netflix business evolved, we grew in the geography dimension. If that was not enough, the domain expanded, we started now doing supported devices, and we started doing quality of service. The membership services grew, so we started out with one service, give or take, one giant monolith. Now there's a subscription service, there's a gateway service, there's an async service. As you can tell, on the right is pretty much how the library is expanding. On the left is how the adoption is growing. The same library lived in one server, now is already part of many different servers. Time went on, and the library kept getting popular. What we're seeing here is growth engineering systems, messaging, billing, playback, API Gateway, Edge services, you name it, everybody needed that library. It kept getting bigger. Let's also not forget, this is also when Netflix probably went from 10 million to 200 million users. There was horizontal scale. Every single app that existed within Netflix was now going from a 10-node cluster to probably 100. Every single one of them had this library. This is how we finally landed where the library itself grew because now there's so much complexity. It's a combination of an in-memory persistent store. There is code logic to go with the modifications, availability, rules, so on, and the general sprawl across all of Netflix.

This highlights all the challenges, but I think the peak one here, which is actually my favorite because I wrote, typically when you're appearing for an interview, somebody asks you to write a web crawler. All the practice that I got in the interviews really came in handy, because at its peak, we were running a web crawler at Netflix scale to make sure every single node within Netflix had the latest greatest version of the library. Also, let's be mindful of the fact that this library had very critical information, we're talking about Netflix prices. If altered, you need guarantees that every single node or every single app or system that is giving out a price number needs to be the same. You can't afford to have 100 incarnations of a price value. In terms of operational inefficiency, I'll just paraphrase all of this. It was a giant library, so developing, releasing, validation became extremely hard. If there was any problem in how this is configured, it would go fail in some remote corner because the library wouldn't just go by itself, it would also pull in 50, 60 other dependencies. It's really not something that you want to sign up for operationally, that I released this library and then it went and failed 20 nodes away from me, or 20 hops away from me. It was getting very hard. Reliability is something that we care about very deeply. This is price, so we really don't have the luxury of showing you a different price on the signup screen, and when you get that email that's off. Making sure that this price point is exactly what needs to be at all points, required us to really run away from this architecture.

The requirements for the new plan pricing architecture, like if you just look at this, you'll be like, this is a no-brainer. This is so standard. You put a persistence store, you get all the guarantees. You have the gRPC service right in the middle, it takes care of all of the burden of the scale, scale it horizontally, and so on. The only fun thing which is not truly simplistic is that because this was such a popular library, if you really have to create a service for this from the get-go, we had to onboard millions of requests per second right from day zero. We didn't really get that slow ramp-up. We migrated traffic off of the library and into a live system. While the other things are straightforward, that's where the challenge lived.

To summarize, this is what the current version looks like. We have that CockroachDB database. At the very beginning, I introduced you to the fact that our plan pricing information lives there. It just didn't start out like that. We are now in that state where we have all the relational guarantees, persistence, so on. We have a gRPC service that manages all of your traffic, which like on a good day, 3 million, 4 million requests per second. That's where we are using a lot of the shenanigans around client-side caching at the gRPC level. We ourselves cache the entire records in memory, so as not to have the CockroachDB as a single point of failure. We had to deploy a whole bunch of optimizations there to hold up in the face of as much traffic, and also to open the gates to so many important services within Netflix. As you can see, all of the apps directly call into this plan pricing service. All of the external Steams and apps within Netflix also call into this service.

While this architecture itself is simplified and clean, the real takeaway is, the journey to get here was a multi-year long migration across 20-plus engineering teams, and 50-plus apps, and 5-plus libraries, which is also interesting that given you're really talking about all of Netflix, so let alone the apps, there are also these libraries that have an important footprint within Netflix. These libraries themselves would distribute the membership libraries, or your library is everywhere, and then good luck trying to migrate it away. While this was our journey, the key takeaway here that all of us can benefit from, if you're making architectural choices, in this case, it was as simple as, do we need a library or an app? It was a simple one at the very beginning, probably 10, 15 years ago. Somebody chose that the library was a good choice. Really, maybe future proofing your technology choices a little bit can really go a long way. This was showing up to be a problem in our operational logs. The other thing is, you should not ignore that that's a signal. Every week or every two weeks, when you're going off call, you look at what's going on. If there's this item that is pesky and shows up all the time, that's a sign of tech debt. You should pivot quickly and tackle it, or else damage control after many years can be expensive. I have to report back that while a new service is really the de facto place, that's where all our plan pricing, availability, everything is handled, it is serving all of your traffic. This library still exists in certain corners of Netflix. It's really like we have to find the bandwidth to run these migrations, and it's still an ongoing effort. I just wished we had pivoted sooner. Now when we are faced with technology choices, if you're really doing a rewrite, we just don't solve for today, tomorrow, or the next month, we actually solve for a few years. Answer some of those questions when you're in the driving seat.

2. Member History

The next use case study that we go deeper into is member history. When we first started out with member history, it was really only available via application-level events produced by services in a fire and forget fashion. Every service endpoint wrote a bunch of updates and then would emit a single event afterwards. These events were used for tracing historical state, triggering actions for other services, customer service operations, and business reporting. I think I've touched upon how membership importance and criticality evolved. We were at that juncture where the ability to answer questions related to member state at the microsecond granularity became incumbent. Since we only maintain the latest greatest, or the current state of a member, this was really a requirement that was left unfulfilled. This is what was really going on. I think by now we are already familiar with the membership architecture. On the right, you have the plan pricing bits, and the code redemptions, and the member pricing service, which holds all of your state management once you're already a current member. On the left is more of our subscription management going on. In the middle is some of the partner bits. If you see all these events in green, is everything that the system is emitting. On the left, if somebody started their membership, we would emit a start membership event. This would be used by, for example, your messaging system, and then send and craft you that email saying that, "Congratulations, you've just signed up. This is your subscription. This is your billing country. This is what you're going to be paying us. These are your other options. These are your show recommendations." Same thing, if you were to take any action, if you were to change plan, or if you were to switch your, we call it the partner. You could sign up as a Netflix organic member, but at any point, you might say that I want to now pay for my Netflix account via, for example, App Store, like an iTunes setup. That's something we call as a billing partner change event. Our system would emit that. An example here is you are getting renewed, so we would emit a renewal event. From this architecture, what you're really seeing is that there are hundreds of events that are getting emitted, there's a whole bunch of writes that are happening. These events could go out at any time or not. If at any microsecond, if somebody were to call me and question, like I'm the membership on-call, and they're like, at this time, somebody took this action for this account, can you come back and say, who made that change? Why did they make it? What really happened? Or, how did our systems respond? Given membership at the time, this is like a few years ago, we only had the current state of a member, we could not answer, like it's a blind spot. We could hope for the best that these events were persisted in a limited capacity. If there any failures for us to trace that, figure it out, reconcile, all of that was not possible.

This summarizes what the challenges were. These events are also a combination of change history and business logic. There is no persistence, no guarantees that they'll be written if there's an application logic or database failure. There's a whole bunch of problems. To summarize, these are the key ones. There's less than desired observability. There's a gap in the ability to easily and accurately view the series of state changes to our operational datastores. There's really a split brain on mission critical information. Different consumers are eating these events, using them in different ways. Then after the math, they come back and ask questions that, why is this the case? There's no way for us to go back and actually reconcile all of that state and build it out and say, ok, this is exactly what happened. This is where we decided that we needed a new service in our midst, so this is the established goals that we had when we took this turn on that architectural conversation. This is something that some of the databases already do. It's really an append-only log, but we were looking to borrow from that and create a direct delta or capture the direct delta changes on our key operational data sources. This is really a change data capture pattern. In layperson terms, this is really what your append-only log is. We didn't have the luxury to just pick this off the shelf, we were on Cassandra 2 at the time. Given we have to uphold high availability, our team doesn't really have the luxury to now start doing a science project on top of an operational database, which can potentially bring down service. While, yes, we could hack into Cassandra, we chose not to.

This is what the new architecture looks like. This is again, a little tick, but I think that's where the meat of the puzzle is. We have a lot of membership services, so that's abstracted away, but at any time you're trying to update or record. Right in the middle is our membership subscription service, that is the system of record for all of our 220 million going, backed by the Cassandra database. We are here working through an example. We are working through a billing partner update. This is when somebody who's already paying via Netflix chooses to move to a different paying partner, it could be Apple, it could be Xfinity, it could be something else. This is really an update to the cass_member data, which is our database right here. This will just maintain the current state. Once this write goes through, what you would see is, this has been updated. What we also want to capture is at that particular timestamp, we will, in an async way, write to the history database capturing that this is the change that was made, and capture that. Every time a write is updated, we're basically tracking it in an append only log fashion. The other cleanup that we did. Earlier, you saw there's like all of these 20, 30 apps, they're writing out your events to whatever other team they would like to in whatever fashion, that split brain problem is also solved. Because now member history came in the picture, we started tracking all of this information, persisting it, and started using Keystone streams to emit. The ability to retrace, retract, or if there's a failure to come back and reconcile, all of that became possible with the new architecture.

These were some of the planned wins. We were able to unlock super power debugging ability. For, if I'm on-call, somebody comes in and asks that at this microsecond, this happened to this record, can you come back and tell me exactly why it happened? Or my system received this event, and somebody provided consent, can you double check that that actually happened? I was able to answer all of those questions, because now we just didn't have a current view of all our memberships, but we could go back to every single write that happened at every single microsecond, since that membership started to exist. We replaced all our app-level events with a view on top of member history. This goes back to, there is this one system, which is the interface to all other systems within Netflix, so everything needed to go through there. There was never this that this write happened or didn't happen, or like the What IF's. In addition, with schematization and persistence, it was possible. We also brought in better tech, off-the-shelf. We started using Iceberg tables. Using these, we were persisting every single event that we emitted using member history streams. This gave us the ability to reconcile. We could reconcile history with our core membership systems and make sure everything is in order. We were also able to replay. Suppose the sister team analytics or downstream came back and said that we didn't hear about these 20,000 memberships on that certain day, can you replay them? Given now we have a persistent store backing all of these events, we were also able to do that. Reconciliation and making sure all of the systems within Netflix see the same exact accurate records became possible with the history service.

It didn't end there. There were some other ways where history just came and stole the day. There was this issue that happened with data corruption, this was at the Cassandra layer. Netflix was one of the few teams or apps that was impacted. This was really because of the scale that we operate in. We had made a change to our datastore to add a collections data type, but this resulted in data corruption and we lost some data. Because we already had history at play, and this is really our second system of record, we were able to replay back all of those records that were corrupted, and bring everything back to what it should be. Other than saving the day in the times of critical issues, the history service now feeds into customer service contact analysis. If somebody calls and says so and so happened, you'd straight dig in, run a query in membership history, and you will know exactly to the microsecond granularity what changed and why. It is now feeding into downstream analytics pipelines, messaging experience, and has become a backbone of membership data systems. The key takeaway here is some architectural choices may pay off heavy dividends once at play. We should have the courage to invest in big bets. Building out member history and getting it to this juncture was probably a 3 or 4-year-old long journey, but it has really paid for all the effort that we put in worth its weight in gold and some more.

3. Preparing for the Future - Member Subscriptions Ecosystem Evolution

With this, we're really getting to the final lap, and this is where we'll talk about how the subscriptions ecosystem evolved. It starts out with what are the basic architectural choices we did. What are the basic off-the-shelf bits we took, pieced together, and started serving traffic? Then, how in our journey from really, every 100 million, how that has evolved. This also brings back all our system design 101 classes, all of us engineers have to go through this whenever you try switching jobs. We're all aware. If somebody tells you, you need a scalable, performant, highly available, distributed, fault-tolerant system, how would you do it? Because we are joining the distributed systems journey here, a lot of these problems have already been solved for us. You pick up the latest and greatest. You pick up a gRPC service at your app level, you pick up a database, which could be Cassandra for all your scaling needs. Remember to put that backup Cassandra there for fault tolerance, and I think this should pretty much get you through.

Does it really work? There is no ability to reconcile state without talking to the live system. Given this is a very highly available service and system, we can't really go and run queries in the backend or reconcile data on the live system. There is no ability to walk the dataset quickly, and there is no deletion pipeline to keep the records in check, going back to the fact that this is a billions game. There is really no infinite scale. What we needed here was to unlock big data capabilities. For all of our database needs, we also now have added a Spark Casspactor, which picks up the latest backup of the data, and then puts it in Hive tables. This is where we run all our reconciliation jobs, run all our data auditor alerts and figure out if there are gaps, and do some degree of self-healing.

If this were grade school and not QCon, I would ask you what were we missing there? Yes, now we have history. We just got better at answering a lot of the other questions about better debuggability or tracing member state. Adding history in the mix also removed that single point of failure vulnerability from membership systems. Of course, like as an engineer, now I can turn back and see exactly what's going on, where. This architecture solves a lot of our problems, but no infinite scale, so we would have issues with our SLAs if the data keeps growing. That's where cache is something that we have to start looking into. This is something that's on the table. Probably we'll start executing on this. This is really coming back to say that when consensus-based lookups will not serve us well, hash-based lookups will. We're really looking to front all of our persistent database with EVCache, and we will be trading off for performance at the cost of lower consistency. The key takeaway here is there is really nothing like infinite scale. Every system has a limit to which it will perform optimally. It is important to stay on top of innovation and invest in rewrites and architectural evolution proactively. We should not be waiting for an outage to really understand what our system limits are.

Key Takeaways

We're really getting at the juncture where it's all going to come to an end. The Netflix pricing choice that really is the key takeaway from that use case, is, we should future proof our technology choices as much as we can, or react or pivot before it's too late. The key learning from the member history case is some architectural choices may pay off heavy dividends once at play. Have the courage to invest in big bets. Finally, about the member subscriptions evolution, which is really, you really need to keep at it and you're never really done. You might wonder why I say that. That's really because this is a problem that my team is faced with. Somebody very famous said this many years ago. There are only two hard things in computer science, one is cache invalidation, and the other one is naming things. While I'm busy taking care of caching and figuring out what's going on there, if you're looking for your next show to watch, that would be the Queen's Gambit. That's also where I borrowed a lot of the visuals from. This is a story about a girl, Beth Harmon, who goes on this journey to become the best chess player in the whole wide world.

 

See more presentations with transcripts

 

Recorded at:

Feb 05, 2024

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT