BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

Disenchantment: Netflix Titus, Its Feisty Team, and Daemons

Bookmarks
47:17

Summary

Andrew Spyker talks about Netflix's feisty team’s work across container runtimes, scheduling & control plane, and cloud infrastructure integration. He also talks about the demons they’ve found on this journey covering operability, security, reliability and performance.

Bio

Andrew Spyker worked to mature the technology base of Netflix Container Cloud (Project Titus) within the development team. Recently, he moved into a product management role collaborating with supporting Netflix infrastructure dependencies as well as supporting new container cloud usage scenarios including user on-boarding, feature prioritization/delivery and relationship management.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Moderator: A quick little cool thing about Netflix. Netflix, I enjoy your video service. That's nice. But really what I like about Netflix is that Netflix as a company has run into a large amount of technical problems. And what Netflix has done in a lot of cases is instead of looking out to the community to wait for solutions, Netflix says, "Well, we'll just roll our sleeves up and get busy." So I think this is what this talk is going to be about- how Netflix rolled up their sleeves and got busy in the container space. So here we go, give it up for Andrew Spyker and his talk about Titus.

Spyker: I'm here to tell you a story about a hard drinking working princess, her team that gets everything done including Elfo and her own personal Daemon, and the adventures that she has in dreamland. Anybody recognize that? All right. For people that aren't raising their hands, go home, binge watch it. This is actually a show from Matt Groening from "Simpsons" and "Futurama" fame. I'm not actually going to talk about that. I'm going to talk about our adventures of another cartoon character, "Titus," and or container management platform that we've been running for the last three years.

"Titus" is our container management platform at Netflix. It's a scheduling system that handles both service and batch jobs. When I think about service, think about your typical micro-services. When you think about batch, think about things that run to completion. It does fleet wide resource management and it's deeply integrated with AWS in our Netflix ecosystem. Jumping right to some numbers, we have peaked out at this point. We've been running this since the end of 2015. We peaked out at 3 million container launches per week. We started with batch as you can see in the bottom left with the green, and service has been growing on top of our batch workload over the years that we've been running the system. It's a very high churn rate system. So if you think about batch, you can see that we have things that run less than a second. We have things that run from a minute to an hour, and that's the predominant, but we have batch systems that are doing machine learning and other things that are out longer than a day. But this causes a lot of containers to launch fairly frequently. Also, we're pretty much in churn state if it comes to services as well, because we auto-scale our workloads. So workloads don't typically last longer than a day due to our Auto Scaling. So some big numbers. I'll return to some of these.

Introducing myself, I'm Andrew Spyker. I am the manager of the 10 person team, 11 if you count me, that have developed, designed, operate, and support our entire container management platform. That is the run time for the rest of our ecosystem at Netflix, we have a platform engineering team, and we're standing on the shoulders of AWS for our cloud.

Titus Product Strategy

So when we started Titus, we had a strategy for how we were going to do containers. It was really what we were looking at and what was most valuable to us was using containers to get features out to the service as fast as possible. So it was all about developer velocity the containers unlocked, compared to our existing virtual machine-based infrastructure. Of course, this is such a key part of infrastructure that reliability was as important, but it was kind of second in the list. We didn't actually go after containers for cost efficiency, which is kind of a distant third in terms of our prioritization.

The other thing is we're not trying to move beyond VMs and move everything to containers and just make a hard switch. So we need to make sure that our container environment worked with our VM environment, and that you could take an Amazon-based application, and pick it up, and run it in a container unchanged. So these were our product focuses. When I say deeply integrated with AWS, we'll just cover a couple of the aspects, but this is a container management platform that is fine-tuned to work on Amazon Web Services. So we decided early on we are going to do IP per container, so we natively integrate with Amazon's networking something called VPC. And we do that through ENIs and we enable application level firewalling, their concept of security groups.

We also wanted to make sure IAM Roles worked correctly, so we actually emulate the metadata proxy that delivers IAM credentials to VMs and we do that in containers. We leverage Amazon's cryptographic identity for machines that are starting to bootstrap our ideas of cryptographic identity and containers. We also have our own service discovery and load-balancing inside of Netflix, but we wanted to make sure it worked with Amazon's load-balancing as well. So it works with AOB. And in rounding this out, fairly recently we added Auto Scaling to our service-based jobs. We can do this on SQS queue depth and any CloudWatch metrics. So all the things that you were used to using in virtual machines are now available to containers.

This has been a long journey for us. So I'm standing up here in 2018 talking about the lessons learned. These are lessons, hard lessons learned over three years of running a container management platform. We were mostly going batch as you saw on the graph at the end of 2015, and around midpoint of 2017 is the first time where we actually started having services that were impactful to you as Netflix customers running in a container management platform, and we grew between those two ends of the spectrum.

Our containers are used for things that you may know about the Netflix experience. It's used to manage some aspects of our CDN. Encoding is done in it for taking raw streams and turning them into what you see on the service. Our big data platform team is using this, our data pipeline. All kinds of workloads are running in containers at this point. Akin to the last talk, I would say one thing that's not running in the system is stateful database systems. So we called it out.

A couple other numbers beyond the first couple of graphs, you can look at these and see which ones are interesting to you. I'll just point out three different things. In terms of the workloads, we have thousands of different images that are now running in containers. We are managing over 7000 virtual machines that are sliced up into containers. That's around 450,000 of the CPUs in Amazon compute terms. We also launched, I said 3 million a week, we peaked at 750,000 a day, and you can do the math to figure out what that means per minute, per second.

We're based on Mesos and Docker, although we've customized each. So we've written our own custom scheduler, and we've written our own custom executor to wrap around Mesos and to wrap around Docker. I will use these terms for the rest of the talk, so it's worth keying onto this. There is the front door of our container management platform that our deployment systems and our workflow systems work against, which we call our scheduler and our control plane. That's what hosts our API, that then uses Amazon concepts to spin up virtual machines that are Titus hosts that are just your normal VMs but really big VMs like the 16 XLs or the families. And then we slice that up into user containers, and then we run a whole bunch of system agents. We also use our own run Docker registry, if you're interested in that.

We've also open-sourced this. So if you want any more details on any of this, you can just go to netflix.github.io/titus and you'll get a lot more details. And you can go in for slack and ask questions. Our main reason for open sourcing is not to get a whole bunch of people using Titus. It's to be able to share what we've learned, so the rest of the container community goes in a direction that's certainly beneficial to us and we can help them get there faster.

End-to-End User Experience

So that was the overview. What we're here for is the lessons learned. So let's dive into some things we've seen in the last three years. First, our team is a runtime team at Netflix. So, our initial view of containers in this team was, someone uploads a container to the container registry, they submit a job. We run the job, maybe it's a service and we make sure we monitor it. We make sure it runs reliably and then sometime in the future, they redeploy said job. That was our view of containers and we spent a lot of time getting this really solid.

The reality is there's a lot more to it, and we've started to do this over the last say a year or so even in the runtime space. What about compliance? What about security scanning? How do you do change campaigns if someone needs to update a Docker image and you need to tell people that their jobs need to be redeployed? There was a whole bunch more that we were missing in the runtime that we've been filling out. And equally important, the runtime is only the third thing that you get to. If you're not thinking about what tooling you need for local development, if you're not thinking about how this affects your CI/CD systems, it doesn't really matter if you have a solid runtime.

At Netflix, we've created tools in this space. We've had tools in this space for AMI-based deployments, VM-based deployments. We've created new tools or customized these tools to work for us in containers. For local development, we have a tool called Netflix Workflow Toolkit that helps you manage Docker on your local machine. It helps you do local development tasks. For continuous integration, we're using Jenkins and we put node onto our CI systems to give more of a common tooling environment between the two. For continuous delivery and open source project that's out there that, hopefully, everybody knows about, is Spinnaker. It has a cloud driver, not only for EC2 but it has a cloud driver for Titus cloud which is really a cloud on top of a cloud. It's just nothing more than a cloud on top of AWS of containers.

Change campaigns, we have an internal tool called Astrid that would help you understand if your AMI was out of date, what you needed to update in it. Now, it's being changed to be able to tell you that for container images. And then performance analysis, we actually have two open source projects around flame graphs and vector that we've open-sourced. So really the tooling guidance of this story is, make sure you're considering the entire SDLC of containers. It's not just sufficient to think about just your orchestration engine. What does it take to get an application out to your orchestration engine?

We also added things to Docker. Some of the things we've done is we've curated a set of base images. At Netflix, you can deploy whatever image you feel like you want to deploy. It's freedom responsibility, so you’ve got to make sure you're responsible for that, but for people that are looking for a base Java image to start with or a base Node image to start with, we have a curated set of images for that. We also found that image tagging was a really big deal in tagging through the entire infrastructure. So you can know once you're running a container and you saw a problem, what was the continuous delivery pipeline that delivered that? What was the CI system that built that? What was the git commit that actually got that code out to production? And if you're not consistently tagging things, it's just a guessing game once it makes out to your orchestration system. And we try to make this really consistent across our tools.

Operations and High Availability

So moving on from tools, moving onto an area I really enjoy around reliability, high-availability, and operations. Learning how things fail is really important and you don't get these lessons until they failed once. So I wanted to share some of these with you. This is kind of increasing severity. So a single container crashes. Everyone probably understands that. A single host crashing is, of course, going to take down multiple containers because you're going to probably be running multitenant. You're probably going to have a bunch of containers that are running. The control plane failing. That's your scheduler. That's your API. This is where it gets a bit more severe. So at this point, none of your containers that are currently running have problems but you can't start any new containers. What that means is if you're auto-scaling or if containers are failing and need to be replaced, none of this is occurring once the control plane is in this state. And then finally, the control plane getting into a bad state which I'll go into a lot of detail on, this is where it can get really scary.

So first, what we do about single container crashes? We actually have three SLOs in Titus that we look at and this is on a dashboard in our area of our development area, it's availability, it's crashing, and it's the latency for container stats. This is the actual dashboard that we see in this area. For crashes, we consider these our failures. So if a user submits a container that exits with a bad error code, that's their problem. That's a failed job. Whereas a crash job is we try to start the container and something about starting that container didn't work very well.

What you'll find is this happens mostly during startup and shutdown. It's like configuring your networks, configuring your disk. These are the kinds of things that will fail and you have to be very careful to watch for what we call "crash loops" which is, a container does something new in a system that we haven't seen before that triggers a bug in our system that it starts if container fails, starts if the container fails, starts if the container fails. And if you're not monitoring the crashing this in your system, you're not going to see these, and you're not going to learn what you need to improve.

Moving onto single host crashes, there was a great point on this in the last talk, of making sure that you have a placement engine that's spreading your work across hosts, so you don't end up with everything on one host. But once you do have host crashing which will happen, you need a way to detect and remediate bad hosts. I'm going to just pepper in some Kubernetes stuff here. Obviously, we're not running Kubernetes, but I thought it might help you kind of bridge back to looking at things that may help you. There's a project going on with monitoring node health in the Kubernetes community, that right now runs a Daemon state that can detect some of the problems I'm going to talk about. But I don't think its integrated back with the scheduler which I'll show you the value of. So this is an important project to keep moving forward.

On Titus, each of our agents, or hosts that are running these containers, has a very extensive health check mechanism. So it's constantly checking on each node, can I be pulling from the registry? Is Docker up? Is Mesos up? Is the OS in a good state? Are the AWS resources that are attached to this instance in a good state? And it's reporting that back through our health check and service discovery mechanism at Netflix. And then our scheduler is actually taking advantage of that data to say, "If this node is bad, I shouldn't send any more work to this node. Let the operator take care of that node in whatever way is possible, and let's schedule around this bad host. This is pretty important because you don't want a bad host to become a drain for all your work going into the system.

We started with a manual intervention in these cases. We, of course, at our scale that doesn't scale very well. We moved to our automated remediation. So what actually happens is when we detect something is bad with a host, we actually have a system that does automatic playbook automation. It's something called Winston. And we've written scripts that, when a node goes will actually log into the host, will do a set of analysis of known problems that we know about, will try to remediate any of that are actually recoverable, and maybe fix that host and heal that host and it goes back in healthy, and the scheduler starts sending work to it again. But if it's unrecoverable, and it's something we haven't yet been able to fix in the system, we'll tell the scheduler to reschedule the work, and we'll terminate the instance in our instance workplace, and work will proceed.

Another thing that's been very helpful to us as you get up into the thousands of hosts, is being able to spot what's happening across the instances. For new problems that we're unaware of, there's lots of information coming out from your container runtime, coming out from your OS in the form of logs. We actually push all of our logs from all of our hosts that are running Titus into centralized stream processing, and we look for signal amongst that noise. It does mean we are processing about 2 billion log lines per day to get that done. We're finding very interesting things that new workloads will trigger on our nodes that we didn't know about before.

Moving on to the third thing. Now we're out of the individual host or individual containers. Now we are up to the control plane failing. So a good thing before we get to the really bad one is when it has a hard failure. So your control plane goes down, your API server is down, or you can't have people submitting new work, how do you deal with that? We found two ways to deal with that. First, in terms of white box monitoring, we have metrics and telemetry in our scheduler. It says, "This is how long this task has been queued for and hasn't actually run." We time bucket that. We say, "If a certain number of tasks have been waiting more than a minute, waiting more than 5 minutes, waiting more than 10 minutes, waiting more than 15 minutes, increasing severity of that queue is backing up of work. So we need to find out what's happening with our scheduler."

That's okay if your scheduler is up but failing for some reason. It's not so okay if your scheduler isn't up at all. So we actually, again, back to our SLOs here, we have a process of sitting external to all of our Titus stats that's submitting synthetic workload into each of the stacks, and tracking the work, and making sure it gets scheduled correctly and executes correctly. And that's more of a black box monitoring system for the entire orchestration platform.

So the fourth one is soft failures. And these are the ones that are going to make you feel kind of sick to your stomach. These are not the ones you want to deal with but before we go into that, let's talk about zombies. So zombies are one of the words we've used, we've used orphaned, disconnected. These are all words for containers that are running in the environment that the control plane doesn't know about. So think about a case where Docker has hung for some reason or the OS is hung for some reason. And the user has come along and said, "I want to kill that job. I want to redeploy my next version of my micro-service.” Our control plane is trying really hard to get information down to that container saying, "You should stop." But it can't. There's something broken in the control plane between where the user is and where the container is.

We had a “fix” for this, I'm going to use air quotes there, initially called implicit reconciliation in Mesos which we leverage. And what we did was we said, "Hey, Mesos. Tell us everything you know that's running." And then we compared that to what we knew should have been running, and then we'd sent a kill command for everything that shouldn't be running and say, "That will get it to a consistent state." Thumbs up, right? But what happens if your cluster state goes bad? What happens if the systems like your controllers, or your agents, or hosts, or your scheduler has a problem reading that data?

It actually happened to us. It's probably happened to everyone that's run this for a long enough amount of time. What happened was our scheduler came up. It actually had a latent bug that was introduced months before. It came up, tried to read data out of the database. There was a bit of a drift in the version of the schema that it was reading. Through an error, it swallowed said error and then loaded up with zero state in the scheduler. So we said five minutes later, "We're going to reconcile this problem." And this earned us the sticker on my laptop here of, "I've killed production," where on the East Coast data center, we had 12,000 containers running, and within 30 seconds, we shut them all down. This is a funny story to tell, and it's amazing to learn from, but is very impactful when you read that. What was it? Winston's mom could not show her grandchildren Moana at that time. It makes you feel a lot more rough about what's happening when you do things like this at Netflix.

The more interesting thing is it took us an hour to restore service. We have a global traffic system I'm going to talk about that saved us within five minutes. But it took us an hour as a set of people that have expertise in this, to understand what was actually happening, redeploy the code, catch that exception, remediate that data and get the system back up and running. Our data was still there, so we could restore our state. So you can see on the right-hand side, about an hour later, the containers come back.

But the guidance here is you're running a fairly complex distributed database underneath your Kubernetes cluster or your orchestration system. If you don't know how to back that thing up and you don't how to test the restores are working, and you haven't tested for corruption, and you don't know what failures can actually occur here, you're waiting for a world of hurt to come your way. And it's really important to be an expert about this or leverage off-the-shelf containers as service like EKS, or GKU, those kinds of things, to do this on your behalf. But it's really important that you're an expert in how etcd interacts with all the systems that Kubernetes offers you.

At Netflix, we learned from this. We moved to a less aggressive reconciliation model. If the scheduler comes up and there is a mismatch of over a couple containers, we'll actually immediately page the operator. The operator will get in, figure out what's going on. All of our containers continue to run in their previous state, and then we can heal the system with a human involved. We also moved to an automated snapshot and testing as I said about backup and restore. So every so often, we actually back up all the data out of production clusters, load it into our staging clusters, make sure it loads as expected.

Security

So let's move on to security. We do a couple of things. We try to reduce the escape factors through things like Seccomp and AppArmor profiles that are configured for what we know are needed. We also make sure that only containers that get a cryptographic identity can get the cryptographic identity that they were set up to have. We also have done something that I think we're still fairly unique in, in terms of reducing the impact of if a container escape would occur where we run username spaces. This is something I think is still a challenge in Kubernetes. Let me explain username phases first. First, it's if you're in the container and you think you're root, you're not actually root on the host operating system. You take all the user IDs and you shift them, and inside of the container, they might have lower UIDs. Outside of the container, they are running with higher UIDs, and they're not shared with system user IDs.

This was pretty challenging to get it working for us. It took a couple quarters. The big challenge is with persistent storage. So if you shift those user IDs and then you write to persistent storage, you may have random different user IDs writing to your persistent storage. We ended up having to do some kernel work to make this work with some of our NFS that we use, but not everybody has done that. And I think the challenge in Kubernetes is trying to get that to work for all the persistent systems that are out there. But this is a really good escape mechanism that if someone escapes, they don't get back to immediately having root on your hosts.

The other one that I see is really important in the security space is isolating control plane network from the container network. People that are running overlay networks, make sure that they're not bridgeable back to your main control plane. Make sure it's MTLS and make sure that if you're on Amazon or other that you have the firewall rules set up between them. You can see clearly, hackers know this. They are out there scanning for Docker and Kubernetes and for weak for configurations in this space. Even this year, Board, which is well ahead of Titus from a maturity perspective, had a security researcher pointing out that they had some gaps in how the network was bridged. And we're able to kind of bounce into the control plane of Google.

And being honest, we had the same problem. Hackers didn't find it on our behalf. We actually found it trace and testing, and we have since remediated it. But if you knew some tricks, you could actually bounce from our container network into our control plane network, and that's a really bad thing because at that point, you can start launching whatever containers you want with whatever security you want, which is something you want to really avoid. So really lock down and isolate your control plane network.

We've also tried to keep our users off of our hosts. Initially, we had our host wide open. If someone wanted to log in, look at what was happening to a container, they could log in to the host. They could look at the containers that were on it. We found people that didn't want to set up a Docker environment logging into our production host, and running Docker commands, and that was not goodness. But what we found was the real usages were things that needed kernel level access were still hard in a container environment. So we've helped that by making it more self-service. One of the things we've done is for anyone that wants the SSH and get access into a container, we give them a drop-down link out of all of our tools and say, "Copy this to your clipboard." And the way it works is they get a command that SSH is into our bastions under their user accounts, and then goes through our special SSH daemon into our containers, like, only gives them access to what they need inside the container.

Well, we also have made our performance tools self-service, such that things that you may need a kernel level debugger to do around Node.js or whatnot, you can do through a point-and-click out of our UI. So really focusing on those use cases that drove users to want host access, and helping them be self-service is one of the things we focused on.

Scale – Scheduling Speed

I'm moving onto scale and scheduling speed. So I told you during that incident, we had five minutes of downtime. In fact, Netflix was down. If you're an iOS user, you didn't get Netflix. That was our bad. Sorry. But five minutes later, Kong came along, and what Kong is for us is, we can actually shift traffic. We run out of three data centers of AWS. We can take one data center offline and shift the traffic within five to seven minutes to the other two data centers in the world. So you start moving around the world for where you're connecting to Netflix, and we call that thing Kong.

So this is an example of real-world of Amazon Web services having elevated API problems which basically meant we couldn't start up new hosts, which meant Titus was going to have capacity issues. We pulled the lever and shifted traffic. What you can see here is our overall calls to netflix.com. You can see that we had around the time of the "Red arrow," we took all the traffic out of U.S. East and we took half of that traffic - it's not exactly half but I'll say half, half and we shifted that to E. U.S. one. We took the other and send it to U.S. West 2, but you can see there's a huge spike in the amount of demand that EU had to take at this point in time.

So the infrastructure challenge here is, okay, now you need to increase capacity in that savior region, really quickly. Within the next seven minutes, can you launch a few thousand containers? It's easy, right? So if you go and you read things, it's like, "Yes. We did this amazing performance work." There's one here from Kubernetes, one here from Ashley Corp. Great work. I don't want to take anything away from this work. The performance work here was amazing, but if you look into the details, almost all these are synthetic benchmarks. They're not running the heterogeny of workloads that you would run in production. They aren't including full end-to-end launches where you're doing a Docker pull, you're setting up the network. A lot of them are really around synthetic testing of how fast can your scheduler go disconnected from the agents?

And the last one is critically important because I know for a fact, it's really hard. The network in the cloud today is still not fast enough to be able to support this kind of elastic burst at once. So you need to be looking at the more real-world examples. We can do this in Titus though. Actually, that graph I showed you before was real. So how do we get it done? Two things. We do dynamically changeable scheduling behavior. So we don't have one scheduler. We have a customizable scheduler, and then we also do networking optimizations which we can do in Amazon.

So explaining our scheduling. This is our normal scheduling algorithm. So during normal business hours, nothing is happening. We're spreading our workload as much as possible. So what that means, as application "A" comes on, we'll spread it across as many hosts as we possibly can. And as application 2- sorry, app 1 and app 2- as 2 comes on, we'll actually, again, spread it out as many hosts. They'll end up using some of the same host but hopefully, they are not all co-located on a single host. And this is really trading off normal scheduling for reliability concerns.

Now, when that failover event occurs, we’ve got to do things really fast. So we will shift our scheduling algorithm over to packing where we can take advantage of batching and other things that I'll show you. So what that means is during that failover event, instead of spreading that application out across as many hosts as possibly could, we start co-locating it as hard as we possibly can, and we start doing the same thing for application 2. So this is giving you speed. The reason for this is, if you previously had spread your workload out across all of your hosts, there's a high likelihood that the Docker images are now on as many hosts as possible. And the things that take time to set up, like networking interfaces that are tied to security groups, they are already kind of configured. So you can take advantage of those. So the fact that we spread makes it easier to bin-pack very tightly at a fast rate during failover.

The other thing that still is unsolvable, and we're doing our best here, is bursting on IP addresses. So the only resource we don't have during that failover is the IPs that are in those application firewalls and propagating that information across the fleet. We've moved to a model where we basically burst allocator IP. So what I mean by that is on any one node, if a container starts, we guess that there's a few containers behind it. So we'll actually ask instead of one IP address for that container, we'll ask for as many IP addresses as we can get, and if we overshoot it, no big problem. In a few minutes, we'll garbage collect it, and get back to a normal state. So this is our dynamically changeable scheduling behavior with our bursty networking, and this is real-world results. So this is during that event. We launched about 7500 containers in EU in about five minutes and this is what it gets you.

Scale – Limits

So the next is scale in terms of limits. So we had a question about two quarters ago, is how far can Titus go in a single stack? We actually knew through Twitter's use of Mesos that Mesos wasn't going to be a problem. We started to see cracks in our own scheduler of like it was creaking. At our current level scale, we were like, "We don't know how much further we can actually go and still be okay." Also, any error we made would take down as you saw, one region at a time, which takes down basically a third of the capacity of all the Netflix simultaneously.

So we wanted to look for a solution for this. We had two options. One was idealistic, and we had great arguments on this for its worth. One was idealistic, we can just keep improving performance. We knew there were things we could improve in our scheduler performance, and we could keep picking up for watermelons there. And we could try to keep automating our way out of making mistakes or it could be realistic. And we talked to a whole bunch of folks like Google, like Amazon and others, and found that they had this concept of cells, where you test yourself up to a certain level. And you know you're okay at that level, as opposed to worrying about where that cliff is, and a little fuzziness as to where that cliff is. You know where it is. As long as you run at that size, you're good to go, and we can use that as another way to contain our own mistakes which realistically, we are going to keep making mistakes.

So we introduced something called Titus Federation, and it's just purely a way that we can scale out a single Titus environment. It's not about the Federation to do cloud bursting. We are not interested in bursting between the data center and a cloud or multiple clouds. We are not trying to join resources across business units. We're just trying to scale out a singular pool of resources for one Titus environment.

Our implementation is fairly straightforward. As far as our users know, there is one Titus environment in each of the regions of AWS. It's behind a VIP. It's what environment is it in? What region is it in? So they go to titusapiuseast1.prod and there's one Titus there as far as I know. Under the cover, we actually stand up cells that are - that architectural diagram I showed you at the beginning - an entire copy of that. And we'll stamp that out more than one time. And we maintain without the users worrying about it, what applications are in cell 1, what applications are in cell 2. If someone does a query, we'll actually join the results between cell 1 and cell 2. So as far as they know, it's one big environment.

So how does that look? As I said, it's that simple. So that architectural diagram for every one of these little sets of computers, there's a whole Titus environment there, and there's just the Federation proxy sitting in front of them that does this routing that I talked about. Then the next question is how many cells should we have? Should we have N? Should we have two? Should we have three? We've gone with two. And two at our current level of scale. We just want as many few large cells as contains our blast radius and our scalability limits. We're not trying to do this for any other reason, and larger resource pools we know can help us with cross workload efficiency. It can help us with our operational overhead, and it can help us with the impact of bad workloads. If you throw a bad workload into a small pool, it affects a majority of the pool. If you throw a bad workload into a very large bowl, it's much less noticeable.

Performance and Efficiency

So moving on to performance and efficiency. This is something we're actually currently deploying right now. So you'll learn more over the next few months. I think it will be open source fairly quickly as well. We run on 64 vCPU servers. I'm trying to make this simpler. So this is now down on each computer level. Trying to just show a fictitious 16 vCPU host. If you look at the left side, that's going to be one chip, one compute. The right side will be another chip in the same machine, and then inside of that package, there are cores. And each of those cores from top to bottom has the main core and the hyper-threaded core. So that's why it shows up as 16.

What we have seen is if you consider a couple of workloads landing on this host...let's call them static A, B, C, and burst D. If you start placing the workloads and you just start smearing these workloads across the cores that are in the server. So let's put workload "A" there. Let's put workload "B" there. Let's put workload "C" here, and let's put the burst workload there. Well, it all fits, right? There's still some room for other workloads to come in. What we found with this is even with using quotas, we do CPU isolation, we do quotas, even with doing that, we saw outliers in terms of latency. We saw outliers in terms of, how long would a batch job run? And what's the latency of a service job? And the reason for this is the problems that you incur.

So some of the potential performance problems I see in how this got laid out, is static C is crossing packages. So if you have NUMA memory in your system, you're going to be thrashing memory between for workload C. Also, B can steal from C. So they are on a core and a hyper-threaded core. One is going to get good performance. One is going to get almost good performance. And then finally, the bursty workload is sitting there but there's a whole bunch of compute that's not being used that could be used.

We're working on something called low level CPU rescheduling. I think there's an alpha for the implementation that's doing something like we're doing with CPU sets, but I think the feature to configure this from a Kubernetes control plane perspective, is in beta. After the workloads are placed on the node, we'll actually rearrange them through CPU sets, to be optimally efficient. So what you can see in this case is we moved static A to use the first core in its hyper-thread, as well as the first core of the next compute. We moved static B to be over on a package. We moved static C to not cross packages. These are the ways that you are going to get the best performance, and have the least amount of what we call "noisy neighbor" within the container environment.

At this point, we're doing it basically as the workloads show up. So our control plane has no idea of this, but our agents will optimize this. Over time, I think we will move to this being more dynamic. So we'll rearrange it as workloads disappear, and we'll give this knowledge back up to the control plane. And there are some interesting things you can do of saying, "Hey, I just happen to see this problem with noisy neighbor over here between these two workloads. The next time you schedule this, could you push them across to us, as opposed to putting them on the host?" And these are some of the things we're working on in this space.

The final thing I will talk about is opportunistic workloads. So building on that idea, I talked about those bursty workloads. Actually, I think I skipped that. Let me go back one second. The bursty workload was the other one. So it was taking up two compute units. We can let bursty workloads burst into people that don't care about static performance. So we will let the bursty workloads then expand out, and then as workloads come in and need those cores and repeatable performance, we'll shrink those bursty workloads back to their minimum requirements. Thank you.

So taking that a bit further, we in Netflix, we buy all of our compute in reserved instances. So we buy three years ahead. So the dotted blue line is our projection of what we need to run the entire Titus environment. The reality is once we buy that, then users use it. So there's this green line of what have they actually allocated out of that system? What are they actually saying they're going to use? And then there is a yellow line that is what are they actually using? Because they may run the workload. They may not actually use all the capacity they ask for. They may use the capacity but they may underutilize it.

There's an amazing opportunity here for us. We've done this with virtual machines at a virtual machine level. Now we're looking at doing this at a container level as well where we can take advantage of essentially that trough of underutilization. So if you imagine our encoding form that spins up every night and has to encode mass parts of the Netflix catalog, if you can run this in compute that's been purchased but not used, there's a definite financial in there. So you can see we’ve transitioned now from the motor focusing on developer velocity to reliability to now efficiency. And with that, I think I can take questions.

Questions and Answers

Participant 1: I wanted to ask if you thought about using bare-metal servers at any point to have the 16 vCPUs?

Spyker: So the question was, have we thought about bare-metal servers? Absolutely. We're working towards that, but at the scale at which we are at right now … So bare-metal is an Amazon product offering that came out the last reinvent. At the scale we're at, I think we've got a little bit more time, both us and them, maturing that to being able to handle the tens of thousands of instances that we are going to need but definitely, totally of interest. The most interesting thing and part of the reason why we run at the largest part of the family, is once you get someone else off of your physical machine, you start to get access to performance counters, and other instrumentation that is kind of unsafe to do on the lower size machines, such that we see the future of the M5s as well as bare-metal as giving us a lot more instrumentation to help us power Titus in terms of our efficiency work.

Participant 2: I was wondering about the Tennis Federation. Do the different individual pools need to talk to each other to coordinate any kind of resource allocation?

Spyker: They're completely isolated from each other. They have no idea that the other exists. It's only us as operators that know they exist and the federation proxy above. So clarification on that is the Titus stacks don't know each other exist, but every container is given an IP address with the security group. So as long as the security groups are configured that they can talk to each other, it's no different than them reaching sideways and talking to a VM. So the question was do they get their own IP space? We actually use VPC and we allocate the IPs out of the VPC. So we don't have like a centralized IP pool per cluster. It's at the VPC level.

Participant 3: I wanted to ask about the last thing you talked about with the CPU set leveling. If you're not notifying the control plane that you're doing that, it sounds like you have to be concerned maybe a little race conditioning, like, what if the control plane is scheduling workloads that you've just rebalanced CPU away from? How do you call back to the control plane to say, "No more workloads here please?"

Spyker: From over scheduling the box, we will still honor the min and anyone that's a static workload. So we won't end up creating a problem that the scheduler doesn't know resources are overbooked. But your question of, "Hey, we just rebalanced that. Now the control plane send something else and needs to rebalance it again”, I think that's a learning we're going through right now. The most interesting thing, it was actually our algorithms team that saw some of the noisy neighboring, especially around burst. And they're actually helping us with the algorithms to do that convergence over time. So it's going to be interesting how that evolves.

Participant 4: So you had mentioned that Titus is the accumulation of work over the past three years, and there's been some issues [inaudible 00:44:26] and different directions of kind of correcting all that. Do you have any recommendations or suggestions on maybe things you wish you would have known at the very beginning of all this?

Spyker: I think the most interesting thing that affects the most people is just the amount of technical investment that it takes to run a system like this, and develop a system like this. The fact that the Kubernetes community is growing is immensely valuable, but I would say beyond just the technology growing on how to operate something like this, I think for the masses, I would strongly recommend moving towards a managed service for containers. Even know we had an idea of what kind of investment it was going to take, we were probably undershooting on how many experts we would need in kernel level systems, distributed systems, all the things that go into running this complex of an environment. That's one answer. I'm sure I have more.

Participant 5: You talked before that the developers actually do SSH to the aging machines. What are the reasons a developer does an SSH instead of using tools on top of it?

Spyker: I would say most users are not but we always wanted their ability to operate their application that's running on top of us. We have no restrictions of what sorts of applications people will run inside of their containers. We just expect them to be responsible for what they are running. If we lock down that access, force them to go a centralized logging facility or those kind of aspects, we limit the amount workloads that could come onto the platform. And we just wanted to keep it as [inaudible 00:46:15], as opposed to sort of as a parse or more of a 12-factor solution. So SSH is always that external escape patch of, “We forgot something. You still have access.”

Participant 6: My question is, I assume for Netflix, a lot of requests must be stateful session, right? And when you failover between different data centers or like when you failed a host, how do you maintain a session?

Spyker: So the requests are not stateful, and this is not my area of expertise. So I will lead off with that. We triply replicate all the data that's associated with the Netflix customer experience. So we're ready to serve you in the other regions of the world and we paid for that extra data, extra capacity, and that's how we deal with that. So if you get switched over, we'll say, "Hey, you're a new person here. We authenticate based on your tokens and then proceed to using the data that you had before."

 

See more presentations with transcripts

 

Recorded at:

Feb 09, 2019

BT