BT

InfoQ Homepage Presentations Full Cycle Developers @Netflix

The next QCon is in New York, Jun 24 - 26, 2019. Join us!

Full Cycle Developers @Netflix

Bookmarks

Transcript

How many of you went to that talk earlier by Slack about? It was really interesting. If you didn't get a chance to attend, check out the video. I thought it was really informative, really fascinating. But good afternoon. Welcome to my talk. I'm going to be speaking on full cycle developers at Netflix. I'm going to take you on a journey back in time today.

But first, let's begin with the present, the state of Edge Engineering at Netflix in 2018. Ten years ago, we had less than a million streaming subscribers. Today, we have over 137 million. Ten years ago, Netflix streaming was only available in the United States. Today, we're in over 190 countries. According to the latest Sandvine report, Netflix now accounts for 15% of downstream traffic across the entire global internet. Edge Engineering has more than … we build and operate more than 50 applications, and yet we only have about 80 developers. That's 80 developers for 50 applications. Our deployments happen daily. Our developers are empowered to quickly troubleshoot and fix any production problem on their own. Despite all this, Edge Engineering has zero test teams and zero dedicated operations teams.

So how did we get here? Well, at this point you might be asking yourself, "All right. Who is this guy? Who is this Greg Burrell? Why is he qualified to talk to me about Netflix?” Well, I've been there for over 13 years. And in terms of Silicon Valley time, that's pretty much forever. I realize it's unusual for one person to stay in one place for so long. But to me, it hasn't felt like one place, because Netflix is constantly evolving, and I've evolved along with it. I was hired in 2005 to be a part of this experimental group within Netflix, a group that was playing around with this crazy idea of sending movies over the internet, wacky stuff like that. To put that into context, when I was hired in 2005, I tried to explain to my parents about this great new job and the exciting things we were doing. There were no examples for an online subscription based service that sent movies over the internet. And so for years, my mom told her friends that I worked at a video store. I spent a lot of time fielding questions about where's our store located and what are your late fees like. Fortunately, Netflix is now a household name, and that explanation is a lot easier.

Now, I've mentioned Edge Engineering. Let me tell you about this wonderful group. Edge Engineering, we build and operate the services that power sign up. Maybe you've heard about some of our great programming, some of our great shows, and you want to check them out for yourself. So on your phone or your tablet or your TV, you go to Netflix and you join. You become a member. Now, you want to find something to watch, so you browse around. Maybe you check out some new releases. Watch a few previews. Maybe you scroll through some of our Netflix originals. But you're a fan of the Marvel Universe, and so you look for Marvel's "The Defenders" and you find it. Then you press play. And within moments, you're watching. You're engulfed in the story. So maybe after an hour or six, you decide to turn off the TV and then you go into the bedroom, but you pull out your smartphone and you resume watching right where you left off. All this magic happens via Edge Engineering. We own the majority of critical tier-one services within Netflix. Despite the name, Edge, we're actually at the edge in several layers deep. So this means dozens of micro services owned by a number of teams.

Our Journey Begins

Okay. So, now that you know who I am and where I come from, let's go on that journey I told you about. Let's go back to the prehistoric days of 2005. I mentioned that Netflix streaming started as an experimental group within Netflix. And this group operated like a start-up within Netflix. Everyone wore multiple hats. Everyone had freedom to do whatever needed to be done. And operating like a start-up was okay because we were still experimenting. We were still exploring possibilities. We were still trying to get the service off the ground. But then in 2007, something wonderful happened. We released streaming to the public. And the release of streaming, this was the bet on the future of Netflix. And so we felt the need to get out of this isolated start-up mode and integrated with the rest of Netflix and how they did things.

The Specialized Teams Model

And so, in 2007, we adopted the specialized teams model. Now, in this model, the developers, well, they develop and they propose configuration changes. And at some point, they take a bunch of those changes and they just throw them over the wall to test. I was a member of test. We'd spend weeks doing a combination of automated and manual tests. And at some point, we decided, "Well, we've hammered on this thing long enough. Let's get it out in the world. Let's release it." And so, I had open up a change request ticket for our network operation center. And I had to write it out in detail, line by line, exactly all the changes we wanted to make this release happen.

Now, back in this time, back in 2007, Netflix was still operating its own data centers. And so we needed a network operation center or NOC to stand guardian over those data centers. The NOC controlled all access to the servers. It controlled all configuration changes. It controlled what got deployed and when. The NOC was staffed 24/7. They had these huge monitors with lots of graphs. They watched out over Netflix, and they waited for alarms to go off. Once a week, they'd take the service down for maintenance. Pick up a stack of these change request tickets and start trying to decipher what I had written, start trying to interpret the instructions I put there. And invariably, something would go wrong and they'd have to kick the problem back down the chain.

Skipping forward to 2011, at this time, Netflix had mostly moved out of its own data centers, and our migration to the cloud was mostly complete. And so we adopted a new model to go along with this. We call it in retrospect, the hybrid model. In this model, our developers, well, they still develop, of course. But they're also free to make their own deployments into production. And what's more, they're also on call for after-hours production issues. At the same time, we spun up a DevOps team. I was on this team. We were operational specialists. We did deployments. We put out fires during the day. We worked on monitoring. We worked on performance tuning. And, of course, we were also tasked with working on tooling in our spare time, which was never.

Well, since we were no longer in our own data centers, we didn't really need a network operations center. So we evolved this into something we called Core. Core was like an operations team for all of Netflix. If there's a problem with the website or with the streaming service or with the content delivery network, Core would coordinate a response. Get the right people together. Core would see the problem through to the end and then coordinate the follow-up postmortem. Of course, Core was also responsible for working on tooling when time permitted.

Pain Points of These Models

So by now, I'm sure you can see some of the pain points inherent in these models. I'd like to call a few particularly egregious ones. Maybe you'll recognize or relate to some of these. The first pain point we encountered was lack of context. Our developers and our testers, well, they knew the applications really well, but they didn't know the production systems. Our DevOps and our NOC, they knew the production systems, but they weren't so familiar with all the details of every application. So we spent a lot of time trying to work around this, trying to work around this lack of context with a lot of communications. We did things like changelogs, which really just meant dumping a bunch of commit messages into a text file. We did things like handoff meetings, handoff meetings between dev and tests, between tests to the NOC, or handoff meetings with everyone involved. We invariably would end up with these really long email threads with lots of replies and forwardings. And it was impossible to figure out what was going on. It was like trying to unravel a ball of yarn or that box of Christmas lights in your garage. We spent a lot of time trying to gather vital information, you know, trying to piece together the state of the world, trying to get everyone, the right people onto a conference call.

The second pain point I'd like to highlight is lengthy troubleshooting and fixing. Our DevOps and our NOC people, well, they knew the production systems. They could describe symptoms, but they didn't know the applications well enough to get at root causes. Conversely, our developers, they could theorize about root causes, but they didn't have access to the production systems in order to do the test or validate those theories. During production incidents when minutes mattered, we’d spent a lot of time trying to get people up to speed and on the same page. Up to speed on the latest state of the apps, up to speed on the state of production, up to speed on all the problems that were going on this time.

A lot of troubleshooting really just turned into conference calls. We did troubleshooting via conference call. Often, I'd be on that call and I'd be like, "Hi, yes, can you try restarting the server? Yes, okay, here's how you restart the server. Did it come up? Here's how you can tell. All right. Go look in the log file and tell me if you look to see some errors there. Yes, here's where you find the log file. Okay. I'll wait. Yes, whose dog is that? She likes to bark. Oh, you found something, great. Send it to me." And so I get an email with maybe 100,000 lines of log messages. And I'd have to look at this and try and make some sense of it. And if I couldn't, I'd forward it off to a developer. And the developer, she might say, "Well, I don't know. Try this or that." And I'd forward that back up the chain. As you can imagine, any production incident at this time had a very high MTTR, a high mean time to resolution.

A third pain point I'd like to highlight is a lossy feedback cycle. Our developers, they really want to focus on code. You know, production, all that messy stuff, that was somebody else's problem. Our operations team, well, we didn't know how to put out fires. We were operational specialists. And so that meant a lot of quick fixes and workarounds. For example, we might add more servers to work around a performance degradation, or we might spend a lot of time constructing a really complex dashboard, trying to infer some missing piece of data. When, in fact, the correct fix would be go back to the development team and say, "Hey, could you instrument the code to give us that information directly?"

Once when I was on the DevOps team, for one particular application, we noticed that occasionally, servers would start performing really badly, lots of high latency and lots of CPU burn. One of them would just crash at random. Well, we were DevOps. We know how to investigate. So we rolled up our sleeves and dug in and we found there's a lot of garbage collection activity going on. Aha, this looks like the classic symptoms of a memory leak. So we did what you'd expect. We wrote a script to go through and reboot all these servers periodically. Problem solved. We felt good about ourselves. Pat ourselves on the back. Of course, we did open a bug against the server. But by now, nothing was on fire so that got deprioritized. It was months before we got that memory leak fixed. And, in fact, there were developers on the team who weren't even aware that there was a memory leak. This poor feedback cycle led to a lack of urgency. The only people feeling the pain were the operations teams and, of course, our customers.

Oh, by now, I'm sure you recognize a lot of our problems stemmed from the fact that we were working in silos. Specialized teams can create a real efficiency within one segment but a lot of inefficiency across the cycle. This meant that every change, every deployment, every problem had to be coordinated across those silos. Now, occasionally, we ventured out of our silos, maybe hand off some code or a little piece of information, and then we scurried back inside where it was warm and safe.

One time, there was the team that works on the Netflix streaming client for Xbox. And they were waiting for a new server feature to land in production so they could begin their own testing. So the Xbox team, they did what you expect, they go to the NOC and say, "Hey, when's the next release schedule?" And the NOC looked through their tickets and they said, "Well, we don't know anything about a new release. Why don't you go check with test, see if they have the build?" So the Xbox team comes to me. I'm a member of test. And they say, "Hey, Greg, when's that new server going to be ready? We're kind of waiting for you to do some of our own testing." Well, this was news to me. I didn't know anything about this new server build. Let's go check with dev. So we go over to the developer area, and we asked around. And somebody says, "Oh, yes, I think that was done. Why don't you check with so-and-so? She worked on it." So we tracked down so-and-so, and she says, "Oh, yes, I finished that two weeks ago. I kicked it over to test." Okay. This is majorly embarrassing. How come I don't know about this?

So I asked around with my colleagues. And one of them says, "Oh, yes, I finished testing that. It's ready to go." “Then why haven't you released it? Why is it not out there yet?” Well, at this time, back in this era, our servers were much more monolithic and so we had to test all the changes together before we could release anything. So we were held up waiting for some particular sub component to be tested. So I'd go find the tester for that particular sub component, and I said, "Hey, there's a lot of people waiting for this release, have you even started yet?" And this tester, he looks up at me. He looks at me with a mixture of surprise and annoyance. And he says to me, he says, "Dude, we just did a release like two weeks ago. When are we going to do, one like every month or something?"

I wish I could say that was a one-off event or even a rare event, but it was far too common. We spent far too much time living at the worst end of the pain scale, the Matt Murdock level 18 on a scale of 1 to 10. Lack of context, high communications overhead, we literally had to walk from team to team, from cube to cube trying to gather information. Lengthy troubleshooting, can you imagine this level of friction if it were applied to a critical bug or a service outage? We worked in silos. Useful information was stored in silos. Critical context was locked away inside those silos. Something had to change.

In addition to all these pain points, there's another factor that drove us to re-examine how we build and operate our services, and that was growth. You see, back in 2007, Netflix was still a DVD-only company. How many of you out there were DVD subscribers? And how many of you just realized you're not sure if you're still a DVD subscriber? Yes, remember the last time you moved and you threw those red envelopes into a box? Well, on behalf of Netflix, thank you.

Since 2007, Netflix has gone from 0 streaming members to over 137 million today. And along with that, we've grown in headcount, number of teams, number of micro services, and architectural complexity. And so all this growth, it would just amplify the pain points we were feeling. Something definitely had to change. But sometimes, sometimes you have to hit pause. Sometimes you get stuck trying to solve that problem that's right in front of you. Sometimes you become reactive. You keep trying things hoping something will work. You keep going down some rathole or another. We knew that throwing more people at the problem wouldn't work. That'd just lead to bigger silos or more links in the email chain. The pain might get diluted or spread out, but it would still be there. And our customers would still feel it.

First Principles

So around 2014, we went back to first principles. We asked ourselves some hard questions. What are the problems we're trying to solve? What's not working? How can we do better? How can we get back to systems thinking? How can we get back to that place of end-to-end ownership? How can we optimize our feedback loops so our developers feel that pain and then remove it? How can we better understand our internal and external customers' needs so we can address them? How can we get back to that place of continuous experimentation and learning, a place of continuous improvement, a place of reducing the friction in this cycle? Remember, when we're operating in start-up mode, when every team member was free to do whatever needed to be done to make fixes and improvement? How can we recapture that sense of ownership, that sense of empowerment? Can we get all that again but in a way that's sustainable at our scale and complexity?

The Full Cycle Developer

So here we are. In 2018, the Netflix Edge Engineering organization is operating under the full cycle developer model. In this model, full cycle developers are responsible for all areas of the software lifecycle. They apply an engineering discipline to all the areas you see here. They evaluate all problems from a developer perspective. For example, a developer might say, "I've got some great ideas on how to fix your deployment automation. Let me go do that". Or a developer might create a tool to help with support request. Everything, everything you see in the cycle comes under the purview of our full cycle developers.

It sounds great. But how do you make this work? What do you need, and what does this mean for you? Well, the first thing you need is a mindset shift. You need to shift away from thinking, "I'm a developer. I only design and work on code. Everything else, not my problem." You need to shift away from the idea of, "I want to finish this deployment, so I can get back to my real job of writing code." You just shift into the idea of, "Hey, I've done some deployments. I've got some really good ideas on improving the automation. Let me make that happen." You just shift into the idea of, "I can't do a code review right now because I'm handling a critical support request. I'll get back to you right after that."

The second thing you need is tooling. You need good tooling. It's key, and you need specialists to consult with. At Netflix, we have centralized teams that develop common tooling infrastructure to solve the problems that every development team has. Our centralized teams also have specialists in areas such as storage or security or AWS infrastructure. You know, for example, we don't expect all our developers to become deep security experts. But we do have central teams that can provide that security expertise and consult with us. We have central teams that can provide tooling that encapsulates security best practices. In this way, our central teams really become force multipliers.

So, let me give you an example. I've talked about the importance of key tooling. It really is important, and so let me give you an example of some of the tooling that Netflix provides, tooling that supports the full cycle developer. So it used to be that if you wanted to create a new application, you'd find one you liked, maybe one of your favorites, and you'd copy it. You'd copy all the source code and then you start hacking away at it, trying to remove the vestiges of the old application so you can just leave a skeleton.

Creating a new application this way took a long time and a lot of work just to get to a starting point where you could then begin adding new code. And what's more, you were also copying along outdated technology and outdated practices along with that. And once you got something that would compile, well, you still had to set up Jenkins build jobs and continuous integration workflows, not to mention the dashboards and alerting you would need to run in production. Creating an application like this was a lot of work. It's a lot of toil. To address this pain point, we created a NEWT. That's the Netflix workflow toolkit. So what you do is you run NEWT. It asks you some questions and it goes out and creates the world for you. It sets up a repo with the skeleton application, and that skeleton encapsulates the latest technologies and best practices. Right out of the box, your new application is already integrated with the rest of the Netflix ecosystem. NEWT also generates some tests for you. It sets up your Jenkins jobs. It sends up your Spinnaker pipelines. It even goes so far as to create some dashboards and some alerts on your behalf. With NEWT, getting a new application up and off the ground went from being a week-long endeavor to something a developer does in minutes.

Okay. So you have your new application generated in building and now you can use Spinnaker for continuous integration, for canaries, for production deployments. Spinnaker is our continuous delivery platform. It's something we're developing in conjunction with Google and an active open source community. Spinnaker also lets our developers do production deployments and manage the production environment. So remember that application we've created with NEWT? So say we add some code and maybe some tests and then we take all those changes and commit them to get, bam, just like that. Our continuous integration kicks in. Jenkins will pull the code. Build it. Spinnaker will take that build and bake an image and deploy it out to the test environment. You as a developer can then decide when you want to deploy that same build out into production. And just like that, the new application is live in the world. No more gatekeepers, no more wondering where the build is, no more silos.

But wait. We've only covered half the cycle. We've now got an application running out there in the world. We need to take off our developer hat and start thinking like an operator. Maybe you go to Atlas, our open source time series database, and you check on the health of your application. You might even want to aggregate a number of these graphs into a dashboard, but nobody really likes looking and staring at graphs or dashboards. So we've created some tools to help with that. We've created alerts. I mentioned that NEWT generated some alerts for you so you can automatically be notified when something's wrong with your application. But creating alerts yourself can also be very iterative and error-prone. So we have things like alert templates and alert packs. Use the operator as a service owner, you decide which sets of alerts apply to your application and you just stop those, just like that.

So these are just a few of the examples of the tools that Netflix provides, tools that support the full cycle developer, tools that are offered by our centralized teams as part of what we call the Paved Road. So the Paved Road is a set of well-integrated and supported tools that simplify the workflows used by our development teams. Now an individual team can decide to go off the Paved Road. For example, an Edge team might decide they want to explore some new technology or try out a new tool. But staying on the Paved Road means you get something, a set of tools that's proven, it's well-integrated and it's growing. And we frequently work with the central tools teams. We work with them daily even, to add to and expand the set of tools. If I might extend this metaphor a little, our central teams are constantly laying down new pavement, growing this tool set to meet our changing needs.

Staffing

So now that we've gone into tooling a bit, let's get back to what's needed to support the full cycle developer model. In addition to mindset shift and stat and tooling, the third thing you need is staffing. I'd like to take a moment to speak to the leadership out there in the audience, if I may. I want to be clear that this model, this is not a cost savings measure. Let me say that again. This is not a cost savings measure. You can't take the work of multiple teams and just dump it on your existing development team. That won't work. That's just a recipe for poor quality. It's a recipe for availability issues. It's definitely a recipe for developer burnout. When we were first starting out, we could get away with a lot of heroics and make this work. But it doesn't really scale. That just leads to always being one small step away from an outage. It leads to always living in a state of frustration and anxiety. You must invest in adequate staffing.

Training

The next thing you need is training. Some developers, well, they just aren't used to being hands-on with production. You know, maybe they came from a shop where they weren't even allowed to touch or see production. So we invest in training to bulk up our developers' production muscles. Training needs to be ongoing. It's not just a one-time thing. Training need dedicated focus. This isn't just reading a manual in your spare time or playing with some new technology on the weekend. For example, at Netflix, we do a lot of shadowing. I often get new developers saying, "Hey, Greg, can you sit with me while I make this production change? Just look over my shoulder and shout out if I'm going to do anything bad." We do bootcamps. We do ongoing training to keep our developers up to speed on the latest technology. You have to invest in training. So if I might just speak to leadership again, support your development teams by investing in training.

Commitment and Prioritization

Fifth and finally, you need commitment and prioritization. Commitment and prioritization means the whole team and indeed your organization buys into the model. It means investment. Again, you must prioritize all aspects of the lifecycle. Or to put that another way, all areas of the software lifecycle are first class deliverables. For a long time, we tried to get away with this idea of, "Hey, developers, you keep developing full speed ahead. Whoever's on call, you do everything else. You do the deployments. You put out the fires. You answer things in the support chat room. You respond to support emails. You work on the monitoring, the alerting, the automation." Well, is it any wonder that most of that stuff never got done? All we accomplished was to accumulate this really big backlog of operational work and technical debt. What it took for forward progress was having managers who prioritized operations and support and automation alongside bug fixes and new features. So these are some of the keys that you need to support the full cycle developer model. Switching into this model wasn't something we did lightly or overnight. It took a lot of time and thought.

Trade-offs

I'd like to take a few moments to address some of the trade-offs of this model. The first trade-off is, this is not for everyone. Even in Netflix, not all teams employ this model. There are teams outside of Edge Engineering that don't use this model. It really depends on what's the right fit for your team. It also depends on whether your organization can support this model. Can you make the investments without trying to cut corners? And some developers, well, they just want to develop. Maybe they want to go deep in some particular technology or become a specialist in some area, or maybe some developers, maybe they've done the full cycle thing and they're tired of carrying the pager. We all know that operations work can be exhausting, particularly after a series of late-night pages. My wife used to say that Netflix should really buy us a new couch because I spent a lot of my late nights there trying to fix production or recover from a bad deployment. My wife used to joke that our couch was my second Netflix office. At least, I think she was joking. I'd be remiss if I didn't say change is scary.

Switching models takes a lot of courage and honesty. You might look at this model and you might say, "Well, what happens to our existing test team or what about our operations team? What's going to happen to those guys?" But change can also be a time of opportunity. For example, when the Edge team switched models, I'm happy to say we didn't let anybody go as a result. Our testers, they moved into other teams within Netflix and they applied their testing passion there. And it was great because they understood Edge and they understood where we were coming from. And so we could form these great partnerships that way. When we disbanded the DevOps team, a lot of those engineers went over to the Core team. And, in fact, one particular DevOps engineer, he decided to follow his passion and his schooling and he became one of our data scientists in Netflix. I myself have evolved along from tester to DevOps, to SRE, and now I work as part of the developer productivity group. This change has kept things interesting and exciting for me.

Another trade-off of this model is an increase in breadth. There'll be an increase in cognitive load. You have to shift away from only focusing on code to now focusing on all areas of the software lifecycle. There's also the risk of getting interrupted too often. Not to mention, you'll need to balance more priorities. At Netflix, we mitigate these interrupts by having on-call schedules. So this week, you might be neck-deep in code. Next week, you might be doing deployments and production work. The week after, you might be fielding support requests. So in this way, we let each developer sort of focus on one thing, and yet keep up to speed by going through these rotations. On the flip side, this can be really liberating. There could be a lot of satisfaction and understanding the whole end-to-end picture. Our siloed models often kept our developers feeling like they were constrained and in the dark.

This model, it can be empowering. It can be really frustrating when you know what's best for your code, when you know how to operate your service most efficiently, when you know exactly what's wrong with production and how to fix it, but you lack the power to do so. The full cycle model gets back to that place, back to that feeling of being in start-up mode, back to a place where every team member has a sense of control, a sense of freedom, where every team member feels like they're having a direct impact on our customers and ultimately, a direct impact on the success of the company.

Improving on This Model

But we know we're not perfect. We know there's still work to be done. We know we need to keep asking ourselves some hard questions. You know, what are the new pain points? What are the new sources of frictions? How can we do better? Well, we have some immediate ideas for improving on this model. The first one is tooling. I've said it before. I'll say it again. Tooling is key. We need tooling that's easier to use. Our developers, they don't have the bandwidth to become experts on some really complex tooling. We need tooling that's opinionated, tooling that encapsulates best practices, tooling that reduces risk, eliminates toil, lowers cognitive overhead.

We also need metrics. We need metrics to keep us informed, metrics to make sure we're addressing all aspects of the software lifecycle equally. We need metrics to measure ourselves, how are we doing, metrics to measure our productivity, our complexity, our team health, metrics that will tell us where to invest. I don't want to keep harping on it, but I'll say it again, tooling is key. We need to keep improving that tooling. And, for example, a tool really shouldn't just dump a bunch of graphs onto an operator. That tool should provide context and correlation. Maybe you can correlate this outage to a specific change and roll that back. Maybe you can initiate a rollback of a bad deployment. We've also been working closely with the Spinnaker and Kayenta teams to improve our canary analysis. Our aim is really to remove human judgment from the deployment process. Let these tools apply rigorous statistical test to determine if the build is good or not. So we need the tooling. We need the metrics to measure how we're doing. We need the metrics to measure ourselves. We need the metrics to tell us where to invest.

So this is the full cycle developer model. This is how Netflix builds and operates its key services in 2018. Our evolution wasn't always easy. We made a lot of mistakes over the years. It took a lot of courage and some really hard questions and a desire for continuous improvement.

As I was thinking back over this model and its evolution, a particular incident came to mind. And as you know, Netflix is always hiring. And that means a steady influx of new developers. So, earlier this year, I went to a team happy hour, and I sought out one of these new developers. And I said, "Hey, welcome to Netflix. How's it going?" And he got very excited. He got very animated. He says, "Oh, I tell you, today is such a good day. Today, I did my first deployment to production all by myself. Today, I have my first feature live in production. I created a dashboard. I'm watching adoption. I'm watching for errors. I already have some ideas about bug fixes and for improvements. I'll roll those out next week. I tell you, I've been at Netflix for three weeks, and I already have my first feature live. I still have code changes at my previous company that are waiting to get picked up. To be able to do this by myself, this is such an exciting place to be." Thank you.

Woman 1: Awesome. Thank you, Greg. We totally have time for some questions. I see two. Okay.

Man 1: Hi, Greg. First of all, thank you for your talk. It's really great.

Burrell: Thank you.

Man 1: And question, one of the downsides of working in operations is where sometimes you have to work on weekends. So does it mean now that everyone at Netflix sometimes has to work on weekends?

Burrell: Yes, that's exactly what it means. Our developers, they're responsible for their applications. They own those applications. And that means they have rotations for being on call. So whoever's on call that weekend, that developer might be the person to answer the alerts and to respond to the pages. They get a lot of support from their team members. They can get a lot of support from the tooling. But ultimately, it's an individual developer who will go in and fix those problems. You'd be surprised how many Netflix outages and problems I've fixed from my couch in my pajamas on the weekends at 2:00 a.m. with maybe my cat beside me. And you think, "Well, some companies, they have these huge rooms that look like NASA mission control and they've got all kinds of graphs and people and it's really hard to get in." But at Netflix, it might just be fixed by a developer in their pajamas on the couch. It's a trade-off. Yes, you do have to work some weekends, but you have a real sense of power and control as well.

Woman 2: Could you define some weekends?

Burrell: Well, our teams, as I mentioned, our teams are staffed adequately. So your time in the rotation may not come around for a while.

Woman 2: Okay. So it is a rotation. It's not every weekend.

Burrell: Yes, exactly.

Woman 3: Hi. How do you decide what tooling to build to support your developers?

Burrell: That's a great question. Thanks. The tooling you build is the tooling your developers need. For a long time in Netflix, our central tools, they were developed by these tool teams that were developing the tools they wanted to build. And so we've turned that around. We give direct input from the developers, the feature sets, the use cases, those come from the bottom up. There's no sense in developing some tool that, you know, "Hey, it's got this great feature. It's so exciting," but no one's going to use it. What's the point of that? So our developers are really involved from the ground up. If a tool team wants to create a tool, they'll come out and ask us, or if developers say, "Hey, we need a new tool. Let's build it ourselves. Let's make a prototype, an MVP.” And then once we demonstrate how useful this is, some central team will adopt it and support it and manage it from there. So it really has to come from the bottom up. Build those tools your developers need, not the tools that are fun or interesting or exciting.

Man 2: Hi, great talk.

Burrell: Thank you.

Man 2: Is there any kind of criteria or coordination or mutual agreement to push stuff to production? Is it just up to the developer?

Burrell: That's a good question. There is a little bit of coordination. For example, we often enact like a quiet period around Christmas. We don't want people pushing on Christmas Day. We sometimes enact a quiet period say around Thanksgiving, Black Friday, maybe the Oscars night, something like that, Super Bowl because as soon as that game goes off, people turn on Netflix. So we don't want add or to strain those times. If we have a really big, really complex deployment, we might coordinate with our up and downstream teams just to give them a heads up and make sure we're not going to step on each other's toes. But for the most part, no, each team, they know their application best. They know what to do. They know how to recover quickly. They know what signs to look for if something's gone wrong. It's up to them to do it. Now that said, we don't just, you know, “hey, build's down, let's deploy it out to 1,000 servers." We do things like canaries. We do things like small rollouts, by regions maybe. Trying to limit the blast radius instead of just going all in. So there's very little YOLO kind of pushes at Netflix. We're a little more methodical and careful about that.

Man 3: Hi. Thank you for your talk. I have a couple of questions. So first is, do you still have a level one dedicated support team or not? And second, is the team that is responsible for tooling, do they also operate in a full cycle developer mode, they have support for the tools or something like that?

Burrell: Yes. That's a good question. Netflix does still have a Core team. But often, if there's a problem with your application, you as the application owner, you should be the first person to know. You should have crafted your alerts in such a way that you get notified first. Yes, the Core team is still there. If you need a wider array of teams to be involved, they will coordinate that response. Maybe you run a conference call. If there's a problem between, say, streaming and the content delivery network, they'll help coordinate that. But in general, you are the person getting the first page. You as the developer and the service owner, you get the first page. You jump on your laptop immediately. You are free to diagnose and do whatever you need to fix the problem as fast as possible. You don't have to wait for approval. You don't have to get your manager to sign off. You don't have to ask for permission from Core or anybody. You just make it happen.

Oh, and tools teams. Yes, thank you. As I mentioned, not all teams within Netflix employ this full cycle developer model. So some of the tool teams, yes, they do. Maybe in their case, the support work might be a little less. It depends on how critical that tool is. For example, I mentioned Spinnaker. That's almost a tier zero tool because we really rely on that for managing our production environment and managing our deployments. So the Spinnaker team, yes, they have a rotation. They are on-call. All the developers are on-call. They will respond to problems at any time of day. Other teams, you know, if the tool is maybe a little less critical, maybe it's more of a tier two or three thing, well, they might have an agreement: "Yes, we'll respond to support requests during business hours. If it's emergency, emergency, yes, you can always page us. But please try not to." So it really depends on the team, what's the right fit for them and what's the level of criticality of that tool.

Woman 4: Thank you. It was a beautiful talk and a beautiful presentation.

Burrell: Thank you.

Woman 4: You mentioned that it's not for all the teams in the trade-offs.

Burrell: That's correct.

Woman 4: Would you mind giving an example of what teams is it not suitable for?

Burrell: Oh, sure. To give you one example, you know, for example, the teams that develop, say, our client applications. And we've got a lot of UI and front-end teams. They're not really used to operating services and things like that. So those teams, they might have a dedicated operations team. The content delivery network, they also have a dedicated operations team because that really requires some specialists to do a lot of that global networking stuff.

Woman 5: Hi.

Burrell: Hi.

Woman 5: It sounds like things were very messy in the stage where there were the developers and the core team split out. Was that as bad as it sounded? It seems like that perhaps would have been rolled back versus continuing on for a long time.

Burrell: When streaming started, Netflix was already a DVD company for a long time. It's pretty surprising how long Netflix has actually been around and most of that was as a DVD company. So they already had a lot of structure and processes in place. And so we kind of tried to fit into that existing structure and processes. A lot of the problems we were seeing was a result of that poor fit there. Also lot of the existing issues they had… You mail a DVD, there's a very high latency in that. They weren't so used to dealing with instant problems that come with streaming. Some of those things really surfaced in the pain points that I've described.

Man 4: How do you opt in teams to new improvements to generators? So once the generator is run and the code exists, how do teams take advantage of improvements as the tooling team continues to add new features?

Burrell: Okay. Let me give you a couple of responses to that. The first is that the ability to regenerate a generator project, that's something that's still in development. The second is it's really just up to the team to decide. They know their code best. One team member may say, "You know what? We need to regenerate this. I'll take it on myself and do that work on behalf of the team." We give our teams a lot of freedom and control to do their…and manage their own applications.

Man 5: So how big is a team typically that has this type of cycle? And is there any room for depth on that team as well as the breadth?

Burrell: Yes. That's a good question. These teams are generally anywhere from, say four to a dozen people. And so within that team, there may be areas of interest. Maybe one of that team member comes from a security background. So he might be the security expert in the team. It doesn't mean he does all the security work, but it means maybe he has a little more expertise in the area. He can help his colleagues get up to speed. One of the questions we have is, would this model really work if you had a huge team? Is it possible or would you need to really break into smaller teams? We don't really know if that's… That's some of the ongoing questions. This is the current model we have in 2018, but we need to keep asking ourselves, "Is this still working? Will it still work? Will it still work as we grow? What can we learn from other companies and the way they do things?"

Woman 6: Hi. So I'm just curious, in the situation when you have one or several developers who own the lifecycle of an entire service, how do you reduce the risk of knowledge islands, or having a low bus count for that particular service, if that person leaves or goes off sick when that service goes down?

Burrell: That's a good question. In general, we try not to have teams of one or two people for exactly that reason. Too much information becomes compartmentalize. That's effectively their own silo again, right? We don't generally go teams that small. Instead, if we had a team that's small, it might just get absorbed by another team that does very similar stuff, maybe the team that does the up and downstream services from their own service.

Woman 1: More questions? Yes.

Man 6: Thank you for your talk. It was a nice talk. So my question is that you mentioned you are full cycle developers, so no testers, no DevOps, all as full cycle developers. So what confidence does a developer have to check in his code, without being tested properly?

Burrell: It doesn't mean we don't do any testing. Yes. Okay. It means that the developers are responsible for doing the testing. So what if the developer says, "Uh, I don't really feel like doing testing. I'm just going to check it in and go," right? We try and hire mature, proven developers who have a track record of making good decisions. So, yes, you might try and do that, but you're probably not going to be there very long. We give you freedom. Our mantra is really freedom and responsibility. That doesn't mean freedom from responsibility. So you can do that if you want, but you're responsible for the outcome.

Man 7: A great presentation. Greg, how do you prioritize your work?

Burrell: How do we prioritize our work?

Man 7: Yes.

Burrell: That's a good question. So every team, we do have managers. The managers don't come in and say, "Here's the agenda of what we need to do." The managers are really there to set context. We have partner teams. We have upstream and downstream. We have client teams that are depending on us. These are their needs. And then the team can organize itself around that work. They can divvy it up. They can decide what the priorities are. They can decide when something is a little more urgent than something else. But we really believe in context, not control. So your manager will not say, "These are the agenda items for this week." Instead, they'll say, "These are all the needs. These are what our partners need. So let's figure out how we can address this.

Woman 7: So some kind of follow-up question for this. So I suppose that some of the things you are doing are driven by business rather than clients, like by product management or whatever it's called in your organization. So do you engage with product management to decide also on testing and reliability priorities, or it's like you get input from your business partners and then you prioritize within a team? And how do you push off features that are overflowing the team?

Burrell: That's a good question. How do we juggle all these priorities? If we have client teams that depend on us or upstream services that need something, we really try and get ahead of these problems. It's kind of up to a client team to let us know ahead of time so that we can schedule that work. We don't do well with last minutes, "Oh, by the way, we need this new feature by Friday." But if that has to happen, it will. But generally, we try and work with our partner teams. We're there to make their lives easier. They're there to help us, and so it's really to both to our advantage to get way ahead of these problems and start that scheduling early. What do we do if we have competing requests from multiple teams? How do we juggle that? How do we say, "Sorry, I can't help you because I'm helping this other team?" Really, it comes back to- it would be tempting for managers to say, "Oh, yes, we'll take on all project. We'll take on all features. Yes, my team can do everything. They're wonderful." And then overload their team. No, it's really up to a manager, up to the partners who have that relationship to set that context to say, "Yes, we've already taken on these projects. We're not going to overload ourselves," because if we overload ourselves with asks, then nobody benefits. We're just going to deliver poor quality results. We're going to burn out our people. And what's really the point of that? We're looking at a longer game there.

See more presentations with transcripts

 

Recorded at:

Dec 11, 2018

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.