BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Dianne Marsh on Engineering Velocity at Netflix

Dianne Marsh on Engineering Velocity at Netflix

Bookmarks
   

1. I’m Barry Burd, professor of computer science and mathematics at Drew University in Madison, New Jersey. I’m here at QCon New York speaking with Dianne Marsh, she is director of engineering at Netflix and she gave a keynote talk during this conference, Engineering Velocity: Shifting the Curve at Netflix. Dianne, can you help us understand what you mean by engineering velocity?

Sure, that’s a great question. A lot of times when people talk about velocity they are talking about the agile concept of measuring how much our teams do in order to be able to estimate how much we can get done in the future; that’s not what I am talking about. What I am talking about when I say engineering velocity is the action of actually engineering velocity into our systems, so bringing in concepts and tools and practices and culture so that we can actually have an environment that moves very quickly. We’ll reduce the wait time on our tools, we’ll reduce overhead in the way that we interact with people, so that we can have a system that truly moves as fast as it can, understanding that our developers are already responsible for making themselves move at an appropriate rate.

Barry: By velocity you mean roughly the same thing that the agile camp means, but your idea is not so much as to measure it and live with it, but it is to actually engineer it so that it moves ahead faster.

Yes, it’s a continuing evolution. We are not trying to come up with a number and stick with that, we are actually trying to continually improve how we can build our systems so that we can make changes that contribute to that velocity throughout the entire process.

Barry: What are some of the good things that you can do to engineer high velocity and what are some of the bad things that we can do to inadvertently, if you will, engineer low velocity?

You can put practices in place that slow people down. We spend a lot of time hiring really great software developers to build tools and to build systems, and then a lot of times companies throw processes in their way and slow them down. Let me give you an example, a lot of process around “Write this entire detailed document, describe what you intend to do, describe what you think your findings will be, make a really elaborate plan around how you are going to build this system”. We can actually treat that in a different way, we can experiment and we can actually build the system sometimes faster than we can talk about what we’re building, we can head down parallel paths with different developers and not worry about the fact that we are duplicating effort, sort of let that evolve. We can learn a lot from all the practices and processes that we have in engineering software and build that back into our systems.

Barry: I take it some of the negative things is over-managing your developers.

Yes. We hire these people because they are really bright and then sometimes managers say “this is how I want you to do this” and we impose order and process and implementation details on the developers when in fact they are closer to the implementation process than we are, there are more of them than us, they have these really great insights and if we tell them how to do things what we are doing is squashing that creativity and we are building the systems that we envision rather than letting the other people, the engineers, come up with these ideas independently. And there is just a lot more scale that we can get from letting them come up with ideas.

Barry: What do you do about the team member that is going off on a limb spinning his or her wheels, and you can see that it’s happening, due to no fault of their own? Or about the team member who is not playing with the team, not playing nice with the team? Certainly, you have to do some managing, some telling them what to do; I think what you are arguing for is not doing too much of that when things are going at least fairly well.

Actually, I am arguing for not doing that at all, I’m arguing for providing our developers with context, about what we are doing, about what the benefit is of what we are trying to build and let them come up with the ideas about how we are going to build it. So, by providing context, let’s say for an example, I lead the engineering tools team at Netflix and so we are building tools that the rest of the organization uses for building, baking and deploying their systems, so I go out and talk to lots of different teams and find out what they need, I bring that context back to my team and talk to my team about what the rest of the company needs; they are building contexts as well, but it’s really my responsibility to bring that context back to them. They give me context about what’s possible and what they are working on and I give that back into the organization, so my job is really a facilitator of information, I’m spreading and gathering context.

   

2. Do you have specific guidelines that you follow then in order to engineer this velocity?

No, we actually don’t have a lot of rules or processes at Netflix. What we do is we all use imagination and creativity to be able to build systems for the entire organization in whatever way that we think will actually suit the organization. And I gave an example in my keynote about how a team had gone off and created a solution for a predictive auto-scaling engine, we already had a reactive auto-scaling engine that came to us from Amazon’s auto-scaling services and that was working really well, but the team that came up with this ideas said “I think we can do better” and I think when we are taking about engineering velocity is we are constantly approaching the idea that we can do better, and so they said “we can do better, we have a very predictable peak of what our customers do during the day and during the week and we can anticipate where we are going to need to scale up our services”. And so, we have two choices here, we can over-provision our services all the time so we don’t suffer that scale-up time too late or we can come up with this predictive opportunity, this predictive auto-scaling engine that allows us to decide, based on the statistical modeling that we can gather from the way the systems work, to pre-scale those systems so that they are ready ahead of the curve rather than behind the curve when we have already needed it. So that’s an example of where creativity inside the organization provided the solution that the managers wouldn’t have probably thought about. So that’s what I am talking about, continuing to push that envelope and say yes, we have a solution for this, but I think we can do better.

   

3. Right. I guess what I was asking earlier, and let me clarify, is do you have specific guidelines that you as a manager follow?

I, as a manager, get out of the way of my developers, most of the time. We talk a lot about ideas that they have and we talk a lot about what that means in the broader organization. Every team at Netflix is asked to consider “Is this the best solution for the company?” Our developers have a lot of freedom around the decisions that they make, but with that freedom comes responsibility to make that a great choice for the company, for the team and for themselves. And so, in examining that freedom and responsibility, our teams are able to make really good choices and I really push the idea of freedom and responsibility and context not control on my team.

   

4. How important is the human factor? To what extent can best practices overcome any staffing limitations that you have?

That’s a really good question. The human factor, the culture that we bring into the company is absolutely critical. We could prematurely optimize our solutions and that would have a huge impact not only on the fact that we don’t get very creative responses today, but also that people may not be encouraged to do that in the future.

   

5. Before we go on, can you clarify what you mean by “prematurely optimize”?

Sure. By “prematurely optimize” we could do a lot of different things, we could say “This product that you want to build is already being built in a very similar way someplace else in the organization, don’t build it yourself, even if your needs are a little bit different, even if the deadline, the timeframe by which that other team will be able to deliver doesn’t meet your needs, you could say that’s that team’s responsibility”. We don’t really worry about team responsibilities at Netflix, what we worry about is serving the organization well, sometimes what that means is we have two teams working on the same problem but from a different angle because they have slightly different needs, so we don’t prematurely say “This is an engineering tools team responsibility, this is a platform team responsibility, this is an edge services team responsibility”, we work that out individually on a case by case basis, but we also don’t really pre-decide about who should work on things.

Barry: I think that one of the things I am hearing is that you don’t have to compromise velocity and quality.

Velocity and quality can be at opposite ends of the graph, what we know from building systems over years and years in the industry is that you can have a highly available system if you build it and you never make changes to it, because you spend all of your time perfecting that system and it is just rock solid or you can have lots and lots of changes introduced in your system, but at great risk for reliability because you are introducing lots and lots of changes and that introduces a lot of uncertainty into the system and you haven’t maybe tested it right. What we are doing as an organization, what we are being asked to do in the organization that I am a part of, operations and engineering inside of Netflix, is figure out how to shift that curve, how can we actually not impact reliability as we increase that rate of change, because if you look at the graph it levels out it’s a fairly predictable pattern, but let’s say we want to go from three nines to four nines, can we do that and still maintain this high level of rate of change that we are accustomed to introducing to our customers, can we deliver them the features and the new things that we would like to deliver to them without impacting reliability?

And so, what we are charged with is figuring out how to do that, and I think the two key things to doing that are the cultural changes that I described where the developers have this freedom and responsibility, where they are building things because they have the context of what needs to be built and by engineering tools that take the burden off the developers to be intimately involved with little details that they shouldn’t have to focus on. What I want the developers at Netflix to worry about, outside of my team, is the problems that they are trying to solve, whether that’s how to communicate with devices or how to make a really great personalization algorithm or all the multitude of things that developers at Netflix worry about and let my team worry about the tooling and take the tooling out of the other developers’ consciousness, so they can treat it as a reliable tool just how they treat their IDE, they don’t spend hours and hours and hours every day figuring out how to configure their IDE and make it just right. I want our tooling to save in their consciousness to manage their build, their bake, their deployment so that we are picking up a lot of that load for the rest of the engineering teams and that we can actually deliver features at a very fast rate without impacting reliability.

   

6. Now I am going to ask you a leading question, and I know what the answer is, but it’s a good wrap up question. Do people like working at Netflix?

I can say I love working at Netflix, I joined the company just over a year ago and previously I had owned my own company, so people say how can you go from owning your own company to working for somebody else, and I have to say I was lured there by the idea of the impact that Netflix has and the impact I can have inside this organization both on the things that we're developing, but also on the open source community, but I’ll stay because of the people. It’s an amazing culture to work in, the people that work at Netflix are of an amazing caliber. And you hear that about every company, but it’s really truly the case, we have a culture where we try really hard to raise the bar with every hire that we make, we are constantly vigilant about the culture of the company, because if we let our guard down about bringing in people that don’t value this freedom and responsibility culture, that will just wear away at the culture and eventually will erode it and people like me that joined the company because that culture is there understand the importance and the value of that and so we are not letting those people in the door and it’s just an amazing experience to actually have that happen.

Barry: Dianne, thank you so much for being here today.

Thank you.

Oct 11, 2014

BT