BT

Roy Rapoport on Freedom to Decide and Open Sourcing at Netflix
Recorded at:

| Interview with Roy Rapoport Follow 0 Followers by Manuel Pais Follow 9 Followers on Jun 20, 2015 | NOTICE: The next QCon is in London, Mar 4 - 6, 2019. Join us!
18:34

Bio Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics". He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.

Sponsored Content

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

   

1. I'm Manuel Pais and I'm here at QCon London 2015 with Roy Rapoport, Insight Engineering Manager at Netflix. Thanks for accepting our invitation. Can you explain to our readers what does an Insight Engineering Manager do exactly?

An Insight Engineering Manager manages a group of engineers who are the Insight Engineering Group at Netflix. So Insight Engineering at Netflix is a 10-person group right now, software developers who build real-time operational insight systems at Netflix.

   

2. I wonder how it's like to work at Netflix with the amount of attention you get these days. Specifically when you release something to the public, open source, do you feel like you're under some kind of scrutiny or how does that feel?

Yeah. We're pretty aware of the fact that when we release something and we make some sort of repo in Github public, chances are people are going to look at it and I think there's some degree of pressure to make sure that when we release something, it's good at least from a code perspective.

Internally, we've actually talked about open sourcing as a forcing function for doing code hygiene and for writing better code because frankly developers are sometimes highly motivated by trying to avoid shame.

   

3. Your talk here at QCon focused on deciding between buying and building non-core products. Can you give us a brief rundown of the decision process that you use at Netflix? In particular who is involved in that decision process and do you have some kind of checklist to decide? Or how does it work?

I think the kindest way to describe it is highly organic. The people who make the decision are typically the people on the ground who need to figure out how to get something done. Just like every other decision at Netflix, generally speaking, we try to have the decision happen at the lowest possible level in the company.

So for example, three and a half years ago, when we decided to outsource notification for alerting, that was a decision that I, at the time just an individual contributor, made. I discussed it with my director who actually wasn’t entirely supportive of this but didn’t get in the way. So quite often it happens in the lowest possible way.

   

4. Would you like to, or dare to, share with our readers maybe some case where you didn’t take a good decision, i.e. a posteriori you realized it wasn’t a good decision [either] to buy or to build? Do you think that could have been avoided somehow or it's just an organic process again to learn and maybe the value is in that process?

It's very tempting to engage in hindsight bias and everything that doesn’t work out you look at it as “well, we should have known better at the time”. It's also very tempting to feel better about yourself by saying “we've never made any mistakes, even when we've changed our mind later”.

I think it's more complex than that. We've certainly made decisions sometimes to, for example, use another company's product and found that, especially at our scale, we've had cases where the external product couldn’t keep up with us and at some point we stopped and we started building our own product.

I would say even in those cases, in my experience, we got some benefit from seeing a polished product even if it couldn’t keep up with us as a way to inspire our own internal efforts. Perhaps it's the overly optimistic part of me but I can't point to anything that feels like an obvious mistake, though certainly in many cases we've made a decision and then later turned out to be no longer the right decision for us.

How's that as a potentially political answer?

   

5. That makes sense. And a similar question but now for an example of a decision that turned out really well in the long run and maybe you weren’t really sure when you took the decision how it was going to work out?

I don’t want to promote vendors so much but I'll make an exception here. Three and a half years ago when we were building our next generation alerting platform, we were clear that we wanted to keep telemetry in-house and maybe even alerting in-house. But when we looked at notification, we looked at PagerDuty as a notification as a service product. And when I talked to my manager about it, this was actually the decision of which we had some disagreement because he felt like, "Oh, we can probably just build it ourselves. How hard can it be?"

I argued that building a bad notification product is relatively easy and building a great one is not. And we ended up using PagerDuty. That was three and a half years ago and that relationship is still going strong. I'm nowhere near deciding that we should build our own product for that.

   

6. In general, how much leeway is there for experimentation? And do you have some kind of learning budget at Netflix for when you need to experiment and maybe see if a vendor product would be the best option or you might just want to build a small prototype to see if you could do it yourselves?

I think there's tremendous space for experimentation. It's not just something that's allowed but I would actually argue it's required. We don’t have a budget for this because I think budget essentially is some sort of arbitrary decision making from the top as to how much should be done.

Rather we try to hire very smart engineers and let them figure out, for any given decision they need to make, is this something that I can sort of follow the beaten path and implement or is this something I need to experiment in. If I need to experiment, how do I minimize the cost of that experimentation so I know as quickly as possible whether or not this thing is going to work for us.

   

7. I'd like to ask about one particular example that you gave during your talk: your monitoring system. How was that process to decide to build it since it's rather complex and what were the factors that led you to decide to build and invest in that?

I think it might depend on who you ask. I was having a lot of conversations with our director at the time who felt strongly that we should build our own product. I would say that the reasoning that I heard was a certain degree of distrust in commercial products out there and other groups within Netflix who might have provided this product. So from his perspective, we had a bunch of really smart developers. If we are willing to invest the developer time to build it, clearly we could come up with something that's better for us.

Having said that, I will tell you that I thought we should use an open source product and got permission to essentially spend about maybe a month, six weeks, to investigate options in that space. And I actually came from that process believing we should actually build our own because I didn’t see anything that was going to be a great fit for us.

So at the time I think we made a decision that felt like a relatively easy decision maybe. We thought it was going to be a relatively smaller project than it ended up being. We might have engaged in more discussions if we knew how long it would take us to actually deliver this, but I don’t think the outcome would have been different. I think in the end it was the right decision for us.

   

8. A related question: with systems as large as Netflix and that keep growing, how can you, in anticipation, envision what are the requirements you need? In this case for the monitoring tool considering that you can't predict really in maybe two years, three years, how much load you will have?

We'll have a lot more load than we have today, that's for sure. The first two years of building this platform ended up taking us... we moved a little slower than we expected partially because we kept running into scaling concerns. The reason we kept running into scaling concerns is because it took us a while to notice that we're increasing our volume of telemetry by about 100% every quarter. You can maybe even start predicting that and say, okay, so that means that essentially I'm going to increase by about what? 8x or is it? Actually no, it's I think 16x, 24. So 16x every year but that's just ridiculous.

It's ridiculous for two reasons. One, when you're increasing at that speed, you keep running into technical constraints and you spend most of your time not improving the product but solving scalability concerns which is certainly very attractive but not necessarily useful to our customers. And the other one is cost. Even at Netflix, which isn’t particularly necessarily cost sensitive, when you start running the biggest cloud ecosystem within Netflix and costing Netflix hundreds of thousands of dollars a week, people pay attention.

So at some point, we actually sat down with the very prolific metrics producers within our environment and the good news is there are only about three different teams at Netflix who produce the vast majority of metrics. So the good news is if you're not one of those three teams, you can do anything you want. You don’t even have to be thoughtful about metrics because if you quadruple your metrics count, we're not even going to notice it. But those three teams need to be a lot more thoughtful.

And so we've had much more conversations with them. They are much more thoughtful about this. And we actually have a target for metrics growth of not increasing our metrics count from these systems faster than the business is growing. They've done a pretty decent job keeping to that target.

   

9. Interesting. I'd like to know also when you build a non-core system or tool, do you decide beforehand if you would be willing to open source this tool or does that come later on for other reasons?

I think most of our stuff was built before 2012 when we started actually open sourcing our products. So obviously, anything that happened before 2012 we built originally without any thoughtfulness about OSS.

These days, generally speaking, most teams who start working on a new product probably have some idea about whether or not they're going to open source their product. And generally speaking, frankly, the answer is yes. We're going to open source it because there's no reason not to potentially, other than sort of the increased maintenance load. But given the fact that we don’t need permission to do this, given the fact that generally speaking engineers seem to like supporting open source, generally speaking it's going to happen.

   

10. Has it happened that you start open sourcing before the tool actually has matured enough that you can use it inside Netflix? Or it's always at the later stage?

Traditionally, I would say we've waited until it was really good and it had been working in production at Netflix for a while. That's certainly what we did with Atlas.

In some cases we haven’t, so some of the more recent development we've done, we've actually started off, from day one, with a completely non-functional product being open sourced. In some cases, we did it much earlier than when it's actually ready.

So for example, with RxNetty, which is basically the reactive framework that some people at Netflix use, our developers would probably agree that RxNetty is incredibly powerful but not necessarily ready for wide scale adoption at Netflix. The vast majority of engineering teams have not adopted it. But it's open source. My team is one of the maybe two or three different teams at Netflix who use RxNetty in production. It's been a pretty good product for us.

   

11. In those cases, do you welcome contributions from the community? Do you have a strong review process or how does it work?

I suspect most people at Netflix would argue that if you open source something you should not be thinking of this as some sort of read-only open source project. If we're open sourcing something, we're hoping that other people are going to use it and contribute to it. I think we're very happy when that happens.

As for the review process, this is where it gets kind of interesting. I think we definitely need to review code. In the end, it's our repository. We're the ones who own it and that means that we need to be responsible for every line of code that runs in this environment.

We can do a pretty decent job, I think, making sure, for example, that we don’t incorporate malicious code into our project. Where it gets interesting is if people want to incorporate features and capabilities into this product that we don’t have the ability to test, how do we do that?

If people wanted to figure out how to make Atlas run on the Google Compute Engine environment and proposed some code enhancements to make it happen, I think we'd be really interested in figuring out how we validate that those things actually work because we'd be interested in accepting them. But we don’t necessarily have our own way to test that. We haven’t yet figured out exactly how to work that balance.

   

12. On the other side of this open source question, do you take in a lot of tools to use at Netflix? Is there also a kind of review process to check that the tool won't be malicious or have side effects that you don’t want in your environment?

We use a tremendous spectrum of open source products. I think frankly you'd be somewhat foolish if you're running a Linux environment and you're not using a whole lot of open source products. I think maybe the question is why not? We use Cassandra, we use Apache, we use Tomcat, we use Linux. That's sort of just the first few that came to the top of my head.

As for the review process, there's nothing particularly formal. I think when we have questions about a product, typically if it's not necessarily a very famous or very well-adopted product, I'm lucky enough to work with some really fantastic security engineers. Cloud security at Netflix are some of the very few security professionals out there I've actually enjoyed working with and these guys are fantastic resources and fantastic consultants if we have questions around that.

   

13. Another topic: here at QCon we've seen many talks on microservices, DevOps. Given Netflix's open culture and very resilient architecture, these are things I assume are embedded in your culture? Is there any intentional acknowledgment of these topics, in the sense of “we need to do more microservices” or “more DevOps”? Or is it just something which is part of the culture?

I think from a microservices perspective I suspect I may be a little too much of a hipster here in saying that we were doing microservices before they were cool. We didn’t call them microservices. We just basically looked at the natural reaction to running a monolithic stack and saying "well, God, we should do this differently in the cloud." As a result, we've got about a thousand different services in the cloud and there's no particular desire to sort of try to change that particular trend.

Sometimes we find that people sort of notice “oh, hey, we built this thing that was originally a microservice, but now it's doing three distinct things. Well, maybe we should separate them.” They re-architect their platform to create three smaller microservices so to speak.

As for DevOps, I got to tell you, we never used that phrase internally. When we just moved to the model from IT supporting production to engineers both writing code and deploying it and running it in production and waking up at two o'clock in the morning, it wasn’t in response to the DevOps movement. It was a fundamental way we thought to align the responsibilities and incentives of the release process with what developers should be thinking about.

   

14. So as a final question, you mentioned that you work with the security guys and you have a good relation with them. I also saw some time ago a tweet from you, kind of a call to action, on DevSec. Can you explain to our readers what did you mean? Do you have some kind of involvement with that movement?

I have a shameful admission which is that I never actually realized that DevSec was a term that anybody had used until I actually used it. The point that I was trying to make is we talk a lot about DevOps and when we talk about DevOps, largely what we're talking about is an alignment between operations people and developers. At Netflix we don’t really have a DevOps movement because we don’t really have operations people. So we've basically localized both the development and the operations in the same people.

But what we do have are developers and security people, and I think that DevOps is a good start but really what you've got to look at is alignment between different teams. If you've solved the DevOps problem, congratulations. Figure out where else do you have alignment issues and work those out. I mentioned that I love working with the security engineers at Netflix and one of the reasons is because I would argue that what they're considering success and what I consider success are the same thing. We work together for a common goal. In many other organizations, security people largely think of themselves as defending the organization from the bad decisions of developers.

At Netflix, security people help me get my job done better. If you solve DevOps and if you solve DevSec, then the next question is what's the next group you want to look at? One of the reasons I love working at Netflix is because of, not just the engineers or the security people, but the facilities people and the purchasing people and the HR people. If you solve those two things, look at Dev-HR and Dev-Purchasing and Dev-Facilities. It's all about alignment. That was the point I was trying to make.

Manuel: Thank you very much, Roy.

Thanks.

BT