BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Kolton Andrus on Breaking Things at Netflix

Kolton Andrus on Breaking Things at Netflix

Bookmarks
   

1. Hi. My name is Ralph Winzinger and I am a software architect and editor for InfoQ. I am here at QCon 2015. I am here with Kolton Andrus. Kolton, would you please introduce yourself to our readers and watchers?

My name is Kolton Andrus. I work at Netflix. Before I worked at Netflix I worked at Amazon for a few years. It is my privilege to break things on purpose to see what happens.

   

2. That is all to say. Our speaker page reads that you are a “chaos engineer”, which sounds really awesome. So, how did you actually manage to get paid for breaking things in production?

During my time in Amazon I had the privilege of working on the availability team and building some software similar to what I built at Netflix, to break things on purpose and help improve the reliability of the retail website. Netflix is been renowned for Chaos Monkey and for the great work they have done in this area and so it came to them and found a team that really shared that vision and that focus and I joined it and I have been able to work on it.

Ralph: Usually we are trying to keep our production systems as safe as possible and really not to touch them. So, it is also a possibility to ensure that your production system is really ready for the failures that might come.

Yes. There is an interesting concept from the anti-fragile book which I have recently read and it kind of deals with risk and investment, but it talks about this barbell strategy where on the one hand you take the safest approach possible, you are very careful, you do everything that you can. So, in production, we do everything we can to prevent outages and to build our systems to be safe and resilient. But the other part of the barbell in the approach is to take a little more of a risky or speculative approach and what the author posits is that the combination of the two is actually more successful than either one or the other. So, failure testing is a bit of this more risky approach. But in my opinion, it is not as risky because by going out, by causing that failure in production by seeing how it is going to behave and most importantly by validating the assumptions that you have about your system and that your fallbacks will work, you are saving yourself pain and trouble further down the line.

Ralph: Yes, that sounds reasonable. Do you think that injecting failure in production systems is a way that works for every company, for every product? So, for example, I am working a lot for customers from the financial area so I guess they would be surprised if I recommend to intrude in their production systems.

So, my experience with injection failures I made primarily at Amazon and Netflix, but to kind of turn the question around to you, do you still have failures in a banking system? I would imagine you do and so they are going to happen whether you choose to or not. So, the question is “Is your system going to handle it well and what is going to occur?” I was thinking a little bit about banks in particular and someone spoke recently about the transactional nature of banks. In the past, when we had ledger books, if there was an error, then there was another entry that backed the back out. So, it may be that when you are dealing with those transactional–type systems that your normal error handling should be one that has a line item that backs out the trouble that occurred. If that is the approach you take, then does it not make sense if you are going to do failure testing to essentially have those on the books? Say, “Look, we did this on purpose, we wanted to see occurred and then we backed that transaction out”, because it is a bit more honest about how the system behaves. Sometimes we software engineers, we want to hide it all under the covers, but really it is a fact of life and we have to deal with it.

   

3. Yes, that is right. You were referring to anti-fragility here and also in your talk. Could you please explain a little bit what is the idea behind anti-fragility actually?

I think that the way it really made sense for me is to ask what is the opposite of fragile? Something that is fragile does not handle change well and a lot of people, myself included, would think that the opposite would be robust or resilient. But in reality that is the middle of the road. Resilient or robust just means you are indifferent to change. You do not actually improve upon it. So, the concept of anti-fragility is something that actually gets better in the face of change and the best examples are people or humans or society, or you see it all over the place in nature.

Ralph: So, for example, the human body that gets vaccinated gets actually a little bit ill, but then it will be immune to the disease afterwards.

Exactly. That was one of the examples I used in my talk: vaccinate yourself with failure now so that you are immune to it in the future.

   

4. Yes, but the examples are all kind of referring to systems that are able to evolve somehow. So they are able to change themselves, but usually our software systems will not work that way. So, how can we create anti-fragileness software?

Yes, so that is the question and I do not the answer. I do not know a way to build an anti-fragile software system, but in thinking about that a little bit I think we can build the cultures and the organizations – those can be anti-fragile, those can benefit from the change and from the failure to get better instead of simply staying the same or being worse off from it.

Ralph: So it’s not only the software that is the system, but a software including the team responsible for the software.

We have come to learn culture and the company and the people make a big difference in how successful a project is so, certainly, that applies when it comes to anti-fragility.

   

5. The framework you created is called FIT. Everybody knows about the Monkeys at Netflix, they introduce chaos and latency and whatever – is it going to be some kind of replacement for the monkeys?

We still have the monkey and the monkeys still serve their purpose. Chaos Monkey does a great job at what it does. Latency Monkey was the one that was a little bit problematic. The way I phrase it is that developers were kind of afraid to let it out of its cage because they got bitten. Things did not quite work the way they expected and they really could not control to the fine degree necessary how the failure was going to manifest. So, FIT was really built to solve that problem and one of the key concepts is that the failure scope, determining how much of impact you are going to have, the blast radius. But it is built in a way that essentially replaces Latency Monkey. It does everything that Latency Monkey could do, but it does a lot more as well.

   

6. Is it also open source or will it be open source?

It has not been open-sourced yet. Part of what I wanted to hear and I want to hear from anyone interested is if it valuable to open source, if they would use it. I think the concepts are great. Right now it is a little bit tied to how we do things in Netflix and so there is moving pieces that would kind of need to be put together. But if there is sufficient interest, I am more than happy to share not just the concept, but the code behind it.

   

7. I think there will be a lot of downloads. How does it work? Do I have to instrument my service to use the fit framework?

So, when a request comes in, we leverage our proxy layer that everything goes through to decide whether or not we should cause a failure on that request or not and then we decorate the request and that is a little bit of the magic: we have this request context that flows through our system. So that is how we get the failure in, but then the individual – the injection points they are called – those are the layers where we want to cause the failure. Our RPC clients, our caching layer, our persistence layer, our circuit breaker layer - Hystrix - that is where we want to cause the failure and so each of those has been instrumented to have a hook and to implement the failure behavior. So, what was nice about that was that when we rolled it out to the company, I went and talked to the teams that owned those layers and the team that owned the proxy layer and I was able to make the changes there and all of the application developers in the middle just kind of got it for free.

   

8. It is actually implemented in some kind of infrastructure code at Netflix and everybody can benefit. In the morning I was listening to a talk about AB testing and then I had the idea that that sounds a little bit like AB testing, but with failure. So, maybe 5% too high latency and 95% will be normal. Can you compare it this way?

It is an interesting thought. At least as it has stood this far, if we are going to impact a large number of customers, we want to be very careful. We do not want to cause our customers pain and we want to insure that they can stream and that we do not get in the way. So, with successful failure tests no one really knows that anything happened. So, if we were to be always causing failure on a high percentage of customers, I think there is inherent risk there that makes me a little uncomfortable. That being said, we are looking at doing a side project this summer to do some more advanced failure injection that we do it automatically on a very, very small percentage of request and make sure that customer still has a good outcome and then looking at what could have failed and being intelligent about determining if we are resilient to it and kind of always searching through the service-oriented architecture to see if there are things that were vulnerable to it.

   

9. It sounds a little bit like machine learning and stuff like that?

There are some great opportunities there. We will see how that turns out.

   

10. Who is defining the test scenarios for the FIT because if actually the service teams are not really involved and they benefit without doing it, who is defining the testing there is and who is running the tests?

We built it to be self service. One of our core values is freedom and responsibility. We want people to be able to go out and do what they think is best. In this regard we want the mid-tier services, as we call them, those that do not sit on the edge, to be able to go out and do their own testing and understand how the system handles it when their service fails or gets slow. Likewise, we want the UI engineers and those that run the clients to understand how different failures occur. But we sit in this unique position on the edge between the middle of both and we really have the context to know what fails and how it fails often. So, in some cases we drive the failure testing and we know what we want to make sure we are resilient to. But in other cases, if another teams has something that they want to replicate or reproduce, they can go out, they can create the scenario, they can go test it and make sure that it behaves as they expect.

Ralph: So the teams do not have to change the software, but they are able to define the scenarios on their knowledge – what might be useful for this service

One of the things we did to make it easier and more approachable is that we built a nice web interface on top of the simple service and part of that is that when you are building a scenario, you can look at the domain of things that are possible to fail and this helps with a team that may not understand. They know that they need to be resilient to losing a social service or the AB testing service, but they may not know what that entails. So sometimes they may need to come and talk to someone that has more context or they may be able to just use the UI to kind of rely on naming conventions and find those.

   

11. After all do people have fun in trying to break their stuff?

I have fun in trying to break things, but I am a little sadistic in that manner. I also volunteered to be a call leader or an incident commander. If Netflix goes down I am happy to be on the hook to go help fix it and make things better. But yes, certainly we have talked to some teams that have enjoyed it. I think that what they enjoyed most is that it makes it easy for them to do their job and that is part of my approach to freedom and responsibility and getting people to adopt software: make it easy to do the right thing. So, when people go out and they know that their service needs to be resilient, but if they have to go through a lot of custom traffic routing, a lot of manual configuration and it is really not how the system behaves in production, then they are spending a lot of time and effort to do this failure testing and they may not be getting out of it what they really expect. So, giving them a realistic failure scenario and making it easy to do and yet safe, guiding them in how it happens, I think really enables that culture that we want to propagate.

   

12. I guess if you are testing in production and introduce failures in production, there must be a lot of testing before so that you can be sure that the services that will be propagated to production are really ready for it. So, what is involved before a piece of code is introduced in production?

Well, on one hand the failure testing service, we have in the test environment as well. I want to deploy the test, before I deploy the product and make sure that things behave as expected. So people are certainly able to do that. We have a wide variety of the normal kind of tests and ways in which we ensure things behave as expected. We have also used this concept of FIT to do some integration tests where we will mock some requests and then we will introduce this bad behavior and we will execute that as part of our build and we will make sure that these isolated pieces of code behave correctly, that this functional test – make sure that when we cannot talk to a dependency or something goes wrong that at least at our layer, it works correctly and we can make sense and return a degraded response.

   

13. There is one last thought I had. Yesterday, you had a live demo in your talk and you broke your account – I guess that was the recommendation service for your account. So it did not work anymore and then you put it back to normal. Is there some kind of security issue with FIT because if somebody gets hold of FIT maybe he can break the streaming service for all of your customers?

Yes, it is a good question and it is definitely one we have thought about. On one hand it is an internal tool, we do not have any external access to it. On the other hand, if you were to be able to inject some failure, you would only be able to do it for your request and so you could not break other people as a consequence. When it comes to the internal tool, if you are able to go in and use it to kick off a large scale failure test and impact many customers, in the top right corner, we have this bright red button and it says “Halt all failures” and anyone can go push it and it shuts down the system and it disables it throughout our infrastructure so that the failures will not be injected. So, if heaven forbid that were to occur, we would quickly mitigate the issue and prevent any of that bad behavior from going. So, layers in defense – multiple ways in which we want to make sure that does not happen.

Ralph: So hopefully no one is keeping me from watching Breaking Bad.

If our customers can stream, that is of the utmost importance, that is what keeps me up at night. I want to make sure our customers have a good experience.

Ralph: Ok. Thanks a lot for the interview. It was very fun and thank you for sharing your insights.

My pleasure. Thank you.

Sep 18, 2015

BT