Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Michael Nygard on Building Resilient Systems

Michael Nygard on Building Resilient Systems


1. Hi, my name is Ryan Slobojan. I'm here with Michael Nygard. Michael, what's the difference between feature complete software and production ready software? Is there a difference?

There is definitely a difference. In fact, I think to some extent, the only possible answer for that question is Mu. You have to unask the question. Feature completeness really tells us nothing at all about a software's ability to survive the real world of production. Feature complete tells us that that's past QA, which means that, by large, when I click this button, that label gets activated or when enter a date it's in proper format, it says nothing at all about whether the software will handle continuous traffic from millions of users 4 weeks at a time.


2. As a developer, what can one do to think about these kinds of considerations?

Some of it can be tested out. We do have mechanisms for load testing and, if you run a load test for more than a few hours at a time, you get into what I consider a longevity test. Whereas the typical QA system runs for a few hours before it gets shut down and reloaded with yet another build of the software, I like to see QA running for several days at a time under continuous load. That gives you some idea about whether the software exhibits memory leaks or resource leaks of other sorts. Other kinds of issues, though, can't be tested out, they have to be engineered out. For example, I talk about scaling effects a lot.

In development and QA, every system looks like a pair of servers that the web tier, a pair of servers that the app tier, a pair of database servers or maybe just one database server in dev. System A talking to system B in QA is 222 talking to 222. Scale that to production and you may have 100 servers in system A calling into 10 or 4 servers in system B - that's a very common kind of scaling effect. We also see scaling effects between tiers. In dev and QA it's 222 and in production that may be 10, 20, 2. Things that work OK when it's just a 1-1 ratio, don't work as well when it's 20-1 ratio.


3. One of the principles of Agile and of Lean development is you're deferring a decision until the last responsible moment. Do you think that considerations like this, they are really something you can defer, like the last responsible moment is not at the end of the project, it's more at the beginning?

I would agree to that interpretation. The last responsible moment for some of these issues can come fairly early. The decision about what type of integration call to make, for example. Is it going to be a synchronous HTTP request-response to get information from the back end versus asynchronous message queuing has tremendous operational implications. Anything dealing with synchronous calls, you're inherently coupling availability. Anything dealing with asynchronous messaging, you are decoupling availability at the expense of some complexity in the logic on the calling side. That's a decision you really do need to make fairly early on in the game, it's tough to swap up that later on.


4. One of the problems that I've seen with respect to things like having low testing downing issues to QA and all the other things that have to happen during the software development process is that you need to have a certain amount of hardware available in order to be able to do that. So, you would dedicate, say, 3 servers to your long running testing and you have 3 servers for QA and you also have this and that tends to have very prohibitive cost-wise. Do you think that the rise of cloud computing and platform as a service, having the ability to just instantiate a bunch of machines out in the cloud? Do you think that will help increase the adoption of these scales processes?

Yes, I absolutely do! I want to go on a brief tangent first and talk about continuous performance testing. In the realm of functional testing, we used to have this model where functional testing was extremely expensive. It was overhead, it involved a whole bunch of people reading scripts and clicking around through applications. So, we didn't do it very often. We tried to defer all of that until a release, which meant that we were accumulating a lot of technical risk. As we did the development we had a large number of lines of code of unknown quality until the end. One of the things we did with unit testing, when we started adopting that practice was reduced the cost of each additional test to the point where we could do it continuously throughout the development cycle.

It's possible also to do that with load testing. There are open source tools like JMeter that work pretty well. They are not suited for really high volume load tests, but they work pretty well for testing as you go and checking to see when you introduce a performance regression. It is possible to do more of that on a continuous basis.

While it won't prove that you are ready for production scale load, it can tell you if you've introduced something that broke your performance. That's one aspect. The other aspect is you are absolutely right. Cloud computing can help in this arena in 2 ways: one is that you can spin up a new environment of your software in the cloud relatively cheaply and quickly, pay for what you need it and tear it down. This will have a side benefit of getting you to be very good at deployments and automated deployments, which is pretty important for operations in its own right, but some organizations have security concerns or data privacy concerns. Anyone dealing with US regulatory systems like HIPAA or industry standards like PCI are going to have a very hard time doing that. Another alternative is that you can spin up a load farm in the cloud to test your software where it happens to reside, probably in your own data center. That's another option that can bring you lower cost than going large scale with some other vendors.


5. An approach that seems to be taken frequently is that it's thought that you can harden the system, make it production ready shortly before release. Do you believe that's something that's really possible?

I would call that necessary but not sufficient. I think you do need to harden a system. I think you need to apply several forms of production readiness testing that certainly includes load and stress testing, longevity testing. I believe that it also includes security testing, which fortunately we see more and more organizations do, but I don't believe that that's sufficient to really gain a lot of confidence. I think you have to move some activities upstream. Even as an Agile guy it pains me to say it, I think there is a certain amount of design in architecture that you have to do throughout the entire process, to make sure that you are production ready when you reach that point.


6. What are some of the patterns that a developer or an architect can follow to make a system more resilient?

In my book, Release it!, I have a catalog of what I call "Stability Patterns" that help around resiliency and a catalog of capacity patterns - things you can do to help reach higher volumes with your current set of resources. In the Stability Patterns I talk about ways that you can preserve functionality for some features or some users, even in the event of partial system failures. One of the things that I try and get people to adopt is this failure oriented mindset - that I call it - where I want everyone to understand that, at some point, every piece of your system and every external system you call on is going to break and you need to be able to survive that.

It might mean that there are certain features you can't provide to your users while the breakage is occurring, but you should preserve as many features as possible. So, I have patterns that I call "The Circuit Breaker" and "Bulkheads" - those are 2 of the key patterns. The circuit breaker essentially cuts off an external integration point when it's malfunctioning and it does this to preserve request handling threads in the calling system. Very often, when you make a call to an external integration point that's broken, it will tie up a thread in a blocking synchronous call for an indefinite period of time.

Once you get enough threads tied up waiting on that external integration point, you are down as well, which means you've allowed a fault in an external system to propagate into a failure in your system. We should never allow that. The Circuit Breaker allows you to cut off that integration point when it's malfunctioning. Bulkheads are another approach to preserving functionality for as many users as possible. Bulkhead means I partition my system into separate sets of resources that are unlikely to fail together. I'm trying to segregate, perhaps processors into processor pools. So, when a process goes wrong and starts eating CPU, the most it can eat is 4 CPUs instead of 16 because boxes come in pretty large numbers of cores these days.

That's one kind of a Bulkhead. Another kind is to say, "Within an application, I'm going to create multiple thread pools that each serves different purposes", so if the thread pool that handles JMS queues gets overloaded because of a message flood, I can still process front end transactions because that's in a different thread pool and I can still get to it through the admin interface because the admin threads are in a separate pool. That's another example of Bulkheads.


7. How can you approach making an existing legacy system more resilient and more monitorable?

Interesting - 2 questions embedded there! I'll tackle monitoring first and then I may need a reminder to go back and do resilience. Monitoring begins with log files. I know it sounds primitive, it sounds like debug with printf, but log files actually still have a lot of advantages. They're plain text, so you can move them around to a lot of different places. You can ship them off to a vendor or a consultant, if you need some postmortem analysis. They persist after the condition that was causing the fault is gone, so you don't have to wait for it to happen again and try and catch it in the act. Log files are still a pretty important mechanism.

I also advocate exposing information through a management interface. For example, anyone in the Java community has JMX available to them. JMX can let you open up the internals of your application and make them transparent. Status variables, counters, thresholds, timers - all of these kinds of things can be very interesting. I mentioned the Circuit Breakers a little bit earlier; the open or closed state of the Circuit Breaker can be a pretty interesting quick indicator of the overall health of your system. Exposing the state of your Circuit Breakers through JMX makes them immediately monitorable. Then, the last mile of monitoring is getting the information out of the application and into an operation system.

JMX has connectors to a lot of different monitoring systems and I really advocate creating transparency in your application and leaving the last mile up to the operations group. I don't advocate developers directly hooking into the monitoring systems. There are a couple of reasons for that: one - things that change at different rates should be separated concerns. Rules about monitoring and rules about thresholds change very rapidly. Rules about how to react to a particular monitoring event change very rapidly and they change under the control of the operations group. You really don't want a code deployment cycle just to accommodate that. Second, the operations group sometimes buys a new product and makes a sweeping change to replace NetCool with OpenView or OpenView with CA or something like that. Again, we don't need that to be a code change and we particularly don't need to go sweeping through our code looking for calls to one API and replacing it with calls to another API. That was the monitoring aspect. I'll go back and address resilience, if you remind me what the question was.


8. It was resiliency with respect to existing legacy systems.

Creating resilience in legacy systems, like anything else with legacy systems, will be a challenge, no matter what. The first aspect is to identify the weak points. I would always start with resource pools and external integration points. Any kind of a thread pool, any kind of a database connection pool is a place where you know there is going to be blocking in the application, so it's very likely to be a failure point. Any place where you are calling out to an external system, particularly if it's a synchronous call, it's going to be a place where a fault will be introduced into your system - so that's a failure point to look at.


9. Have you got any tips or recommendations for the types of information referring to your point about monitoring, types of information that can be verified by those different mechanisms like logging or monitoring through JMX - that type of thing? Thinking of the observation that quite often it's only after the problem has happened that you realize you wish you had a bit more monitoring or logging at the time that these happened?

That's an excellent point and absolutely true! We always want just a smidgen more information than we actually got from any particular event. First and foremost I'm going to start with resilience and then think about capacity. In terms of resilience, you always want to know who's blocking, where and how often. Again, on something like a database connection pool, I'll keep track of the high watermark, how many connections have been checked out. I'll keep track of how many times a thread has blocked. I'll keep track of the longest blocking time.

If I have a good statistics collection mechanism built into my app, I might even keep stats on every time a thread begins blocking and ends blocking. I'll just record that, so I can get an idea of the distribution of blocking times as well. The instant you see that 40 out of 40 threads are blocked on a database connection pool, you immediately know that that's where the problem is or, at least, that's the approximate cause - trace backwards to figure out why you leaked connections.

Anywhere there is a pool definitely track who's blocking and how often, high water, low water and some stats about number of times things are being checked in and out. Other kind of health indicators: any place you've got a cache, keep track of how many items are in cache, what the hit rate is, what the eviction rate is; any place you've got the circuit breakers, keep track of how many times the circuit breakers are flipping from an open to a closed state or from closed to open, current state of all of them, of course, and the thresholds that are configured into it. Those are all useful things to expose through a monitoring and management interface.

It can also be useful to expose controls on these things - for instance, with the circuit breaker, a control to reset it; with a pool a control to change what the high water and low water mark will be. I can think of several cases where we've had an ongoing partial failure mode and we needed to go in and change the maximum number of connections in a connection pool and dial it down, so that the front end system would stop crushing the back end system. That's a very useful kind of control to have at runtime.


10. You heard Richard Gabriel talk about ultra large scale systems about a month ago?

I must say I missed his talk last month, but I've heard him deliver the talk before.


11. The things that you know now and are confident of now with large complicated systems, are they going to be equally applicable and useful when you get into ultra large scale complex systems, particularly when you start integrating human components in the merely mechanical or electromechanical components?

The ultra large scale system is almost by definition one that is too complex for any single human to understand or comprehend. That necessitates a very de-centralized approach to things. I think that many of the issues that I'm looking at and dealing with really help in a de-centralized world. We deal a lot with centralized operations now. Monitoring assumes a nervous system that can look at everything and control everything, but I also look at patterns that we can apply within components to make individual components more self-aware, more defensive and more able to survive on their own.

The patterns I'm talking about, like the circuit breaker that I keep bringing up, is really a way of an individual component to stop trusting everyone else around it and decouple itself from other components that are hurting it. So, you could almost view that as a pain response, which I think would help in the ultra large scale where there is not going to be any centralized awareness of the health and well-being of all the components.


12. It seems that the operation of an application is often considered a secondary concern to the development of an application. Why do you think that is?

Because mostly we ask developers about it. If you talk to CIOs and IT managers, they'll happily tell you that more of their budget goes to operations than to development. We often focus on development because of the time pressures. As soon as a requirement has identified it, it represents a current need, so any delay between the identification of that current need and the satisfying of that need is painful to the people who have the needs.

Since the people who have the needs are usually the ones writing the cheques, that means that time pressure on development is always severe, so it can be very difficult to have a conversation where you say "We need a few more weeks in development to handle production readiness issues". Particularly if you say things like "We're not going to implement that feature you want because we need to add data purging to the application layer".

It's a very difficult discussion to have. What we need to look at more often is total cost of ownership. That's a term that gets thrown around a lot, usually be people who have something to sell you, but in truth we often skip things in development that create an ongoing operational cost. We need to actually balance and say "If it's an extra week of development time, there is direct cost for the development and there is indirect cost in delaying revenues. We need to look at how much that cost is and we need to compare that to total of the ongoing operations cost over the expected life time of the system."

If we do that we will make the decision differently in some cases and we'll make the decision the same way in some cases. By that I mean we'll sometimes choose to incur that ongoing operational cost we'll sometimes choose to spend some additional development time to avoid the ongoing operations cost. One of the examples that I use when I talk about capacity is if you're handling say, web page requests and you have 1 million hits per day - 1 million hits per day is not all that large these days - and each one takes just an extra 250 milliseconds.

First of all, that's going to have an impact on your revenues, and companies like Google and Amazon have identified that very clearly, but secondly an extra 250 milliseconds on 1 million hits per day is about 70 hours of additional computing time, which means roughly you need 4 additional servers to handle the load. 4 additional servers draw power every month, they require administration every month, they may or may not require software licensing every month, they probably have support contracts. Once you get enough administrators, you need managers of administrators to keep the organization in check, so really, that 250 milliseconds per page that seems pretty small in development, translates into a pretty substantial ongoing operations cost.


13. You mentioned the ongoing operations cost of a system. With many systems that I've seen in the past, the operations cost can be very opaque or you're not certain what causes it. For instance, you have a certain number of servers, a certain amount of load and there isn't really the understanding or the visibility in the system that allows you to say "If we were to tweak this thing, we could go from 30 servers to 5". What are your recommendations for how you can look at a system and determine that and see where do you think there might be something that you can do, if you have an existing goldfish-like system?

I think it begins with thinning the barrier or tearing down the wall between development and operations. In many companies, particularly large companies, development and operations have completely different management chains; they have different imperatives operating on them and they may even have a history of conflict and antagonism - it's not uncommon at all.

We need to start getting some visibility into it and that works in both directions: we need developers to get visibility into what their systems behave like in production, we need operations to get visibility earlier in the development process, so that they can have an influence on the architecture and design of the systems. Too often development happens in a vacuum where the developers don't know what sort of machine they are going to be running on, maybe not even know what size or capacity the machine is, they don't know were the files are going to live, the access in a file from a local disk versus a SAN versus a direct attached storage of some kind. We've created these layers of abstraction in the development world that we needed to create to get more leverage out of our platforms and environments, but it's also isolated us from the realities of where we run.

If you talked to somebody who was programming 30 years ago, they probably know how many cycles each instruction they were writing were going to take to execute. How many cycles does it take to invoke a method call in a C# VM? Who even knows? How could you even find that out? We need some of the abstraction to deal with the complexity of the environments, but we need some visibility so that we know what we are causing.

If you don't have the visibility, you'll never have a closed feedback loop and without a closed feedback loop, the person creating the pain is never going to stop creating the pain. I think that's really where it has to begin. One interesting aspect is that people in the cloud computing community are starting to talk about private clouds quite a lot. In a private cloud you could envision the IT department running the cloud and the developers subscribing to it and being consumers of the service.

To me, this naturally creates a closed loop feedback system, because the developers are going to incur charges for their business unit based on the amount of resources that they consume. The business unit will see the charges instead of only IT seeing the charges. It's too often the case in most organizations that developers create applications that maybe don't perform well or consume more resource than they need to, but they never have to pay for that. IT pays for it and, when they need more, they ask for more in the giant IT budget and then the CIO gets beaten up by the CFO on why the IT budget is so large. The feedback is all wrong in that mechanism.


14. You very correctly pointed out that the need to have the ops people and development people integrated in order to deal with some of these problems. Don't you also need to start thinking about the business people and their workflow and the kind of assumptions that they make about the way things should be and even the way the unspoken assumptions that IT makes about the need to have a relational database as opposed to formally and flat files distributed across the network - these kinds of assumptions, but most importantly, the business and their workflow and how that has an impact on performance and scaling and everything else? How do you bring them into the cycle as well?

That's an excellent point! I agree completely. I think with Agile development we've started doing a better job of bringing the business closer to development and gaining understanding of not just what are you telling me you need, but what do you actually need. Let's have a conversation instead of throwing spreadsheets back and forth to each other. I think we need to continue that integration through to operation, so I definitely do agree. We are starting to see some inklings of that coming out of the social network and the extremely high scales startup space where we are talking about things like architecting to take advantage of latency instead of being victimized by it.

You can ask "Here is a business requirement where we would normally assume that the instant of some piece of content gets approved or some price change occurs or something like that". The very next millisecond, the web page should reflect that, which pushes you down the road towards complete transactionality and checking the database all the time or building elaborate cash flush mechanisms, all of which I've done and seen done many times. But it turns out that that price change, as a system event, may have taken only a few milliseconds to execute.

It was probably the result of a decision process that involved humans and phone calls and meetings that may have taken hours, days or weeks. After that length of time, is it really necessary for the web page to reflect the change one millisecond later? - Probably not. We can have more conversations with the business users about how long is it allowable to let something propagate through the system. How fast is this need to show up to people. Understanding that the smaller your answer is, the larger the costs are going to balloon out, we need to have more of those conversations. I think it's beginning and it's a healthy trend we should encourage.

Aug 24, 2009