Book Excerpt and Review: Release It!
This book comes from my extensive experience living with systems in production. I've often been the one to get woken up at three in the morning when some supposedly 24x7 system goes down.
Other books on design and architecture only tell you how to meet functional requirements. They help your software pass QA. In "Release It!", I'll show you how to make your software production ready. If you don't want to wear an electronic leash, you need this book.
Nygard also has several excerpts from the book available, including Sample Patterns and Sample Anti-Patterns, and Nygard's InfoQ article The 5 a.m. Production Problem was an adaptation from this book.
InfoQ spoke with Nygard about the areas that the book covers and some questions around how the book's philosophy fits in with concepts such as Agile:
InfoQ: What is the difference between "production-ready" software and feature-complete software?
Michael Nygard: First off, there's quite a bit of variation in what people mean by "feature complete". Even at best, it just means that all the specified functionality for a release has passed functional testing. For an agile team, it should also mean that all the acceptance tests pass. In some cases, though, all it means is that the developers finished their first draft of the code and threw it over the wall to the testers.
"Production ready" is orthogonal to "feature complete". Whether the acceptance tests pass or the testers give it a green check mark tells me nothing about how well the system as a whole is going to hold up under the stresses of real-world, every day use. Could be horrible, could be great.
For example, does it have a memory leak? Nobody actually runs a test server in the QA environment for a week or a month at a time, under realistic loads. We're lucky to get a week of testing, total, let alone a week just for longevity testing. So, passing QA doesn't tell me anything about memory leaks. It's very easy for memory leaks to get into production. Well, now that creates an operational problem, because the applications will have to be restarted regularly. Every memory leak I've seen is based on traffic, so the more traffic you get, the faster you leak memory. That means that you can't even predict when you'll have to restart the applications. It might be the middle of the busiest hour on your busiest day. Actually, it's pretty likely to happen during the busiest (i.e., the worst) times.
Another aspect of production-readiness is resilience to what I call "transient impulses". These are short shocks to the system. For example, the surges in traffic when your site hits the front page of Digg.com or a mispriced item shows up on FatWallet.com. Many three-tier web apps do really badly when they have to create a huge number of sessions all at once. Lost connectivity to the database or another back-end system is another kind of transient impulse. This is kind of a question about how quickly the system breaks when things start to degrade.
I live in Minneapolis, Minnesota. The freeway system here is marginal even in the best times. It's slow, but functional. You can get where you're going without the crazy commutes you see around Washington DC or the Bay Area in California. But, you know, it's Minnesota. It snows here a few times a year. When it does, the freeway system gets incredibly bad, incredibly quickly. Basically, every freeway comes to a standstill simultaneously. You can imagine a more robust mesh of freeways that would hold up better, like one inch of snow means everything moves 10% slower, instead of 200% slower.
You should ask yourself the same question about your systems. Do they only hold up in sunshine? Or will they keep working when the snow flies and some back end service suddenly takes a minute to respond instead of 250 milliseconds.
InfoQ: What are some of the major issues encountered when trying to make feature-complete software production-ready?
Michael Nygard: I think it begins with awareness. If you're an architect or a developer, you need to design good failure modes into your systems, just like automobile engineers design the crumple zones into a car. Sooner or later, every single box and arrow on your architecture diagram will go wonky. Guaranteed. So, it's your duty to make sure you can preserve as much functionality as possible when it does.
The other big challenge I see is in testing. Some of these problems are incredibly hard to test for. I tell a story in the book about a crash that brought a multinational corporation to a halt, in a very publicly visible way. Huge consequences. It started with a tiny interaction between some well-written exception handling code and a really obscure behavior in Oracle's JDBC driver that only triggered after a cluster failover and virtual IP address handoff. We could see it in retrospect, and we were even able to replicate it in staging, once we knew what to look for. But nobody, I mean _nobody_ would have predicted it in advance. It's simply not possible to think of, let alone test for, the combinatorial multiplicity of interactions within a system of any meaningful size.
InfoQ: Can you give a couple of examples of big systems which were launched when not production-ready, and what the failures were?
InfoQ: What are some common problems encountered with application stability, and how can they be resolved?
Michael Nygard: I go into a lot of the specific problems, that I call Stability Antipatterns, in the book. To counter them, I introduce a set of Stability Patterns that are good at addressing classes of problems.
The meta-problem, if you will, though is graceful degredation. It comes from looking at a system as a single, sort of monolithic, entity. From the users' perspective, and from the sponsors' perspective, the "system" is really a collection of interrelated features. Some features are more important than others, so you should work to preserve as many of the important features as you can, while jettisoning features that misbehave in some destabilizing way.
For instance, if you asked, "Should customers still be able to book a room on the hotel site when we can't show local restaurant listings for that location?" you'd probably get laughed at. Those features don't seem remotely connected! Why should you stop taking reservations just because some third-party local search function isn't working? Ah, but those features are coupled, because both kinds of page requests are served by the same request-handling thread pools in the same app servers. So if that local search service gets slow, or stops responding altogether, and you let all your request-handling threads block forever, waiting for a response, then you've just allowed a non-essential feature to take down the core of your business.
InfoQ: What are some common problems encountered with application capacity, and how can they be resolved?
Michael Nygard: There are a bunch of specific problems that I commonly see, that I wrote up as the Capacity Antipatterns. They mostly spring from architects and developers who don't think about multiplier effects.
I'll give you an example. Every retailer has some kind of category structure to organize their product catalog. That structure shows up in the navigation of the site, usually as some kind of menu. Any time I see that, I ask a couple of questions, "How often does the category structure change?" and "How often is it being displayed?" The answers are always something like, "Once a month" and "Four hundred times a second". So why render that dynamically? Just in case it changes and you need to show the new structure seven nanoseconds later? It doesn't make sense. Far better to render the menus into HTML once during content publishing and just spool out that fragment on each page display. (Better still to push it out into Akamai and let _them_ stuff it into your pages through edge script.)
So, any place your system spends resources, ask yourself what the multiplier effects are.
InfoQ: Aside from stability and capacity, what other sorts of problems can occur?
Michael Nygard: I commonly see two other issues: systems that are opaque or rigid. An opaque system is like a sick goldfish. It's either going to live or die, and there's not a darn thing you can do about it. The system doesn't tell you how it's doing, whether it's healthy or not.
Fortunately, this is changing. I see a lot more awareness of monitoring, especially with free software like Nagios and Zenoss.
Rigid systems are ones that can't evolve, or only do so with difficulty and downtime. Systems that require me to reboot the world to deploy code or content. Ones that are version-locked with half of the enterprise, so I have to schedule a massive "upgrade the company" day, with it's attendent high risk.
InfoQ: How does operations fit into production-ready software? Are they involved in the development process, do they help with the specifications, do they get trained when the product is near release, etc?
Michael Nygard: Operations is integral to creating production-ready software. There's a tendency in development and architecture to stay too abstract, for too long. I think there's great value in getting concrete about your deployment architecture, and Operations can help do that. Work out your directory structures. How are you going to do code releases in a way that enables easy rollback? How can you push the code separately from activating it? There are simple solutions to each of these that just involve a bit of communication and negotiation with operations.
As an example of one possible solution, if you're on UNIX, you can use directories named for the releases, with a symlink pointing to the current release. So if version 1.5.2 of the store is current, you might have a "store_1_5_2" directory with a symlink called "store" pointing to it. Then, when you want to push out the code for version 1.6.0, you deploy it to "store_1_6_0" where it sits waiting until Operations updates the symlink and bounces the app server. It's not hard to see how you could build this mechanism right into the build process.
The other thing to consider is that Operations already has a number of systems that you want to enable. I'm sure they've got some kind of monitoring solution in place, so you need to make your system work well with it. They may even have a CMDB that tracks versions and dependencies among applications.
Ultimately, the better you can support Operations, the better they can support your system. You really want to enable that, if you like sleeping through the night.
InfoQ: There seems to be a tension between the up-front work involved in creating production-ready software and the Agile idea that you do something only when you need to, and refactor as necessary - what are your thoughts on this?
Michael Nygard: As an agile developer, I struggle with this tension myself. I don't have a perfect answer for resolving it, but I think there's a parallel here to good object-oriented design.
Once you've written some code and some unit tests that pass---regardless of which came first---you refactor the code to improve the design. "Improve". Well, what does it mean to improve the design? Doesn't that mean you have to have some notion of "better" and "worse" as it applies to OO design? It does, and that's where Martin Fowler's "code smells" from Refactoring come in. "Code smells" are a qualitative way to talk about better and worse design without getting all hung up on metrics.
I think there's something similar for the architecture. For me, a remote call without a timeout is an "architecture smell". So is a SOAP call or a REST GET that tries to fetch all orders for a customer, without applying a limit.
So, while I do not subscribe to big design up front, or "big architecture up front", I do believe in defining the boundaries within the system, designing failure modes into it, and eliminating "architecture smells" as we encounter them.
InfoQ: Do you see software or APIs (e.g. Hibernate Shards, Puppet) as aiding in creating production-ready software? If not, do you think that such software can be created, or is this strictly an architecture or learning problem?
Michael Nygard: I suppose this would be an ideal time to announce a product. I haven't got one. I think it's orthogonal. Framework developers, product vendor developers, application developers... we're all just developers. I see the same variation in production-safety in vendor code and framework code that I see in application code. So, there's nothing automatic about framework code.
The right kind of framework, well-written, can improve production readiness, the same way that Doug Lea's concurrency library vastly improved Java thread safety and concurrency. But, ultimately, we have to gain confidence in products and frameworks. Proven is better than shiny, open is better than closed, and diverse, real-world, production longevity trumps everything else.
It sounds very interesting
Re: Superb book
An excellent book indeed, and a needed one. I really think "the industry" (i.e. we developers) need to put more focus on how applications behave in production.
Maybe part of the problem is also that many developers will never get awakened at 3am anyway, because operations will be someone else's job, and they don't get a chance to learn much about this very important aspect of software development.
I think it would be interesting to investigate "organisational" patterns as well that can help improve application robustness, in addition to the technical design patterns.