InfoQ Homepage Articles Agile, Architecture and the 5am Production Problem

Agile, Architecture and the 5am Production Problem

Jun 25, 2007 16 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Introduction

I love the first line of the Agile Manifesto: "We are uncovering better ways of developing software by doing it and helping others do it." It captures the collective, collaborative journey that many of us have been on. Agile methods are evolving as we learn and discover. Most importantly, we have created these methods through the rough-and-tumble of real work. The practices we employ were not invented in sterile academic isolation, but in the muddy trenches of the working developer.

Agile methods tell us a lot about how to build software that meets its requirements and still changes easily over time. Probably the most influential technique to emerge is unit testing. Unit testing emphasizes isolation of the unit under test. Interfaces, with their perennially controversial mocks, stubs, and fakes, provide clean separation of each unit being tested.?

The trouble is that problems emerge in the white spaces between units. It exists in the connections—and gaps—between the boxes on the diagram, between the units that get tested. This is hardly news. The entire practice of functional testing exists precisely because an assembly of well-functioning units is not necessarily well-functioning as a whole.

Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in "Release It" is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what's really happening.

I'll tell you a story about a vexing problem that could not be explained or solved up at Layer 7. Getting to the bottom of it took us right down to the wire and back again.

The 5 a.m. Problem

Software has to pass QA before you release it, but it lives and dies in production. Actually, it dies in production all too often. That's where I come in. I don't have a limp or a Vicodin habit, but other than that, I'm like Dr. Gregory House. When high-volume, transactional systems die, I get the strange cases that don't make sense. You won't see these cases on TV, but then again, I don't have Hugh Laurie's sexy misanthropic sneer, either.

House usually assumes the problem is either drugs or an infection, until some symptom points elsewhere. My first suspect is always an integration point. I've seen integration points cause more failures than anything else, and that includes shoddy coding. I contend that, sooner or later, every integration point in your system will fail somehow. It might refuse connections or return a partial response. It might answer with HTML when you're expecting XML. It might get really, really slow. And sometimes, it just won't answer at all.

I consider database calls as just a special case of an integration point. Databases are enormously complex, yet most developers just expect them to behave well. They take it for granted when the database chugs away, accepting any kind of Byzantine SQL you throw at it, returning tidy result sets. Most of the time, they give much thought to what goes on beneath the neat abstraction of a SQLConnection.

Don't be lulled into a false sense of security. Any call to a database can hang. I'm not just talking about inserts and updates that can deadlock. I'm not just talking about stored procedures. I saw a case where queries to a read-only database---one that nobody was even allowed to update---caused a recurring pattern of failure.

Get the Crash Cart! This Website Just Flatlined.

Downtime at the same time every day.

One of the sites I launched developed this very nasty pattern of hanging completely at almost exactly 5 a.m. every day. This was running on around thirty different instances, so something was happening to make all thirty different application server instances hang within a five-minute window (the resolution of our URL pinger). Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.m., there were only about 100 transactions per hour of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us Central Time folks). Restarting all the application servers just as people started to hit the site in earnest was what you'd call a suboptimal approach.

On the third day this occurred, I took thread dumps from one of the afflicted application servers. The instance was up and running, but all request-handling threads were blocked inside the Oracle JDBC library, specifically inside of OCI calls. (We were using the thick-client driver for its superior failover features.) In fact, once I eliminated the threads that were just blocked trying to enter a synchronized method, it looked as if the active threads were all in low-level socket read or write calls.

The next step was tcpdump and ethereal. (Ethereal is now called Wireshark.) The odd thing was how little that showed. A handful of packets were being sent from the application servers to the database servers, but with no replies. Also nothing was coming from the database to the application servers. Yet, monitoring showed that the database was alive and healthy. There were no blocking locks, the run queue was at zero, and the I/O rates were trivial.

Packet Capture

From abstract to concrete.

Abstractions provide great conciseness of expression. We can go much faster when we talk about fetching a document from a URL than if we have to discuss the tedious details of connection setup, packet framing, acknowledgments, receive windows, and so on. With every abstraction, however, there comes a time when you must peel the onion, shed some tears, and see what's really going on---usually when something is going wrong. Whether for problem diagnosis or performance tuning, packet capture tools are the only way to understand what is really happening on the network.

tcpdump is a common UNIX tool for capturing packets from a network interface. Running it in "promiscuous" mode instructs the network interface card (NIC) to receive all packets that cross its wire---even those addressed to other computers. In a data center, the NIC is almost certainly connected to a switch port that is assigned to a virtual LAN [VLAN]. In that case, the switch guarantees that the NIC receives packets bound for addresses only in that VLAN. This is an important security measure, because it prevents bad guys from doing exactly what we're doing: sniffing the wire to look for "interesting" bits of information. Wireshark is a combination sniffer and protocol analyzer. It can sniff packets on the wire, as tcpdump does. Wireshark goes farther, though, by unpacking the packets for us. Through its history, Wireshark has experienced some security flaws---some trivial, some serious. At one point, a specially crafted packet sent across the wire (by a piece of malware on a compromised desktop machine, for example) could trigger a buffer overflow and execute arbitrary code of the attacker's choice. Since Wireshark must run as root to put the NIC into promiscuous mode---as any packet capture utility must---that exploit allowed the attacker to gain root access on a network administrator's machine.

In addition to the security issues, Wireshark is a big, heavy GUI program. On UNIX, it requires a bunch of X libraries, which might not even be installed on a headless system. On any host, it takes up a lot of RAM and CPU cycles to parse and display the packets. That is a burden that should not be on the production servers. For these reasons, it is best to capture packets noninteractively using tcpdump and then move the capture file to a nonproduction environment for analysis.

The screenshot below shows a capture from my home network. The first packet shows an address routing protocol (ARP) request. This happens to be a question from my wireless bridge to my cable modem. The next packet was a surprise: an HTTP query to Google, asking for a URL called /safebrowsing/lookup with some query parameters. The next two packets show a DNS query and response, for the "michaelnygard.dyndns.org" hostname. Packets five, six, and seven are the three-phase handshake for a TCP connection setup. We can trace the entire conversation between my web browser and server. Note that the pane below the packet trace shows the layers of encapsulation that the TCP/IP stack created around the HTTP request in the second packet. The outermost frame is an Ethernet packet. The Ethernet packet contains an IP packet, which in turn contains a TCP packet. Finally, the payload of the TCP packet is an HTTP request. The exact bytes of the entire packet appear in the third pane.

I strongly recommend keeping a copy of Kozierok's "The TCP/IP Guide" or W. Richard Steven's "TCP/IP Illustrated" open beside you for this type of activity!

Repetition and Paranoia

Understanding the root cause behind the crashes

By this time, we had to restart the application servers. Our first priority is restoring service. We do data collection when we can, but not at the risk of breaking an SLA. Any deeper investigation would have to wait until it happened again. None of us doubted that it would happen again.

Sure enough, the pattern repeated itself the next morning. Application servers locked up tight as a drum, with the threads inside the JDBC driver. This time, I was able to look at traffic on the databases' network. Zilch. Nothing at all. The utter absence of traffic on that side of the firewall was like Sherlock Holmes' dog that didn't bark in the night---the absence of activity was the biggest clue. I had a hypothesis. Quick decompilation of the application server's resource pool class confirmed that my hypothesis was plausible.

I said before that socket connections are an abstraction. They exist only as objects in the memory of the computers at the endpoints. Once established, a TCP connection can exist for days without a single packet being sent by either side. As long as both computers have that socket state in memory, the "connection" is still valid. Routes can change, and physical links can be severed and reconnected. It doesn't matter; the "connection" persists as long as the two computers at the endpoints think it does.

There was a time when that all worked beautifully well. These days, a bunch of paranoid little bastions have broken the philosophy and implementation of the whole Net. I'm talking about firewalls, of course.

A firewall is nothing but a specialized router. It routes packets from one set of physical ports to another. Inside each firewall, a set of access control lists define the rules about which connections it will allow. The rules say such things as "connections originating from 192.0.2.0/24 to 192.168.1.199 port 80 are allowed." When the firewall sees an incoming SYN packet, it checks it against its rule base. The packet might be allowed (routed to the destination network), rejected (TCP reset packet sent back to origin), or ignored (dropped on the floor with no response at all). If the connection is allowed, then the firewall makes an entry in its own internal table that says something like "192.0.2.98:32770 is connected to 192.168.1.199:80." Then all future packets, in either direction, that match the endpoints of the connection are routed between the firewall's networks.

So far, so good. How is this related to my 5 a.m. wake-up calls?

The key is that table of established connections inside the firewall. It's finite. Therefore, it does not allow infinite duration connections, even though TCP itself does allow them. Along with the endpoints of the connection, the firewall also keeps a "last packet" time. If too much time elapses without a packet on a connection, the firewall assumes that the endpoints are dead or gone. It just drops the connection from its table. But, TCP was never designed for that kind of intelligent device in the middle of a connection. There's no way for a third party to tell the endpoints that their connection is being torn down. The endpoints assume their connection is valid for an indefinite length of time, even if no packets are crossing the wire.

After that point, any attempt to read or write from the socket on either end does not result in a TCP reset or an error due to a half-open socket. Instead, the TCP/IP stack sends the packet, waits for an ACK, doesn't get one, and retransmits. The faithful stack tries and tries to reestablish contact, and that firewall just keeps dropping the packets on the floor, without so much as an "ICMP destination unreachable" message. (That could let bad guys probe for active connections by spoofing source addresses.) My Linux system, running on a 2.6 series kernel, has its tcp_retries2 set to the default value of 15, which results in a twenty-minute timeout before the TCP/IP stack informs the socket library that the connection is broken. The HP-UX servers we were using at the time had a thirty-minute timeout. That application's one-line call to write to a socket could block for thirty minutes! The situation for reading from the socket is even worse. It could block forever.

When I decompiled the resource pool class, I saw that it used a last-in, first-out strategy. During the slow overnight times, traffic volume was light enough that one single database connection would get checked out of the pool, used, and checked back in. Then the next request would get the same connection, leaving the thirty-nine others to sit idle until traffic started to ramp up. They were idle well over the one-hour idle connection timeout configured into the firewall.

Once traffic started to ramp up, those thirty-nine connections per application server would get locked up immediately. Even if the one connection was still being used to serve pages, sooner or later it would be checked out by a thread that ended up blocked on a connection from one of the other pools. Then the one good connection would be held by a blocked thread. Total site hang.

Dead Connection Walking

Finding a solution

Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. It checked validity by executing a SQL query like SELECT SYSDATE FROM DUAL. Well, that would just make the request-handling thread hang anyway. We could also have the pool keep track of the idle time of the JDBC connection and discard any that were older than one hour. Unfortunately, that involves sending a packet to the database server to tell it that the session is being torn down. Hang.

We were starting to look at some really hairy complexities, such as creating a "reaper" thread to find connections that were close to getting too old and tearing them down before they timed out. Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it is still alive. If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection.

We weren't that worried about the client crashing, but the ping packet itself would be enough to reset the firewall's "last packet" time for the connection, keeping the connection alive. Dead connection detection kept the connection alive, which let me sleep through the night.

Lesson Learned ?

We would never have written a unit test to simulate a database call hanging indefinitely inside the TCP/IP protocol itself. Why would you? Worse, there are an unlimited number of ways for fallible networks, servers, and applications to generate these "out of spec" failures. So what can we do? Is there a missing practice in the agile world? Is there some testing technique or coding practice that we can employ to avoid this type of failure?

At one time, no one would have envisioned programmers testing their own code. Twenty years ago, the very idea was laughable. Today, it is expected... sometimes demanded. A growing number of us even define "legacy code" by Michael Feather's definition: code without unit tests. Perhaps someone will invent a testing technique to ward off the myriad failures that can bubble up when foundational abstractions misbehave.

Until then, I contend that we need to consider the architecture, even in agile projects, and guard ourselves against it. We must apply design patterns for resilience, just as we apply them for functionality. I have created a set of such "Stability Patterns" in "Release It". I hope these are just the beginning.

About the author

Michael strives to raise the bar and ease the pain for developers across the country. He shares his passion and energy for improvement with everyone he meets, sometimes even with their permission. Michael has spent the better part of 20 years learning what it means to be a professional programmer who cares about art, quality, and craft. Michael has been a professional programmer and architect for nearly 20 years. During that time, he has delivered running systems to the U. S. Government, the military, banking, finance, agriculture, and retail industries. More often than not, Michael has lived with the systems he built. This experience with the real world of operations changed his views about software architecture and development forever.

Most recently, Michael wrote "Release It! Design and Deploy Production-Ready Software", a book that realizes many of his thoughts about building software that does more than just pass QA, it survives the real world. After its release, the book was Amazon's #1 "Hot New Release" in the Software Design category for over a month. Michael previously wrote numerous articles and editorials, spoke at Comdex, and co-authored one of the early Java books.

Adapted with permission from Release It! Design and Deploy Production-Ready Software, by Michael T. Nygard

Copyright 2007 Michael T. Nygard, published by the Pragmatic Bookshelf, ISBN 0-9787392-1-3.
Book available in paper and PDF from www.PragmaticProgrammer.com

InfoQ Software Architects' Newsletter

Agile, Architecture and the 5am Production Problem

Write for InfoQ

Introduction

Related Sponsors

The 5 a.m. Problem

Get the Crash Cart! This Website Just Flatlined.

Packet Capture

Repetition and Paranoia

Dead Connection Walking

Lesson Learned ?

About the author

Rate this Article

This content is in the Software Craftsmanship topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter