Agile, Architecture and the 5am Production Problem
I love the first line of the Agile Manifesto: "We are uncovering better ways of developing software by doing it and helping others do it." It captures the collective, collaborative journey that many of us have been on. Agile methods are evolving as we learn and discover. Most importantly, we have created these methods through the rough-and-tumble of real work. The practices we employ were not invented in sterile academic isolation, but in the muddy trenches of the working developer.
Agile methods tell us a lot about how to build software that meets its requirements and still changes easily over time. Probably the most influential technique to emerge is unit testing. Unit testing emphasizes isolation of the unit under test. Interfaces, with their perennially controversial mocks, stubs, and fakes, provide clean separation of each unit being tested.?
The trouble is that problems emerge in the white spaces between units. It exists in the connections—and gaps—between the boxes on the diagram, between the units that get tested. This is hardly news. The entire practice of functional testing exists precisely because an assembly of well-functioning units is not necessarily well-functioning as a whole.
Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.
One of my recurring themes in "Release It" is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what's really happening.
I'll tell you a story about a vexing problem that could not be explained or solved up at Layer 7. Getting to the bottom of it took us right down to the wire and back again.
The 5 a.m. Problem
Software has to pass QA before you release it, but it lives and dies in production. Actually, it dies in production all too often. That's where I come in. I don't have a limp or a Vicodin habit, but other than that, I'm like Dr. Gregory House. When high-volume, transactional systems die, I get the strange cases that don't make sense. You won't see these cases on TV, but then again, I don't have Hugh Laurie's sexy misanthropic sneer, either.
House usually assumes the problem is either drugs or an infection, until some symptom points elsewhere. My first suspect is always an integration point. I've seen integration points cause more failures than anything else, and that includes shoddy coding. I contend that, sooner or later, every integration point in your system will fail somehow. It might refuse connections or return a partial response. It might answer with HTML when you're expecting XML. It might get really, really slow. And sometimes, it just won't answer at all.
I consider database calls as just a special case of an integration point. Databases are enormously complex, yet most developers just expect them to behave well. They take it for granted when the database chugs away, accepting any kind of Byzantine SQL you throw at it, returning tidy result sets. Most of the time, they give much thought to what goes on beneath the neat abstraction of a SQLConnection.
Don't be lulled into a false sense of security. Any call to a database can hang. I'm not just talking about inserts and updates that can deadlock. I'm not just talking about stored procedures. I saw a case where queries to a read-only database---one that nobody was even allowed to update---caused a recurring pattern of failure.
Get the Crash Cart! This Website Just Flatlined.
Downtime at the same time every day.
One of the sites I launched developed this very nasty pattern of hanging completely at almost exactly 5 a.m. every day. This was running on around thirty different instances, so something was happening to make all thirty different application server instances hang within a five-minute window (the resolution of our URL pinger). Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.m., there were only about 100 transactions per hour of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us Central Time folks). Restarting all the application servers just as people started to hit the site in earnest was what you'd call a suboptimal approach.
On the third day this occurred, I took thread dumps from one of the afflicted application servers. The instance was up and running, but all request-handling threads were blocked inside the Oracle JDBC library, specifically inside of OCI calls. (We were using the thick-client driver for its superior failover features.) In fact, once I eliminated the threads that were just blocked trying to enter a synchronized method, it looked as if the active threads were all in low-level socket read or write calls.
The next step was tcpdump and ethereal. (Ethereal is now called Wireshark.) The odd thing was how little that showed. A handful of packets were being sent from the application servers to the database servers, but with no replies. Also nothing was coming from the database to the application servers. Yet, monitoring showed that the database was alive and healthy. There were no blocking locks, the run queue was at zero, and the I/O rates were trivial.
From abstract to concrete.
Abstractions provide great conciseness of expression. We can go much faster when we talk about fetching a document from a URL than if we have to discuss the tedious details of connection setup, packet framing, acknowledgments, receive windows, and so on. With every abstraction, however, there comes a time when you must peel the onion, shed some tears, and see what's really going on---usually when something is going wrong. Whether for problem diagnosis or performance tuning, packet capture tools are the only way to understand what is really happening on the network.
tcpdump is a common UNIX tool for capturing packets from a network interface. Running it in "promiscuous" mode instructs the network interface card (NIC) to receive all packets that cross its wire---even those addressed to other computers. In a data center, the NIC is almost certainly connected to a switch port that is assigned to a virtual LAN [VLAN]. In that case, the switch guarantees that the NIC receives packets bound for addresses only in that VLAN. This is an important security measure, because it prevents bad guys from doing exactly what we're doing: sniffing the wire to look for "interesting" bits of information. Wireshark is a combination sniffer and protocol analyzer. It can sniff packets on the wire, as tcpdump does. Wireshark goes farther, though, by unpacking the packets for us. Through its history, Wireshark has experienced some security flaws---some trivial, some serious. At one point, a specially crafted packet sent across the wire (by a piece of malware on a compromised desktop machine, for example) could trigger a buffer overflow and execute arbitrary code of the attacker's choice. Since Wireshark must run as root to put the NIC into promiscuous mode---as any packet capture utility must---that exploit allowed the attacker to gain root access on a network administrator's machine.
In addition to the security issues, Wireshark is a big, heavy GUI program. On UNIX, it requires a bunch of X libraries, which might not even be installed on a headless system. On any host, it takes up a lot of RAM and CPU cycles to parse and display the packets. That is a burden that should not be on the production servers. For these reasons, it is best to capture packets noninteractively using tcpdump and then move the capture file to a nonproduction environment for analysis.
The screenshot below shows a capture from my home network. The first packet shows an address routing protocol (ARP) request. This happens to be a question from my wireless bridge to my cable modem. The next packet was a surprise: an HTTP query to Google, asking for a URL called /safebrowsing/lookup with some query parameters. The next two packets show a DNS query and response, for the "michaelnygard.dyndns.org" hostname. Packets five, six, and seven are the three-phase handshake for a TCP connection setup. We can trace the entire conversation between my web browser and server. Note that the pane below the packet trace shows the layers of encapsulation that the TCP/IP stack created around the HTTP request in the second packet. The outermost frame is an Ethernet packet. The Ethernet packet contains an IP packet, which in turn contains a TCP packet. Finally, the payload of the TCP packet is an HTTP request. The exact bytes of the entire packet appear in the third pane.
I strongly recommend keeping a copy of Kozierok's "The TCP/IP Guide" or W. Richard Steven's "TCP/IP Illustrated" open beside you for this type of activity!
Repetition and Paranoia
Understanding the root cause behind the crashes
By this time, we had to restart the application servers. Our first priority is restoring service. We do data collection when we can, but not at the risk of breaking an SLA. Any deeper investigation would have to wait until it happened again. None of us doubted that it would happen again.
Sure enough, the pattern repeated itself the next morning. Application servers locked up tight as a drum, with the threads inside the JDBC driver. This time, I was able to look at traffic on the databases' network. Zilch. Nothing at all. The utter absence of traffic on that side of the firewall was like Sherlock Holmes' dog that didn't bark in the night---the absence of activity was the biggest clue. I had a hypothesis. Quick decompilation of the application server's resource pool class confirmed that my hypothesis was plausible.
I said before that socket connections are an abstraction. They exist only as objects in the memory of the computers at the endpoints. Once established, a TCP connection can exist for days without a single packet being sent by either side. As long as both computers have that socket state in memory, the "connection" is still valid. Routes can change, and physical links can be severed and reconnected. It doesn't matter; the "connection" persists as long as the two computers at the endpoints think it does.
There was a time when that all worked beautifully well. These days, a bunch of paranoid little bastions have broken the philosophy and implementation of the whole Net. I'm talking about firewalls, of course.
A firewall is nothing but a specialized router. It routes packets from one set of physical ports to another. Inside each firewall, a set of access control lists define the rules about which connections it will allow. The rules say such things as "connections originating from 192.0.2.0/24 to 192.168.1.199 port 80 are allowed." When the firewall sees an incoming SYN packet, it checks it against its rule base. The packet might be allowed (routed to the destination network), rejected (TCP reset packet sent back to origin), or ignored (dropped on the floor with no response at all). If the connection is allowed, then the firewall makes an entry in its own internal table that says something like "192.0.2.98:32770 is connected to 192.168.1.199:80." Then all future packets, in either direction, that match the endpoints of the connection are routed between the firewall's networks.
So far, so good. How is this related to my 5 a.m. wake-up calls?
The key is that table of established connections inside the firewall. It's finite. Therefore, it does not allow infinite duration connections, even though TCP itself does allow them. Along with the endpoints of the connection, the firewall also keeps a "last packet" time. If too much time elapses without a packet on a connection, the firewall assumes that the endpoints are dead or gone. It just drops the connection from its table. But, TCP was never designed for that kind of intelligent device in the middle of a connection. There's no way for a third party to tell the endpoints that their connection is being torn down. The endpoints assume their connection is valid for an indefinite length of time, even if no packets are crossing the wire.
After that point, any attempt to read or write from the socket on either end does not result in a TCP reset or an error due to a half-open socket. Instead, the TCP/IP stack sends the packet, waits for an ACK, doesn't get one, and retransmits. The faithful stack tries and tries to reestablish contact, and that firewall just keeps dropping the packets on the floor, without so much as an "ICMP destination unreachable" message. (That could let bad guys probe for active connections by spoofing source addresses.) My Linux system, running on a 2.6 series kernel, has its tcp_retries2 set to the default value of 15, which results in a twenty-minute timeout before the TCP/IP stack informs the socket library that the connection is broken. The HP-UX servers we were using at the time had a thirty-minute timeout. That application's one-line call to write to a socket could block for thirty minutes! The situation for reading from the socket is even worse. It could block forever.
When I decompiled the resource pool class, I saw that it used a last-in, first-out strategy. During the slow overnight times, traffic volume was light enough that one single database connection would get checked out of the pool, used, and checked back in. Then the next request would get the same connection, leaving the thirty-nine others to sit idle until traffic started to ramp up. They were idle well over the one-hour idle connection timeout configured into the firewall.
Once traffic started to ramp up, those thirty-nine connections per application server would get locked up immediately. Even if the one connection was still being used to serve pages, sooner or later it would be checked out by a thread that ended up blocked on a connection from one of the other pools. Then the one good connection would be held by a blocked thread. Total site hang.
Dead Connection Walking
Finding a solution
Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. It checked validity by executing a SQL query like SELECT SYSDATE FROM DUAL. Well, that would just make the request-handling thread hang anyway. We could also have the pool keep track of the idle time of the JDBC connection and discard any that were older than one hour. Unfortunately, that involves sending a packet to the database server to tell it that the session is being torn down. Hang.
We were starting to look at some really hairy complexities, such as creating a "reaper" thread to find connections that were close to getting too old and tearing them down before they timed out. Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it is still alive. If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection.
We weren't that worried about the client crashing, but the ping packet itself would be enough to reset the firewall's "last packet" time for the connection, keeping the connection alive. Dead connection detection kept the connection alive, which let me sleep through the night.
Lesson Learned ?
We would never have written a unit test to simulate a database call hanging indefinitely inside the TCP/IP protocol itself. Why would you? Worse, there are an unlimited number of ways for fallible networks, servers, and applications to generate these "out of spec" failures. So what can we do? Is there a missing practice in the agile world? Is there some testing technique or coding practice that we can employ to avoid this type of failure?
At one time, no one would have envisioned programmers testing their own code. Twenty years ago, the very idea was laughable. Today, it is expected... sometimes demanded. A growing number of us even define "legacy code" by Michael Feather's definition: code without unit tests. Perhaps someone will invent a testing technique to ward off the myriad failures that can bubble up when foundational abstractions misbehave.
Until then, I contend that we need to consider the architecture, even in agile projects, and guard ourselves against it. We must apply design patterns for resilience, just as we apply them for functionality. I have created a set of such "Stability Patterns" in "Release It". I hope these are just the beginning.
About the author
Michael strives to raise the bar and ease the pain for developers across the country. He shares his passion and energy for improvement with everyone he meets, sometimes even with their permission. Michael has spent the better part of 20 years learning what it means to be a professional programmer who cares about art, quality, and craft. Michael has been a professional programmer and architect for nearly 20 years. During that time, he has delivered running systems to the U. S. Government, the military, banking, finance, agriculture, and retail industries. More often than not, Michael has lived with the systems he built. This experience with the real world of operations changed his views about software architecture and development forever.
Most recently, Michael wrote "Release It! Design and Deploy Production-Ready Software", a book that realizes many of his thoughts about building software that does more than just pass QA, it survives the real world. After its release, the book was Amazon's #1 "Hot New Release" in the Software Design category for over a month. Michael previously wrote numerous articles and editorials, spoke at Comdex, and co-authored one of the early Java books.
Adapted with permission from Release It! Design and Deploy Production-Ready Software, by Michael T. Nygard
Copyright 2007 Michael T. Nygard, published by the Pragmatic Bookshelf, ISBN 0-9787392-1-3.
Book available in paper and PDF from www.PragmaticProgrammer.com
Just enough design up front
Happend to me too
nice article. The same once happened to me, too. See www.epischel.de/wordpress/?p=45. I agree with you that most failure points in the field are integration issues - in particular when practicing TDD. In most cases you would try to mock up external systems. And even if you use the real one, you probably won't test in your production environment.
In an other, much bigger project, the first failure during load-testing was (against all odds) a network switch that failed under heavy network load only. Should we take that into account when developing software? When human lifes depend on it - yes. But otherwise?
Re: Just enough design up front
Is agile Silver bullet?
I would agree that integration problems are tend to be very complex once and not possible to handle in "Unit" testing environments, but as usual very important aspect of agile development is forgotten - and this aspect is "evolution".
We can't fix/test/predict integration problems with our unit test wherever methodology we were using. It is indeed very complex/impossible to create a fake interfaces that will match 100% interfaces of the integration point, but what we can do with Agile/TDD/XP/(put your agile method here) is to make system evolution simple!
I would never assume that TDD will replace normal functional testing and will solve 100% of problems system will have in the production environment, but I strongly beleve that high unit test coverage, big number of automated end2end functional tests will help us to deliver new functionality faster without worring about broken existing functionality.
We also have allot of discussions in our company about value of automated unit/functional testing and only one conclusion is feasible for me - all automation tests are by nature regression tests.
Re: Just enough design up front
1. Set the first one or two iterations as architectural ones. Some of the work in these iterations is to spike technological and architectural risk. Nevertheless most of architectural iterations are still about delivering business value and user stories. The difference is that the prioritization of the requirements is also done based on technical risks and not just business ones. By the way, when you write quality attribute requirements as scenarios makes them usable as user stories helps customers understand their business value.
2. Try to think about prior experience to produce the baseline architecture
3. One of the quality attributes that you should bring into the table is flexibility - but be weary of putting too much effort into building this flexibility in
4. Don't try to implement architectural components thoroughly - it is enough to run a thin thread through them and expand then when the need arise. Sometimes it is even enough just to identify them as possible future extensions.
5. Try to postpone architectural decisions to the last responsible moment. However, when that moment comes -make the decision. try to validate the architectural decisions by spiking them out before you introduce them into the project
Here is the correct link to the post I made on quality attributes - the link my first message is broken.
Re: Is agile Silver bullet?
So, this isn't so much meant to complain that there was a problem we didn't find through unit testing as it is meant to draw a parallel.
We (using the "royal we" for a moment) invented and adopted unit testing to solve our own problem of producing buggy code. Here, I see a similar problem.
I often here XPers say there should be no architecture up front, that it should all emerge through the practices. On the opposite end of the spectrum, there are the Zachman framework types that want to define the world before any projects can begin. Even on the most pragmatic of agile teams, there's still a kind of connotation that some of amount of up-front architecture is probably necessary, but it's a compromise---a necessary evil.
That leaves us wide open to this kind of problem, and myriad others that I've seen. Failures in the white space. Cracks originate in the gaps between boxes.
Is there something analogous we could invent to address architecture issues while remaining consistent with agile values?
Re: Is agile Silver bullet?
Is there something analogous we could invent to address architecture issues while remaining consistent with agile values?
As I said above - I think this can be handled within the practices of agile development. if you express architectural constrains as user stories - by demonstrating how the concern is manifested in the application. You can then prioritize and handle it like other user stories (you can look at an architect as a type of a technical product owner).
Re: Is agile Silver bullet?
But, with respects to this particular example (in the article), I'm not sure it has anything to do with Agile or TDD at all. Production problems happen - Agile or not.
The fair question to ask would be 'could I have avoided this problem?' And if your answer is yes, that you could have foretold this problem, the next question would be 'with what accuracy?'
By and large, we in the technical community are TERRIBLE fortune-tellers. You will miss things. But if you try to foresee all, you will over-design, over-complicate, and increase your code debt.
I think the question should be stated differently - 'What kinds of architectures evolve?' If TDD (via refactoring) is a local improvement - isn't this analogous to a steepest-decent algorithm for design? We know that steepest-decent gets stuck in local minima. Does TDD really give us a good architecture? Does it give us good-enough architecture? Or does the local nature of TDD preclude evolving towards an acceptable architecture?
Architecture & Agile
Re: Architecture & Agile
I'm puzzled -- is there really an Agile school of thought that says "architecture" is something bad?
I think that for many YAGNI is just that
Re: Architecture & Agile
Actually, YAGNI says no fortune-telling because we tend to be wrong more than we tend to be right. You are within YAGNI if you have an architecture in mind and then wait to evolve it in that direction when the requirements ask for it.
This is very similar to Real Options (www.infoq.com/articles/real-options-enhance-agi... )or the interview with Erich Gamma last year (www.artima.com/lejava/articles/designprinciples... ) where he described how the eclipse team refactors to patterns. Which, of course, brings up Joshua Kerievsky's Refactoring to Patterns work ( www.industriallogic.com/xp/refactoring/).
When I heard of refactoring, I remember thinking, "Now I know what architecture is. It's the stuff that's hard to refactor!" I guess that's the art of "just-enough" or delaying decisions--knowing when you have to make them. There are some decisions that do have to be made earlier.
One agile value is "don't throw stuff over the wall." I've almost always had to support what I wrote, and that forces a production mindset. I don't want the phone to ring at 5 AM, and if it does, I want the problem to be obvious. So I build in monitoring and logging functionality from the start. I guess I could cover proper behavior of logs and monitors with unit tests. Find a copy of "Writing Solid Code." It's 10 years old, and C-centric, but I learned a ton from that book.
Another agile value is "test early and often," and I guess that can include load testing. I like to try and build the simplest-possible feature that spanned all of the components in the architecture, and load test that. If you log and graph CPU, memory, network, and disk I/O on all components, you will begin to see patterns. As you test, monitor various system components and graph the output. You will start to see patterns long before flames start shooting out. If you have underpowered hardware, all the better. You're trying to see where and how the software breaks.
Re: Architecture & Agile
Re: Architecture & Agile
No test environment?
Saying that you CAN'T test your production architecture because you don't bother to is not a good enough answer. There are some things you cannot test effectively, but firewall rules should definitely be the same. Where do Agile Development methodologies recommend that functional testing be done in an environment that does not mimic production?
Having said that, I admire your skill at finding the problem, and this is a good write-up of how to do this sort of low-level packet sniffing.
Re: No test environment?
You make a great point, and one that I address in the book. One of my major themes is getting grounded and connecting with the actual deployment environment. It's the only way to have true confidence in what you deliver.
Most companies will not build exact replicas for their test environments, though. They choose to save a bit of money by eliminating expensive network components like firewalls and hardware load balancers. This is a penny-wise, pound-foolish decision. Whatever money they save on network equipment will surely be lost in production outages. Nevertheless, budgetecture happens, particularly in QA.
Sometimes, it's not as much a budget issue as it is a knowledge gap. Development may not know what the enterprise network will be, particularly if development is outsourced. Other times, the network architecture changes late in the game. I've heard, "We can't disrupt the QA environment now! We're too busy getting ready for release to lose a day while you change the network." Of course, what happens then when it does hit the real network?
Anyway, I always fight to have the QA network match the production network. About 50% of the time, I win that fight.
Re: No test environment?
Problem vs. Architecture, Agile
An underlying concept in Agile is that not everything can be forseen. I am not sure what could have predicted this problem nor discovered it quicker than actually fielding the software. It is only by fielding quickly that we can discover what we don't know.
Re: Problem vs. Architecture, Agile
What Agile doesn't do is try to design and build it before doing any other coding, which often involves trying to predict every darned thing the application(s) will ever need. Just build enough to support today's needs, be mentally and technically prepared to add or change or remove bits as the app evolves.
To the original post - this was a fascinating story. That detective work would be beyond me and the project teams I know in my company. We do some "unplug the network cable" testing for failures, but this situation would have been way hard to predict and test for.
Re: No test environment?
Re: No test environment?
My purpose here is certainly not to bash Agile. I've been a proponent and practitioner since before the moniker existed. I was doing unit testing, pairing, refactoring, and short iterations back when it was all just called "XP" or, more generally, lightweight methods. Several years back, I even quit my job to start a company explicitly built on agile methods. More recently, I spent an intense year in a fully agile Scrum/XP project. In the first 8 months, we delivered what the client had failed to deliver over the previous 2 1/2 years. In the next 4 months of my time on that project, we did six additional releases.
I'm speaking from within the Agile community, not from outside of it.
I can see that several people have misread my intention. I blame myself, as the author, for not being clear enough. I will try to make myself more clear here in the comments.
I don't attribute the failure here to a "failure of agile". Nor do I expect that agile methods, as formulated today, should have prevented this problem.
What I am presenting is a problem that has two very difficult characteristics:
- It's virtually impossible to predict.
- It only occurs in the actual production environment.
I'm drawing an analogy to unit testing. In days past, people thought it was impossible to test software within the development environment. Testing was done in a test lab, by testers, using testing tools. We have rewritten those rules. We now understand that unit testing won't catch every bug, but it sure catches a lot of them. (And, yes, unit testing also motivated changes in the way we design the code itself. We don't mind that much, since the design changes needed for unit testing are all "virtues" that we endorse anyway: decoupling, isolation, single-responsibility, and so on.)
Furthermore, we use automation to solve problems once and keep them solved. So, once a bug is discovered, we write a test to verify the bug. Once we fix the bug, the test acts as a barrier to keep the bug from re-emerging. We use our suite of automated tests to "nail down" the functionality. (And, they allow us to retain existing value while incrementally adding more value.)
As a practice, automated unit testing supports many positive virtues. We don't expect it to prevent or solve every problem. There are known challenges---areas that work, sort of, but not very well---mostly around databases and GUIs. Despite those challenges, I would never give up unit testing.
My point in this piece is to ask a question, not to bash anyone or anything. Can we think of a practice, consistent with agile values, that would advance architecture work the same way that unit testing has advanced coding? I am asking this question by using a specific example of a general class of problems to illustrate a difficult, costly situation that I would like to have avoided rather than solved.
I'm asking this question because I see a need for more connection to the actual deployment environment: filesystems, servers, networks, databases, etc. There are times and places for isolation, but we cannot always be isolated from the deployment environment. By the way, this "disconnectedness" is not unique to agile developers. I suspect that the best solution to disconnectedness will come from the agile community. The Ivory Tower architects have already had their whack at it---and they responded with even larger diagrams that got further disconnected from the real environment.
I very much want to avoid problems like this one, but I've got a hundred stories like this. Some come from agile projects, most come from non-agile projects. Some come from projects with heavy "big architecture up front", some come from projects with incrementally developed architecture. I'm certainly not blaming "agile" for these problems. I'm looking for a solution to them, and for that solution, I'm asking the agile community if we can find a practice that fits our values: incremental, automated, expressive, "just-in-time", self-describing, executable documentation, and enabling.
Like software testing, traditional (heavy) approaches to architecture have not moved the needle on the quality gauge. Let's see if there's a way to do for architecture what unit testing did for software quality.
Hope this helps clarify my intentions.
Re: No test environment? (Preventing Failures)
Yes, we should do the best we can do in testing given real world constraints. The value, though, we get from agile development is risk reduction through reduction in costs sunk in an absolute failure. By having multiple, iterative releases, a failed iteration can be rolled back with only the sunk costs of 2-4 weeks of effort, however, we must plan for and be prepared to rollback.
Iterative development moves us past the risks of all-or-nothing one-shot project development. The way to address the risks of unknowns is to push them to the front of the queue and force them to arise as soon as possible. There is no way to plan to avoid the unknown, one can only force it to arise as early as possible allowing time to recover from it.
How would upfront architectural design have prevented this problem
What I do not see is how doing upfront architectural design would have prevented this from occurring (except armed with 20/20 hindsight).
It seems to me that the same problem could easily have occurred in a waterfall project. The lack of unit tests and functional tests (and likely code bloat to handle dozens of other potential problems that never do happen in production) in most waterfall projects would have made the set of possible causes orders of magnitude larger. Being less sure that each unit was working correctly and that the system works correctly under normal conditions, discovering the root cause would have been much more difficult.
Furthermore, once a fix was determined, establishing that the fix did not break anything else would have also been much more difficult without all those automated unit and functional tests.
I do not even see how doing a few iterations of architectural spikes at the beginning of the project would have prevented this anomoly. In my experience, making sure the first story really goes end-to-end forces a slice of each architectural component to be implemented. This gives us the best of both worlds - validating everyone's understanding of the requirements as well as laying out and testing the architectural approach.
Re: How would upfront architectural design have prevented this problem
Patterns in methodologies
If all methodologies strive for predictable outcomes, and if patterns that worked before are trusted more than unfamiliar patterns, can we arrive at this generalized principle that all software development projects will apply patterns (architectural, methodological, testing practice, and coding practice) based on the dominant makeup of the team and little else?
I believe this to be true. Given a brilliant technologist with incredible people skills and lots of self confidence vs. a more junior technologist with less self confidence but a better solution, which will a business person choose to listen to? How will a business person make budget decisions? Who will they trust their reputation to?
In IT, more than any other industry, people are the dominant success factor. Agile development recognizes this. Traditional waterfall recognizes it much less. The false concept of man-months recognizes it not at all.
In fifteen years, I've never seen a project that failed for purely technical reasons. Political? ROI? Bad management? Yes. If, by failure, we mean over budget and late, we can always tie the failures back to our inability to estimate accurately the coding effort of a feature more than a few weeks out.
Why? There is a cliche`: "In the time it took to figure out the requirements and write up the spec, I could have coded the feature." I believe this is the seed that germinated into short iterations in all the modern methodologies. It also speaks to our inability to estimate effort without knowing all the details.
Once you 'know the details', the code flies out of your head and onto the screen. Except for those pesky parts where surprising new details are discovered. We'll save that for the next iteration :-)
Nice to read your writing again, Mike.
On a pragmatic note, I don't think the team and methodology exists that can accomplish what you suggest; namely, covering all production scenarios. Based on my aerospace manufacturing experiences, all design refinement is based on feedback loops from production experience.
Where to place the fuel tanks is influenced by what the intended top speed of the aircraft should be, which influences the material of the tanks, which influences their shape and manufacturing method, which influences the type of fuel sensor, which influences which side of the fuel tank its mounted on, which influences where to place the tanks...