Bindings, Platforms, and Innovation
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
Tracking change and innovation in the enterprise software development community
Posted by Deborah Hartmann on Jun 25, 2007 01:06 AM
Author Michael Nygard counts himself among those who still believe there is such a thing as architecture. In his InfoQ article Agile, Architecture and the 5am Production Problem, Nygard walked the reader through the whodunnit mystery of a real production problem. The surprising conclusion illustrated his message that building applications for the real world, and not just QA, requires a failure-oriented mindset and strong defensive programming tactics. The article poses a challenge to the Agile community's ideas about what constitutes "just enough" architecture.Agile methods tell us a lot about how to build functional software that changes easily over time. Programmers created techniques such as unit testing and refactoring for use by other programmers, and they improved the craft as a result. For the most part, though, agile methods focus on the interior of the system boundary. In the agile community, debate continues about how much attention we should pay to the architecture of things outside the application boundary. The most extreme adherent (or should that be "eXtreme" adherents?) say, "Let the architecture emerge from relentless refactoring and vigorous unit testing!"The article told the true story of an interested and unexpected failure that would only occur in the wee hours of the morning after a quiet period on the website: an application that would hang at 5 AM every day, involving a database that was only ever queried. The guilty parties---simultaneously the victims---were a web server, a database server, and a firewall. For those whose first reaction is to think "there's no way to create a deadlock if you're just querying": you'll be interested to see what Nygard uncovered.I am an agile developer and architect, but you should count me ... among those who think architecture must stay grounded in implementation. A good architecture is one that survives contact with the real world. A bad one creaks and groans its way through the day, chewing up people and computers. I have often observed that architects who retreat into abstractions create architecture that cannot be built successfully.
Agile Development: A Manager's Roadmap for Success
Effective Management of Static Analysis Vulnerabilities and Defects
I agree. On decent sized project bottom-up design techniques such as TDD are not enough. You also need to do some some up front work and think about quality attributes. I think that architectural retirements expressed as user stories should (so that implementing them would also bring business value) be emphasized in the fist few iterations.
Hi, nice article. The same once happened to me, too. See http://www.epischel.de/wordpress/?p=45. I agree with you that most failure points in the field are integration issues - in particular when practicing TDD. In most cases you would try to mock up external systems. And even if you use the real one, you probably won't test in your production environment. In an other, much bigger project, the first failure during load-testing was (against all odds) a network switch that failed under heavy network load only. Should we take that into account when developing software? When human lifes depend on it - yes. But otherwise?
Me too. I've never believed (or found) that architecture and agile are mutually exclusive. Overblown and over-engineered architectures, sure, but then these should have no place in the waterfall world either. I do have some sympathy for the "let the architecture emerge" school of thought, and certainly no architecture should be so inflexible so as not to change, but properly skilled architects (who are grounded in something more technically relevant than just powerpoint..) ought to be able to lay down some appropriate guidelines to meet NFRs and shifting requirements that save buckets of time later. The trick is to get 'just enough' in place, and not to get carried away with gold-plating the design when you can't be 100% sure it won't all change in the next iteration.
I tend to think that author expects to solve all problems with Agile/TDD/XP methods which is obviously not correct. I would agree that integration problems are tend to be very complex once and not possible to handle in "Unit" testing environments, but as usual very important aspect of agile development is forgotten - and this aspect is "evolution". We can't fix/test/predict integration problems with our unit test wherever methodology we were using. It is indeed very complex/impossible to create a fake interfaces that will match 100% interfaces of the integration point, but what we can do with Agile/TDD/XP/(put your agile method here) is to make system evolution simple! I would never assume that TDD will replace normal functional testing and will solve 100% of problems system will have in the production environment, but I strongly beleve that high unit test coverage, big number of automated end2end functional tests will help us to deliver new functionality faster without worring about broken existing functionality. We also have allot of discussions in our company about value of automated unit/functional testing and only one conclusion is feasible for me - all automation tests are by nature regression tests.
Yup. As I posted in my blog I found you need to take a mixed approach, try to come up with a basic a architecture and evolve it with the project: 1. Set the first one or two iterations as architectural ones. Some of the work in these iterations is to spike technological and architectural risk. Nevertheless most of architectural iterations are still about delivering business value and user stories. The difference is that the prioritization of the requirements is also done based on technical risks and not just business ones. By the way, when you write quality attribute requirements as scenarios makes them usable as user stories helps customers understand their business value. 2. Try to think about prior experience to produce the baseline architecture 3. One of the quality attributes that you should bring into the table is flexibility - but be weary of putting too much effort into building this flexibility in 4. Don't try to implement architectural components thoroughly - it is enough to run a thin thread through them and expand then when the need arise. Sometimes it is even enough just to identify them as possible future extensions. 5. Try to postpone architectural decisions to the last responsible moment. However, when that moment comes -make the decision. try to validate the architectural decisions by spiking them out before you introduce them into the project PS Here is the correct link to the post I made on quality attributes - the link my first message is broken.
It's not so much that I expect TDD or XP to solve everything, but I am a strong proponent of agile methods and the agile values. So, this isn't so much meant to complain that there was a problem we didn't find through unit testing as it is meant to draw a parallel. We (using the "royal we" for a moment) invented and adopted unit testing to solve our own problem of producing buggy code. Here, I see a similar problem. I often here XPers say there should be no architecture up front, that it should all emerge through the practices. On the opposite end of the spectrum, there are the Zachman framework types that want to define the world before any projects can begin. Even on the most pragmatic of agile teams, there's still a kind of connotation that some of amount of up-front architecture is probably necessary, but it's a compromise---a necessary evil. That leaves us wide open to this kind of problem, and myriad others that I've seen. Failures in the white space. Cracks originate in the gaps between boxes. Is there something analogous we could invent to address architecture issues while remaining consistent with agile values?
I
Is there something analogous we could invent to address architecture issues while remaining consistent with agile values?
As I said above - I think this can be handled within the practices of agile development. if you express architectural constrains as user stories - by demonstrating how the concern is manifested in the application. You can then prioritize and handle it like other user stories (you can look at an architect as a type of a technical product owner).
I'm not sure I agree that there is any conflict at all. Nobody, to my knowledge, has ever said that Agile or TDD is a silver bullet. But, with respects to this particular example (in the article), I'm not sure it has anything to do with Agile or TDD at all. Production problems happen - Agile or not. The fair question to ask would be 'could I have avoided this problem?' And if your answer is yes, that you could have foretold this problem, the next question would be 'with what accuracy?' By and large, we in the technical community are TERRIBLE fortune-tellers. You will miss things. But if you try to foresee all, you will over-design, over-complicate, and increase your code debt. I think the question should be stated differently - 'What kinds of architectures evolve?' If TDD (via refactoring) is a local improvement - isn't this analogous to a steepest-decent algorithm for design? We know that steepest-decent gets stuck in local minima. Does TDD really give us a good architecture? Does it give us good-enough architecture? Or does the local nature of TDD preclude evolving towards an acceptable architecture?
I'm puzzled -- is there really an Agile school of thought that says "architecture" is something bad?
I'm puzzled -- is there really an Agile school of thought that says "architecture" is something bad?
I think that for many YAGNI is just that
That is a misread/misunderstanding of YAGNI. Actually, YAGNI says no fortune-telling because we tend to be wrong more than we tend to be right. You are within YAGNI if you have an architecture in mind and then wait to evolve it in that direction when the requirements ask for it. This is very similar to Real Options (http://www.infoq.com/articles/real-options-enhance-agility )or the interview with Erich Gamma last year (http://www.artima.com/lejava/articles/designprinciples.html ) where he described how the eclipse team refactors to patterns. Which, of course, brings up Joshua Kerievsky's Refactoring to Patterns work ( http://www.industriallogic.com/xp/refactoring/).
A few random thoughts. When I heard of refactoring, I remember thinking, "Now I know what architecture is. It's the stuff that's hard to refactor!" I guess that's the art of "just-enough" or delaying decisions--knowing when you have to make them. There are some decisions that do have to be made earlier. One agile value is "don't throw stuff over the wall." I've almost always had to support what I wrote, and that forces a production mindset. I don't want the phone to ring at 5 AM, and if it does, I want the problem to be obvious. So I build in monitoring and logging functionality from the start. I guess I could cover proper behavior of logs and monitors with unit tests. Find a copy of "Writing Solid Code." It's 10 years old, and C-centric, but I learned a ton from that book. Another agile value is "test early and often," and I guess that can include load testing. I like to try and build the simplest-possible feature that spanned all of the components in the architecture, and load test that. If you log and graph CPU, memory, network, and disk I/O on all components, you will begin to see patterns. As you test, monitor various system components and graph the output. You will start to see patterns long before flames start shooting out. If you have underpowered hardware, all the better. You're trying to see where and how the software breaks. Robert Merrill www.ufunctional.com
I consider YAGNI to be a good guideline for a good architecture - i.e., one that is only as complicated as it needs to be. Having no architecture at all, and no-one explicitly or implicitly responsible for it, is a sure recipe for failure.
I agree with both you and Amr - However, what I see is that a lot of people look at YAGNI as an excuse not to do any forward thinking activity like architecture or design.
I am the only one reading this who is amazed that you didn't have a test environment that exactly replicated (down to the firewalls used) the production environment? Saying that you CAN'T test your production architecture because you don't bother to is not a good enough answer. There are some things you cannot test effectively, but firewall rules should definitely be the same. Where do Agile Development methodologies recommend that functional testing be done in an environment that does not mimic production? Having said that, I admire your skill at finding the problem, and this is a good write-up of how to do this sort of low-level packet sniffing.
Angus, You make a great point, and one that I address in the book. One of my major themes is getting grounded and connecting with the actual deployment environment. It's the only way to have true confidence in what you deliver. Most companies will not build exact replicas for their test environments, though. They choose to save a bit of money by eliminating expensive network components like firewalls and hardware load balancers. This is a penny-wise, pound-foolish decision. Whatever money they save on network equipment will surely be lost in production outages. Nevertheless, budgetecture happens, particularly in QA. Sometimes, it's not as much a budget issue as it is a knowledge gap. Development may not know what the enterprise network will be, particularly if development is outsourced. Other times, the network architecture changes late in the game. I've heard, "We can't disrupt the QA environment now! We're too busy getting ready for release to lose a day while you change the network." Of course, what happens then when it does hit the real network? Anyway, I always fight to have the QA network match the production network. About 50% of the time, I win that fight. Cheers, -Michael Nygard
Excellent. So what does this have to do with Agile? Forgive me, but it is a good article with a misleading title. It seems to me you are only perpetuating the misunderstanding of Agile methods.
I guess I don't see the mapping between the problem encountered and either Architecture or Agile. I am not sure what the authors proposes to do differently. An underlying concept in Agile is that not everything can be forseen. I am not sure what could have predicted this problem nor discovered it quicker than actually fielding the software. It is only by fielding quickly that we can discover what we don't know.
There were several references to "no architecture". That's simply not possible. Everything has one whether you gave it any thought or not. What Agile doesn't do is try to design and build it before doing any other coding, which often involves trying to predict every darned thing the application(s) will ever need. Just build enough to support today's needs, be mentally and technically prepared to add or change or remove bits as the app evolves. To the original post - this was a fascinating story. That detective work would be beyond me and the project teams I know in my company. We do some "unplug the network cable" testing for failures, but this situation would have been way hard to predict and test for.
Have to agree with this one. There's no evidence here that a more "architectural heavy" approach would have inexorably discovered this flaw, nor is there any definitive proof that "Agile" and not "gross developer error" was the root culprit. It's a great case study right up until you start Agile Bashing. At that point, this article becomes FUD, pure and simple. Michael, please stop spreading it.
"Agile Bashing" and "FUD" are both very incendiary terms... not conducive to conversation at all. Neither is asking people to be silent. I will attempt to respond to the substance of your comment rather than the terms you've put it in.
My purpose here is certainly not to bash Agile. I've been a proponent and practitioner since before the moniker existed. I was doing unit testing, pairing, refactoring, and short iterations back when it was all just called "XP" or, more generally, lightweight methods. Several years back, I even quit my job to start a company explicitly built on agile methods. More recently, I spent an intense year in a fully agile Scrum/XP project. In the first 8 months, we delivered what the client had failed to deliver over the previous 2 1/2 years. In the next 4 months of my time on that project, we did six additional releases.
I'm speaking from within the Agile community, not from outside of it.
I can see that several people have misread my intention. I blame myself, as the author, for not being clear enough. I will try to make myself more clear here in the comments.
I don't attribute the failure here to a "failure of agile". Nor do I expect that agile methods, as formulated today, should have prevented this problem.
What I am presenting is a problem that has two very difficult characteristics:
I'm drawing an analogy to unit testing. In days past, people thought it was impossible to test software within the development environment. Testing was done in a test lab, by testers, using testing tools. We have rewritten those rules. We now understand that unit testing won't catch every bug, but it sure catches a lot of them. (And, yes, unit testing also motivated changes in the way we design the code itself. We don't mind that much, since the design changes needed for unit testing are all "virtues" that we endorse anyway: decoupling, isolation, single-responsibility, and so on.)
Furthermore, we use automation to solve problems once and keep them solved. So, once a bug is discovered, we write a test to verify the bug. Once we fix the bug, the test acts as a barrier to keep the bug from re-emerging. We use our suite of automated tests to "nail down" the functionality. (And, they allow us to retain existing value while incrementally adding more value.)
As a practice, automated unit testing supports many positive virtues. We don't expect it to prevent or solve every problem. There are known challenges---areas that work, sort of, but not very well---mostly around databases and GUIs. Despite those challenges, I would never give up unit testing.
My point in this piece is to ask a question, not to bash anyone or anything. Can we think of a practice, consistent with agile values, that would advance architecture work the same way that unit testing has advanced coding? I am asking this question by using a specific example of a general class of problems to illustrate a difficult, costly situation that I would like to have avoided rather than solved.
I'm asking this question because I see a need for more connection to the actual deployment environment: filesystems, servers, networks, databases, etc. There are times and places for isolation, but we cannot always be isolated from the deployment environment. By the way, this "disconnectedness" is not unique to agile developers. I suspect that the best solution to disconnectedness will come from the agile community. The Ivory Tower architects have already had their whack at it---and they responded with even larger diagrams that got further disconnected from the real environment.
I very much want to avoid problems like this one, but I've got a hundred stories like this. Some come from agile projects, most come from non-agile projects. Some come from projects with heavy "big architecture up front", some come from projects with incrementally developed architecture. I'm certainly not blaming "agile" for these problems. I'm looking for a solution to them, and for that solution, I'm asking the agile community if we can find a practice that fits our values: incremental, automated, expressive, "just-in-time", self-describing, executable documentation, and enabling.
Like software testing, traditional (heavy) approaches to architecture have not moved the needle on the quality gauge. Let's see if there's a way to do for architecture what unit testing did for software quality.
Hope this helps clarify my intentions.
I think the issue is that there will always be unknowns in software development and any test environment will always be just an approximation of the actual operating environment. We can only make our bets guesses at what parts of the physical and operational environments are significant, what needs to be accurately reflected, what needs to be simulated or approximated, and what can be ignored. Yes, we should do the best we can do in testing given real world constraints. The value, though, we get from agile development is risk reduction through reduction in costs sunk in an absolute failure. By having multiple, iterative releases, a failed iteration can be rolled back with only the sunk costs of 2-4 weeks of effort, however, we must plan for and be prepared to rollback. Iterative development moves us past the risks of all-or-nothing one-shot project development. The way to address the risks of unknowns is to push them to the front of the queue and force them to arise as soon as possible. There is no way to plan to avoid the unknown, one can only force it to arise as early as possible allowing time to recover from it.
Really nice article! One thing that was nicely illustrated was that discovering and fixing this problem was a team effort, not just a heroic effort by one "architect". What I do not see is how doing upfront architectural design would have prevented this from occurring (except armed with 20/20 hindsight). It seems to me that the same problem could easily have occurred in a waterfall project. The lack of unit tests and functional tests (and likely code bloat to handle dozens of other potential problems that never do happen in production) in most waterfall projects would have made the set of possible causes orders of magnitude larger. Being less sure that each unit was working correctly and that the system works correctly under normal conditions, discovering the root cause would have been much more difficult. Furthermore, once a fix was determined, establishing that the fix did not break anything else would have also been much more difficult without all those automated unit and functional tests. I do not even see how doing a few iterations of architectural spikes at the beginning of the project would have prevented this anomoly. In my experience, making sure the first story really goes end-to-end forces a slice of each architectural component to be implemented. This gives us the best of both worlds - validating everyone's understanding of the requirements as well as laying out and testing the architectural approach. Steven Gordon
Agree! It's like talk about that hammer is not good because it can't clinch screw, but someone just forget even stone also can't clinch screw.
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.
This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.
This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.
This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.
After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.
IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.
Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.
24 comments
Watch Thread Reply