InfoQ

Interview

Recorded at:
Recorded at

Michael Nygard on Building Resilient Systems

Interview with Michael Nygard by Ryan Slobojan on Aug 24, 2009

Community
Architecture
Topics
Methodologies
Tags
QCon ,
Interviews ,
QCon London 2009 ,
Large Projects
Summary
Michael Nygard on: feature complete vs. production ready, how to make a system more resilient and monitorable, explaining stability patterns like Bulkhead and Circuit Breaker, and the need for the development department to cooperate with the operations one and the business managers.

Bio
Michael Nygard is a software architect with over 15 years of experience designing and writing applications for US Government, military, banking, finance and retail industries. He speaks frequently at conferences like QCon, No Fluff Just Stuff and JAOO and he is author of Release It! and co-author of Java Developer's Reference.

About the conference
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community. QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
Hi, my name is Ryan Slobojan. I'm here with Michael Nygard. Michael, what's the difference between feature complete software and production ready software? Is there a difference?
As a developer, what can one do to think about these kinds of considerations?
One of the principles of Agile and of Lean development is you're deferring a decision until the last responsible moment. Do you think that considerations like this, they are really something you can defer, like the last responsible moment is not at the end of the project, it's more at the beginning?
One of the problems that I've seen with respect to things like having low testing downing issues to QA and all the other things that have to happen during the software development process is that you need to have a certain amount of hardware available in order to be able to do that. So, you would dedicate, say, 3 servers to your long running testing and you have 3 servers for QA and you also have this and that tends to have very prohibitive cost-wise. Do you think that the rise of cloud computing and platform as a service, having the ability to just instantiate a bunch of machines out in the cloud? Do you think that will help increase the adoption of these scales processes?
An approach that seems to be taken frequently is that it's thought that you can harden the system, make it production ready shortly before release. Do you believe that's something that's really possible?
What are some of the patterns that a developer or an architect can follow to make a system more resilient?
How can you approach making an existing legacy system more resilient and more monitorable?
It was resiliency with respect to existing legacy systems.
Have you got any tips or recommendations for the types of information referring to your point about monitoring, types of information that can be verified by those different mechanisms like logging or monitoring through JMX - that type of thing? Thinking of the observation that quite often it's only after the problem has happened that you realize you wish you had a bit more monitoring or logging at the time that these happened?
You heard Richard Gabriel talk about ultra large scale systems about a month ago?
The things that you know now and are confident of now with large complicated systems, are they going to be equally applicable and useful when you get into ultra large scale complex systems, particularly when you start integrating human components in the merely mechanical or electromechanical components?
It seems that the operation of an application is often considered a secondary concern to the development of an application. Why do you think that is?
You mentioned the ongoing operations cost of a system. With many systems that I've seen in the past, the operations cost can be very opaque or you're not certain what causes it. For instance, you have a certain number of servers, a certain amount of load and there isn't really the understanding or the visibility in the system that allows you to say "If we were to tweak this thing, we could go from 30 servers to 5". What are your recommendations for how you can look at a system and determine that and see where do you think there might be something that you can do, if you have an existing goldfish-like system?
You very correctly pointed out that the need to have the ops people and development people integrated in order to deal with some of these problems. Don't you also need to start thinking about the business people and their workflow and the kind of assumptions that they make about the way things should be and even the way the unspoken assumptions that IT makes about the need to have a relational database as opposed to formally and flat files distributed across the network - these kinds of assumptions, but most importantly, the business and their workflow and how that has an impact on performance and scaling and everything else? How do you bring them into the cycle as well?
show all  show all
Mu, not Mule by Mark Wutka Posted Aug 25, 2009 9:41 AM
Re: Mu, not Mule by Ryan Slobojan Posted Aug 25, 2009 1:09 PM
Re: Mu, not Mule by totoro totoro Posted Aug 25, 2009 8:25 PM
  1. Back to top

    Mu, not Mule

    Aug 25, 2009 9:41 AM by Mark Wutka

    In reply to the first question, I think it should be "the only possible answer for that question is 'Mu'" - it's a Zen reference.

  2. Back to top

    Re: Mu, not Mule

    Aug 25, 2009 1:09 PM by Ryan Slobojan

    Thanks for catching that - the transcript has been updated.

  3. Back to top

    Re: Mu, not Mule

    Aug 25, 2009 8:25 PM by totoro totoro

    "mu", as in nothing. Not "mu" as in the Greek letter. They sound completely different. Say "moo", shorten the vowel and what you get is "mu" meaning nothingness.

Educational Content

Brian Marick on 4 Challenges and 5 Guiding Values of Agile Software Development

Brian Marick takes us through a quick tour of the most important values and challenges to adopting Agile successfully (they aren't the typical challenges and values we hear in the community).

Are You a Software Architect?

The line between development and architecture is tricky. Does it exist at all? Is an ivory tower actually needed? There's a balance in the middle, but how do you move from developer to architect?

Agile – A Way of Life and Pragmatic Use of Authority

The word 'authority' sometimes produces an allergic response in hard-line agilists. Freedom and authority – both are bad if misused and both are good if used in right spirit for a noble cause.

Getting Started with Grails, Second Edition

"Getting Started with Grails" brings you up to speed on this modern web framework. Companies as varied as LinkedIn, Wired, and Taco Bell are all using Grails. Are you ready to get started as well?

Using ITIL V3 as a Foundation for SOA Governance

Those familiar with only ITIL V2 often scoff at the thought that ITIL could serve as a governance framework for SOA. With ITIL V3, the focus of the framework shifted towards service-orientation.

Adrian Colyer on AspectJ, tc Server and dm Server

SpringSource CTO Adrian Colyer discusses AspectJ, SpringSource's dm Server and tc Server products, OSGi and Scrum.

Adam Wiggins on Heroku

Heroku's Adam Wiggins talks about Rails, Background Jobs, Add-Ons, Ruby, and how Heroku manages to work around Ruby's inefficiencies using Erlang and other languages.

SOA as an Architectural Pattern: Best Practices in Software Architecture

For Grady Booch the foundation of a good architecture is patterns, SOA being just one of many patterns. In this Second Life presentation, Booch attempts to bring more clarity on what architecture is.