Jason Little discusses how to avoid an organizational change failure when introducing Agile by leveraging principles of Lean Startup and Customer Development.
Jeremy Edberg discusses how Netflix designs their systems and deployment processes to help the service survive both catastrophic events like zone and regional outages and less catastrophic events like network latency and random instance death.
Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it.
Mike Andrews discusses architecting for failure even you when don’t know what might fail.
Poul-Henning Kamp considers that if developers are not getting better, we are going to repeat many of the major IT project failures. He exemplifies with major Denmark project failures.
Michael Brunton-Spall talks about various types of system failure that can happen, sharing the lessons learned at the Guardian and measures taken to prevent and mitigate failure.
John Allspaw discusses pitfalls to be avoided while troubleshooting failed systems, comparing web operations at scale with practices in aviation and nuclear power industries.
Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.
Justin Sheehy talks about failure and the need to prepare for it, giving some real life examples along with techniques implemented in Riak to make it resilient to faults.
Robert Myers talks about the role played by failure in Agile development, sharing a number of Lean and Agile practices helping to embrace failure and showing how to interpret the feedback received.
Herbjörn Wilhelmsen discusses the reasons why an SOA project failed while trying to reuse existing resources, and how it succeeded later starting from the same business case with reuse in mind.
Justin Sheehy explains why a paradigm shift is necessary when dealing with large concurrent distributed systems and what are some of their requirements: no global state is shared, ACID no longer works but rather BASE and CAP, getting rid of RPC and using protocols over APIs instead, prepare for failure, degradation, understanding the harvest-yield balance, and using measurement.