On Distributed Failures (and handling them with Doozer)
Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.
Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.
Agile adoption and transformation is sometimes effective, and sometimes not. Is there a common thread to the failures? Does fear have anything to do with it? And what can we expect if we start an agile adoption initiative in an environment that is full of fear?
Usually failures result in anger, frustration and playing the blame game. However, failures are wasted if there is no learning from them. How can Agile teams make failures beautiful?
Philippe Kruchten described the Agile movement as "The agile movement is in some ways a bit like a teenager: very self-conscious, checking constantly its appearance in a mirror, accepting few criticisms..." and shared a list of twenty elephants in the room - uncomfortable issues that are ignored on purpose. The first of these unmentionables is that commercial interests are censoring failures.
Amazon has published a detailed report on the service failure plaguing one availability zone in the US East Region. The online media is full with analysis, commentaries and lessons to be learned from the event.
Justin Sheehy talks about failure and the need to prepare for it, giving some real life examples along with techniques implemented in Riak to make it resilient to faults.
Robert Myers talks about the role played by failure in Agile development, sharing a number of Lean and Agile practices helping to embrace failure and showing how to interpret the feedback received.
Herbjörn Wilhelmsen discusses the reasons why an SOA project failed while trying to reuse existing resources, and how it succeeded later starting from the same business case with reuse in mind.
Justin Sheehy explains why a paradigm shift is necessary when dealing with large concurrent distributed systems and what are some of their requirements: no global state is shared, ACID no longer works but rather BASE and CAP, getting rid of RPC and using protocols over APIs instead, prepare for failure, degradation, understanding the harvest-yield balance, and using measurement.