All Right It Failed, What Next?
Usually failures result in anger, frustration and playing the blame game. However, failures are wasted if there is no learning from them. How can Agile teams make failures beautiful?
Rather than blaming people, I blame the process. What is it about the way we work that allowed this mistake to happen? How can we change the way we work so that it's harder for something to go wrong? This is root-cause analysis.
One of the most effective ways of doing root cause analysis in the event of a failure is the 5-Why's technique. The 5-why's analysis has its origins in lean manufacturing. It is used to find the root cause of a problem through identifying a symptom and then repeating the question “Why?” five times. It is observed that usually the solution becomes clear after 5 iterations of asking why.
Another technique used by some Agile teams is the Fishbone diagram, which looks at the big picture around the problem. Infact, to visually view the process of 5-why's the fishbone diagram is often very useful. A related yet interesting technique suggested by Joel Spolsky is the 'Fix it Twice' method. It suggests having a quick solution for fixing the incident so that the team can move further and then having a slower fix, which prevents the incident from occurring again.
So what is the best way to conduct a root cause analysis?
- Get the right people in the room.
- Create the right environment for blameless problem solving
- Don't stop unless the real problems and solutions have been identified.
- Don't be satisfied with a single root cause. Many situations are more complex than that.
- Just human error need not be the outcome.
Likewise, Gojko Adzic quoted Douglas Squirrel when he suggested that after getting all the affected parties together, there should be a poll to identify the problems. Once the problems are identified, follow the 5-Why's technique till it hurts. If it does not hurt then you are not doing it right. Once the problems have been identified, a very important aspect is to define outcomes which are proportional to the problem.
Don’t get carried away and “retrain your development team because of five minutes of downtime”, said Squirrel, “but define tasks proportionate to the problem”. “It’s not necessary to solve problems, but make progress”, said Squirrel. Instead of gold-plating solutions, he suggested acting quickly. “If you do it wrong, it will come back again”. Solutions that take too long will never get done, so Squirrel suggested thinking about what you can do in a week or even in a hour, and building up the solution the next time a problem happens.
Jim too, suggested that the real work begins once the root cause analysis is over. It is easy for people to get back into the delivery mode and forget about the failures. However, the tasks decided as an outcome of the root cause analysis need to be actively managed and tracked in the backlog. Metrics need to be collected and people need to be made aware of the right way.
You’ll need to use metrics and cost data to drive behavior and to drive change, and to decide how much to push and how often: are you changing too much too often, running too loose; or is change costing you too much, are you overcompensating?
Thus, failures are best utilized as learning grounds. The key lies in identifying the root cause and tracking the 'proportional to problem' solution tasks actively to closure.
Built-in Self Regulation
A related yet interesting technique suggested by Joel Spolsky is the 'Fix it Twice' method. It suggests having a quick solution for fixing the incident so that the team can move further and then having a slower fix, which prevents the incident from occurring again.
That second fix should come under software resiliency engineering which I think is going to a big area of concern in the coming years and which starts with software being imbued with self-observation and self-regulation capabilities that are continually extended with knowledge acquisition during incident and problem management.
Automated Performance Management starts with Software’s Self Observation
Activity Based Costing & Metering (ABC/M) – The Ultimate Feedback Loop
Evolving Culture and Values. Understanding the Tradeoffs. Growth through Failure. The Importance of Leadership and Open Communication.
Pedram Keyani Mar 11, 2014
Summly: An Award Winning Mobile App's Journey to the Cloud with Five-9s Availability on a Shoestring Budget
Eugene Ciurana Mar 11, 2014
Christophe Achouiantz Mar 11, 2014