BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Soft Skill Patterns for Software Developers: The “Learning from Unintended Failures” Pattern

Soft Skill Patterns for Software Developers: The “Learning from Unintended Failures” Pattern

Bookmarks

Key Takeaways

  • Soft Skill Patterns are combinations of personal and interpersonal behaviours that are proven to solve commonly occurring problems.
  • System failures are almost impossible to entirely avoid, but each failure offers improvement opportunities.
  • The “Learning from Unintended Failures” pattern guides us to improve the resilience of our system following a failure incident.
  • There are four distinct steps in the pattern: identifying a failure, quickly resolving any immediate impact, analysing root cause and system behaviour during the failure, and finally generating and implementing improvement ideas.
  • Driving system resilience improvements from failures can only happen where open, honest and blame-free incident reviews are held.

What are Soft Skill Patterns?

Software developers require strong soft skills to effectively solve many of the problems which we face.

Peter F Drucker, the famous management educator, tells us that “Doing the right thing is more important than doing the thing right.” Intuitively this adage makes sense. For programmers, is there any value in building a great product that no one wants?

Soft skills, which include communication, team-work, and problem solving, define our ability to “do the right thing.” Our hard (technical) skills can only help us “do the thing right.” Our soft skills are therefore, arguably, more important to our effectiveness—our ability to deliver value—than our hard skills.

Ever since “The Gang Of Four” gave us “Design patterns : elements of reusable object-oriented software” in 1995, software developers have understood the benefits of well-known patterns. We know that, whilst no two problems we face are ever identical, recurring themes are often identifiable.

Once we identify such themes, we can turn solutions that prove effective into defined reusable patterns. These patterns not only help us solve common hard skill problems effectively, but also reduce the time it takes us to make a decision and increase shared understanding of the solution.

So, if we can benefit from patterns that solve hard skill problems, can we do the same for soft skill problems? 

In this article we will take a look at a Soft Skill Pattern we can use to help us drive big  improvements following a system failure. We are going to walk through the “Learning from Unintended Failures” pattern.

Why This Pattern?

It is a frustrating truth that software systems sometimes fail. These failures impact the system's users, therefore a primary goal of the system's developers is to minimise the failures and their impact. Fortunately, every failure provides learning opportunities to improve the resilience of the system.

The “Learning from Unintended Failures” pattern is a four-step approach where unintended system failures are identified, resolved as quickly as possible to limit impact and then analysed to establish root cause. Improvement ideas are generated based on the analysis and then delivered.

This pattern appears very well-known—even obvious—to many at first glance. The real benefits from this approach are only gained, however, if the analysis is effective and thorough and the ideas are actually implemented. This pattern describes an effective method for gaining real system improvements following system failures.

Real world examples of Learning from Unintended Failures are all around us. Flood barriers, sprinkler systems in buildings, electrical fuses, and airbags in vehicles are just a few examples of human responses to previous failures.

What is the “Learning from Unintended Failures” Pattern?

The "Learning from Unintended Failures" pattern describes an effective approach that ensures maximum value is gained from an unexpected failure of a system. For the purposes of this pattern, consider System to refer to any software-based solution or product.

This pattern takes inspiration from Matthew Syed's book "Black Box Thinking". Syed introduces his book with: "...it is about the willingness and tenacity to investigate the lessons that often exist when we fail, but which we rarely exploit."

Virtually every system failure is avoidable with hindsight. It is inevitable, however, that some failures will impact even the most prepared. This pattern helps us harvest valuable learning from all severities of failure: from the near-misses to the mission-critical outages. New ideas to significantly improve a system typically appear, to the observant, during a high-profile failure. Even when applied to near-misses or minor failures, the pattern will generate improvement ideas. These ideas can be a prime driver towards system resilience—as long as they are acted on.

How to Use the Pattern

Let’s take a look at how to use this pattern, step-by-step. You will find each of these steps in the pattern definition diagram above. I’ll highlight the primary soft skills used in each step, too.

Step 1: A failure occurs to your system.

This is the entry point to the pattern. The system has operated in an unintended and sub-optimal way. This may have caused a negative impact to the use or output of the system. The following steps will help you resolve the issue and make long-term improvements to your system.

Step 2: Resolve any immediate impact the failure has caused.

If a negative impact was caused by the failure, then this should be resolved immediately. The “Broken System Crisis Resolution” pattern describes an effective approach for this activity in detail. One important consideration is to ensure that any action does not make the impact worse. Ineffective and detrimental "fixes" are a real risk when there is urgency to resolve a fault.

During this step, the soft skills applied will include: Cool-Headedness, Problem Solving Skills, Risk Awareness, Collaboration, and Communication Skills.

Once the system is returned to its normal state you can proceed to the next step.

Step 3: Analysis of what went wrong: root cause and the system's behaviour.

Many soft skills will be practised during this step. Root Cause Analysis needs to be performed to understand what caused the system to fail. Logical Thinking and Analytical Thinking will be needed to understand exactly how the system behaved during the failure. Collaboration Skills will also be used as this step is ideally performed by more than one person.

There are many theories and practices for performing Root Cause Analysis (RCA). When investigating system failures, one RCA practice that often proves effective is the "5 Whys" approach. Wikipedia provides this simplified example of the "5 Whys" approach:

  1. The vehicle will not start. (the problem)
  2. Why? - The battery is dead. (First why)
  3. Why? - The alternator is not functioning. (Second why)
  4. Why? - The alternator belt has broken. (Third why)
  5. Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why)
  6. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

"5 Whys" ensures the investigation searches deeply enough to find the real root cause, not just a shallow effect of the root cause. Fixing these effects can be useful, but more material benefits are gained by resolving the deep root causes. In the example above, we could replace the alternator belt, but this would only resolve this specific problem. Fixing the root cause, the lack of maintenance, would give much broader benefits.

Identifying the root cause(s) will allow improvements to be made that will stop a recurrence of the same failure. More benefits may be attainable by gaining a thorough understanding of the system's behaviour during the failure. The root cause led your system to follow an abnormal flow through its process. Fixing the root cause will stop one entry point to the abnormal flow. Analysis of the abnormal flow itself will often yield additional improvement opportunities, covering broader possibilities than the root cause.

To understand the possible benefits of analysing the system behaviour during the failure, let's pretend we go for a woodland hike. We are reliant on the signs as we don't know the area. After a while we reach a crossroads where the signpost has fallen down. It's not clear which way we should be going. We are now in abnormal flow-what do we do? It is clear that the root cause is the fallen post, but our behaviour now will define its impact. If we take a wrong route or run-around screaming in a panic, we might want to look back later and think "going back the way we came would have been preferable". We would then have learned, not from the root cause, but from analysing our behaviour following it.

Step 4: Make improvements to your system based on what you have learnt.

During the previous step, you identified root cause(s) of the failure and gained a thorough understanding of how your system behaved during that failure. During this step you will use this new learning to generate improvement ideas, select which of these ideas to implement and deliver them. The final stage will be to test and measure the effectiveness of the improvement idea.

Generate ideas

In many cases the root cause(s) and system behaviour provide obvious improvement ideas. Extending the vehicle example introduced in the previous step, it is clear from the analysis that, in addition to replacing the alternator belt, we need to take the vehicle for a service.

Improvement ideas can be generated by defining the behaviour you would have liked your system to have taken during a failure. Consider an example where analysis of a system failure concluded that the cause was a broken hard drive on your only server. In this scenario you might identify that with a second "hot" server running, the failure's impact would have been vastly reduced.

The ideas you generate should not be restricted to the physical system—the running code. Improvements to the way a system is used and the ability to identify failures may well be identifiable when analysing a system failure.

When generating ideas be Creative and "Dream Big". Be aware of, but not controlled by, constraints such as budgets, time, required skills etc.

Select ideas to deliver

When selecting ideas to deliver, initially consider each idea independently. You may ultimately select one, some, all or none of your ideas. For an idea to be selected for delivery, it needs to pass three subjective tests:

  • Do the expected benefits outweigh the expected costs?
  • Do you believe that the idea can be delivered effectively?
  • Does the idea have a senior sponsor who can remove the barriers to delivery?

If any of these tests fail, then the idea should be dropped. Put the remaining ideas through one final test: Do you have the resources to deliver all the ideas? If not, use Comparative Analysis to finalise the selection.

Deliver the selected ideas

Value is only attained from ideas when the change defined by the idea is delivered. Until delivery, ideas are a cost to the organisation not a benefit. It is like buying a book that you never read.

By following the pattern, ideas will only have been selected that have a strong chance of being delivered effectively. Often the hardest part of the delivery process is getting started. As with those books on your shelf that you have never read—the first page always looks challenging. Be Decisive and start the delivery process. If you encounter barriers, use the idea's senior sponsor to remove them.

Measure and test the value attained

Effective development processes continuously assess whether the expected value is being delivered and adapt throughout the process. Delivering value from improvement ideas is no different. Regularly re-assess the three tests: is the idea still beneficial, viable, and sponsored? Be Adaptable to get the delivery back on course if needed.

When an idea has been delivered, it is time to evaluate the benefit that has been attained. Two evaluation approaches that can be used are measuring and testing.

Where possible, objectively compare before-and-after relevant measurements. Quantified comparisons clearly demonstrate the value attained from the change. An example of measuring would be "Prior to the change, the System could process 10 concurrent transactions. It can now process 100 concurrent transactions."

Quantifiable measurements are not always available. Value can also be demonstrated by testing a specific scenario. For example: "Previously, errors were typically first identified by affected users and this resulted in complaints. Following the change, all system errors are now trapped by the system and result in immediate automated text messages being sent to the support team. This provides an opportunity for us to fix the system before users are affected."

Outcome 1: Improved system resiliency.

The primary outcome from following this pattern is that the system is now more resilient than it was prior to the failure. Following the steps in the pattern should have helped you identify improvement ideas, select the most beneficial, and attain value by delivering them. At the very least, you would have a better outcome if your system were to suffer the same problems again. Naturally, the value attained is dependent on the number and effectiveness of the ideas and the delivery.

Outcome 2: Better understanding of the system.

The steps of this pattern require you to look deeply into how your system behaves in normal and abnormal situations. Numerous people will have been involved in this process, so the understanding is not only deeper, it is also more broadly distributed. This greater understanding is likely to influence future enhancements to the system and improve processes that build and use it. Having greater understanding of a system gives you greater control of it.

Outcome 3: More effective cultural reaction to failure.

As we have seen, positively and openly addressing failures can drive significant improvements to a system. Alternatively, if an organisation has a culture where failures are hidden, denied or obfuscated by people fearful of blame, the learning opportunities are missed. A repeated pattern of failure-driven-progress will remove the fear, increase the openness and drive further progress. And of course, an open culture is far more pleasant to work in than one looking for the next person to blame.

The Pattern in Action

Note: this is a fictitious story with fictitious characters (apart from me).

My mobile rang at 01:24: that could only mean one thing. "Kevin, the production system is down—nothing is working! The users here in Singapore are seriously unhappy. What was in that change you guys made?” shouted a panicked Ed, our support lead. "Morning Ed...", I replied as coherently and calmly as I could muster, "... see if you can get Fiona on a conference call and give me 5. Let's work the problem".

The conference call took place with me (Kevin, London dev lead), Ed (New York, support lead) and Fiona (Singapore dev lead). With help from others it took around four hours to understand what had gone wrong and to get the system back online. We will save the details of those four hours to the "Broken System Crisis Resolution" pattern. This "Learn from Unintended Failures" pattern is primarily concerned with what happened after the crisis was resolved and the system restored to health.

Shortly after the fix was implemented, Ed and I received an email from Gail, the head of the Singapore Sales team—one of our system's primary users. The message read:

"That's the fifth time in two months that the new Sales system has been unusable. How could those guys not test that new data feed? I am running the Post Incident Review myself—it’s in an hour. Get everyone on the phone who had anything to do with this. Especially Hardeep!"

I could see what was coming. It was clear that Gail had already adjudged what had happened and who was to blame. For a short moment, I was tempted to do what she had asked, after-all, most of her wrath wasn't being aimed at me. Hardeep was the developer responsible for the change. He had helped us identify and resolve the issue. At the time he couldn’t explain why the issue hadn’t occurred during the extensive testing he said he had performed.

I picked up the phone a little cautiously to ring Gail. I knew I had to do the right thing, even at the risk of attracting negative attention. "Hi Gail. I just wanted to have a quick chat about this review. Any chance we can delay it 24 hours? While we were fixing the issue, I found a few things that don't look quite right. They might explain why Hardeep is so convinced he tested the change properly." Gail sounded flustered in her reply, "Really? Well, ok then. I've got so much to sort out here now anyway. Tomorrow though. No later. This can't keep happening. How can we sell anything when the sales system is down?"

Our Sales system receives data from many sources. The change that Hardeep developed added some new fields to an existing input feed (a plain text file) of Market Data. The feed is critical to our system with the rest of the process depending on this data.

Root Cause Analysis established that the issue occurred because the updated feed included some values in scientific notation (like 1.5E-5) for the first time. Our production system didn't recognise these values as numbers due to the “E” character, and threw an error. This was not identified during the extensive testing of the change as the test environment used was significantly different to the production environment. The scientific notation values had been used on the test environment without issue.

I sketched a diagram showing the normal system flow and the actual behaviour following the change so that we could refer to it during the meeting.

Investigation of the system behavior during this failure yielded some surprising findings. Firstly, we found that the error caused by the bad data was not thrown until very late in the overall process. The data loading process wasn't affected as the data is stored in an unstructured manner. Identification of the bad data so late in the process meant there was no time to correct it. Secondly, it was clear that the whole system was dependent on the arrival of this data. Without it the system was unusable. This wasn't just the case with the Market Data, it was any data feed.

Ed hosted the review meeting the next day. We took two hours to go through all the details. We added changes to the sketch that would give us preferable behaviour were the same events to recur. The first change would catch the issue immediately and notify the data providers. If they weren't able to send corrected data by the critical time, the second change would allow us to reuse the latest data. With both changes in place a repeat of this issue would have negligible impact and the Sales System would be usable.

The meeting concluded with a recorded failure summary, a root cause and a list of resolution actions. Everyone agreed that these actions were now our top priority—more important than building the next set of features. Hardeep was delighted with the opportunity to show everyone the diligent testing he had performed. Even Fiona seemed happy with the outcome. At the end of the review she commented “Thanks everyone for taking this so seriously. I can see that it wasn’t just down to a lazy mistake as I first thought. I am really confident that these actions will make a big difference. Please just make sure they happen.”

The Failure Summary

“A change was made to the Sales System on 9th September intending to add four new fields to the Market Data feed. The change unintentionally altered some values on an existing field to be sent in scientific notation. These unexpected values were treated as non-numeric by the System. An uncaptured error was thrown when attempting to read these values during the daily batch process. The dependent process steps did not start due to the error. The Sales system was inoperable as the daily batch process had failed.”

The Root Cause

“The failure was caused by a defective test environment. The data type issue was not identified during testing due to inconsistency between the test and production environments.”

The Resolution Actions

  1. Environments: ensure test environments are consistent with the production environment to maximise the effectiveness of testing
  2. Data Quality: add a validation to the load process for all feeds to ensure all values match the expected data type. Send an alert when any mismatch is identified and remediate urgently.
  3. Batch Process Dependencies: alter the daily batch process so that the previous day's data is re-used if new data is not available on time. Using old data where necessary is preferable to having no system.

Anti-Patterns: Pitfalls to Avoid

In the example, following the pattern ensured some great system improvements were made. There are some important things to be aware of when incorporating this pattern into your decision making

We don’t make mistakes here.

Let’s get this myth out of the way straight away. Everyone makes mistakes and if you are a software developer those mistakes will cause failures to your system. The trick is to learn from mistakes, that is, don’t make the same mistake twice. You can only do that if you accept that the mistake was made and take action.

Don’t act on the learning.

I have frequently observed this anti-pattern. First there is the panic of the crisis. Then the excitement of the resolution. The optimism of the review. And finally: the de-prioritisation of the actions. Until the next crisis. Caused by the same issues. Fixable by the same actions. It goes without saying that those great actions you identify only help if you actually action them. They can’t help you whilst they are sitting in your backlog.

Reaching a premature conclusion.

It is very easy to make an early judgement on the cause of a failure. This is hard to avoid, so a great approach is to delay sharing this judgement with others until the facts become clearer. Announcing an early opinion publicly, often causes two detrimental effects. Firstly, it artificially influences others to hold the same view. Even when entirely flawed, a declared opinion can become considered factual, especially in the absence of contrary facts. Ever played Planning Poker when someone showed their card a little early? Did it make you reconsider your choice? Secondly, Cognitive Dissonance Theory teaches us that, once we have made our opinions known to others, the personal cost of changing those opinions can be so great that we close our eyes to any contradictory evidence. In our example, delaying the Review meeting 24 hours gave everyone a chance to gather facts before sharing opinion and drawing conclusion. Had the review happened straight away, far less valuable conclusions would have been drawn.

Blame and self-preservation.

Virtually all the valuable learning that can be gained from a failure relates to what went wrong. Very rarely does it matter who was involved. There are of course rare cases of negligence or intentional detriment. In those cases, appropriate action should be taken with the individual. Start your review of the facts by assuming there was no intent and no negligence. Focus on what went wrong and how you can make changes to prevent it reoccurring. In a culture where blame is prevalent, individuals will hide learning-rich failures and facts to protect themselves, and others will be driven-out for inevitable mistakes. Both of these events weaken the organization and remove the opportunity to learn from the failure.

We are too busy for reviews.

Do you perform post-incident reviews? If not, how are you going to learn the lessons from those incidents? Yes, an effective review takes time, but a few hours can discover huge improvements to your system. Don’t sprint in the wrong direction, jog in the right direction—you will get there quicker.

Simpler and Learning from Experience

Soft Skill Patterns come down to finding ways to make our jobs simpler by learning from experience.  Our own experiences, and that of others. Keeping things simple makes us more effective at our jobs and our lives, and helps us to constantly “do the right thing.”

We can use patterns to solve problems related to soft skills (communication, teamwork, problem solving) just as we use them to solve hard skill problems.

Consider the potential value of Soft Skill Patterns, and give the “Learning from Unintended Failures” pattern a try. I hope that if you find it works for you, you’ll consider giving other patterns a try, too.

About the Author

Kevin Jackson is an agile-certified Software Development Manager with two decades of experience working in the financial services industry. During this time, he has performed most software development practices including coding (mainly .net), analysis, testing, support, project management, architecture, stakeholder management and team leadership. Kevin gets a buzz out of collaboratively creating simple solutions to complex problems that his users love to use. His approach is to: focus on the goal, reduce the waste, work together and iterate. And if that doesn't work, then at least he will have learnt something new! You can find him on Twitter as @softskillpatterns

Rate this Article

Adoption
Style

BT