BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Bug Fixing Vs. Problem Solving - From Agile to Lean

Bug Fixing Vs. Problem Solving - From Agile to Lean

There are many definitions of lean but the most inspiring for me is the one that Lean Enterprise Institute Chairman John Shooke gives in his book Managing to Learn : lean is developing products by developing people. Drawing on this definition, this seminal essay then explains the lean way to develop people : through problem solving.  This definition sheds a clear light on the beauty of this management practice : designing the work so as to make problems (and, as such, learning opportunities) visible and to solve them, using the scientific approach, as they appear.

One of the misconceptions I’ve made while working with software development teams using agile methodologies is that I initially confused bugs with problems and I tended to believe our agile process was Lean, as it made bugs visible. During the last few months, this idea has cleared up a bit and, in retrospect, I now believe that our agile team producing bugs was not a Lean system producing learning opportunities : it was a team having quality problems, which is something I have seen with many teams. 

The goal of this article is to describe how my thinking has been evolving on the topic of bugs and problems, provide some hints on how to better understand the problems causing bugs in order to improve the performance, and put this into perspective with some real life stories. (Disclaimer : the goal here is not to pretend that all agile teams have similar misconceptions

What is a bug?

In the software industry, a bug can be anything from a system error (NullPointerException, getting an http 404 error code, a blue screen …), a functional issue (when I click B system should do Z and it does Y), a performance problem, a configuration issue, etc …

A bug is not a problem in lean terms unless it is clearly expressed as defined in the next section. Believe me I have seen (and produced !) my share of it and 95% of the bugs I’ve known don’t look like problems - performance bugs might be a general exception, but, funnily enough, they are qualified as performance, aren’t they?   

What is a problem?

Let’s proceed with a standard definition here. In The Toyota Way Field book, Jeffrey Liker defines the four pieces of information required to define a problem:

  1. The actual current performance
  2. The desired performance (standard or goal)
  3. The magnitude of the problem as seen by the difference between current and target performance
  4. The extent and characteristics of the problem

As Brenée Brown quoted in her TED talk about vulnerability, if you can’t measure it, then it doesn’t exist. More practically, if you can’t explain a problem as a performance gap, it could well be because you haven’t been thinking about it long enough.

Before starting to work on a problem, it is critical to express it clearly, to take time to understand it (lean expert Michael Ballé says to be kind to it) and to resist the temptation of jumping to solutions. We all know the famous quote of Einstein : « If I had an hour to solve a problem, I would spend the first 55 mns thinking about the problem and 5 minutes thinking about the solution». No-one said it was easy.

In a context of a software development agile team, the performance indicators can be a burndown chart (cost and delay), number of bugs, response time (quality), customer evaluation of the delivered User Stories (a grade out of 10 for customer satisfaction) and number of user stories (or story points) delivered per sprint (productivity).

From these indicators, some example of problems could be:

  • Quality : the target of this page response time is 500 ms and we measured 1500 ms with 5000 simultaneous users
  • Quality : number of open bugs left at the end of the sprint (2 instead of none)
  • Cost/Delay : We thought this user story would take us 3 days to complete, it took 8
  • Productivity : the number of user stories delivered by the team at the end of the sprint is 5 instead of the 7 planned
  • Customer Satisfaction : We want to have a 8/10 grade for each user story and we had 2 below this grade last sprint (6.5 and 7 respectively).

How to extract problems from bugs?

Bugs are symptoms of a more general issues and it is critical for a lean team to relate these symptoms with actual problems. We could say that just as Michelangelo saw and then extracted beautiful shapes out of marble pieces, the job of lean teams (i.e teams doing continuous improvement as part of the job, on a daily basis) is to see and then extract insightful problems out of bugs heaps. This requires to do some analysis and work to transform the raw material into learning opportunities.

A great way I found to start this analysis is by classifying bugs into families and understanding the weight of each bug family. Most of the time a bug family can be a cause of an existing problem or can be a problem in itself. This correlation helps you in making sure you are tackling the problems in the right order, starting with the problem having the most impact on operational performances. If you still don’t know where to start, starting on quality is a safe bet.     

 

Example 1 : in an agile world

I was the manager of this team doing agile development. Like many such teams, it was not a cross-functional team (the scrum-but plague) but a silo team developing its iteration, producing the foundation server-side software for application teams to use in subsequent iterations.

We did some Paretos about the bugs we were having and identified a family regrouping 20% of them : these were opened by application teams and related to « implicit » definition of the API exposed by our server-side software. When the application teams were using it, there was some input parameters missing, output data lacking etc … so they would open bugs and the team would go like hey but it was implicit that we would not return this data).

We also noticed that the life span of such bugs, between the time they were created and the time they were closed was around 4 weeks. The code is released at the end of a one month iteration for the client teams to use in the next one (in the best case). They open the bug which is addressed to the developer, 2 or 3 weeks after she developed the code, so she has to go back into it etc …)

In order to tackle the issue we decided to re-engineer the work and put people together as parts of co-located, cross-functional and cross-disciplinary teams.

With this approach, we noticed a drastic reduction (about 50%) of these « implicit API » bugs. Most interestingly, the average life span of such type of bugs went down to a couple of days. Yet this is not really meaningful because some of these bugs were still found without any ticket created as developers would do pair programming and fix it on the spot.

Despite the blatant results, I was still somehow uncomfortable but I couldn’t tell why back then. It has appeared to me later that there were two small flaws there from a lean perspective :

  1. since we were still having bugs, there was rework and the development system was producing waste : there was no built-in quality to ensure that the problem doesn’t even go past the server-side developer developing the API. Besides, there was no real standard in the team bar « let’s sit together when we have an issue »
  2. Even if these results were quite significant and encouraging, there was no direct correlation with the daily performance of the team allowing to take immediate action and witness the result the next day. We only checked the macro-result of the effect at the end of the 6 months release : out of the so many bugs we only had that many related to API. So we could see that setting up cross-disciplinary teams would somehow improved quality but we did not provide the mean to monitor it and take action on it on a daily basis. 

Example 2 : in a leaner world

Fast forward a couple of years. In the same organization I am now project leader and coach in charge of deploying agile in a large multi-team, multi-technology project. There is this team developing a rather challenging technical integration with a technology we don’t have so much expertise about. The team has not delivered any User Stories for the last two sprints and is struggling with quality issues, i.e bugs. On the retrospective of the second sprint that didn't delivered any completed User Story (i.e without any pending bug on the functional scope as per our definition of Done), the team decides to have bugs review (red bin analysis in lean) on a weekly basis.

In the first session, the team builds a Pareto of the problems. It is set as a table with the bug family in one column, the number of bugs and the ID of the bugs in the following ones.  

The objective from then on is to eliminate the root cause of each bug family, one by one, starting with the one with most occurrences. In a view to foster collaboration on the topic, the Scrum Master decides to display this Pareto next to the Scrum Board and the number of bugs, and to update it everyday. Any new bug is classified on the spot during the morning stand-up meeting when the team reports the bugs of the day. This helps in making daily quality performance explicit for the team. This also provides a great way to do the C in PDCA’s: the Check. When the problem is eradicated, there should not be any bug left on that line for a week or so. Yet, this sometimes happens : another space for learning.

As an example, the team identifies regression as one bug family : software modification has broken an existing working feature. This happens mostly at the graphical user interface as it is very difficult to test automatically. One of the identified root causes is because more junior programmer do not always fully understand the impact of their code change. The counter-measure is to introduce a new step in the process which is the pre-commit code review with a more senior developer. This 15 mns step reduces drastically the regression and this is measured daily on the number of bugs per release (there are 2 releases per day) while improving the skills of the junior developer.

Eventually, all problems are tackled and the result is just stunning : the problems are eliminated one by one, using standards (code review before commit being one). The number of daily bugs plummets and the one of fully functional and bug free user stories delivered each iteration increases. Within 3 months, the team has turned from the one producing the most important volume of bugs to an example of high quality / high velocity team within the project.

This approach is leaner than the previous one as there has been a direct impact on daily performance (quality) and productivity (number of US delivered) and the team has set new operational standards.

 

Fig 2 : Example of performance indicators for agile teams

Turning an agile team into a learning team

With the above two examples above in mind and from what I’ve learned during this journey, this is a roadmap I would recommend to turn an agile team into a lean and learning one :

  • Measure the performance, make it visible and discuss it, everyday

I know this one is rather hard to swallow for some hippy Agile coaches (which I somehow still am deep inside). But here is the sad truth: if we want to improve, the first thing to do is to measure. Besides, and most importantly, we don’t learn unless we confront to reality. This is how web giants (Google, Amazon, Twitter, Facebook) or practice leaders (Etsy) do: they measure just about everything. They didn’t reach that level of performance only counting story points.  A practical example in agile teams : beyond the sprint burndown, display the quality performance (number of bugs left open, number of bugs per release, per family etc …), customer satisfaction (a grade out of 10 of delivered User Stories for instance) and question on a daily basis why the burndown is not meeting the target.

  • Make sure problems are expressed the lean way

A problem must be expressed as a difference between the observed performance and the target performance. Pareto is a great tool to turn raw bugs into families but then there must be supplementary analysis to understand how each family affect  the performance.

This will allow you to make sure you’ve clearly formulated the problem and you tackle them in the right order from a business performance perspective

  • Treat problems one by one, as they appear

This is one of the key of the lean problem solving approach : you don’t want to tackle many problems at once. You only want to tackle one to understand how it impacts your performance indicators and to make sure you understand the cause effect relation.

  • Do the check

From my experience this is the stage we tend to skip, unfortunately. Confront your estimates with reality. It didn’t work as expected? great! What is there to be learned? The precious space between what you thought would happen and what has actually happened is where the learning occurs. This is exactly what the team did in the second example. As Stephen J. Spear wrote is his awesome book Chasing the Rabbit, this is where your organizational system whispers to your ear : « there is something you don’t yet know about me but if you listen carefully, I’ll tell you ». This is where the team develops expertise both on its work and its process and fast and surely turns into a dream team.

From Agile to Lean

As an agile practitioner since 2004, my thinking has been evolving towards lean for the last couple of years as it has helped me to move beyond obstacles that Agile alone could not help me solving.  

In my experience, Lean has proved to be instrumental in moving beyond Agile to set up a practice of continuous improvement with direct effects on team performance and engagement.  Making a clear distinction between bugs and problems has proved to be instrumental in this improvement.

If you have started the very same journey, what key differentiating elements have you identified?

About the Author

A Lean IT Coach at Operae Partners, Cecil Dijoux is an IT professional with 25 years international experience. Passionate about 21st century Management (Lean, Agile, Enterprise 2.0), Cecil blogs in french and english on http://thehypertextual.com about organizations cultures in an interconnected world. One of his blog article has been discussed by The New York Times online and Read Write Enterprise. Cecil also happens to be an international speaker and the author of "#hyperchange - petit guide de la conduite du changement dans l'économie de la connaissance" a french downloadable e-book on change management in the knowledge economy.

Rate this Article

Adoption
Style

BT