The Original Sin of Software Metrics
When I was still a very naive, starry-eyed software manager, there was one technical issue for which I proposed to create 2 defects, one for fixing the issue, one for putting in more logs around the area to make it more diagnosable in the future. In the meeting, a DEV manager more senior than I objected, “no, the logging one should be an enhancement, not a defect. We are putting things into the product which it doesn’t have before, therefore it should be an enhancement.” The manager was not kidding, he said it in all seriousness, and other managers agreed. The logic was so out of my “common sense” that I was both astonished and amused. All this was because a few days previously an even more senior manager wrote an email asking why the number of defects was so high. That manager didn’t ask us to reduce the number: he simply showed an interest and asked a question.
After I grew more cynical, I played along to make metrics more pleasant. For example, we had a metric tracking the number of server defects reported by customers. The usual tricks to make it more pleasant were:
- Lower severity
- Combine multiple defects into one
- Argue a defect is actually an enhancement
My distrust and disregard of software metrics grew, was it because of me being insubordinate, or are most software metrics screwed from the beginning? To my relief, I do not have an attitude problem. There are plenty of books out there that explain why metrics in a creative industry are problematic. The books that inspired me are:
These books are really scientific and dry. In this article I apply what I have learned from these books into software metrics and I hope do so in a less dry way.
Other references include:
Tom’s first attempt
In this imaginary scenario, Tom was a program manager of a big software company overseeing several products. Some products had bad reputations as customers had reported many defects.
The workflow for customers reporting defects was:
Tom opened the metric dashboard, and saw those bad-rep products had many defects. Tom noticed that products with good rep had much fewer defects. So he established a rule that in 1 year, products must reduce the number of defects to be below 50.
Here, Tom made his first mistake. Just because A leads to B (good quality leads to fewer defects), doesn’t mean B leads to A (fewer defects means good quality). Aristotle (335 B.C.) was among the first to record the human tendency to forget the directional nature of implication on observing correlation.
One year late, Tom was puzzled. On the one hand, the bad-rep products had dramatically reduced their backlog of defects, one product even reduced the number from 250 to 40, a big achievement! And yet Tom still received complaints from customers, in fact, complaints were getting louder.
Tom decided to probe into the whole workflow of “defect reporting – defect creating – defect fixing”. He observed how each product team handled customer issues. It didn’t take him long to observe these things:
- Product team members argued with customers that the issue they reported was not really a defect, it was something the product hadn’t implemented, so the issue should be an enhancement request. If customer pressed to get the enhancement implemented, the product team members would argue that it was not in line with the product roadmap.
- Product team decided a defect was of so low severity that it was not worth fixing and closed it.
- For some issues, because that could not be consistently reproduced, for example, dirty data issues, performance issues, the product team didn’t even create defects; various “workarounds” were given to customers, such as cleaning up data or buying more powerful computers.
Tom’s second attempt
Tom still believed his defect metric was good, it was just people did it wrong. Being a reflective man, Tom’s first thought was not “those damn software engineers cheated me”, but “I should make it harder for them to game the metric.” So he changed the process to the following:
Tom reasoned, the failure of his previous attempt was because the RnD team controlled how the data was generated, so instead of RnD creating defects, support was put in charge of creating them; support had substantial knowledge about the product, they had the final say whether it was a defect or an enhancement and they could also decide on the severity of a defect from users’ point of view.
Tom was happy with the idea. It was not easy to change the process, and it caused some squabbles between support and RnD teams. But Tom muscled through and persevered. He believed that the cost of this change would be justified by the gains.
One year later, Tom was more puzzled. From the metric dashboard, he could see RnD team was burning down the backlog of defects quickly, but new defects were reported almost as quickly. What went wrong?
Tom talked to the managers of these products and he was frustrated with their feedback. Everyone said they had met the targets and therefore they were performing well, and there were things that were out of their control.
Tom didn’t know that his metric, intended to make people accountable, actually caused nobody to be accountable. Edwards Deming, often considered father of both the Japanese and the American quality movements, has declared performance measurement “the most powerful inhibitor to quality and productivity in the Western world” (Gabor, Andrea, Catch a Falling Star System, U.S. News and World Report (June 5, 1989, p.43))
Tom’s conversation with Bob
Tom sought out the manager of product Phoenix Bob, who was his college classmate. Tom always thought Phoenix was a bad name for a product, he has seen two products named Phoenix that went up in flames. He secretly worried his old buddy from college would be caught up in a fire too. Phoenix was not in a good shape: its backlog of defects was burning down quickly, but the influx of new defects was alarming.
Tom: “Hey, buddy, what is the matter with the defect metric? Why hasn’t it improved Phoenix’s quality?”
Bob (raised his eyebrows almost irritably): “How do you define quality of software?”
Tom (taken back by the question, and stuttered): “Well… A good quality software has fewer defects. “
Bob: “That is only remotely true. There are three aspects of software quality depending on who you are talking to. “
Bob opened his laptop, and pointed to the following diagram, and continued:
“The first aspect is functional quality, which means the software performs the tasks that it is intended to do for users. Including:
- Meeting or anticipating the specified requirements. Since we are working on products, not on contracted projects, we are working on assumptions that what we are working on will solve customers’ problems.
- Creating software that has few defects.
- Good enough performance.
- Ease of learning and ease of use.
The metric you defined is largely around ‘creating software that has few defects’.
The second aspect is structural quality, which is what I am most interested in because of my former life as a software engineer. It is all about ‘-abilities’, which are as hard to define as art, because sometimes it is in the eyes of the beholder. This aspect includes:
- Code testability.
- Code maintainability.
- Code understandability.
- Code efficiency.
- Code security.
I like ‘code understandability’ the best. Code is meant to be read by human beings. When I read code, I expect to read it like a very boring novel, there should be neither twists nor surprises. A machine can read the code no matter how ugly it is, but the machine doesn’t maintain the code, human beings do. “
Tom: “Are you telling me that in order to satisfy the defect metric, you had to sacrifice these ‘-ability’ qualities?”
Bob: “Yes! We were able to work around the metric two years ago by creating fewer defects, it was cheating, but we were managing the damage. You redesigned the workflow last year and tied our hands even more. In order to keep up with the metric, we had to increase our speed, and in the process, we were forced to sacrifice the structural quality. Code became more and more complex because we were putting more and more band-aids. By our estimation, the code became 20% more complicated. The result, as you can see, a lot of regressions and much longer time for new development. By neglecting structural qualities, we are incurring technical debts, which, sooner or later, we have to pay back…”
Tom (light in his eyes): “Wait a minute. You said ‘-ability’ qualities are like art not possible even to define. But you can measure it! You just said you measured it to be 20% more complicated. What if we measure ‘-ability’ qualities? ”
Bob (smiled at his friend’s enthusiasm): “Let me first finish the 3rd aspect of software quality. It is about process quality, which includes:
- Meeting the delivery dates.
- Meeting the budgets.
- A repeatable development process that reliably delivers quality software. If a process is so stressful that it forces out the best people in the team then it’s not a good process.
- Long term maintenance cost. To lower cost, it sometimes means the product has to embed some monitoring and supportability features that are not specified by and paid for by customers. But an easier-to-maintain system lowers the cost and makes happier customers.
These three aspects affect each other. I care most about structural quality because I am a hard-core software engineer. If code is written beautifully – sorry, I am describing it as a piece of art -- it will lead to fewer defects and quicker development cycles. But I am also part of the value-delivering chain: I can’t be too obsessed with the beauty of the code. I and everybody in the value-delivering chain have to balance these factors. Every project is different and the circumstances can change every day.”
Tom: “Ok, I get it now. But I need to understand the status of each product, I promise I won’t setup goals on these metrics, but simply use them as information to understand the status of each product.”
Bob (signed): “I trust you because I’ve known you for a long time. But there is nothing inherent about a metric that guarantees it is informational. What if you were moved to another department and someone else replaced you? Or what if you had a change of heart? The moment you setup a board with numbers written on it, people will use them as targets and will be motivated to meet them – that is the magic of numbers.”
Tom’s heart sank. He now came to the full realization of the mistakes he had made.
- He was mistaken about the directional nature of correlation. Having fewer defects doesn’t mean better quality.
- He was mistaken that defects are the only dimension. In fact, there are so many dimensions in developing software that other dimensions are greatly distorted when people maximize one single dimension.
- He was mistaken that he was able to find all the key dimensions. Software work is so creative, so complex and so constantly evolving, it is impossible for a manager to understand every aspect of it and thus control it. In fact, it is not his job to understand and control everything, his job is to enable people to do their jobs and to provide inspiration and assistance.
But Tom shouldn’t have beaten himself up so much. To his credit, he proposed to use metrics as informational, but it takes more than his personal charm to convince people that metrics won’t be used otherwise. There are features that are at the corporation level that affect trust-building between employees and managers, such as:
- Organization size. Large organizations are perceived as less personal than smaller ones.
- Organizational prestige. Employees are more likely to identify with firms that are widely known and widely regarded.
- Degree to which employee needs are met in the organization.
- Perceived level of mutual commitment. The more employees perceive that the organization has taken on responsibility for their well-being, the more loyal employees are likely to be.
Tom (sounded desperate):”What are you saying? We should abandon all metrics? Wouldn’t that lead to chaos?”
Bob (winked at Tom):” if we abandon all metrics, you wouldn’t have a job. First, it won’t lead to chaos. People have an intrinsic motivation to do a good job. Every decent software engineer wants to write good code and come up with good solutions. Software changes so quickly and new technology pops up every day, if an engineer doesn’t have an intrinsic motivation to learn new technologies, he will quickly find himself left behind. The metrics you created forced my team to do quick-and-dirty solutions. It affects my team’s morale, because they are not doing things that they can be proud of. In fact, some members are so disappointed with their work, I am afraid they might want to leave. ”
Tom winced, he didn’t know that his metrics had such a damage. Bob caught his expression and went on to explain: “If you think about it, it is not surprising. Software, by definition, it’s creative and requires innovative, non-routine solutions. If something is routine, we will devise programs to have computers do it for us. Because of this nature, it is interesting and rewarding. Now you imposed this metric, you practically dictated what we should to do – fixing defects, how we should do it – fixing them quickly, when we should do it – in a year. It took our autonomy from work, reduced us to working ants, made the job less interesting and dampened our initiatives and passion; not only that, it went against the nature of software – it discouraged innovation, which is the key element of software, because our focus was narrowed on one aspect, we were not thinking out of the box which was necessary for innovation.
But you can still use the metrics, because many of them do reflect quality. You can also set a target – most people will assume those metrics as targets anyway – but you should realize these metrics measure only a small subset of dimensions – and arguably not the key dimensions. So you should be careful not to set an extreme target, for example, 50 defects is an extreme number and will motivate people to short-cut by all means; second, you should leave each team to interpret the data themselves, in other words, these metrics become each team’s self-assessment tools, it tells them they are doing something wrong, and they should come up with practical ways to improve them. For example, they might conclude that their code ‘testability’ suffers which leads to a lot of regressions, so they should invest more on refactoring or automation. “
Tom:” What you said makes a lot of sense. But I’ve learned so much today that it shakes my belief and confidence. Do you think we can make successful software after making the changes you suggested?"
Bob (chuckled):” How I wish that was true. You must have seen this amusing picture which sums up the problem so perfectly.”
Bob pointed to the picture in his laptop.
(Click on the image to enlarge it)
Tom (smiled): “The business consultant did a fantastic job, didn’t she? She painted for the customer something more comfortable than what he asked for, and after the customer realized the truth it’s too late, data is in, process is changed. He couldn’t get out.”
Bob:” The landscape has changed. Competitors are more nimble in providing what customers want at a low cost. Back to the metrics question, even if we write beautiful code, even if we have zero defects, even if we complete it within time and budget, it doesn’t mean we are producing something that the customer wants. What we are measuring is not the end result, which is how customers perceive our product, but substitute characteristics: we can do a fantastic job on these substitute characteristics but still get miserable end results. Since we are making products, we are essentially anticipating or even making up customer needs, if we make a mistake it might be many releases before we realize the mistake, and that could be very costly.”
Tom:” Yes, we need to shorten the time from the first clip of the picture to the last clip.”
Bob:” Exactly! We have a golden opportunity here -- dog food! We are a very big company, plenty of opportunities to eat our own dog food. “
Tom:” Dog food?”
Bob:” Yes. We can sell products internally. After all, if our own company does not want to use our own products, chances are slim we can force them down other companies’ throat, no matter how brilliant the business consultants are. Instead of lumping twenty big features in a major release every 6 months, we can dissect big features and release each small feature continuously to internal customers. You can measure how quick each team does continuous release and how well end customers perceive them. The ability to do good continuous release is a testament of how well the whole team – including developer, QA, product owners – functions. Customers’ feedback is the ultimate measurement. Moreover, seeing that what they are doing is making a direct impact on end users will greatly boost people’s intrinsic motivation. “
Tom (smiled wryly):” I like your idea, but it is not easy as you might think to smooth out internal cooperation and communication to eat our own dog food. Last year, I ruffled quite some features trying to push through the workflow changes. In a big company, there are many people, but there are also many turf wars and political plays.”
Bob:” But we can’t make modernized products without modernizing our management approaches while the rest of the world is modernizing. I do not want to see Phoenix go up in flames…”
Tom:” At a minimum, I’d like to have a discussion about the current metrics. What are their intentions? Are they triggering behaviors that are in compliance with the letter of the intentions but are defying the spirit? And more importantly, right now there is no metric that measures the end result – How do customers respond to each release? How do they like each new feature? Or are they using new features at all? If we have such discussions I am sure many new ideas will pop up. For example, in addition to dog food, we might want to develop some reference customers and make arrangements with them.”
I wrote this article mainly to point out the original sin of software metrics and to break people away from the conception that “the metrics are good, people are doing it wrong, so more processes should be set up to make people comply with the metrics.” The end of the article has a feeble attempt to offer some solutions, but at the end of the day, it is about building a culture that taps into and nurtures intrinsic motivations, which is a topic I do not have enough experience to talk about. I hope this article has provided some basis for a discussion about management approaches in a creative industry such as software.
About the Author
Chen Ping lives in Shanghai, China and graduated with a Masters Degree in Computer Science in 2005. Since then she has worked for Lucent and Morgan Stanley. Currently she is working for HP as a Development manager. Outside of work, she likes to study Chinese medicine. She blogs here.
will it be the reality that albeit Tom can run his product without metrics yet he have to prepare one for his boss?
Chen, Loved your Article - BUTTT - - -
R Douglas Shelton
Re: Chen, Loved your Article - BUTTT - - -
The problem is, without a culture where people trust each other and trust the management, it will be impossible to convince people that the metrics won't be used against them. The story in the beginning that a senior manager simply asked a question and prompted a reaction is a demonstration. That is why I couldn’t give a "perfect" solution, because at the end of day, it is about corporate culture.
But I put out some thoughts at the end of the article:
1) Have an aggregated metrics that measures the whole team (the whole software chain, including DEV, QA, product manager etc), so the information in the metrics can't be used against individuals. One such metric could be how fast a feature can be delivered to users (which really tests how good the whole team functions and how good the develop process and code structural quality is), and how users respond to the feature. To implement such metrics, if the product is not offered on cloud, then perhaps the best way is to eat one's own dog food -- making arrangement with internal users (so new features can be deployed continuously) and gather their feedback.
2) Use metrics as self-assessment tools, and have the team to bring up action plans on their own, rather than imposing action plans by some higher-up managers after reading metrics. The reason why the latter approach is demoralizing, I think, is because it takes away the autonomy from people in doing their jobs, while autonomy is essential for people to feel fulfilled and happy in their jobs (according to Driver). Controlling the process (i.e. action plans) is also harmful to inspire innovation, which essentially requires breaking away from the routine and thinking out of the box.