Queues – the true enemy of flow
When a project is late, it’s rarely because of how long the actual work takes us. It’s more often connected to how long our tasks have spent inactive, sitting in a queue. And yet project management offices tend to focus on activity time, not queuing time. The only queue most IT departments measure is the backlog, but there are hundreds of others. This article examines why we ought to track them and how much they cost us.
Why is it taking so long?
Discovering the true enemy of flow
A watched kettle never boils.
If we’re longing for a cup of tea, we complain about how long the kettle takes to boil. But boiling a given amount of water is an activity you can do very little to speed up. It’s only your desperation for tea – symbolised by your anxious attention on the kettle – that makes it seem slow.
Let’s imagine you thought about having a cup of tea half an hour ago. If you don’t yet have one, the issue is unlikely to be the kettle malfunctioning. Far more likely that other things have got in the way. Perhaps the phone rang. Maybe you and your partner needed to have a long row about whose turn it was to make the tea. Perhaps you had to tackle the mound of dirty dishes before you could fill the kettle! These non-value-added activities are the true cause of delay between wanting a cup of tea and getting one. All this time the task ‘boil water for tea’ is essentially sitting in a queue behind other tasks.
It’s a proverb, and a situation, that ought to ring bells for most software development teams – and yet worryingly, it doesn’t. That’s because we’re so blind to these ‘queues’ that we don’t even consider them. Rather than asking why it took us so long to put the kettle on in the first place, we stare angrily at the kettle, willing it to boil and muttering about how long it takes.
Queues are everywhere
All of our working life is filled with queues. Because we are so busy, we find it odd odd to realise that of a project’s total length, very little is actually work. It spends most of its time in a series of queues. The queues range from really big and obvious delays, (waiting to have a team assigned to the project); to minor and almost untraceable ones, (a request for information in an email sitting unanswered in a colleague’s inbox).
Why don’t we see them?
We tend to respond very well to visible queues. We can avoid them (big queue at the bank, I’ll come back later); we can get angry with them (complain to the manager and threaten to move your account elsewhere); and we can manage them (invest in automatic paying-in machines to try and reduce queues at the bank).
In manufacturing, where queues are large piles of inventory on the factory floor, and thus present on the balance sheet as unrealized assets, management will expend a lot of effort on reducing them.
But inventory queuing up in software development is invisible. Most of our work is made up of information: ideas, design and lines of code. The impact is just as severe for us as for the manufacturer, however. The longer part-completed work sits there gathering metaphorical dust, the greater the danger that the value could disappear altogether. The opportunity could pass, a customer might walk away, technology or the environment might change... The sunk cost would be irretrievably lost, the hours of time invested to date, wasted.
Because our queues are invisible, we find it easy to ignore them. If we double the number of requirements in a project, there is no warning bell that sounds. Developers in the department might look slightly more stressed, but there is no real way of knowing what the change has done to their work or to how quickly they will complete it. Now imagine we doubled the number of developers on the project. Everyone would notice that! Not only is there a sudden scuffling for desk-space, but managers would be in crisis meetings trying to find extra budget to pay their wages.
Queues have a major effect on our cycle time
In a system that had no queues at all, the cycle time (the time taken on average to deliver a single set of requirements) would be the sum of all the activities. That means decision gates wouldn’t take the week marked out for them in the project plan, or even a day, they’d take the two hours actually spent in discussion. Researching an idea for development (maybe a daring new user interface) wouldn’t take four weeks, it would take the actual five days of prototyping potential layouts (four prototyping teams would all run concurrently) plus the actual time to collate and analyse the results – an extra day, perhaps.
A project run with zero queues takes an extremely short amount of time. It is also, of course, very costly. To have zero queues, you need to keep your people free from other work – your tester needs to be on standby, ready to jump in whenever required; you need a stable of consumers ready to answer questions on your latest idea whenever you want feedback, or perhaps the CEO must be willing to immediately receive your calls whenever you require his approval on something.
If the fastest cycle time means zero queues, then long queues mean the opposite: the longer the queue, the slower the cycle time. We understand this – we look for short queues at supermarkets because we expect to be served faster. Work in a queue is doing nothing, each day in a queue is an extra day added on to total cycle time.
In short, queues have a direct economic impact on the business. They increase inventory, stall valuable projects, which increases the risk of loss, delay feedback and impact on motivation and quality. Yet in spite of this, they are rarely tracked or targeted. A company that carefully keeps account of every hour of overtime is quite likely to be blissfully unaware of the cost of delay to a project caused by long queues.
Long queues cost more
Faced with a long queue, we tend to react with optimism. True, we say, I just had a big rush of work, but now things will calm down and I can get through my list of things to do (a queue). But our optimism is misplaced. As soon as any single task takes longer than we expected, a queue begins to form. It is unlikely that this will correct itself through lots of quick and easy tasks arriving in the queue. Instead as the queue gets longer and further out of control, the probability we will be able to get the queue back under control greatly decreases. It’s known as the ‘diffusion principle’ and there’s quite a neat mathematical proof for it. As soon as a trend begins to move in one direction, the less likely it is that we will return to our original starting point. Our inability to intuitively grasp this probability issue is one reason that so many investors hold onto falling shares, stubbornly hoping they will go back up.
In actual fact, when it comes to queues of work, the principle is even further weighted against us. Experience shows that even with honest planning, most of us tend to underestimate how long tasks will take – so most tasks take longer than expected, making the rate at which long queues get longer even faster.
If the first task takes longer than expected, then all the tasks behind it will be delayed. If each of these has a cost of delay than you will pay the cost on all of them (even though only one task actually overran). This is why long queues have a much more devastating economic impact – and when really long queues have formed, catching up with them becomes increasingly unlikely.
The UK Border Agency is a famous example. In 2006 it was ordered by the government to deal with 450,000 unresolved asylum cases within five years. By the summer of 2011, the agency still had 147,000 unresolved cases. There were 150 boxes of unopened mail from asylum applicants, their lawyers and constituency MPs stored in the office. As each case was delayed it became harder to trace applicants. Often circumstances had changed – having had a child in the intervening years, for example, often meant applicants now had a right to remain. Such cases blocked up all the cases queued behind them, meaning that they too would suffer the same problems. Politicians remained focused on the activity – how fast and accurately were staff processing cases, rather than the true block defeating them – the length of the queue. Reducing the queue meant facing unpopular decisions – like granting amnesty to anyone in the queue for longer than 5 years. Without that, the queue continues, spreading its economic, and human, damage.
So what action should we take?
1. Measure our queues.
If we only look at output (the stream of asylum cases being resolved), or activity (all the border agency staff look busy), there is a long delay in noticing a problem. When we measure queues, we receive an early warning. At its simplest level we can simply measure how long work takes to pass through the queue – this could be overall cycle time from concept to cash, or it could be through specific processes.
Start by recording when a task enters development. Measuring the exact amount of time the task takes in each process. Record when the task is finished and deployed. By subtracting the ‘work time’ from the ‘total time’ you will get an idea of how long each task spends in queues.
2. Make queues visible
A helpful visual depiction of the queue is the Cumulative Flow Diagram (CFD) – this tells you not only the size of the queue, but whether it is exacerbated by a large number of arrivals or a lengthy service time that results in few departures. It is particularly helpful for spotting emerging queues. Many teams show queues as post-it notes or cards stuck on a board in swim lanes that represent processes or individual developers.
Almost immediately it becomes clear when a particular process is in danger of being overwhelmed as a queue begins to form.
3. Estimate our queues
We can estimate queues using Little’s Law. It equates average waiting time with queue size and processing rate. It is very robust, applicable to everything from the overall system to the individual product queue and it is also simple to explain compared to other queuing theory concepts.
Average waiting time = queue size/ average processing rate
This is the function being used to tell us how long it should take to answer our call or how long from this point to the front of the rollercoaster ride queue. It means we can work out how long it will take to get to any individual task – a piece of information that can really help product owners decide whether they need to reprioritise.
4. Sequence our queues
Once we have good visibility of our queues and can measure them, we can really start to prioritise or sequence them so as to maximise value and minimise pain. We can do this pretty quickly and know that we’ve got a good approximation. We begin with tasks for projects with the highest value. Where the value is equal, the shorter task should take priority, since it blocks the resource for a shorter time and by completing it we can realise its value.
5. Purge our queues
If a job has not been worked on for a certain length of time, perhaps because other tasks have been moved up ahead in the queue, then it’s time to purge the job. If people object then the purge acts as a way to focus their attention – to assign new priority to the purged feature or task and resubmit it.
Where should we hunt for queues in IT?
The Fuzzy Front End
Reinertsen and Smith memorably described the period in a project before the development build begins as ‘the fuzzy front end’ – approvals, exploration, capability studies etc. Companies invest a great deal of time in ensuring that they only devote resources to the ‘right’ idea. This causes a big queue at the very start as each idea needs a project plan with cost and expected return on investment (none of which can exist without prior investigation). It is ironic that companies do this in order to minimise their risk, but in the process cause such long delays to their overall cycle time that they actually increase risk of the ideas they select becoming obsolete.
What can we do?
If you can quantify the cost of delay for each project or idea, you can help focus management attention on making faster decisions or testing ideas out to gain funding incrementally.
We tend to manage specialists for maximum efficiency. Because they are often expensive we like to keep specialists fully utilised. This is the recipe for a queue. Employing more specialists is expensive and companies are often reluctant to invest a specialist’s time to train others.
What can we do?
Having ‘generalised specialists’, can be a fantastic way of adding temporary spare capacity when required – developers who can test; statistical modellers who are happy to pair. The team can also provide support to make work flow as smoothly as possible through the bottleneck. This can range from offering secretarial support (you want specialists working not booking train tickets) to doing as much advance preparation as possible.
Big queues in software are not always about people. They are quite as likely to be caused by a different resource: hardware and environments. Hardware is frequently a constrained resource, either in itself or because of how it is set up.
What can we do?
Sharing an expensive resource makes complete sense to those considering efficiency, but efficiency must also factor in the cost of delay. Teams themselves must also work to manage the bottleneck – preparing set-up instances in advance, for example.
Queues are a fact of life. We are not trying to present them as the embodiment of business evil. They lead to delays and delays usually have an associated cost. In many cases this cost may be worth bearing compared to the cost of eradicating the queue. National Health Service GP surgeries tend to be happy to function with long queues, for example. They are more concerned with the capacity utilization of the doctor than the cumulative delay to patients. Private Healthcare clinics will have a very different attitude to queues and be prepared to run with lower efficiency in order to ensure patients don’t wait.
A blindness to queues means you are unable to make such decisions. It means that your business could be suffering huge bottlenecks and incurring a heavy cost of delay in complete ignorance. A company that is worried about delivery times but looks only at activity and not queues is watching a kettle when the fire has gone out.
About the Author
Paul Dolman-Darrall is an IT director known for developing people and successfully leading large global teams across various change programs for some of the largest companies in the world and contributed to strategy of government. At Emergn, in his role of Executive Vice President, he has helped launch Value, Flow, Quality (VFQ) Education, a work-based learning program to help practitioners achieve immediate business results through the application of skills in practice. The program is designed to help IT departments and business leaders who rely on technology to put in place smarter, more effective work practices to facilitate change, generate significate return on investment and inspire innovation in practice.
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin May 21, 2015
Kai Kreuzer, Olaf Weinmann May 21, 2015