InfoQ Homepage Presentations Lean and Accelerate: Delivering Value as an Engineering Leader

Lean and Accelerate: Delivering Value as an Engineering Leader

View Presentation

Speed:

Download

28:19

Summary

David Van Couvering wants his audience to have a deeper understanding of the effectiveness of these principles and an explanation as to why these principles work and why they should be implemented.

Bio

David Van Couvering is senior principal engineer @Optimizely.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Van Couvering: This is David Van Couvering. We'll be talking about Lean and Accelerate. In particular, we'll be diving into these key Lean principles that are a part of how accelerate measures software delivery performance. Going into some of the theory and math behind these principles to help you better understand why they work so well. Also, to help you convince others who might need convincing that these are changes you'd like to make to how you deliver software, to overall get better value and make for happier employees.

Accelerate Provides Scientific Evidence for Key Practices

I want to start out talking a little bit about the book "Accelerate" that came out a few years ago. I don't know if you've read this book. I really like it. It was just a great thing to come out into the industry, because it actually has research backing some of these practices that I know I've been trying to put into practice to convince others to put into practice, and being able to actually say, here's evidence for how this works. We don't need to keep arguing about whether or not it's going to work. Four years of research across many different markets, across many different sizes of organizations, compelling evidence that certain practices predict that you're going to exceed your business goals. Improve on your business goals, and improve on engagement and morale of your team. In particular, they have four metrics that they follow, lead time, deploy frequency, mean time to recovery, and failure rate.

Major Contributors to Improved Performance

Then the next question they ask is, what are the major contributors to improving your software delivery performance, to improving on those metrics? There's a number that they list but three that stick out to me and which are really key, are practices of continuous delivery, having a loosely coupled architecture, and then having Lean management and development practices. We're going to lean into some of the key Lean principles, and the math and theory behind those principles.

The Principles of Product Development Flow

A lot of where I got the more detailed research into Lean principles comes from this really great book written about 10 years ago by Donald Reinertsen, called, "The Principles of Product Development Flow." I had a friend recommend this to me. I've had to read it three or four times trying to understand what he's trying to say. There's a lot of math in there. There's a lot of principles in there. Not all of them applies. Some of them are a bit very specific to very specific circumstances. There's a number of things in there that were really eye opening for me. I wanted to share those with you here.

Key Economic Principle: The Cost of Delay

First of all, Dr. Reinertsen, he spends a lot of time talking about the economics of software delivery, and how we need to take these economics into consideration. There's a whole section on that. I think the key thing I got out of that is that there's this concept that we have to keep in mind, and that's the cost of delay. Often, when we're prioritizing products and features, we're thinking about the business value. What's the value it's going to deliver to the business when this is delivered? What we don't consider is how that value gets degraded over time, as the delivery of it gets delayed. Sometimes if you see on the left here, the delay is just that you don't get to pick up the value until it's released. Once it is released, it retains its overall value. However, sometimes if you miss your release date, then the value starts getting reduced. Maybe it peaks at a lower level than it might have peaked if you'd released it sooner. In some cases, if you have a fixed date cost, the cost of delay can be significant very quickly if you don't meet those dates. Pretty much anything that you're building, there is a cost involved the longer it takes to deliver it. That affects a lot about how you think about delivering software.

A Production Pipeline aka Value Stream

If you think about a production pipeline as a value stream, is basically a set of processes where one process takes some inputs, does some transformation, delivers an output that gets pulled into more processes. It's a lot like stream processing, but in this case, it's a product development processing and ultimately is the value. Nothing delivers value until it reaches that green thing at the end. You're not going to get any value when it's half done. That's an important thing to keep in mind.

What contributes to delay in one of these pipelines? There's two main things. There's the time it takes for you to process in one of those processing nodes. Then there's a waiting that can happen as you're moving from node to node, or things get backed up for whatever reason and some queue forms. Whenever you have wait times, those manifest as queues, as you can see here at a local coffee shop, in the good old days before COVID, that if your processing time isn't keeping up with the demand, then you're going to get queues. Those queues, nothing's happening in them. Everyone's just sitting there waiting. In the meantime, you're incurring cost just through delays.

The wild thing is that in software delivery, these queues are invisible. You don't see a line of people backing up out the door. You don't see boxes backing up on a machine. It's really hard to see. It's really important to try and find ways to see it, because most of the time, since we don't see it, we don't think about these queues, even though they're having such a big impact on our ability to deliver value more quickly.

Queuing Theory

I'd like to get into a little bit of queuing theory, not a lot, but to help us all understand some of the principles and behavior of queues, how queues behave, and what that means in terms of techniques we can use to improve the speed at which we deliver products.

Little's Law

You've probably heard of Little's Law. You might not have, but it's pretty basic. I know that when I'm standing in a line, it's important to me with how long the line is, but also how quickly people are getting served in that line. That's basically Little's Law, that the time it takes to move through a process, the average cycle time is basically how long your queue is, the work in progress, divided by the processing rate. It's very simple, but it actually becomes very useful. You can apply it to lots of different ways to think about things. You can change this around, what's my average processing rate? I look at my average cycle time and divide it by my work in progress, things like that. It's a really nice formula to have, like F equals MA.

Use Cumulative Flow Diagrams to Make Queues Visible

One really useful tool for helping visualize all this is what's called a cumulative flow diagram, where you basically on the left, you're showing the arrivals of jobs to be done. Then the next line is the amount of arrivals that are getting picked up into your processor. Then there's the departure rate. Basically, the time at which these items are getting finished in this processor. You can very quickly see what your queue length is going to be. You can very quickly see how long you're waiting in the queue. How long you're processing. Your overall cycle time. It's all very visible. For example, on the far right, as you increase your processing rate, the queue gets shorter. If you don't increase your processing rate and the arrivals come in faster, then your queue gets longer. Basically, you can visually see Little's Law in play there.

Characterizing Queues

It's very important to think about queues in different ways, not just queues, but processing systems. Some processing systems like maybe in manufacturing, the time it takes to process an item and the rate at which it arrives can be controlled very specifically. That means you don't have a lot of variability in either your job size or the arrival rate. That behaves very differently from what you might call stochastic systems, where the arrival rate and the processing time varies a lot. This thing is called a Markov distribution, which means that the faster arrival rates, or the longer time it takes to process tends to be less frequent and an exponential rate. That means that most of the times things are arriving slowly, and they're fairly easy to do. There's these bursts that can happen, where it takes a long time to do something or things started arriving faster. This is called an M/M/1/Infinity. Infinity means that the queue lines can get as long as it needs to get, and 1 means there's one processor. That's how most of our delivery systems work.

Capacity Utilization and Wait Times for M/M/1/Infinity Queue

When you have this queue, you can ask some questions about the relationship between how busy you are in a particular process, like a Scrum team, and how long the queue is going to be. Basically, the average number of items in the queue is your utilization squared, divided by 1 minus utilization. Let's say your utilization is 75%, then it's 75% squared, divided by 1 minus U. What does this all mean? That means that a busier team means an exponentially longer cycle time. As you can see, in this graph here, at 50% utilization, your cycle time is two times the time in service. At 90% utilization, it's 10 times. Every time you go to the next 10%, it doubles. This is really important because it's very natural as a manager to think, I want busy teams. Busy teams means we're getting more done. What this is saying is, if you think about the delays that happen, the queues that build up as a result, is actually not what you want. You don't want your team so busy that the queues start building up exponentially. Another interesting thing is that basically, the percent time spent in a queue is equal to your percent utilization. If you're at 50% utilization, then half your time is spent in the queue. That really was an eye opener for me. I've always had this intuition that you don't want teams that are too busy. This really demonstrates it, and is a really great tool to communicate with.

Most of the Damage is caused by High Queue States

Another really important thing to understand is that, even though it's maybe less likely for the queue to be a certain length, the damage that happens when you set that length is significantly higher than when it's the shorter lengths. Basically, for example, if you're adding the percent chance, that 75% utilization that you have 3 items in a queue is 10-and-a-half percent, but it makes 170% more damage than a one item state. Because, basically, it's not like it's three times less likely for you to be in a three queue state than a one queue state. It takes three times as much time to get through those three items. That's why you have your high queue states cause most of the damage. You really want to avoid getting into high queue states.

Once you're in a Long-Queue State, it is Very Hard to Get out of it

Another problem with a high queue or a long queue state is that as you drift toward these high queue states to random processing rates and random arrival rates, you're not going to be able to randomly get back to a low queue state. You have to add additional effort to process that extra stuff to get yourself back down. You can't just expect yourself to drift back to where you were before. As a matter of fact, they did this experiment where they flipped coins and every time you had a heads, you added one, every time you had a tail, you subtracted one. Over time, it'll either drift to a positive or negative direction. Once you're at the number 10, the chance of getting back to zero is 1 in 1000. Really important to stay out of those high queue states.

What Can You Do?

What can you do about it? Looking back at Little's Law, you can either increase your throughput, or you can shorten your wait time. How do you increase throughput? You can add capacity. That's ok. It takes a while to notice that you're at high utilization, takes a while to get people to add capacity. At that point, if you're really at high capacity already, your queue has very quickly gotten long. You're not as able to respond as quickly and you end up in those high queue states. You could try to be more efficient, but there's only so much you can do around that. Another thing you do is reduce your variability. This behavior is with stochastic use. If you can start saying, I want to make sure that my jobs arrive at a more regular rate, and this level of effort is more reliably the same size, then you get more of a flow. You're not going to get into these backed up states. That's hard to accomplish. If you look at the math, variability reduction has a linear impact on your processing rate, whereas queue length has an exponential impact. Really, the best thing to do is shorten your wait times, control your queue length. The thing you don't want to do, is try to get more utilization out of your team, like make them busier. That's going to make things worse, not better.

The Importance of Small Batches

Let's talk about some solutions that specifically have been identified as things that work really well. The first one is the concept of small batches. This one, I know, most of us are really well aware of. It's great to understand some of the theory behind this. I really love this diagram. Because remember, your overall cycle time is a function of both your processing rate and your wait times. When you're working with large batches, each node in the processing unit has to wait for the previous one to get the processing done. Then when you finally get the batch, now all of a sudden, you have a long amount of work in progress that you have to deal with, and you have to process through all that, so it's another 10 minutes. On the left, when you have large batches, the first piece arrives in 21 minutes, the entire batch is in at 30 minutes. If you have one item at a time going through, and again, this is assuming that each batch, each item arrives at a regular rate and can be processed at a regular rate, you don't get things backing up. You can see, just when you add small batches, your cycle time goes down from 30 minutes to 12 minutes. It's very effective.

Advantages of Smaller Batches

It's really important to understand all these other advantages of smaller batches. You get faster feedback. That's really important for a lot of reasons that I'm going to talk about a little bit later. Makes more efficient use of resources. When you have less things being piled up in a large batch, you don't have to have as much management overhead. When a large batch arrives in your system, it overwhelms it. Think of it like a bus arriving at lunchtime to a restaurant, all of a sudden, now everyone's busy really quickly because a large batch has arrived. When I talked about improving variability when you have smaller batches, it tends to be that your arrival rate is more predictable, your processing times are more predictable, and so it improves the flow by reducing variability. It significantly reduces waste and rewrite because with this fast feedback, you get fast feedback on errors. They are caught sooner. You can adjust sooner. You don't have to build all these failed things on top of a failed assumption, if you find out much quicker that there's a mistake. An interesting thing is that it improves overall engagement of your team. When they're getting fast feedback and they're seeing their stuff being used more quickly, when they're seeing the results, you get more engaged. If there's a long wait time between the time you get something and the time you deliver it, there's much less engagement.

Optimum Batch Size is a Minimum on a Movable Curve

The interesting thing is that, what's the right batch size? It depends. It depends on what the cost it is to not process a batch, to have it sit and wait. That's the delay cost. What's the cost for a particular transaction to do a single unit? What's the overhead for processing that unit? For example, every time you work on a ticket, you have to open the ticket. You have to get an approval for the ticket. You have to submit a code review. If those processes take a long time, like maybe you have to run manual tests, or maybe you have to get approval from a change approval board, then if you have your batch sizes too small, then the overhead becomes too much. There's a curve there in terms of what the overall cost is and you want to find the minimum on that curve. However, you can move that curve, and that is by reducing your transaction costs. That's why we see ourselves spending so much time on trying to automate things, and so that it allows us to lower the transaction cost. That allows us to make our batch sizes smaller. The interesting thing is that it feeds back on itself, because as you get some of these benefits out of your smaller batch sizes, it allows you to adapt more things, build more automation in, and then you can just keep getting your batch sizes smaller.

Limiting Work in Progress

I want to talk a little bit about limiting work in progress. This is another really important technique. This one I've noticed has a much less adoption in software development processes. I'll talk a little bit why I think that's so. We talk about how small batch sizes help improve the overall flow and cycle time. It doesn't prevent, like when you have these stochastic systems, from you all of a sudden ending up with a high queue state. How do you get out of that? A small batch size doesn't necessarily help you get out of that state.

Manage Work In Progress to Set a Hard Limit on Queue Length

Managing work in progress allows you to set a hard limit on your queue length. It's like rate limiting on an API. If you have a burst of traffic, you can set a cap on that so your system doesn't get overwhelmed. Most of the time when we talk about work in progress, a lot of us talk about, I've taken something off of the backlog, and now we're working on it as a team. That's your work in progress. However, in this case, I want to talk about both what's in process and what's in your queue, in your backlog. If this is a little bit confusing, think about the backlog where you've committed to, yes, this is something that we're going to work on, versus a backlog of things we'd like to work on. These are the things that you've committed to, and there's this expectation it's going to be delivered. I think of that as also work in progress.

When you're managing work in progress, you're also managing how big that backlog gets. When you do this, you do lose some capacity. Your team is going to be not at 100% capacity, which is a good thing, but somewhere say that you've lost some ability to produce, yes, maybe so. You don't want to reduce your work in progress to the point where people are just sitting around doing nothing, because there's not enough work. Generally, that's not our problem. You do get significant gains in cycle time, as queues don't get too long. Where should you set the work in progress? It's hard to say. It's a bit of a game. You want to just think about it intelligently. I usually say let's start at A, what's our average work in progress that we've got going on a general basis when things aren't getting crazy? You set it to about 2x, and then see, are the queues still getting too long? You need to lower it. If you're getting to have too much idle time, and you're just wasting resources because everyone is sitting around playing whatever, then maybe you should increase your work in progress. You evaluate and you adjust.

A really important tip is, it's actually really useful to say, we're only going to have five initiatives going on at once in the overall organization, but it's also good to set them at the local level, at the team level. Because then you're able to respond faster to emerging queues, each team better understands what they can handle. When a bottleneck happens, you can start pushing that upstream, and basically it's like backpressure. Then the upstream systems can respond.

What to Do When Work In Progress Limit Is Met

Let's say you set a WIP limit, what are the ways you can do to now say, I'm not going to accept more work? What are the techniques you can use to do that? You could just do load shedding. Any work after we have 10 items at our queue, that's it, we're not accepting any more work, sorry. You can also, instead of just dropping it, it might be better to just communicate, say, "Upstream team, we're not accepting any more work. We'll let you know when we're ready for more." Then that team can maybe decide whether they've met their WIP limit, because now they got a bunch of stuff backed up, and maybe they can keep going backwards. What's interesting is, you can actually identify, look, we'll let it go back to a place where it's not nearly as expensive to have something sitting in queue. Maybe it's like we haven't decided to do this, this quarter, maybe that's a good place to let it settle versus waiting for testing or something.

A great example of that is if there's a lot of fog in the San Francisco Airport, and you don't want to accept any more planes landing, because it's just not safe. It's really expensive to have the plane circling, but if you can send the message back to the airports that are sending planes, they can keep the planes on the ground at a much lower cost. You can also decide to make sure you prioritize your work correctly, including cost of delay, and drop the lower value work. There's many more techniques in the book, but these are some really useful ones.

Sequence Work Economically: Use Weighted Shortest Job First

I really want to talk about this very great technique called weighted shortest job first. Basically, it's a technique where you don't include just your business value, but the overall estimated cost of delay into the way you weight a particular job. Some factors that are included in cost of delay might be increased risk, or it might be lost opportunity, or it might be time to value. The fact that there's this loss of value over time, specifically, if there's a hard date, you might increase the cost of delay a lot. Then, you divide it by the time it takes to do the work. In this case, you have something that has a high cost of delay, number B, but it takes a lot of work to do it. That actually gets a lower priority than something that has a lower cost of delay. If you can see here on the bottom, the numbers add up so that if you order things correctly using weighted shortest job first, then you get a reduced cost of delay. That's another reason why it's important to have smaller batch sizes is because you incur less cost of delay.

Work In Progress Constraints over Time Can Create Constant Flow

What's really nice is that as you have these WIP constraints, you can just keep watching and reducing it. Over time, you can get to a point where your departure rate starts just exceeding your incoming rate, and that keeps queues very close to zero. Reinertsen had one example of a company that applied this constantly reducing WIP, to get to the right level, and they went from a cycle time of 220 days to 30 days, just by doing. It's pretty impactful.

Controlling Work In Progress Needs Leadership Support To Succeed

This is not adopted nearly as much as small batch size in most companies that I worked at. I think it's because this really feels wrong on so many levels to management to say, we're going to do less work. We're going to start saying, no, we're not going to accept this thing until we get the next thing done. We're not going to have so many things going on at once. To many managers, and I know I've done this, is a busy team is a good team. Working really hard is a great thing to see. It really goes against the intuition of what it means to improve velocity. As we've seen, if you're at full capacity, you're going to get long queues, and that everything is going to start getting delayed. I find it really helpful to walk your leaders through this math that I just went through, and help them understand why limiting work in progress is so effective at delivering value.

It's All about Business Value and Happier Teams

It's good to remember that all these processes, the goal is to get better business value and happier teams. When you practice Lean, you're able to reduce your queues, improves your cycle time, get faster feedback, better software delivery performance, better business outcomes, happier engineers, happier teams. Then usually, improvements yield more improvements. It's very important to get the support of your leadership to make this happen. For example, if you decide as a team that you want to limit work in progress, and you start saying no, but your leaders don't really understand, they're just going to put pressure on you to take on the work, and say, "Why do you keep saying no? You can't say no. We need to keep going on this stuff." There needs to be strong leadership support. That's why I hope this talk will help you have ammunition as it were to help your leaders understand why these steps are important. Use these tools. They apply in many different situations. If something seems slow or frustrating, ask if maybe you can apply one of these principles.

Again, read "Accelerate." Read "Principles of Product Development." There's a lot to learn there.

See more presentations with transcripts

Recorded at:

Mar 09, 2021

David Van Couvering

InfoQ Software Architects' Newsletter