Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Dominica DeGrandis on Dealing with a World of Uncertainty (Especially in Ops)

Dominica DeGrandis on Dealing with a World of Uncertainty (Especially in Ops)


1. Hi. I am Manuel Pais and I am here at AgileConf 2015 with Dominica DeGrandis, Agile and Lean Coach. Thank you for accepting our invitation, Dominica. Can you briefly introduce yourself to our audience?

I am Dominica DeGrandis. I am director of training and coaching at LeanKit. I live in Seattle, four children, married. My hobbies include Yoga - I am a big Yoga fan – and I like to rearrange kitchens. I think that between Yoga and Kanban we could solve about 92% of the world’s problems.

Manuel: Probably. Your talk here at the AgileConf is about an Ops team that wanted to improve their practices, but apparently they had no slack to be able to do that. In that kind of situation, how would you propose to actually stop firefighting all the time and get buy-in from management to invest in improving the flow of work? Knowing that typically that requires, at that moment, not being able to respond as quickly to customers.

I think firefighting comes in a couple of different flavors. There's the true fires that are when your servers are under attack and production is threatened to come tumbling down – that is a real fire. But there is also perceived fires where maybe product owners or features that are trying to be delivered are competing with each other and you have conflicting priorities. Those kinds of fires come from escalations that go up the food chain and over to somebody else’s director and then down and then you get a walk up to your desk.

We find out that a lot of escalations are of type 2, rather than type 1. If we have a sufficient amount of type 1 kind of fires and escalations it is because we are not taking time to do the maintenance, to fix the technical debt, to figure out what our priorities are. So, sooner or later, the problem exacerbates and we end up having a DDoS attack because we do not have enough bandwidth. Or there is an audit coming up and now, because we have not been doing some regular maintenance and working on compliance, we have to stop this other work.

Eventually, if we don't do the maintenance, we will have more priority 1's. So I think bringing some visibility of that to management can help begin to understand the importance of taking some time to better prepare and do some maintenance.


2. How can you then demonstrate to management that you are making progress towards a better workflow and fixing technical debt and those kind of problems and not wasting effort in punctual improvements?

When I hear that question I think of how do we show that our flow is improving instead of being trapped into doing “fixed date” kinds of releases. I think the key metrics for demonstrating that we are improving our flow is showing that our cycle time is improving, that the elapsed time for getting work done, from the time we start working on something to the time we've delivered it, that that elapsed time is getting shorter. And that the amount of work in progress that we are taking on is being reduced because of the direct correlation, the relationship between work in progress and cycle time.

So when measuring we want to show those two metrics. Then your second question about having regular fixed-day-driven work going on constantly – that is a really long conversation. But the short answer to that I think is understanding how much uncertainty there is in our world, particularly in operations and that when we are dealing with a world of uncertainty, we need to look at probabilities and not hard due dates.


3. And in the case of that organization you are talking about, were those the metrics you put in place for tracking their progress?

We did. We looked at cycle time, we looked at work in progress and we looked at throughput, three key metrics. Originally they were starting out using story points, but story points are a bit arbitrary, right? We are interested in looking at actual time. How long did it take in terms of calendar time? Because that is what people want to know, especially management. They want to know when is this thing going to be done.


4. What was the net improvement in the end for that organization, from the start to whatever point you are now?

I just have to say that there is still room for improvement. So this sentence might surprise you a bit. I think one of the biggest improvements was the ability for the team to now have a voice in the matter - to bring about a voice of reason of why they cannot do 500 projects at the same time. Why that does not work, why 100% capacity utilization does not work and why we need to have some flexibility in the schedule to do some improvements, to make sure that we are secure, to make sure that we are going at a sustainable pace. Because if we expect people to be on call 24/7, burnouts start to happen. I think a happier crew is one of the net improvements and when we've got a happier group of people, particularly in Ops, we've got better service.


5. What were some of the lessons learned during that project?

A couple of big ones.

One, know the dependencies. We would march off on a project, there would be a due date set instead of a probabilistic outcome and we would find that this project is dependent on three other teams and we had no idea going upfront. Really thoroughly understanding the uncertainty that is involved because of unknown dependencies was huge.

Other lesson learned personally was about coming in to help this team, because I was brought in by some leadership but did not necessarily have the buy-in of the individuals who were already overburdened. And now they have to look at you like this new thing that is coming. I think getting all-around consensus of executive support for trying this new thing would have been a good idea upfront.


6. Speaking of centralized Ops teams and in some cases they are even external to the organization, they seem to be still the norm in the industry. How does that impact DevOps adoption, from both an organizational and technical point of view?

Adopting DevOps from an organizational point of view - we see it go back and forth. This organization started out as centralized – they've been reorganized three different times so then they went to a more matrixed approach. So it was one central organization and then it was broken up into different components so you have a site reliability team and then you have a team that is building out new data centers and then you have a team that is handling project work or feature work.

I think no matter what structure you are in is how we're communicating that becomes important. So while siloed teams may have the benefit of being cohesive and tight and have a bit more connectedness, as far as moving to a DevOps perspective we want to be able to anticipate what is headed our way as soon as possible. So when Ops people are embedded upstream into product development teams, they can participate, they can have a voice and a say in what is headed their way. When you have time to discover things upfront, you can design a more elegant solution to the problem instead of trying to hack something. I think it works more in DevOps fashion because now we become more part of a team. We can join the party too.


7. So is it more like a trade-off where there is some advantage to centralized teams and other advantages to more distributed ones?

I think one of the hardest cases or scenarios is when half the team is on site and half the team is distributed. Because that core group sees each other every day, it is probably easier to bond versus teams working remote, feeling a little bit left out. It is harder to feel like you are part of the team so maybe bringing those remote people on site at a frequent cadence is pretty helpful.

Manuel: And in terms of staffing – usually, adding more developers during growth periods is not a problem, at least if the technical stack is a popular one. Buthat does not usually happen for Ops, it seems to be harder to grow the Ops team according to the needs.

Yes, we have more specialization in operations. I think there is an idea that we can bring on more developers, but then you have a shared services team and operations that is trying to support multiple projects that are all conflicting with each other. If the expectation is that Operations is going to be at all those individuals stand-ups it is just not happening because they would spend all day in stand-ups. They can't make it to 10 stand-ups every day, so then they miss it and that one key piece of information that they needed to hear is missed. It is like sorting through your mail inbox with 500 e-mails, maybe there's 2 or 3 key e-mails that you need to get, but you are inundated with e-mail and you miss it. It is a problem.


8. It's hard to grow the Ops team mostly because it has a steeper learning curve, would you say?

That is what I observe when I ask people how long does it take to get people up to speed in this organization – 6 months, 9 months. And if you are going to hire, often people will say “We need help. We need more people. There is too much demand. We need to hire more people.” But when you consider the amount of effort that is involved with writing job descriptions, telephone interviews, interviewing people on site, hiring new people, getting them up to speed, your performance actually goes down for that period of time until you can cross train those individuals.


9. Going back to Lean flow and improving flow of work, can you share some strategies that you have applied successfully in multiple organizations?

Sure. If we can focus on finishing some work before starting new work, that is key. Because what typically happens is we've got too much work on our plates and then when new work comes in, we are expected to do that also and the amount of work in progress piles up and piles up and impacts cycle time which takes longer.

So the first thing you want to do is start measuring how much work we have on our plate, how long is it taking – the elapsed time, calendar time – and address that first thing. Some successful approaches I have seen are temporary teams that are created to finish projects that are 90% done. They just need that little bit of effort to get it done. So, we have special SWAT teams that just really focus on finishing some of the projects to get them done and then hold off on starting new projects until those other projects are done.

Another key thing we look at is getting a collective agreement on prioritization. Usually everything is a priority one. When everything is a priority one, nothing is a priority one. So, we are trying to get collective agreement above, at the higher level, from stakeholders, on what the priorities are. It is very helpful for teams in helping get their work done.


10. Last question. You have written about Kanban for DevOps. Can you explain what you mean and what are the benefits?

What is Kanban for DevOps? Well, let’s define DevOps first because that is a little nebulous right now. It seems to be defined more by what it isn't, than by what it is. My take on DevOps is that it is about improving organizational health, it is about improving performance, it is about improving co-operation and trust, and improving job satisfaction. That is key.

Kanban is about doing continuous improvement and just looking at bringing visibility to the work and reducing the amount of work we have going on so that we can increase our throughput and have a smoother flow. I think that using the Kanban approach to improve performance, to improve feedback, to improve job satisfaction and culture is the perfect marriage. They go hand in hand.

We are trying to move away from command and control and to work in a more experimental nature. We are trying to do smaller batch-size releases. We are trying to do gradual changes over time and bring about a better place to work, a more sustainable place to work instead of relying on heroism to get things done.

Oct 18, 2015