BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Anders Wallgren of Electric Cloud on Metrics for DevOps and the Importance of Culture

Anders Wallgren of Electric Cloud on Metrics for DevOps and the Importance of Culture

Bookmarks

Key takeaways

  • The importance of metrics for DevOps
  • The value of having visibility into the whole deployment pipeline
  • It is possible to reduce deployment cycle times by orders of magnitude
  • It needs a supportive, safe culture to enable these types of improvements
  • Start by making the whole deployment pipeline visible and optimize the whole, not just one part

At the Agile 2016 conference in August Anders Wallgren, CTO of Electric Cloud, gave a talk on the importance of the right metrics in DevOps. Afterwards he spoke to InfoQ about the talk and the state of DevOps in general.

InfoQ: Please tell us a bit about Electric Cloud?

Anders: Electric Cloud provides DevOps Release Automation solutions. We work with our customers to automate their software delivery chain, their software pipeline. Our approach to take an end-to-end view of how things work so that really, what we’re striving for is from “laptop to live”. We want to have one chain of automation from beginning to end. Lots of other tools are going to be involved in that; test tools, build tools, performance testing, security testing and so on. We are kind of the conveyor belt and the single pane of glass for managing and coping with all of the software delivery headaches that we all deal with.

InfoQ: Please tell us a about your talk.

Anders: I spoke about metrics and kind of the role that they play, particularly, in DevOps, and in Agile as well. I tend to lump these things together a little bit, which upsets some people. But there’s a continuum of all these things and there’s overlapping circles. I think one of the interesting things that the state of DevOps report that Puppet Labs and Gene Kim in IT Revolution puts out once a year, has really shown, and pretty rigorously in terms of statistics, that there are certain behaviors and ways of doing this stuff that actually impact the bottom line, customer happiness, employee satisfaction, all of those kinds of things.

And a lot of that is rooted in metrics, and they found a bunch of pretty interesting metrics that turn out to be highly correlated with success in software teams. Early on in the 2014 report, they pointed out that organizations that use binary artifact repositories tend to have less failure rates, have lower mean time to recovery. There’s a high correlation between how often are you able to deploy and how successful are you as an organization.

So the talk was “Let’s look at some examples of where organizations are using these kinds of metrics.” I laid out the prototypical software pipeline; Dev, CI, QA, Release, Operations, with a bunch of metrics relevant to one or more of those phases, and then layered in a little bit of the information that we’ve gleaned from our customers and how they’ve been using those metrics. And the kinds of returns that they’re getting: cuts in cycle time or less errors or less rework due to errors, and so on.

And the interesting thing is that actually ties nicely together with the numbers that the state of DevOps report talks about in terms of, “Here are some predictors for high performing organizations.” It turns out that a lot of the anecdotal evidence we have backs that up, which you would hope would be true, and it lines up pretty nicely.

InfoQ: So what are some of these important metrics?

Anders: Cycle time is always a really important metric. We did a survey a couple of years ago of 800 people or so, not a scientific survey, and we asked developers and QA people, “How much time do you spend waiting each week?” For developers, it was 12 hours a week. For QA people, it was 20 hours a week, waiting for builds, waiting for environments, waiting for tests to finish, waiting for deployments to finish, just sitting there basically reading their newsfeed, or context switching, which can be very expensive.

So cycle time has always been a pretty important thing that influences a lot of things down the line. I think one of the interesting things that we’re seeing is that, and this is actually covered in the 2016 report, the amount of rework that you have to do, whether that’s something like security rework or bugs found downstream, highly impacts your productivity as an organization. A low performing organization spends twice as much time remediating security problems as a high performing organization. That’s a pretty significant percentage. You know, if you didn’t spend your time doing that rework, you could spend it doing useful stuff.

So those kinds of metrics; cycle time, lead times, mean time to recovery in case of an outage or a problem during a deploy, how long does it take us to discover it, to remediate it, to make sure it doesn’t happen again. All of those kinds of things are very, very valuable metrics. There’s probably no one perfect set of metrics for everyone but there’s a cluster of very useful ones that are probably pretty universally applicable.

InfoQ: Can you give us some examples of these metrics and their impact?

Anders: Sure. We’ve been working with some pretty large organizations. Our customers tend to be GE Healthcare or Huawei or Qualcomm or Juniper, not so much, maybe five people building a mobile app kind of the thing. A couple of years ago, Huawei decided to focus on what is our lead time to feature delivery? So from the time that we decide to do something when it’s ready to go, how long does it take? So they studied it and worked on it and realized they were about 30 days, which isn’t that bad, actually. But they decided they want to bring that down to seven.

Now, that drove a whole set of initiatives. That drove an initiative to get cycle time for CI down, which they did. From ten minutes to about one minute; it drove an initiative to get their production builds down from about 300 minutes to about an hour. They cut their test times by a factor of four by parallelizing execution of tests and so on and so on.

There’s a large contingent of customers as well, HP, Qualcomm, a bunch of others where one of the big returns that they got was to automate the testing to not do so much manual testing anymore. It has lots of impact that cuts down in your cycle time but it also lets you run more test cycles. That’s going to improve quality. You’re going to find more problems. That’s going to lead to less rework and more time spent on task.

So those all kinds of interesting things like that. Some of our customers that have longer build in test times are cutting them dramatically. We’ve got lots of examples where for example, Juniper I think (I hope I’m getting the numbers right but the magnitudes are about right) basically run from a 20-hour build to a one-hour build. That’s a significant -- now, you don’t go home and come back a couple of days later maybe only to find out that it failed. You waited an hour and find that out.

So when you’re able to kind of take that slack out of the schedule and kind of shut down -- the other thing that we see a lot is those kind of islands of automation with manual handoffs in between, and a lot of the time wasted and a lot of the errors happened in those handoffs. So they’re the kind of end-to-end automation that we work with our customers to do -- it really helps to tighten up and move things to the left more and more and gives you more and better test cycles.

An interesting example that I heard from one of our customers, FamilySearch: their goal was to reduce delivery time -- when they started their continuous delivery journey, they were at about 90 days from code complete and built to when it got deployed and was live. They now do that in ten minutes, and that’s a game changer. They can be on the phone with a customer who just reported the bug, chat with them for a while and say, “Oh, try it now.” And they fixed it and deployed the fix, assuming that you can fix it quickly, obviously. But to deploy the production is no longer a problem for them in terms of their ability to deliver value to their customers.

InfoQ: So what needs to change in an organization to move from a 90-day deployment cycle to a ten minute deployment cycle?

Anders: A lot of things. Automation is key. We need to get people to stop doing things that machines are better at. As a species, we’re kind of bad at typing the right command into the right terminal window at the right time and rebooting the right machine.

But also culture. Culture, both in terms of working in a culture that’s not punitive or bureaucratic. You can tell a lot about a culture, about what do they do to the messengers, the bringers of bad news. Do they get executed? Do they get ignored or do they get helped? And then we go figure out, “Okay, let’s remediate the problem and fix it now quickly.” But then also, let’s go back and figure out, “Why didn’t we catch that earlier? Why did this happen in production or why did it happen in UAT? Why didn’t it get caught in the unit test?”

So cultures that are more blameless and more generative in the typology of cultures. There’s a massive correlation between the type of culture that you are and your success at DevOps and your success at software. A lot of it is lean management techniques. You need to do continuous improvement. You can’t do continuous improvement if you’re always shooting the person who’s telling you what needs to be improved. I think honestly still, it’s one of the hardest things to get right. Nobody likes to have their cheese moved. And to do these things, we do have to change a little bit the way we work, both in Ops and the Dev and everywhere in between.

InfoQ: What about upstream?

Anders: Developers, they build quality in. I think one of the biggest indicators of how well are you going to do is how much time do you spend doing rework and how much time do you spend dealing with problems downstream that should have been caught upstream, whether that means that the individual developer should write a unit test or it should have been caught in regression testing or performance testing or security testing. All these things that we used to do as follow-on phases. You know, we’ll build the software then we’ll build the quality then we’ll build the performance then we’ll build the security and then we’ll ship it.

InfoQ: Except we often missed the bits in the middle.

Anders: Well, yes, of course. But now, it’s much more common to say, “Look, we’re going to have our CI system. We’re going to do the minimum amount of work that we have to do there to make sure that we didn’t completely blow things up.” But then we have follow-on stages after CI to do these things like UAT and performance and the things that take too long, probably, to do in your CI cycle because you want fast feedback to the developers. But these things have to happen for every piece of code that gets released. It has to happen automatically. You have to collect all the data.

All of these things have to happen for you not to get into a bad situation once the software goes out there, especially as we’re starting to get into the internet of things and when all my personal data is available including my heart rate and all that kind of stuff, security, privacy. All of those things that were important and are not add-ons, they’re features, they’re primary first-class features and you need to treat them as such. And I think the industry is getting a little bit better at doing that. We’re still not great at it but we’re getting better. There’s enough public shaming of problems now that nobody wants to be that company.

InfoQ: Looking even further back up to where needs come from. How do we make sure we build the right stuff?

Anders: I think that’s where Agile has really helped a lot, obviously. I mean I don’t want to say that Agile has solved the problem because it’s still a tricky problem to get it right. But I think the focus on cycle time helps quite a bit. The more prototypes you can put in front of your customer, the more you can have them trial things out, the more you can show them what it’s going to do, and the more you can involve them in that, the better.

So I think Agile has largely won the methodology war, it’s fair to say. And now, I may get in trouble for saying this but now continuous delivery with DevOps is plumbing the last mile. Obviously, the Manifesto says, “We care about the whole thing” and that’s absolutely true. From a practical perspective, I think I had to less focus on Dev and QA and more on product ownership and less about how we deliver it to there I think that was probably where the most pain was ten years ago. So we focused on that, and now we’re trying to figure out how do we plum the last mile out to the customer, whether that’s firmware that goes in a box that goes to Amazon or whether it goes up on the web or what have you.

InfoQ: Aside from culture, what are some of the other impediments?

Anders: One of the things that we -- I don’t know if I’d call it an impediment, maybe a challenge, or let me put it this way, the most frequent question I get when I talk about these things is, “Okay, but where do I start?” because this is a big thing. I think people have realized this by now but I mean at first, it was like, “Well, some people do a class and then we’ll be Agile or we’ll be DevOps and we’ll leave on Friday doing waterfall and on Monday morning, we’re Scrum or we’re DevOps.”

It doesn’t work that way. It is a transformation. There are cultural changes that probably have to happen. There are tooling changes that have to happen. There are technological and process changes that have to happen. One of the things that we encourage our customers to do is figure out where you are on that spectrum, because company A’s challenges necessarily the same as its company B’s.

The other thing we encourage our customers to do is, sit down and figure out what does your process look like from laptop to live. From the time a developer checks in a piece of code, the one that’s usable by the customer, what do you do? Especially in large organization, especially in organizations with distributed teams and all of these kinds of things, it’s pretty common that not everybody has that picture in their head. In fact, it’s extremely common to say, “Hmm, you know what? Bob knows about that. Let me go get Bob.” So we bring Bob to the room and we talk and he says, “No, no, no, that’s not what we do at all. We do this. And by the way, if you guys did this, then that will help us.”

So by getting everybody in the room, ideally at the same time, and really mapping out what the process is. One of the benefits of that is, you now can look at that and figure out where is the biggest pinpoint. Because what tends to happen when we’re siloed is it’s like all the people are kind of looking at the elephant, blindfolded looking at the elephant and everybody cares about their little piece of the elephant.

If I’m a build engineer, I want faster build. If I’m a QA engineer, I want faster testing. But what is your biggest problem? Is your problem rework through the bugs that aren’t detected in test? Is it unreliable deployments? Is it downtime in operations? And unless you know that, you can frequently spend a lot of time not moving the ball forward because you’re making a local optimization back here which doesn’t help you at all in the bottom line.

So we always encourage our customers to look, at least once, sit down and figure that out. Now, what we then encourage them to do then is to model that and codify it in our product and run it that way. Then you can evolve your process along with your product, and your process should really kind of live with your product a little bit. The analogy I’ve started using now is back when it was all desktop software, we built installers. Our deployment scripts are our installers these days, our deployment processes, right? There are installers that we give to Ops as it were.

And we’re kind of making the same mistake that a lot of us did with installers, which was, “Give it to the new guy. He’ll do it.” Even though it’s a pretty tricky thing to do, it’s not something you necessarily want to throw inexperienced people at. In fact, you want to throw your best people at the problem. How do we automate our DevOps pipeline? How do we do continuous delivery? That’s not something you throw some inexperienced people at. You want your best people on that to get that right because it is difficult and important.

InfoQ: What is new in the Electric Cloud product suite?

Anders: We’ve got a new release of Electric Flow coming out. The banner headline on that one is, we now basically have clicked to do rolling deployments into large scale environments. So we now have the ability to model, and on a per environment basis say, “When you deploy into this environment, I want you to do this sort of a phased rollout. I want you to do either this many of this tier, machines in this tier or this percentage, then I want you to run this test or this verification. Then we can deploy to 50% and do another test.” So you can pretty easily, either on our domain-specific language or in the UI kind of plan out a rolling deploy into a large-scale environment.

One of the things that we’ve heard from customers is that scale with -- one of our customers is Fortune 20 bank. They have 6,000 applications, 144,000 endpoints in their data centers. So the massive scale, one of the tricky issues for them as well, how do we not step on each other’s toes? So release scheduling and environment calendaring and reservation is another piece of functionality that we have in there, as well as a full dependency view of all these applications, all of these components and so on.

So you can say, “Okay. We’re going to rev this service. Who uses these three apps? Okay, we’re going to do it on Friday. We need to touch these environments. Let’s reserve them and make sure that nobody else is doing deployments into those environments.” And by doing so, I mean we literally prevent people from doing deployments into that environment unless you hold that reservation, or you can set blackouts or those kinds of things. I mean I think one of the exciting things is that the things that -- I mean obviously, this has been around for a long time. But it’s usually been bespoken, scripted, custom stuff written in-house or the unicorns of the world have been able to do it. So now, we’re kind of bringing that to the horses, not just the unicorns, being a little bit more off-the-shelf in doing those kinds of things, which is pretty exciting and fun to see happen.

About the Interviewee

Anders Wallgren is Chief Technical Officer of Electric Cloud. Anders brings with him over 25 years of in-depth experience designing and building commercial software. Previously, Anders held executive and management positions at Aceva, Archistra, Impresse, Macromedia (MACR), Common Ground Software and Verity (VRTY). Anders holds a B.SC from MIT.

Rate this Article

Adoption
Style

BT