Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News How to Measure Continuous Delivery

How to Measure Continuous Delivery

Leia em Português

This item in japanese

Stability and throughput are the things that you can measure when adopting continuous delivery practices. These metrics can help you reduce uncertainty, make better decisions about which practices to amplify or dampen, and steer your continuous delivery adoption process in the right direction.

Steve Smith, an independent continuous delivery consultant, will talk about measuring continuous delivery at Lean Agile Scotland 2017. The conference will be held on October 4-6 in Edinburgh.

InfoQ interviewed Smith about what makes continuous delivery so hard, how metrics can help to adopt continuous delivery and what to measure, what he learned from using the metrics at a UK government department, and how Google’s SRE concept of "error budgets" relate to his continuous delivery metrics.

InfoQ: What makes continuous delivery so hard?

Steve Smith: I always say there are two kinds of people doing continuous delivery - those that know it’s hard, and those that are in denial. Continuous delivery is hard because you’re trying to introduce a huge amount of technology and organisational changes into an organisation.

Automating changes such as automated database migrations or blameless post-mortems isn’t the hard part. Choosing the tools to use isn’t the hard part either - most tools are absolutely fine, as long as you avoid the really bad ones. The hard part is applying those changes to the unique circumstances and constraints of the organisation. Continuous delivery is different for every organisation, and that needs to be recognised from the start.

InfoQ: How can metrics help to adopt continuous delivery?

Smith: Metrics can help you reduce uncertainty, and make better decisions. They help you understand if your adoption process is headed in the right direction.

I recommend to clients that they start out with the Improvement Kata, which creates a cycle of iterative, incremental improvements around your current ways of working. But when you’ve set out your vision, how do you know far away it is? When you want to establish your next improvement milestone, how do you know what to aim for? When you have experimented with a change such as automated database migrations, how do you know if has improved the current situation?

Metrics can’t give you those answers, but they can guide you to where those answers are. I worked for 2.5 years in a major UK government department and we had 60 teams working towards continuous delivery. Without metrics we didn’t know which teams were doing well, which teams needed our help, which practices should be amplified, or which practices should be dampened. The metrics we used pinpointed which teams we needed to speak with, and what we should speak about.

InfoQ: What are the things that you suggest to measure?

Smith: Continuous delivery is all about improving the stability and speed of your release process, so unsurprisingly you should measure stability and speed! Those are intangibles, but they’re not hard to measure. In How To Measure Anything, Douglas Hubbard shows how to use clarification chains to measure intangibles - you create tangible, related metrics that represent the same thing.

Luckily for us, the measures have been identified for us. In the annual State Of DevOps Report Nicole Forsgren, Jez Humble, et al. have measured how stability and throughput improve when organisations adopt continuous delivery practices. They measure stability with Failure Rate and Failure Recovery Time, and they measure throughput with Lead Time and Frequency. I’ve been a big fan of Nicole and Jez’s work since 2013, and I’ve done a deep dive into the measures used and how they pertain to continuous delivery. Those are the measures I recommend.

InfoQ: What have you learned from using the metrics at government department?

Smith: I learned that adopting continuous delivery without metrics is a bit like operating production environment without monitoring. Without adoption metrics, you’re flying blind. You don’t know which changes have been successful and should be amplified, or which have failed and should be reversed as soon as possible.

In this particular UK government department, we created an internal website that showed stability and throughput metrics for each team and all of their services. That gave us pointers to all kinds of interesting conversations, and insights into some unusual problems. For example, one team massively improved deployment stability over a short period of time, and when I met them it wasn’t clear what they were doing differently. The only small difference was, they had their own custom logging and monitoring dashboards. We extracted the dashboard JSON, wrote a DSL to generate the same JSON, and rolled it out to all teams and services across the country. Within a few weeks, multiple teams reported to us that operating their production services had become easier as a result.

InfoQ: What’s your take on the Google SRE concept of "error budgets" and how does it relate to your continuous delivery metrics?

Smith: The Site Reliability Engineering book is very good. Interestingly, Betsy Beyer et al define reliability as a function of MTBF and MTTR, which is of course synonymous with the continuous delivery definition of stability as a function of Failure Rate and Failure Recovery Time.

Error budgets are a good idea. I always encourage product owners to define their operational requirements, and that includes the required reliability for their product. Using the inverse as an allowance for risk taking could work well, and it’d be interesting if a team practising continuous delivery did something similar - if they rigorously measured release stability and throughput, and blocked auto-deployments to production when stability drops below a configurable threshold. I’ve seen some companies score builds in the past on static analysis, OWASP testing, etc. Scoring proposed deployments on the stability of past deployments is not something I’ve seen done, but it’d be good to see.

Rate this Article