BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Data-Driven Decision Making – Optimizing the Product Delivery Organization

Data-Driven Decision Making – Optimizing the Product Delivery Organization

Bookmarks

Key Takeaways

  • The three main activities in software delivery - Product Management, Development and Operations - can be supported by data-driven decision making to increase the effectiveness, efficiency and service reliability of a software delivery organization. 
  • In Product Management, Hypotheses can be used to steer the effectiveness of product decisions. In Development, Continuous Delivery Indicators can be used to steer the efficiency of the development process. In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production. 
  • Applying Hypotheses, Continuous Delivery Indicators and SRE’s SLIs / SLOs at the same time enables the software delivery organization to optimize for effectiveness, efficiency and service reliability in parallel! 
  • Introducing Hypotheses, Continuous Delivery Indicators and SRE’s SLIs / SLOs to an organization means significant change, which should be rolled out gradually and supported by hands-on on-the-job training activities. 
  • Data-driven prioritization trading off effectiveness, efficiency and reliability is enabled using Hypotheses’ measurable signals, Continuous Delivery Indicators’ values and SRE’s SLO breach counts. 

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making.

 

Introduction

Software product delivery organizations deliver complex software systems on an ever more frequent basis. The main activities involved in the software delivery are Product Management, Development, and Operations (by this we really mean activities as opposed to separate siloed departments that we do not recommend). In each of the activities, many decisions have to be made fast to advance the delivery. In Product Management, the decisions are about feature prioritization. In Development, it is about the efficiency of the development process. And in Operations, it is about reliability. 

The decisions can be made based on the experience of the team members. Additionally, the decisions can be made based on data. This should lead to more a objective and transparent decision-making process. Especially with the increasing speed of the delivery and growing number of delivery teams, an organization’s ability to be transparent is an important means for everyone’s continuous alignment without time-consuming synchronization meetings. 

In this article, we explore how the activities of Product Management, Development and Operations can be supported by data and how the data can be used for rapid data-driven decision making. This, in turn, leads to increased transparency and decreased politicization of the product delivery organization, ultimately supporting better business results such as user engagement with the software and accrued revenue. 

We report on the application of the explored data-driven decision making framework in Product Management, Development and Operations at Siemens Healthineers in a large-scale distributed software delivery organization consisting of 16 software delivery teams located in three countries. 

Data for Decision Making 

In order to steer a product delivery organization in a data-driven way, we need to have a way of expressing the main activities of Product Management, Development and Operations using data. 

That data needs to be treated as Process Indicators of what is going on, rather than as People Key Performance Indicators (KPIs) used for people evaluation. This is important because if used for people evaluation, the people may be inclined to tweak the data in order to be evaluated in favorable terms. 

It’s important that this approach to the data being treated as Process Indicators instead of people evaluation KPIs be set by the leadership of the product delivery organization in order to achieve unskewed data quality and data evaluation. 

Indicators in Product Management 

One of the central questions in Product Management is “what to build?” In order to approach the question of “what to build?”, product delivery teams run small experiments to explore the customer needs. This is ideally done in Production, however, it can also be done in environments preceding production that is used by a selected set of collaborating customers. Each experiment needs associated measurements that are used to either confirm or disprove initial assumptions. 

This process is the subject of Hypothesis Driven Development (HDD). It is well-described in How to Implement Hypothesis-Driven Development. In essence, an experiment is called Hypothesis in HDD and is described using a <Capability> / <Outcome> / <Measurable Signal> notation: 

Hypothesis: 

We believe that this <Capability>

Will result in this customer <Outcome>

We will know we have succeeded when we see this <Measurable Signal> in production 

The definition of Hypothesis for a feature is done before the feature implementation begins. A product delivery team declares which <Capability> they want to put into the product to achieve a specific customer <Outcome>. The customer <Outcome> becomes evident when a defined <Measurable Signal> becomes visible in production. Thus, the focus of a product delivery team is set on the value provided to the customers, as opposed to counting features delivered to production. 

More on Hypotheses can be learned in Data-Driven Decision Making – Product Management with Hypotheses

Indicators in Development

Now that we have focused the Product Management Process on the value for users (effectiveness), we want to ensure that when we are building features, we are doing it efficiently. Therefore, one of the central questions in Development is “how to build the product efficiently?” 

The efficiency of the development process can be measured by analysing the value stream of a software development team. The value stream is Code → Build → Deploy and can be seen on the team’s deployment pipeline. 

It is possible to measure the speed with which the value flows through the value stream. Likewise, it is possible to measure the stability of the value flow. The so-called Continuous Delivery Indicators of stability and speed are doing exactly that. The indicators are defined in "Measuring Continuous Delivery" by Steve Smith.

The Continuous Delivery Indicators of stability are Build Stability and Deployment Stability. The Continuous Delivery Indicators of speed are Code Throughput, Build Throughout and Deployment Throughput. 

The Build Stability Indicator consists of the Build Failure Rate and Build Failure Recovery Time. 

The Deployment Stability Indicator consists of the Deployment Failure Rate and Deployment Failure Recovery Time. 

The Code Throughput Indicator consists of the Master Branch Commit Lead Time and Master Branch Code Commit Frequency. The Build Throughput Indicator consists of the Build Lead Time and Build Frequency. And the Deployment Throughput Indicator consists of the Deployment Lead Time and Deployment Frequency. 

More on CD Indicators can be found in Data-Driven Decision Making – Product Development with Continuous Delivery Indicators

Indicators in Operations 

Now that we have focused the Product Management Process on the value for users (effectiveness) using Hypotheses and the Development Process on the efficiency using Continuous Delivery Indicators, we want to ensure that when we are operating the product, we are doing it reliably. That is, one of the central questions in Operations is “how to operate the product reliably?” 

The product reliability in production consists of many indicators including availability, latency, throughput, correctness, etc. In Site Reliability Engineering (SRE), the indicators are called Service Level Indicators (SLIs). Each service deployed to production has a set of SLIs that make up its reliability. 

In SRE, each SLI of each service gets an objective assigned. The objective is called a Service Level Objective (SLO). For example, the availability SLI of service can get assigned a 99,5% SLO. This means that the team developing and operating the service commits to operating the service in such a way that it is available in 99,5% cases. 

Conversely, with the 99,5% availability SLO, the so-called Error Budget is 100% - 99,5% = 0,5%. This means that in 0,5% of the cases, the service is expected and publicly declared to be unavailable. The Error Budget is the budget available for making errors in the service. 

The development team operating the service keeps track of the Error Budget remaining within a given time frame (e.g. four weeks). Once the Error Budget is consumed, the team enacts a self-defined Error Budget Policy. The Error Budget Policy can, for example, dictate to stop the feature work on the service and only perform the reliability work until the service is back within its SLO. 

More on SRE can be found in Data-Driven Decision Making – Product Operations with Site Reliability Engineering

Indicators Framework 

Isolated application of some of the above Indicators supports individual optimization of the Product Management, Development and Operations processes. 

However, combining the Indicators into an overall Indicators Framework for data-driven decision making in a product delivery organization supports better global optimization of the overall value flow in the organization. This is outlined in the picture below. 

The definition of Feature Hypotheses makes the Product Management process measurable based on the Measurable Signals from Production. The Continuous Delivery Indicators make the Dev / Test / Deploy activities of a development team measurable in terms of stability and speed. And the definition of SLIs and SLOs per service running in Production makes the Ops process measurable. 

The Feature Hypotheses, Continuous Delivery Indicators and SLIs/SLOs feed upon each other. 

The Hypotheses are used to steer the effectiveness of product decisions. The Continuous Delivery Indicators are used to steer the efficiency of the development process. And the SLIs and SLOs from SRE steer the reliability of service operation in production. 

That is, with the application of Hypothesis-Driven Development, Continuous Delivery Indicators and SRE simultaneously, we are able to impact the effectiveness, efficiency and reliability of the product organization at the same time!  

That is why the application of the three suggested methods together is such a powerful combination. 

It enables us to simultaneously approach in a data-driven way the systematic decision making for: 

  • Product Management - what to build? 
  • Development - how to build it efficiently? 
  • Operations - how to operate it reliably? 

Organization Enablement 

With the Indicators Framework defined, it was clear to us that its introduction to the organization of 16 development teams could only be effective if sufficient support could be provided to the teams. 

We introduced Hypotheses first. Six months later we introduced SRE. And six months after that we introduced Continuous Delivery Indicators to the organization. We chose a staged approach to introducing these changes in order to have the organization focus on one change at a time. 

In terms of preparation for the introduction, Hypotheses were the easiest; it took an extension of our Business Feature Template and a workshop with each team. 

To prepare for the SRE introduction, we implemented basic infrastructure for two fundamental SLIs - Availability and Latency. The infrastructure is able to generate SLI and Error Budget Dashboards for each service of each team. Most importantly, it is able to do alerting on Error Budget Consumption in all deployment environments. In addition to the infrastructure, we ran many workshops with each development team to: 

  • Familiarize team members with the SRE concepts of SLI, SLO, Error Budget and Error Budget Policy 
  • Define initial SLOs for Availability and Latency SLIs 
  • Fine tune alerting 
  • Come up with additional SLIs relevant for the services owned by the team 
  • Set up On Call rotas to react to alerts in a timely and efficient manner

To prepare for the introduction of Continuous Delivery Indicators, we implemented a tool that could process the data from deployment pipelines in our Continuous Integration and deployment environments. After processing, the tool visually displays the team’s deployment pipeline with stability and speed bottlenecks along the pipeline environments. With that, the teams can immediately focus on the bottlenecks and discuss how they could be relieved. 

Adoption by the Organization

We introduced the suggested Indicators Framework to an organization of 16 development teams working on “teamplay” - a global digital service from the healthcare domain (more about "teamplay" can be learned at Adopting Continuous Delivery at teamplay, Siemens Healthineers). 

When referring to "teams" below we mean development teams. They define their Hypotheses, ways of software development and SLOs. They also use the resulting data from Hypotheses’ measurable signals, Continuous Delivery Indicators and SLO breach counts. Additionally, we have a small central Operations team, which provides SRE infrastructure to the development teams. The SLO breach counts are first order data for the Operations team as well.

The teams became quite interested in Hypotheses and SRE right after the introduction. The topic of Continuous Delivery Indicators required more explanation as it introduced a new way of looking at software development through the lens of a value stream analysis, which is not something done routinely in the software domain. 

The Hypotheses definition process turned out to be very helpful in hammering out the scope of a feature very early on in the specification process. It served as a good basis for future User Story Mapping and BDD Scenario definition. The definition of measurable signals dictated the necessary developers’ learnings on how to dock onto production monitoring and retrieve insights from it. To contribute to the data-driven decision making, the teams started implementing Measurable Signals, and based on their values, make decisions regarding future feature implementation steps. 

The adoption of SRE took a substantial amount of time as it required “bringing the Ops world” into the Development. Many workshops were conducted with each team until alerting on Error Budget Consumption by SLO breaches could be switched on. The introduction of On Call duty is the next step here. To contribute to the data-driven decision making, the teams will need to evaluate which services consume the Error Budget the most / fastest, and prioritize reliability improvements over new feature work. 

SRE data also started being used in the context of data-driven budget allocation decisions. Some of our services did not have enough developers looking after them to provide good quality service. It was difficult to argue for additional headcount because other initiatives had hard data at hand, showing the benefits of investing in them. Once the SRE data for the services became available, it evidenced insufficient availability and latency of the services. It became possible to argue for additional headcount using data that showed how customers were affected by the current quality of service.

The adoption of Continuous Delivery Indicators started with a few teams. The first insights here were that teams work very differently and are not aware of the stability and speed bottlenecks on their pipelines until they were shown in the tool. One team looked at their bottlenecks and found the biggest one in the Deployment Failure Rate. The Deployment Failure Recovery Time was low, though. To contribute to the data-driven decision making, the team prioritized the analysis of the Deployment Failure Rate bottleneck, and within a day were able to significantly reduce it. Simple deployment checks, that ensured individual environments had the necessary resources deployed for applications to run significantly, drove down the Deployment Failure Rate. The fast recovery from the Deployment Failure was no longer needed on a frequent basis. 

Prioritization

Our teams need more experience with Hypotheses, Continuous Delivery Indicators and SRE’s SLIs/SLOs in order to consistently use the data at hand as an input for prioritization. The data comes in different forms: 

  • From Hypotheses
    • Positively / negatively tested Hypotheses
    • Unexpected insights from Measurable Signals
  • From Continuous Delivery Indicators
    • Most / least stable pipelines in terms of
      • Failure Rates
      • Recovery Times
    • Fastest / slowest pipelines in terms of 
      • Lead times between pipeline environments 
      • Intervals between respective activities in the environments 
  • From SRE
    • Fastest / slowest error budget consumers in production (by service endpoint)
    • Fastest / slowest service endpoints in terms of Latency SLI
    • Most / least available service endpoints in terms of Availability SLI
    • Likewise for each defined SLI 

Now that the data is available, it needs to be taken into account by the development teams, and especially product owners, to make the best prioritization decisions. The prioritization trade-offs are: 

  • Invest in features to increase product effectiveness and / or
  • Invest in development efficiency and / or 
  • Invest in service reliability 

That said, the Error Budget Policy needs to become a binding document that overrides other prioritization activities. 

To facilitate the data-driven prioritization process, it would be great to create dashboards that display all the data points from a team’s Hypotheses, CD Indicators and SRE in a way suitable for prioritization process simplification. 

Future Topics 

We can think of several future topics based on the Hypotheses, Continuous Delivery Indicators and SRE’s SLIs/SLOs. 

As mentioned above, to facilitate the data-driven prioritization process we need to explore how to visualize all the data points from a team’s Hypotheses, CD Indicators and SRE in a combined dashboard. This is going to be our next concrete step. 

Additionally, it might be possible to use machine learning to predict the stability and speed in a pipeline environment based on the data of stability and speed in preceding environments on the deployment pipeline. 

We could also think of creating an overall team score combining different inputs from the Continuous Delivery Indicators and SRE. However, the usefulness of this remain to be discussed. 

Additionally, we can look at correlations between SLO breaches and Hypotheses fulfilments. Our current conjecture is that Hypotheses that are evaluated using user workflows executed by services that are regularly falling out of their SLOs are not going to be tested positively. 

Summary

In summary, if a team optimizes their: 

  • Product Management Process using Hypotheses, 
  • Development Process using Continuous Delivery Indicators and 
  • Operations Process using SRE 

then the team is able to gradually optimize their ways of working in a data-driven way so that over time the team can achieve a state in which they: 

  • build features evidently being used by the users 
  • build features efficiently avoiding big bottlenecks in their value stream (= on their deployment pipeline)
  • operate the features evidently in a reliable way 

Finally, overall the Indicators Framework suggested in this article aims to offer a holistic data-driven approach to the continuous improvement of common software delivery processes. It helps depoliticize and enable transparency in the decision-making process of the software delivery organization. Finally, it supports the organization to drive better business results such as user engagement with the software and revenue. 

Acknowledgments

Many people contributed to the thinking behind this article. The following individuals worked directly on the implementation of infrastructure and tools presented. 

  • Kiran Kumar Gollapelly, Krishna Chaithanya Pomar and Bhadri Narayanan ARR were instrumental to the creation of the Continuous Delivery Indicators Tool. 
  • Philipp Guendisch conceptualized and implemented the SRE infrastructure, SLI/SLO dashboards and alerting. 

Thanks go to the entire team at “teamplay” for introducing and adopting the methods in this article. 

About the Author

Vladyslav Ukis graduated in Computer Science from the University of Erlangen-Nuremberg, Germany and, later, from the University of Manchester, UK. He joined Siemens Healthineers after each graduation and has been working on Software Architecture, Enterprise Architecture, Innovation Management, Private and Public Cloud Computing, Team Management and Engineering Management. In recent years, he has been driving the Continuous Delivery and DevOps Transformation in the Siemens Healthineers Digital Ecosystem Platform and Applications - "teamplay".

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making.

Rate this Article

Adoption
Style

BT