InfoQ Homepage Articles Data-Driven Decision Making – Product Operations with Site Reliability Engineering

Data-Driven Decision Making – Product Operations with Site Reliability Engineering

Bookmarks

Mar 25, 2020 12 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Data-driven decision making regarding whether to invest in reliability of services vs. in new features can be done using SRE methods.
The reliability of a service can be measured using a set of Service Level Indicators (SLIs) to be defined by the team developing and operating the service.
To enable the service reliability measurements along the SLIs, Service Level Objectives (SLOs) need to be defined for each SLI.
Defined SLOs for each SLI determine the budget available for errors - the Error Budget per SLI - to be tracked by the team developing and operating the service.
Introducing SRE methods in a development organization greatly fosters collaboration between Product, Development and Operations by agreeing on SLIs and SLOs.

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making.

Introduction

Software product delivery organizations deliver complex software systems on an ever more frequent basis. The main activities involved in the software delivery are Product Management, Development and Operations (by this we really mean activities as opposed to separate siloed departments that we do not recommend). In each of the activities, many decisions have to be made fast to advance the delivery. In Product Management, the decisions are about feature prioritization. In Development, it is about the efficiency of the development process. And in Operations, it is about reliability.

The decisions can be made based on the experience of the team members. Additionally, decisions can be made based on data. This should lead to a more objective and transparent decision-making process. Especially with the increasing speed of the delivery and the growing number of delivery teams, an organization’s ability to be transparent is an important means for everyone’s continuous alignment without time-consuming synchronization meetings.

In this article, we explore how the activities in Operations can be supported by data from SRE and how the data can be used for rapid data-driven decision making. This, in turn, leads to increased transparency and decreased politicization of the product delivery organization, ultimately supporting better business results such as user engagement with the software and accrued revenue.

We report on the application of SRE in Operations at Siemens Healthineers in a large-scale distributed software delivery organization consisting of 16 software delivery teams located in three countries.

Process Indicators, Not People KPIs

In order to steer Operations in a data-driven way, we need to have a way of expressing the main activities in Operations using data. That data needs to be treated as Process Indicators of what is going on, rather than as People Key Performance Indicators (KPIs) used for people evaluation. This is important because if used for people evaluation, the people may be inclined to tweak the data to be evaluated in favorable terms.

It is important that this approach to the data being treated as Process Indicators instead of people evaluation KPIs be set by the leadership of the product delivery organization in order to achieve unskewed data quality and data evaluation.

Site Reliability Engineering (SRE)

One of the central questions in Operations is "how to operate the product reliably?"

The product reliability in production consists of many indicators including availability, latency, throughput, correctness, etc. In Site Reliability Engineering (SRE), the indicators are called Service Level Indicators (SLIs). Each service deployed to production has a set of SLIs that make up its reliability.

In SRE, each SLI of each service gets an objective assigned. The objective is called a Service Level Objective (SLO). For example, the availability SLI of a service can get assigned a 99,5% SLO. This means that the team developing and operating the service commits to operating the service in such a way that it is available in 99,5% cases.

Conversely, with the 99,5% availability SLO, the so-called Error Budget is 100% - 99,5% = 0,5%. This means that in 0,5% of the cases, the service is expected, and publicly declared, to be unavailable. The Error Budget is the budget available for making errors in the service. It is being consumed when e.g. doing deployments that require downtime (expected downtime), running into bugs (unexpected downtime), or when you aren’t able to provide the service because of failing dependencies (unexpected downtime).

The development team operating the service keeps track of the Error Budget remaining within a given time frame (e.g. four weeks). Once the Error Budget is consumed, the team enacts a self-defined Error Budget Policy. The Error Budget Policy can e.g. dictate to stop the feature work on the service and only perform the reliability work until the service is back within its SLO.

The definition of SLIs and SLOs for a service is done by the product owner, developers and the operations engineers of the service. This way, we get all the relevant roles to agree, in a measurable way, on how important the reliability of the service ist.

A tight SLO means that the development team will spend more time on making the service reliable, and therefore, less time on implementing customer-facing features. A relaxed SLO means that the development team will spend less time on making the service reliable, and therefore, more time on implementing customer-facing features.

Defined SLIs and SLOs, therefore, enable data-driven decision making regarding when to invest in reliability and when in customer-facing features.

A development team that takes an Error Budget Policy-based approach to reliability makes data-driven decisions about when to invest in reliability and when in customer-facing features. Additionally, a team like that knows at all times whether their services in Production are within the defined SLOs, which is a proxy measure for the service user happiness.

On the contrary, a development team that does not follow SRE (no SLIs, no SLOs, no Error Budget Policies) does not know how reliably their services run in Production. Such a team usually reacts to service outages reported by the users and invests in reliability based on the severity of the outages.

Enablement

The introduction of SRE involved many working sessions with each team . Initial sessions laid out the foundations of the SRE discipline, such as SLIs, SLOs, Error Budgets, Error Budget Policies, Alerts and Dashboards. This ensured all team members got on the same page.

Subsequent sessions involved clear specifications of users, whose happiness the team is optimizing, typical workflows of those users, resulting service endpoint call chains, SLO definitions for the service endpoints and enablement of alerting.

For the SLIs and SLOs definition, we enabled instrumentation of our services using Microsoft Azure AppInsights. We started looking at two SLIs: Availability and Latency. For each SLI, we defined a general SLO threshold (98% for Availability and 500ms for Latency). This a threshold that defines whether we set the SLO automatically to the threshold itself or manually in a workshop with a dev team. Service endpoints below the defined general threshold are considered to be running fine and get an automatically calculated SLO assigned that equals the defined threshold. Service endpoints above the defined general threshold require carefully set SLOs manually after a series of discussions between the PO, Ops and developers.

With this procedure, we set the SLOs automatically for service endpoints that are sufficiently fast and available. For the other service endpoints, we engage with the teams and stakeholders to set the SLOs manually. This way, we spend development teams’ time on the most problematic areas.

Our Operations Team provided the following Dashboards for all service endpoints:

The dashboards are generated automatically from the SLO definitions and service logs, which are available in Microsoft Azure AppInsights.

In the upper left corner we can see the Latency SLI for a service. The target SLO for Latency is shown as a horizontal line. Every time, the service breaks the Latency SLO, the Latency Error Budget is being consumed. The consumption of the Latency Error Budget over time is shown on the graph in the upper right corner. The Latency Error Budget is used up when the graph in the upper right corner touches zero.

In the lower left corner we can see the Availability SLI for a service. The target SLO for Availability is shown as a horizontal line. Every time the services break the Availability SLO, the Availability Error Budget is being consumed. The consumption of the Availability Error Budget over time is shown on the graph in the lower right corner. The Availability Error Budget is used up when the graph in the lower right corner touches zero.

In terms of alerting, we alert on the Error Budget Consumption Rate. This ensures that we alert timely on the one hand, and do not overwhelm the teams with too many alerts on the other hand.

Short-term alerting: scan SLO breaches every 10 minutes looking at the last hour. Alert immediately if an SLO was broken twice in the last hour. Once an alert has been caused, pause that kind of alert for an hour.
Long-term alerting: once in 24 hours, look at the SLI / SLO data of the last 24 hours. If the Error Budget for the last 24 hours was proportionally used up, an alert is caused once.
Error Budget Consumed alerting: whenever the Error Budget is used up, an alert is caused once.

Once the Error Budget is consumed by SLO breaches within a four-week period, we asked the teams to follow their self-defined Error Budget Policies. For example, a team’s Error Budget Policy can declare to prioritize backlog items from Incident Reviews over all the other work until finished.

Our future work will be concerned with the introduction of effective Incident Reviews and On Call Rotas with the support of a tool like PagerDuty or OpsGenie as well as advancing our culture using SRE. It really helps bringing Product, Development and Operations together in the spirit of DevOps philosophy.

Adoption

We introduced the suggested Indicators Framework to an organization of 16 development teams working on "teamplay" - a global digital service from the healthcare domain (more about "teamplay" can be learned at Adopting Continuous Delivery at teamplay, Siemens Healthineers).

All the teams welcomed SRE activities as they held the promise to gain more insight into production and become aware of failures before they manifested themselves with the customers.

The teams enabled logging in Microsoft Azure AppInsights for all their services in all environments. That put the teams in a position to automatically get data on latency and availability of the services.

As a next step, the teams were challenged by the request to define very specific detailed user profiles in order to be able to set the SLOs for those specific user segments. If the services are within the defined SLOs, those specific user segments are happy. If the services are outside of the defined SLOs, those specific user segments are unhappy.

Initially, some teams came up with generic user profiles like "Radiologist" or "Physicist". We challenged the teams to get more specific so that the user intents could be understood. Based on the user intents, typical user workflows could be inferred. Finally, based on the user workflows, we could arrive at the typical call chains of service endpoints.

At the same time, some teams were able to quickly point to some service endpoints very central to nearly all user workflows based on technical considerations. For these service endpoints, tighter latency and availability SLOs were defined.

The teams were able to understand the SLI/SLO Dashboards as well as alerting logic quickly.

Some teams started organizing themselves in rotas to react to alerts. This is good preparation for the introduction of On Call duty as part of the SRE activities in future.

A very common request we got from the teams was to add additional SLIs that go beyond Availability and Latency. Here, we need to find SLIs that would be applicable to all teams in order to extend our SRE infrastructure in a generic way. This would also maximize the outcome of the effort by the Operations team that goes into the creation and maintenance of the infrastructure. The following additional SLIs are commonly requested:

Dead Letter Queue Length
Certificate Expiry

Beyond that, we understood that while it would be very beneficial to offer Custom SLIs as desired by the teams, this would stretch the capacity of our Operations Team to provide the necessary SRE infrastructure extensions. Nevertheless, we can enable the teams to make first steps towards Custom SLIs initially without providing the SLI/SLO/Error Budget semantics. What we can do initially is provide the teams with information from Production on the state of Customer SLIs based on log queries. The information can simply be pushed to dedicated Slack channels twice a day. This is much better than operating in the dark. Later, we can extend the SRE Infrastructure to support Custom SLIs one by one.

Finally, our Ops Team currently does all SRE infrastructure related adaptations. This is fine to start with. However, once development teams get familiar with the infrastructure, we will put them in a position to make changes without having to wait for the Ops Team.

Prioritization

Our teams need more experience with SRE’s SLIs/SLOs in order to consistently use the data at hand as an input for prioritization. The data comes in different forms:

Fastest / slowest error budget consumers in production (by service endpoint)
Fastest / slowest service endpoints in terms of Latency SLI
Most / least available service endpoints in terms of Availability SLI
Likewise for each defined SLI

Now that the data is available, it needs to be taken into account by the development teams, and especially product owners, to make the best prioritization decisions. The prioritization trade-offs are:

Invest in features to increase product effectiveness and / or
Invest in development efficiency and / or
Invest in service reliability

That said, the Error Budget Policy needs to become a binding document that overrides other prioritization considerations.

Future Topics

We can think of several future topics based on SRE’s SLIs/SLOs.

We could think of creating an overall team score combining different inputs from the Continuous Delivery Indicators and SRE. The usefulness of this would remain to be seen.

Additionally, we can look at correlations between SLO breaches and feature hypotheses fulfilments. Our current conjecture is that hypotheses that are evaluated using user workflows executed by services that are regularly falling out of their SLOs are not going to be tested positively.

Summary

In summary, if a team optimizes their Operations Process using SRE, then the team is able to gradually optimize their ways of working in a data-driven way so that over time they can achieve a state where they operate the features evidently in a reliable way.

Data-driven SRE helps depoliticize and enable transparency in the decision making process of the software delivery organization. Finally, it supports the organization to drive better business results, such as user engagement with the software and revenue.

This article is part of the Data-Driven Decision Making for Software Product Delivery Organizations Series. The Series provides an overview of how the three main activities in the software delivery - Product Management, Development and Operations - can be supported by data-driven decision making. Future articles will shed light on data-driven decision making in Development, Operations and combinations of data-driven decision making in Product Management, Development and Operations.

Acknowledgements

Many people contributed to the thinking behind this article. Philipp Guendisch conceptualized and implemented the SRE infrastructure, SLI/SLO dashboards and alerting presented. Thanks go to the entire team at "teamplay" for introducing and adopting the methods from this article.

About the Author

Vladyslav Ukis graduated in Computer Science from the University of Erlangen-Nuremberg, Germany and, later, from the University of Manchester, UK. He joined Siemens Healthineers after each graduation and has been working on Software Architecture, Enterprise Architecture, Innovation Management, Private and Public Cloud Computing, Team Management and Engineering Management. In recent years, he has been driving the Continuous Delivery and DevOps Transformation in the Siemens Healthineers Digital Ecosystem Platform and Applications - "teamplay".

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Data-Driven Decision Making – Product Operations with Site Reliability Engineering

Write for InfoQ

Key Takeaways

Introduction

Related Sponsored Content

Process Indicators, Not People KPIs

Site Reliability Engineering (SRE)

Enablement

Adoption

Prioritization

Future Topics

Summary

Acknowledgements

About the Author

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter