InfoQ Homepage Articles Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

Jun 27, 2022 19 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Standardizing service operations is a business requirement, as without the standardization, development teams either spend lots of effort on implementing the basics of monitoring or just live with insufficient monitoring resulting in UX issues.
The standardization is achieved by using a shared infrastructure. The infrastructure beneficially implements the SRE methodology because it already provides a conceptual framework to standardize production monitoring.
The architecture of the SRE infrastructure needs to be scalable to handle any number of onboarded teams. The effort to maintain a team on the infrastructure should scale sublinearly with the number of teams onboarded. This ensures the right economy of scale for the initial and ongoing infrastructure investment.
Visualization dashboards provided by the SRE infrastructure need to be optimized for fast data-driven decision-making. They should immediately show where the biggest reliability problems are (e.g. least availability, highest latency in a given time frame). This should help the teams allocate reliability investments where it matters most.
Success criteria for the SRE infrastructure could be measured using the percentage of successfully defended SLOs and SLAs staying within the allocated error budgets, the number of services, teams and environments onboarded and the view numbers of the visualization dashboards per time unit.

In software organizations, there is an increasing need to operate services reliably at scale. The need can be met in different ways. One way proposed by Google is the so-called Site Reliability Engineering (SRE). It is a discipline rooted in applying software engineering techniques to operations problems. SRE enables a software delivery organization to scale the number of services in operation without linearly scaling the number of people required to operate the services. Furthermore, SRE enables teams to make data-driven decisions about when to invest in reliability of services vs. implement new features.

In recent years, SRE has become quite popular in the software industry. A growing number of software delivery organizations have adopted the discipline. At Siemens Healthineers teamplay digital health platform the SRE adoption started in 2019. It helped the organization with operating the platform reliably at scale.

In our SRE implementation:

The operations team builds and runs the SRE infrastructure
The development teams build and run the services leveraging the SRE infrastructure

It is a classic responsibility split for “you build it, you run it”. Our primary reason for implementing “you build it, you run it” was that this model provides maximum incentives for the development teams to implement reliability during feature development. Developers on-call do not like to be paged. With “you build it, you run it”, the developers are in full control of avoiding being paged by implementing sufficient reliability before the services hit production.

The establishment of the SRE infrastructure enabling the developers to run their services was done in an agile manner. We employed very short feedback loops and standardization in order to make the infrastructure useful and scalable at the same time. The short feedback loops support scaling because the infrastructure development is steered using the feedback from the developers trying to achieve their monitoring goals using the infrastructure. This way, over time, the infrastructure has enough useful features for many teams to want to jump on it.

In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.

Implementing the SRE Infrastructure

The SRE infrastructure is a set of tools, algorithms and visualization dashboards that together enable a team to operate their services reliably at scale in an efficient manner. The implementation, maintenance and operations of the SRE infrastructure requires a, possibly small, team of dedicated developers.

Agile Delivery

When building up the SRE infrastructure, we established a very tight working mode between the operations team doing the implementation and the development teams being onboarded on the infrastructure. We ensured that the operations team was always only one step ahead, not more, of the development teams in terms of feature implementation. Every implemented feature went into immediate use by the development teams and was iterated upon relentlessly based on feedback.

Architecture

The architecture for the SRE infrastructure needs to fulfill multiple almost equally important goals. These are described in the following sections along with external and internal views on the architecture.

Goals

As the most important architectural goal for the SRE infrastructure, we selected scalability, so that a limited number of infrastructure developers can serve a huge number of development teams and their heterogeneous needs. A non-negotiable standardized set of the monitoring tools is essential to reduce the initial efforts for the SRE infrastructure engineers. The telemetry logs need to be standardized so that the same database query can be evaluated on different log tables. Otherwise, each service could log its own format of e.g. availability logs, which then requires a specialized query to check for the service`s availability. In this scenario, a huge amount of frequently outdated queries needs to be built and maintained by the SRE infrastructure engineer and is therefore not a feasible solution.

This connects to the second goal, which is consistency throughout time and across teams. The reasoning for any triggered alert should be immediately presentable for any interested alert responder. Ensuring a common schema throughout all the SRE relevant logs enables aggregations as well as monitoring of the huge amount of telemetry with a small amount of logic templates (e.g. queries on the log data to check for SLO breaches). These templates can be filled with team and service dependent information to customize them for each team’s needs. With this setup, consistency and scalability are not conflicting goals, and in fact benefit from each other.

The third goal is the usability of the infrastructure. Any developer who is familiar with the common SRE terms should be able to adjust, create and remove SLOs for their services on their own. To achieve this, a lot of thought should be put into the intuitiveness of any provided tool. In an optimal scenario you want to integrate your custom tool for SLO definitions into any existing deployment pipeline for the service so that the SLOs are active right from the start and the thought process of defining those has taken place before any new product is released. The configuration file itself can be very simplistic, and standardized targets should be automatically attached to a service so nothing gets missed by accident. Of course, the dev team should be able to explicitly remove or override default SLOs by extending the configuration.

Customer View

This section describes the view onto the SRE infrastructure architecture from a developer point of view who is consuming the tools to keep track of the services. As seen in the picture, the infrastructure itself is not accessible by the developers and therefore appears as a black box. You will find details on this part in the next section. The input describes all the necessary information which needs to be provided for each technical service. It usually contains several service level objectives.

At teamplay a distinction is made between objectives for endpoints exposed to the outside world (SLAs) and objectives for internally and externally used services (SLOs). SLAs are committed SLOs to external partners and always paired with tighter internal SLOs. There is also a margin between the expected service behavior and what is contractually agreed upon so that alerts arrive prior to contract breaches. An SLO definition for the latency SLI contains an error budget, a threshold, and a list of service endpoints to be applied to. It is recommended to define a default (fallback) SLO which gets automatically assigned to all endpoints of a service if no explicit SLO is defined. An SLO definition for Availability contains the same information except the threshold, and instead defines valid or invalid HTTP result codes. Also, the identifier of a database needs to be given so that the checks for alert breaches can be installed at the correct location. The alerting information defines a webhook or a set of mail addresses which need to be notified when SLO breaches occur. To close the loop, the alert notifications are sent to the defined developers and webhook addresses. Also we offer an interactive PowerBi report which shows all SLOs, SLAs and their error budget depletions over the last three months.

Internal View

This section will cover the required steps to transform the SRE configuration into alerts and dashboards.

The first step will install or update alerts on the configured SLO targets whenever a service gets deployed. In our scenario this includes the setup of Azure Monitor alert rules which are frequently executed log queries with an alert condition. They compare the defined SLO targets against the live data and alert on the current error budget consumption. Furthermore, a webhook entry will be configured in PagerDuty to consume the alerts on SLO breaches and notify relevant stakeholders and developers to mitigate the ongoing issue.

After that, the SRE configuration is uploaded to an SLO database. It is recommended to have all defined and deployed SLOs and services at the same place so they can easily be viewed. At teamplay, Azure Data Explorer is used to store this kind of information since it integrates well with Power BI and the query language is very powerful.

Apart from the deployment, a set of daily tasks needs to be executed to keep the infrastructure up-to-date. The alerts were already set up during the deployment and will alert on real time issues. The error budget instead will be consumed throughout a larger time frame. Due to the potentially massive amount of logs, it is recommended to aggregate the telemetry logs and precalculate the error budget depletion for each SLO on a daily basis. This SLO statistics summary is migrated by scripts to the same Kusto database as the SLO definitions and may include further attributes such as number of total requests for that day and endpoint. Once that data is available in Azure Data Explorer, an automated PowerBI refresh pulls the data in.

Service Alerting

Alerting is a very important topic in SRE. There is always a tradeoff between alerting too often versus alerting too late. In the famous series of Google SRE books, a few different flavors of alerting strategies are introduced and all of them have their benefits and drawbacks. At teamplay we calculate the burn rate over the last hour and send an alert if the service is consuming more than twice the amount of error budget as planned. At the end of each error budget period (four calendar weeks) we also send a notification for those endpoints which have consumed all of their error budget.

The figure below, taken from the book “Establishing SRE Foundations”, shows the benefits of alert based on SLIs over regular alerts.

Visualization Dashboards

The visualization dashboards enable teams to check defined SLOs of any service, investigate the error budget depletion by SLO and prioritize reliability work based on the error budget depletion over time. The dashboards are interactive PowerBI reports that support the user with filtering the data to their needs. They are updated once in 24h.

The following dashboards are available:

SLOs definitions dashboard (individually for availability, latency and interval metrics)
SLO adherence dashboard (individually for availability, latency and interval metrics; shows SLO adherence over the last three error budget periods for a single SLO)
Reliability prioritization dashboard (shows SLO adherence over the last three error budget periods for all SLOs for a team, service or deployment sorted by the biggest error budget consumers shown on top)
Current overview dashboard (shows the SLO adherence so far in the current error budget period)

Dashboard with additional insights:

Retry suggestions (detection of missing retries for the HTTP result code “500”)

Ideas for future dashboards:

Absence of applicable stability patterns for distributed systems (e.g. detection of missing circuit breakers)
An SLO higher in the service hierarchy is tighter than a dependent SLO lower in the service hierarchy (although in some cases, it may be possible to build more reliable things on top of less reliable things: SLOconf: SLO Math - by Steve McGhee)

Some of the existing dashboards are shown and explained in detail below.

Error budget depletion dashboard per SLI

The following dashboard shows availability error budget depletion over time. It does so for an SLO specified on top of the dashboard: 99% availability for “post /api/categoriesdistribution/getcategoriesdistributiondata”

On top of the dashboard, the SLO, the definition of a failed request, the dates of the current error budget period and the remaining days in the current error budget period are shown to provide a quick orientation.

The upper graph with horizontal lines shows the remaining error budget in % per day per deployment environment over time. The steepest error budget depletion is in Production EU shown as a mahogany line. The second steepest error budget depletion is in Production JP shown as a red line. The error budget depletion in Production US shown as a yellow line is rather small. The error budget replenishments at the beginning of each shown error budget period are clearly visible with all the lines approaching 100% fast.

Premature error budget exhaustion (negative error budget in an error budget period) can be clearly seen in Production EU (green line) in March, April and May. It can also be seen in Production JP (yellow line) in April. This gives the team a clear direction as to where to investigate and invest in reliability.

The lower graph with vertical lines shows the failed request rate per day per deployment environment over time. It can be used for information purposes.

Similar dashboards exist to track the error budget depletion for the latency SLI and others.

Reliability prioritization dashboards

There are two reliability prioritization dashboards provided by the SRE infrastructure:

Short-term dashboard with an overview of the current error budget period
Long-term dashboard with an overview of the last three completed error budget periods

Below is the short-term reliability prioritization dashboard. On top in the middle of the dashboard, the dates for the current error budget period are shown. Right below, the remaining number of days in the current error budget period are displayed: 13 out of 28.

For the selection, the pie chart in the left upper corner of the dashboard displays an overview of all applicable SLOs vs. SLOs where the error budget is being depleted. That is, about 70% of SLOs do not have any error budget depletion so far in the 13 elapsed days of the current error budget period.

The tables in the lower part of the dashboard show individual SLOs for the data selection. On the left hand side, availability SLOs are shown. On the right hand side, latency SLOs are displayed. In color, the adherence to each SLO is shown in a visually compelling way:

Red cells demonstrate the least SLO adherence, which equals to the most error budget depletion in the elapsed days of the current error budget period. The remaining error budget is too small for not being exhausted prematurely before the end of the current error budget period. The red cells are shown on top because this is where the developers’ attention is required most.
Yellow cells (there are none in the example above) demonstrate SLO adherence where the current error budget depletion is fairly high. If continued with the current rate, it may well lead to a premature error budget exhaustion in the current error budget period. It might be worth investigating the reasons for the significant error budget depletion in the elapsed days of the current error budget period.
Green cells demonstrate a fully acceptable error budget depletion in the current error budget period. If the depletion rate remains the same, the current error budget period will not end with a negative error budget.

Using the dashboard, the teams can focus on the areas where immediate reliability attention is required. One of the use cases where the dashboard is useful is post release monitoring of services. The days after a production release are particularly interesting to assess the impact of the deployed changes on error budget depletion patterns.

The long-term reliability prioritization dashboard is shown below. It allows the teams to zoom out a bit and take a look at the error budget depletion of their services over the last three months.

Based on the data filtering, a list of applicable SLOs is displayed in the middle of the dashboard. On top of the dashboard, the dates of the last three completed error budget periods are displayed. In color on the right hand side, the adherence of each SLO for each error budget period is shown in a visually compelling way:

Red cells demonstrate the least SLO adherence, with a negative error budget left at the end of a given error budget period. In brackets, the actual fulfillment percentage of an SLO is shown. The red cells are shown on top because this is where reliability prioritization is required most.
Yellow cells demonstrate a nearly exhausted error budget in a given error budget period. Although some error budget is left, it is worth investigating the reasons for such a significant error budget depletion.
Green cells demonstrate a fully acceptable error budget depletion.

Using the dashboard, the teams can focus on the areas where reliability prioritization is required in a matter of seconds. Based on the data from the dashboard, the teams typically create work items that describe the necessary reliability work to be prioritized using an existing prioritization procedure established in a given team (e.g. Kanban or Scrum).

UX for productivity

We recommend using a very intuitive color coding, such as red / yellow / green, for error budget depletion and all other metrics the user should make decisions upon. Also, it should be easy for a user to filter all displayed data based on the context of a geographical region or technical service so that issues can be detected as well as limited to a certain domain.

For comparison reasons it makes sense to normalize error budget depletion graphs regardless of the defined error budget to 100%. This shifts the focus towards customer experience since the SLOs are defined based on customer happiness and a 99.9% SLO for one use case is therefore equally important as a 95% SLO set for another use case.

For alerts we recommend to present detailed contextual information within the alert payload so that a developer can easily locate the root cause. Also helpful links to resources, runbooks or even database queries should be included in the alert payload itself. All these contribute to reducing the time to recover from incidents.

The maintenance effort for the dashboard is pretty low since automated daily refreshes collect the latest data automatically. Manual maintenance effort is only needed if the data propagation fails due to a transient error (e.g. source data unavailable) which can be fixed by another refresh in most of the cases. Of course, the dashboard should be continuously improved according to the SREs needs. This work can include usability improvements or the extension by new graphs and additional information condensed from the logs. A daily refresh of the data is sufficient since the dashboard should not be used for reactive production monitoring. This is fulfilled by the alerting mechanism. The dashboard instead should be used to prioritize reliability against the development of new features on a development cycle basis. In addition, it should offer an overview about the defined SLOs and it is beneficial to visualize the effect on reliability of newly deployed changes. This can be achieved by comparing the service performance before, during and after a rollout.

Using the SRE Infrastructure

Getting the SRE infrastructure used requires development teams to be ready for SRE adoption. The teams need to understand the advantages of operating the services using SRE as opposed to other ways of doing so. In order to familiarize the teams with the SRE methodology, concepts and infrastructure, we used team coaching as a core method.

We went team by team and:

implemented standardized logging
defined SLOs through iteration
taught the team members to react to the SLO breaches using on-call rotations
showed how to use error budget depletions to identify reliability issues
demonstrated how to prioritize reliability using provided dashboards
implemented error budget policies

We took a team-based approach to SRE infrastructure adoption recognizing the fact that each team is unique and is on its own maturation journey.

Based on the experience, we embedded SRE in our continuous improvement programme using an indicators framework described earlier, which is systematically rolled out in the entire organization.

SRE Infrastructure Adoption

The adoption of the SRE infrastructure can be measured using several dimensions. These are:

Number of teams on the SRE infrastructure over time
Number of services on the SRE infrastructure over time
Number of environments monitored with the SRE infrastructure
Percentage of services within SLOs over time
Percentage of services within SLAs over time
Number of postmortems over time
Number of views of visualization dashboards over time
Number of major outages over time
Number of customer escalations over time
Number of customer support tickets reporting failures over time

The dimensions 1-7 are outputs rather than outcomes. The dimensions 8-10 can be considered outcomes. For our SRE infrastructure adoption of three years, the current data snapshot for some of the dimensions above is as follows:

Number of teams on the SRE infrastructure: 17
Number of services on the SRE infrastructure: 57
Number of environments monitored with the SRE infrastructure: 15
Percentage of services within SLOs in the last 180 days
1. Percentage of availability SLOs across all services with no premature availability error budget exhaustion in the last 180 days: 99.72%
2. Percentage of latency SLOs across all services with no premature latency error budget exhaustion in the last 180 days: 95.89%
Percentage of services within SLAs in the last 180 days
1. Percentage of availability SLAs across all applicable services with no premature availability error budget exhaustion in the last 180 days: 100%
2. Percentage of latency SLAs across all applicable services with no premature latency error budget exhaustion in the last 180 days: 100%
Number of post mortems in the last six months: 9
Number of views of visualization dashboards in the last three months: 559
Number of major outages in the last six months: 1
Number of customer escalations in the last six months: 2

We do have qualitative evidence that the application of SRE has significantly reduced the amount of customer escalations and, generally, outages reported to our teams from the outside. Further, we do have qualitative evidence that the application of SRE provides a good structure for defining, splitting and fulfilling operational responsibilities across roles in a development team, and beyond.

We do not yet have the evidence that the application of SRE allows to scale the number of people running the services sublinearly with scaling the number of services in operation. Additionally, we need to invest more into measuring the value of the SRE infrastructure to be able to see trends over time at a glance.

Finally, we have evidence that teams that adopted SRE well for existing services apply it to new services by default from the outset. This is a very profound insight. It allows new services to be incepted, prototyped, implemented and deployed with reliability built-in from the start. It will surely lead to more reliable services in production in future.

Summary

A net new SRE implementation in a product delivery organization is a considerable undertaking. It requires a build-up of an SRE infrastructure by dedicated people and the adoption of the infrastructure by the development teams. The process needs to be facilitated by extensive team coaching in order to drive the awareness and understanding of SRE as well as application of its concepts to the unique circumstances of each team. The coaching also establishes a tight feedback loop between developers in the teams and SRE infrastructure engineers to ensure the infrastructure meets the needs of its users.

More details about our work can be found in the upcoming book “Establishing SRE Foundations”, to appear later in 2022.

Acknowledgements

We would like to acknowledge many people in various roles at the Siemens Healthineers digital health platform who enabled and contributed to driving the SRE adoption at teamplay.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Establishing a Scalable SRE Infrastructure Using Standardization and Short Feedback Loops

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

Implementing the SRE Infrastructure

Agile Delivery

Architecture

Goals

Customer View

Internal View

Service Alerting

Visualization Dashboards

Error budget depletion dashboard per SLI

Reliability prioritization dashboards

UX for productivity

Using the SRE Infrastructure

SRE Infrastructure Adoption

Summary

Acknowledgements

About the Authors

Philipp Gündisch

Vladyslav Ukis

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter