BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Building an SLO-Driven Culture at Salesforce

Building an SLO-Driven Culture at Salesforce

This item in japanese

Bookmarks

CRM software company Salesforce have revealed their approach to service reliability using service-level indicators and objectives (SLIs and SLOs). After building a platform to monitor SLOs, they saw massive adoption with 1,200 services onboarded in the first year.  The platform provides service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.

Building a platform to monitor service reliablility abstracts away organizational complexities and toil, allowing teams to focus on driving business value. Tripti Sheth talks through how it was crucial for Salesforce to agree on a definition of "highly reliable" across a range of tech stacks, and across the many products and individual supporting services and products within the organisation. This led to them being able to frame reliability in terms of SLIs and SLOs.

As documented by Google Cloud, Site Reliability Engineering (SRE) begins with the idea that availability is a prerequisite for success. Service-Level Objectives (SLOs) are a precise numerical target for service availability. A Service-Level Agreement (SLA) defines a promise to a service user that the SLO will be met over a specific time period, and Service-Level indicators (SLIs) are direct measurements of the service's performance. These generally accepted definitions are often used to show customer experience in a clear, quantitative and actionable way.

In the past, Salesforce’s teams had assembled SLOs manually, meaning that updating these metrics and reporting on them was a time-consuming and error-prone task. Additionally, different teams would calculate and store these values in different ways, preventing the company from gaining a clear picture of customer experience.

Forming a standardized view of service availability was crucial, and Salesforce approached this in three areas:

Standardised Measurements: Salesforce used a previously established SLO framework based on five readings of request rate, errors, availability, duration/latency, and saturation (READS) to define standardised measurement of product and service health.

Standardised Tooling: a dedicated SLO platform for hosting the definitions of SLIs, SLOs and services, including ownership, health thresholds and alert configurations. This metadata is held in a single data store, with long-term storage and retention to give visibility of historical health trends. Automated alerts can be set up based on the data collected.

Standardised Visualisation: as soon as a new service is added to the platform, an out-of-the-box standard view of metrics is generated, with the standard READS SLIs and any custom SLIs added for that specific service. The visualisation includes a dedicated Grafana dashboard for realtime monitoring which is automatically generated and populated by real-time data. Also, the service is added to the service analytics dashboard which is regularly reviewed to drive conversations about service health and availability.

The combination of these three areas creates many benefits:

  • Confidence that SLOs are calculated in a standardized way
  • Insights from visualized SLI and SLO metrics
  • Using granular targets on SLOs to judge if a service is meeting expectations
  • Alerting on SLI and SLO metrics
  • Correlation of breaches with incidents
  • Identification of service dependencies

The SLO platform architecture comprises multiple components. It is centered around a service registry and configuration store - keeping service ownership information, service statuses and service-specific configuration, and data on SLIs, SLOs and the thresholds required for triggering alerting. Peripheral to this are data stores for change and release information, collected for future use in correlating changes with SLO breaches, and a time-series monitoring platform and pipelines for collecting and aggregating metrics.

The unified service health dashboard has become a focal point for operational reviews. The team has used these metrics to trigger architectural reviews, and stimulated discussions around strategic investments and tactical improvements.

Future work will enable a more comprehensive view of the dependencies for a service - with the goal of pinpointing exactly where a failure occurs and minimising recovery times. Furthermore, having collected these data per service, and with a realistic view of its dependent service, Salesforce will be able to set realistic SLIs across the entire stack.

The full article with further details is available on Medium.

About the Author

Rate this Article

Adoption
Style

BT