BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Explains Why Others Are Doing SRE Wrong

Google Explains Why Others Are Doing SRE Wrong

Leia em Português

This item in japanese

Bookmarks

Stephen Thorne, customer reliability engineer at Google, recently spoke at the DevOps Enterprise Summit London on what Site Reliability Engineering (SRE) is and how many organizations are failing to understand its basic premises and benefits [PDF of slides]. Key misunderstandings that Thorne has seen in other organizations include: confounding service level objectives (SLOs), which are focused on early failure detection, with service level agreements (SLAs), which often serve as financial compensation for past incidents; not enforcing error budgets; and not dedicating at least 50% of the effort of SRE teams to improve the systems and tools and instead letting them continue to drown in toil, aka "firefighting" in production.

SLOs are fundamental in detecting issues early, ideally before the effects become visible to customers. A good SLO is aligned with outcomes for the customer (service availability or response time, for example) and thus reflects whether the system (behavior) is meeting user needs. Resource usage, such as CPU utilization or network throughput, should be monitored, but not used as an SLO per se. Thorne put it simply as "if the customer is happy, then the SLO is being met". Typical SLOs at Google include:

  • uptime of 99.9% a month (i.e. 43 minutes of downtime a month)
  • 99.99% of HTTP requests in a month succeed with a 200 OK
  • 50% of HTTP requests returned in under 300ms

SLAs, on the other hand, typically come into play when customers are already dissatisfied with a service, thus failing to proactively improve the system's reliability. Further, SLAs can lead to the wrong incentives, for example, combining an SLA of two hours to fix an email issue with an SLA of one day to fix a serious production incident might lead to working on one (or more) email problems first although clearly the production issue should be the priority.

Just defining SLOs is not enough, Thorne warned. Error budget policies enable the meeting of SLOs by setting clear rules for action (not monetary compensation) before a system gets close to an SLO's threshold. This also minimizes confrontation between ops and dev when systems are failing to meet user needs. "The error budget is the gap between perfect reliability and our SLO", said Thorne. For Google, a typical error budget policy is to disallow launching new features once an application has exhausted its error budget (for example, already over the 43 minutes downtime budget for this month), or dedicating a sprint to corrective actions stemming from previous post-mortem analysis.

Thorne stressed, however, that what works for Google won't work for every organization: "SRE needs SLOs with consequences that balance an acceptable level of failure with the necessary cost and speed of delivery". Exact SLOs and policies must be adequate for the organization - not a copy/paste from Google - and focus on continuously improving customers' experience, not on setting lofty goals or hard punishments that could be counterproductive. Thorne gave the example of one organization struggling to reduce the processing time of a recommendations system. It turned out that users would only see those recommendations when they came back to the site, on average six hours later. An adequate SLO was to process all recommendations within 6 hours, which meant they could save the cost of three  engineers previously working halftime on the perceived "issue" of slow response time.

Empowering SRE teams to balance workload between the "every day" (often unplanned) ops work and planned work to reduce toil (aka "firefighting") is the third key to SRE, said Thorne. At Google, this means at least 50% of SRE effort is spent on project work: early consulting on new systems' architecture to identify resiliency anti-patterns (and avoid more toil later on), improving monitoring, automating repetitive tasks, or coordinating the implementation of postmortem corrective actions.

Thorne further referenced some clear anti-patterns to implementing SRE such as simply rebranding the ops team to SRE team or hiring for SRE engineers without first putting in place the SRE principles and mechanisms (SLOs, error budget policies and balancing workload) for success.

These are five key steps to get on the right path to SRE, according to Thorne:

  1. Define contextual, customer-focused SLOs
  2. Define sensible error budget policies
  3. Hire (internally or externally) SREs and empower them via leadership support
  4. Allow SREs to fine-tune SLOs and enforce error budget policies
  5. Assign responsibility for mission-critical systems' reliability to SRE teams, other systems under the responsibility of the corresponding development team

Google developed and expanded the site reliability engineering discipline internally for some years, before condensing their lessons learned into the SRE book. Thorne mentioned an accompanying SRE workbook will be coming out later this month.

UPDATE: The new SRE workbook is available for free on this link (PDF) until Aug 23 2018.

Rate this Article

Adoption
Style

BT