BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

Christof Leng, site reliability engineering lead at Google, presented ProdEx, their SRE Teams production excellence review program that helps manage operational risks, promote best practices, and continuously improve across Google SRE.

SRE at Google is a central specialist organization that is matrixed into the individual product and business groups they align with. But SRE is also a community that builds platforms together, establishes standards and promotes best practices, so people learn from each other and grow. That’s the purpose of ProdEx. The production review program (ProdEx) started in 2015, and today they have more than 100 SRE teams signed up. They conducted more than 1000 reviews performed by more than 40 reviewers internal and external to the SRE organization.

The mission and goals of ProdEx are to drive best practices and production health across SRE. They assess the main risk areas for SRE-owned production services; identify SRE teams that need help; provide coaching opportunities for SRE teams; and improve cross-SRE visibility and awareness for SRE leadership.

To do this, they developed and adopted a structured approach for each of these reviews with shared metrics. They use dedicated tools for automated data collection. The teams get reviewed at least annually, and teams that struggle can get reviewed more often. All identified improvements resulting from the reviews are tracked as action items.

Google's ProdEx program overview

Generally, two senior reviewers (directors or principal engineers) conduct the SRE team's operational health reviews. To do this, they focus on six capability areas:

  1. The team information, such as the charter and clear roadmap to guide them towards their goals.
  2. The team’s on-call health, to assess their pager fatigue and quality: their incident load, their alert-to-incident ratio, on-call rotation staffing, and their unactionable incidents.
  3. The team’s interruptions, to check if they have the bandwidth to dedicate to impactful engineering work.
  4. The team’s SLOs and postmortem, to check if their system performance is measured and aligned with the users’ needs.
  5. The team’s data integrity, to identify any risks related to losing data.
  6. The team’s capacity planning, to minimize the cost of suboptimal utilization and manual capacity management.

The impact and outcomes to date are significant, and more teams are enrolling in the program. For example, in the first year, they conducted these reviews, only 23% of teams were scoring high. Over the years, this percentage increased to 66%. At the same time, the fraction of teams at risk that were scoring low dropped from 44% to 9%.

The pager load and incident rate dropped by 34%, decreasing teams' fatigue. The teams’ data integrity became the most predictive section for the overall health score: teams that score low in data integrity are unlikely well-performing teams. And they saved thousands of hours of leadership time needed to perform the reviews due to the automated review preparation.

About the Author

Rate this Article

Adoption
Style

BT