InfoQ Homepage Site Reliability Engineering Content on InfoQ
-
SLOs Are the API for Your Engineering Team
SLOs provide a simple common language for evaluating risk in terms of error budgets. SLOs save everyone involved both time and energy, which you can redirect toward more important things, like keeping your customers happy.
-
Sustainable Operations in Complex Systems with Production Excellence
Successful long-term approaches to production ownership and DevOps require cultural change in the form of production excellence. Teams are more sustainable if they have well-defined measurements of reliability, the capability to debug new problems, a culture that fosters spreading knowledge, and a proactive approach to mitigating risk.
-
Book Review: Site Reliability Engineering - How Google Runs Production Systems
"Site Reliability Engineering - How Google Runs Production Systems" is an open window into Google's experience and expertise on running some of the largest IT systems in the world. The book describes the principles that underpin the Site Reliability Engineering discipline. It also details the key practices that allow Google to grow at breakneck speed without sacrificing performance or reliability.