InfoQ Homepage Articles Book Review: Site Reliability Engineering - How Google Runs Production Systems

Book Review: Site Reliability Engineering - How Google Runs Production Systems

Sep 21, 2016 10 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key takeaways

Software engineering is fundamental to modern ops
Defining reliability targets (“error budgets”) allow devs and ops to have enlightening conversations on the new features vs availability debates
A cap on manual operational work allows the ops team to scale sub-linearly with the IT systems growth
Monitoring is a basic building block for operations and quality assurance activities
Handling load requires a multi-pronged approach, with load balancing and gracefully handling overload are at the forefront

“Site Reliability Engineering - How Google Runs Production Systems” is an open window into Google’s experience and expertise on running some of the largest IT systems in the world. The book describes the principles that underpin the Site Reliability Engineering (SRE) discipline. It also details the key practices that allow Google to grow at breakneck speed without sacrificing performance or reliability.

Although SRE predates DevOps, Benjamin Treynor Sloss, Vice President at Google and Google’s SRE founder, says that SRE can be seen as a “specific implementation of DevOps with some idiosyncratic extensions”. SRE has 8 tenets: availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. Much of the book discusses how SRE teams align and execute their work with these core tenets.

Software Engineering in SRE

So what is exactly Site Reliability Engineering, in the words of Sloss? According to Sloss, SRE is “what happens when you ask a software engineer to design an operations team”. It all starts with the team’s composition: 50/60% of SREs are software engineers by training. They have the same skills as the software engineers who work on product development. The rest of the team has similar skills, but with operations specific knowledge, such as networking and Unix systems internals. This mix is intentional. It’s not feasible to grow the organization linearly with the IT systems’ growth. So Google SRE teams benefit from software engineers urge to automate everything and their dislike of manual work (toil).

But how do SREs manage to get the time to automate problems away? Google’s answer is a 50% cap on toil. SRE must spend 50% of their time on engineering activities. Google monitors the amount of operational work each SRE team is doing. If it’s consistently greater than 50%, then the product teams get the excess operational work. This approach has the nice side-effect of motivating the product teams to build systems that do not rely on manual operation. Also, as a rule of thumb, on-call engineers should handle about two events per shift. Less than two becomes a waste of time. Much more than two and the engineer becomes overwhelmed, unable to handle the event thoroughly and to learn from it.

Handling Risk

The way Google handles risk is distinctive. The dev’s thirst for new features and the ops’ focus on reliability is a common source of conflict. Google facilitates this discussion with the concept of the error budget, changing the whole discussion in the process.

The cost of increasing levels of reliability grows exponentially. Excepting some niche domains, such as pacemakers, 100% reliability is a wrong target. On a practical level, it’s easy to understand that 100% availability is impossible when you factor in all the pieces that sit between the end users and your web service, such as their Wi-fi, their internet provider, their laptop. So, at Google, the reliability target is a product, not a technical, decision. Let’s say that a product teams sets a 99,9% reliability target for a given service. This means that the service can be down 8,76 hour per year. That’s the service’s error budget and the product team can use it on experimentation and innovation.

The book contains lots of real-life stories that bring concepts into life, such as error budgeting. A memorable one is about Chubby, Google’s reliable lock service. Over time, more and more services came to depend on Chubby, assuming it would never fail. Alas, everything fails eventually, so Chubby’s failures started to translate into user visible failures. Chubby’s SRE team decided to ensure that Chubby met, or just slightly exceeded, its reliability targets. If Chubby far exceeded its targets, then a controlled failure would bring down the service. The practice nudged all those services to plan for Chubby’s failure.

The Crucial Role of Monitoring

Monitoring deserves extensive coverage, both from the principles and practices viewpoints. Unsurprisingly, SREs consider that the four golden signals of monitoring are latency, traffic, errors and saturation. Tracking errors is crucial for measuring the service’s reliability. Given that most Google’s services are global, it’s rare to have a full outage. This makes uptime and downtime inadequate to measure availability. Instead, Google defines it as the ratio of successful requests to total requests.

A good monitoring and alerting pipeline does not overload pagers. It focuses on imminent real problems, only paging the symptoms to the on-call engineer. During troubleshooting and debugging, the system can provide detailed cause-oriented heuristics. Monitoring systems should avoid e-mail based alerts, instead striving to provide insightful dashboards. These dashboards display aggregated data, coupled with logs for historical analysis. Standardized processes collect and process metrics that feed those dashboards and alerts. Every Google binary has an embedded HTTP server that exposes sets of metrics through a standardized interface. These are then collected and aggregated according to a rich rules language.Prometheus is an open source monitoring tool that closely follows Google’s model.

Handling Load

The editors devote several chapters to the hard task of handling load. Load balancing is the first line of defense. While load balancing may seem simple, the two chapters describing Google’s strategies makes it plain it isn’t so. At Google, load balancing starts by determining which data center will handle the user’s request. The first load balancing layer is DNS, which focuses on trying to forward the user’s request to the nearest data center. The second layer uses virtual IP addressing to select the backend that will serve the request.

Once inside the datacenter, Google has a lot more control over the infrastructure. Thus, it can thus refine the handling of the the user’s request. Clients in Google’s RPC system open a number of long-live connections to their servers (a specific instance of a service) on startup, to avoid the overhead of opening and closing connections for each request. This strategy creates two challenges. First, how to handle unhealthy servers? Having the servers in one of three states - healthy, refusing connections and lame duck - provides the answer. If a server is a lame duck then it’s still serving but is asking its clients to stop sending requests, to allow for clean shutdowns. The second challenge is resource usage (CPU and memory) optimization. Given the sheer number of servers providing a specific service, it would be wasteful for a client to connect to every server. So how does a client select a subset of the servers? Google uses a weighted round robin algorithm, which has been proven to be the best one to spread the load between all available servers. Weighted round robin works by having the clients keeping track of each server capability over time. Each server reports its own capability according to the observed query rates and errors per second.

Even when load balancing is perfect, sometimes overload is unavoidable. In these situations, the goal must be for the server to continue to serve traffic at the rate is was provisioned to do, while rejecting the excess requests and avoiding cascading failures. The most efficient ways to handle overload involves cooperation between clients and servers. Clients should not exceed their consumption quotas for a given service. Servers should reject excess requests without failing catastrophically and with adequate error codes. Engineers can understand services’ failure modes by load testing them, then taking mitigating measures. Retry policies must be fine-tuned or they’ll just add to a server overload problem. Randomized exponential backoff retries and retry amplification minimization (a failure at a lower layer may trigger a cascade of retries further up the stack) are just a few policies to use.

The Challenge of Data

A software engineer’s life would be much simpler if the problem of ensuring consistent, persistent data just didn’t exist. Alas, it does exist. Google’s SREs have to deal with that problem at a massive scale so they spend several chapters discussing the challenges of ensuring consistent state in distributed systems, large scale periodic scheduling with cron, data processing pipelines and ensuring data’s integrity. Most interesting is the in-depth discussion of system architecture patterns for distributed consensus, mostly based on Paxos and its variations, including their performance trade-offs, deployment considerations and metrics to be monitored.

The data integrity chapter highlights how important it is to have defenses in-depth against data loss. Google applies three layers of defense. The first one is soft deletion, where data is marked as deleted and thus recovering it is pretty simple. Data integrity must be balanced with data privacy as, depending on privacy policies or laws, data may have to be hard deleted after a specific time span. The second layer is the traditional, but critical, backups and restores. The third layer, early detection, addresses those insidious bugs where data corruption is detected days, weeks or months after it actually happened. Google applies out-of-band data validation to increase the chances of detecting inconsistencies between and within data stores.

It may be unexpected, but Google has its own offline backup system, backed by tape drives. The book describes two instances when the venerable tape drives were critical to restore users’ data. Reading how the recovery process involved trucks carrying 1.5 petabytes of data stored in tapes from an offsite storage is enlightening. Learning why SREs then manually loaded the tapes into the tape libraries is inspiring: they knew that manually loading the tapes is much faster than using robot-based methods from previous DiRT (Disaster Recovery Test) exercises.

Different authors, all current or former SRE’s at Google, wrote the book’s 34 chapters. As the editors state in the preface, each chapter is more like an essay that can be read on its own (as this review makes plain). Indeed, some of the chapters are based on previously published articles. Although this format makes this book a little less homogeneous and more long-winded than ideal, the editors - Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy - did an expert job on meshing this wealth of knowledge together. You can read the 524-page book from beginning to end and it will feel coherent. But reading the principles chapters and then focusing on the specific practices you’re most interested in is the best way to enjoy this book.

About the Book Editors

Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.

Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.

Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.

InfoQ Software Architects' Newsletter

Book Review: Site Reliability Engineering - How Google Runs Production Systems

Write for InfoQ

Key takeaways

Software Engineering in SRE

Related Sponsors

Handling Risk

The Crucial Role of Monitoring

Handling Load

The Challenge of Data

About the Book Editors

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter