InfoQ Homepage Podcasts Tanya Reilly on Site Reliability Engineering and the Evolution of the New York City Fire Code

Tanya Reilly on Site Reliability Engineering and the Evolution of the New York City Fire Code

Dec 17, 2018

Podcast with

Tanya Reilly

Wesley Reisz

This week on the InfoQ Podcast, Wes Reisz talks to Tanya Reilly (Principal Engineer at Squarespace and previously a staff SRE at Google). Tanya discusses her research into how the fire code evolved in New York and draws on some of the parallels she sees in software. Along the way, she discusses what it means to be an SRE, what effective aspects of the role might look like, and her opinions on what we as an industry should be doing to prevent disasters. This podcast features discussion on paved roads, prevention, testing, firefighting (in software), and reliability questions to ask throughout the software lifecycle.

Key Takeaways

Teams increasingly are responsible for the entire software lifecycle. When this happens, they think about the software differently because they know their the ones that will get paged if it fails. This idea is at the core of the “You Build It, You Run It” philosophy in DevOps.
The role of SRE is to define how to do things in a really reliable way. The focus is to make the majority of the operations work go away, and, for the things that can’t go away, it’s as easy as possible.
At the very start of a project (when you’re writing the initial design), you should be thinking about the dependencies for a system and how will those that follow with be able to determine that. A great way to do this is to offer an API that people will want to use and then instrument it.
We can learn a lot from the growth of fire safety regulations as metaphors for software, including: fireproof interior walls, socializing best practices, software inspections, and circuit breakers are all examples.
The work SREs do varies in many places. SREs range from making recommendations on patterns to library creators in other areas. Occasionally, SREs are firefighters of last resort. In these cases, they’re the last resort though.
We use error budgets and SLOs to quantify how many much risk we’re comfortable taking. It’s used to inform how much less (or more risk) we’re willing to take on.
We need to consider software reliability throughout the full cycle of software development. When you build systems. Think about as if there will not be someone on call for it .

Subscribe on:

Show Notes

00:21 In 1911, in New York City, the Triangle Shirtwaist Factory fire caused the deaths of 146 people died in 18 minutes. In her keynote presentation at QCon New York, The History of Fire Escapes, Tanya Reilly discussed how this fire led to modern fire safety standards, and how the lessons learned map to the software industry.
02:21 TL;DR: We should be building fireproof software.
02:44 Site reliability engineers often deal with contingency plans, which, by their nature, are not the normal way of doing things, and are often not tested. Fire escapes are the ultimate contingency plan, the untested way of getting out of a building when something bad is happening.
03:51 Since the 1960's, fire escapes are no longer part of the building code. It turns out, they were not a very good contingency plan.
04:31 Reilly describes her job as an SRE at Squarespace as basically asking, "Why are we doing this?" and "Have we written this down?"
04:48 A developer should be aware of the infrastructure underneath the software, but it should just be a layer of abstraction. The underlying operating system, storage systems, and databases should be very well defined, almost like an API to your infrastructure, with SLOs to tell you how reliable each part should be.
05:31 Reilly was at Google for 12 years, working on site reliability and disaster recovery. It was important to trace the dependencies to see how things were interconnected, and know that if they ever turned Google off, they'd be able to turn it back on again.
06:11 The SRE teams at Squarespace provide infrastructure services, and are like advisors on reliability for the company, defining what launching safely means.
06:36 While some may say SRE is just another name for operations, it's operations as envisioned by software engineers, where the majority of the operations work would be automated away or rolled into the software development process.
06:56 Historically, there has been a tendency to think of SREs as "those who carry the pager," and "those who run the software." Instead of thinking of operations as a job function, it's better to think of it as a kind of work.
07:16 DevOps and SREs have moved us past the old metaphor of developers throwing the code over the wall to a dedicated operations team to make it run.
07:50 As an industry, we have created a lot of automation to remove a lot of work previously done manually by humans.
08:18 Often, the best people to run a system are the people who wrote the code. We increasingly have teams that are responsible for the whole software lifecycle. This means, when they're designing the system, they think about how to support the system.
08:28 Each team can end up doing a lot of repetitive research in how to support software. The role of the SRE is to define the right way to do this.
09:08 Mike McGarr has spoken about his work at Netflix, and the idea of defining the "paved road." That defines the role of the SRE to make as much of the infrastructure work go away, and what remains should be as easy as possible.
10:12 SRE is a role that involves a lot of education to remove the perception of magic. Reliability, like security, is something we should always be thinking about, and not treating it as something to just add on at the end. However, there is still a need for a security team, whose primary motivation is keeping the site secure, and won't be distracted by other priorities. The same is true for reliability.
12:22 At SRE Con 2017, Dave Rensin said anything you create will eventually become a platform. If you don't offer a way for people to consume your data, they will scrape your site. So you should provide a good API to provide access to your data, and then instrument your service to see how people are using it.
13:27 One of the questions SREs should be asking is, "How do we know, in a microservices world, how our services are interconnected?" Every team shouldn't figure this out from scratch. In this, and other situations, there should already be a "right answer." In a production readiness review, instead of asking "How are you connecting to the database?" we should be asking "Are you connecting to the database in the standard way, which is documented over here?"
14:11 The SREs shouldn't care about the dependencies -- they should be making sure they're discoverable.
15:25 The keynote talked about four different aspects of fire safety evolution, prevention, detection, isolation, and response, and how those concepts related to software.
15:51 Prevention - The number of fires, and the number of fire deaths, didn't decrease because we got better at response to fire; we got better earlier on in the fire lifecycle. One of the ways was by making it harder for fires to start, at all. These ranged from wiring inspections to public safety campaigns about safe cooking techniques. In tech, we have similar practices, including unit tests and integration tests. We do experiments, like chaos engineering, to determine what may cause a fire.
16:51 The role of SREs to provide education is like those public safety campaigns. It used to be common to have manually configured servers, installed from CDs, but no one should be doing that now, since configuration management is an established best practice.
17:26 Detection - Sprinklers are a form of automation -- they detect a fire and react to it. The reaction may be messy, but the fire is out before it causes a problem. In software, we can have automated responses such as redirecting traffic, or even semi-automated responses like one-click rollback. If a new version of software has a problem, we'll first rollback, put the fire out, and then debug to find the cause.
18:26 Isolation - Tenements built with firewalls didn't stop fires, they stopped the fires from spreading. We can use similar isolation techniques in software to limit the blast radius, such as circuit breakers.
20:06 The role of SREs really varies. In some companies, they will develop tools, while in others they may only serve in an advisory role and provide guidance. They are also the people responsible for writing things down to define, "this is how we should do it."
20:55 Response - When there is an incident, SREs can occasionally be firefighters. But, they should not always respond to every incident, just as you don't call the fire department if you burn dinner.
21:36 The vast number of outages that require human response should be able to be trivially dealt with, preferably by the DevOps team that built the software. When an outage crosses many teams, or you don't know where to start to fix it, then it is useful to have people who are experts at incident response.
22:23 A good incident response involves a lot of communication. They should also be followed by a blameless post-mortem. In the retrospective, try to understand what happened, with the result being actions that will prevent the same thing from happening again. These could be fixing the software, but could also be things that made the situation worse, like fixing documentation that was wrong, or creating documentation that was missing.
24:06 One post-mortem technique is Wheel of Misfortune disaster role playing, where you walk through the situation and ask "what would you do next?" This is helpful in training incident commanders in how to respond, keep detailed notes, and manage the situation, more than arriving at a technical solution.
26:18 An error budget is a way of quantifying the amount of downtime you're comfortable with, for whatever downtime means for you. Ben Trainor, the founder of SRE at Google, said 100% is the wrong reliability target for basically everything because it's incredibly expensive.
27:56 If we have an error budget of 10 minutes of downtime per year, and in March we've been down for nine minutes, that will influence the risks we take. Conversely, if we are near the end of the year and haven't had any downtime, we may take more risks in our development philosophy.
28:41 The fire code didn't really change until a lot of people died. As our lives become more dependent on software, we have to begin to care about software reliability before there are deadly consequences.
30:14 Reilly wants professional standards and a fire code for software. Whenever there is a major outage, there should be a public post-mortem, so the entire industry can learn from it.
30:41 We need to consider software reliability throughout the entire software development lifecycle -- We can't sprinkle reliability on at the end. When you design a new system, think about if there won't be anyone on-call for it, and build accordingly.
31:21 Final advice from Reilly: Please get a smoke alarm. They aren't expensive, and they save lives.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.