BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Greenwater, Washington: an Availability Story

Greenwater, Washington: an Availability Story

Bookmarks
19:56

Summary

Marc Brooker discusses defining and designing for availability that takes people into account, including examples of massive-scale cloud systems designed using these principles.

Bio

Marc Brooker is a Senior Principal Engineer at Amazon Web Services, currently focussed on database services. He’s been at AWS for 12 years, and has worked on EC2, EBS, Lambda, and IoT. He’s interested in distributed systems, databases, serverless, and economics. Marc has a PhD in EE from the University of Cape Town.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Brooker: My name is Marc Brooker. I'm a senior principal engineer at AWS, focusing mostly on serverless and serverless technologies. I'm going to be talking about Greenwater, Washington. A small town up in the mountains of Western Washington, but more broadly, about the way that we think about, talk about, and define the availability of the distributed systems that we build.

First, a little bit about Greenwater. Greenwater is a small town up in the Cascade foothills of Western Washington, and is connected to the larger world by three long highways through the forest. One of them heads north up through Enumclaw to Seattle. One of them east over the mountains to Yakima. One of them south, down to Portland. These long highways through the forest are critical to life in Greenwater. If you want to go to a hospital, to see a doctor, to go to a grocery store, to go to a drugstore, you need one of these highways to be available to you. Availability of the road system is critical to the people who live out there.

99.95% Availability

Say I'm a highway engineer for Washington State. I say, I want Washington State's highways to be 99.95% available. That's a pretty typical number for distributed systems availability, 99.95% available. That sounds really good. Here's the problem for the people of Greenwater. About one in every 100,000 people who live in Washington, live there in Greenwater. Washington's highway system could be 99.999% available, and still be completely unavailable to the 70 people who live in that particular town.

Availability is Personal

What kinds of things happen? In winter, as typical of the Cascades, it snows heavily. If you head to the south or to the east, you're likely to be eaten by a snow monster. In the summer and the fall, forest fires are pretty common. In fact, as I speak, highway 410 north out of Greenwater towards Seattle, is closed because of damage from a forest fire. It is not unusual that Greenwater gets mostly disconnected from the highway system of the larger Washington State. What is unusual is that that state doesn't last long. What's unusual, perhaps to us as distributed system engineers, is the length to which the Department of Transport and other stakeholders go to make sure that the people in that town are able to get the services they need. Opening up parallel roads. Servicing forest roads, all kinds of other things. That's because the people doing that work understand something that a lot of distributed systems engineers haven't internalized, that is, availability is personal. It is absolutely no comfort to me that a system is available for everyone else if it is unavailable for me. It is absolutely no comfort for the people in a small mountain town, that the road system is available to everyone else in Washington State, if it's not available to them.

If they said, today, I need to get out to the drugstore to get some critical medicine. No, you can't do that. You can't do that because the road system is unavailable. Don't worry, it is 99.99% available, and most people are very happy. They're not buying that. You wouldn't buy that. Yet, it is the common way that a lot of people think about the availability of distributed systems. Availability is personal. It's about the people getting service, not about the number.

Distributed System Redundancy

One common way that we as distributed systems engineers think about adding availability to a system is redundancy. Say I live in Greenwater and I want to make sure that my house is always available to me, I always have available housing. What could I do? The typical distributed system engineering approach would be to build a second house. Sounds like a good idea. I'm going to go and build a second house just down the way. Is that going to actually improve my availability? It might, if the threat to my house, so the things that make my house unavailable are storms, or the roof, or something, something that is uncorrelated across those two houses, something that will fail separately. For most of those houses in this town, the thing that limits their real world availability, limits the real customer experience of availability, is that road connectivity. That road connectivity is highly correlated. What correlated means in this context, is, if it fails for one house in Greenwater, it's quite likely to fail for others and potentially all of them. Building a second house in the same town, even a couple of miles away, probably adds no real world availability to my experience of living in Greenwater.

Correlation Limits Achievable Availability

This is the second most important thing that people neglect when thinking about the availability of distributed systems, is that correlation in the real world limits achievable availability. The correlation between failures of independent components provides a practical limit, a real world limit on the availability of the most available system you can build. The naive model would suggest that as we add redundancy to a system, as we add more copies, the availability goes up exponentially. This doesn't happen in the real world. Because in the real world, correlation limits achievable availability. The fact that multiple of those copies are likely to fail at the same time, for various reasons, is most likely to be the real world limit on the availability of any system that you build.

Blast Radius Is Critical

Let's talk about a third factor. I live in Seattle. Something that I really appreciate about the road network is even if there are those snow monsters and fires and bad weather, and whatever up in the forests, they're not affecting my experience. I can still get to school. I can still get to the hospital. I can still get to the grocery store. This is a good thing. Similarly, for example, if one of the interstates in Seattle was unavailable, the people over in Greenwater wouldn't be affected by that. That is because road networks have very low localized blast radius. Any failure, even the biggest worst failures have, at worst, a small regional impact. This is a great property for the real world, human experienced availability of a distributed system. Blast radius, the amount of things that fail, and our ability to reason about which of those will be, and where those will be, is critical to the real world human experience of the availability of a distributed system, way more than the design availability of however many nines.

Single Points of Failure

I'm going to build a house out in the forest. To be able to live in that house, I need five things. I need the house itself with nice, happy flowers around it. I need a well for water. I need a small field of corn for food. I need a pile of firewood to keep my house warm in the winter. I need a septic tank for obvious reasons. The availability of my house requires all five of these things to be available. My house is only available if all five of them are available. What if I'm not happy with that? What if I don't think that's good enough? The typical approach would be to say, let's build a second house with a completely independent set of things. This is great. Completely independent sets of things means exponentially increasing availability, means I can choose as an engineer, to balance the amount of money I spend against the availability of my system. I have the ultimate power tool to do that, and that is redundancy.

It never quite works like that. Never quite works like that because there are these causes of correlation. Possibly, the most obvious cause of correlation is a single point of failure. Say I build five houses, to save a little bit of money, instead of building five wells, I build a single community well that is shared by all of the houses. What's interesting about this is that the availability is still going to be pretty great. One empty firewood pile won't bring down the whole thing. One broken septic tank won't bring down the whole thing. It's only very unlucky circumstances, very unlikely to happen, lots of nines, or the failure of that well that's going to bring things down. As I add houses, the availability of my overall system is going to asymptotically approach the availability of that well. That's how it gets you. That's how single points of failure get you. You all know this. This is something that you are quite likely to already understand.

There are two other ways that single points of failure get you. One of those ways, is actually a way that they get your customers. Say, I have my five houses, I choose to rent them out. I rent them out to somebody who like me, really values availability. They're a distributed systems person. They like thinking about this. They want to have a highly available set of houses to live in. They rent three houses from me. Their expectation, their mental model is that they have bought themselves three-way redundancy, and therefore have a very highly available solution. What I haven't told them because it's buried way down in the small print of the contract, is an implementation detail, is that there is the shared community well. The shared well will break all three of those houses at the same time, and so that well has very high blast radius. That well is a single point of failure that will defeat their efforts to have high availability. By having the single shared well, I have made it impossible for my customers to build a more available housing solution than that well. I have bounded their availability asymptotically to the availability of that well, and so limited what they can do and limited what they can achieve. They might not know this.

The Percentage of Postmortems Drops as the System Scales

There's a third way that this catches you. This is an organizational way. A lot of people when they approach, how should we invest in improving the availability of our systems? Approach that by thinking about failures, thinking about the history of failures, and thinking about postmortems. It's quite a natural thing to do, to say, to pick our investments, we are going to go through our last year of postmortems. We are going to weight our investments around the things that cause failures. That sounds very sensible, doesn't it? Doesn't that sound like a good strategy? The problem here is that the percentage of postmortems that involve the single point of failure, involve this community well, drops really quickly as the system scales. If you go and look at the postmortems from the system, at scale, with many houses, over a year, you will see that about 20% of them involve firewood piles, and about 20% of them involve cornfields. Only a tiny vanishing percentage involve that single point of failure that involve that community well.

If we weight things in the most obvious way, if we allocate resources in the most obvious way, if we're good owners, and we think about the availability of the system that we build, then we're going to make a bad decision, and we're going to invest in the wrong places. We're not going to invest in the thing that has the most impact. This takes careful thought, because we need to think about our system not as what is causing the most failures but what is most likely to cause a large failure? What is most likely to cause pain to our customers and cause a large outage of our system? That is very hard to get from metrics alone. It's very hard to get from history alone, and has to come from some understanding of our architecture. This is hard, because as humans, we have this thing called the availability heuristic, where it's like, we're heavily biased to worry about things that have happened recently or come easily to mind. Maybe we just can't remember the last time that single point of failure failed. Maybe it's never failed. Instead, we have this flood of postmortems from other things. It doesn't mean that it's bad to invest in those things. It just means that you're not improving the asymptotic availability of your system, by investing in those things.

Who, How Much, and Who Else?

The people in Greenwater and the people in my houses, we know that the availability that they experience is personal and matters to them. It's not easily captured by a global number. We know that correlation matters. In that, if they want high availability, they need to be able to think about and reason about where in the network, or where in the system their resources are landing? What does this mean for the way that we think about, design, and monitor systems? As engineers, we tend to be highly biased towards this how much. We can measure something. Measuring is something that we tend to love to do. We can put a number on a thing. Maybe we can draw a time series, a graph of time showing its availability up into the right. That's that how much. We're strongly biased towards thinking about that. Often, that is the only thing that people are thinking about when they think about availability. That's a very limited view of the world.

A lot of what matters to availability is also the, who. Who is it out for? What is their experience? For example, if we say our system is 99.95% available, do we mean that it is 100% available for 99.95% of customers, or that it is 99.95% available for 100% of customers? Because one of those things is way worse than the other. One of those means that 0.05% of our customers are getting no service at all. That is a really bad property for a system to have. We also need to think about, who else? When a failure happens, what other failures does it correlate with? Are we surprised about the other failures that are correlated, or the groups of failures that are correlated together? The groups of failures that happen at the same time. Do we think our customers can be surprised by that? Can our customers build architectures on top of our system that is more available than any component of that system themselves? Remember, in my example with the shared well, there is no way for a customer to build something on top of that system that is more available than that shared well. That's a really bad property for something to have, because it breaks their intuition about building distributed systems.

The Purpose of a System Isn't to Hit Numbers but to Provide Service

This is something that tends to get lost in engineering conversations and in measurement conversations, because often, we think that the purpose of a system is to hit numbers. That 99.95%, that's the goal. That's what we're trying to achieve. It isn't. That's only one proxy measure for the thing that we actually care about. The thing that we actually care about is that the service we're offering to people, the purpose of a distributed system, the purpose of any system that we build, is to provide a high quality of service to people. When we think about, design, measure, monitor, and build culture around our systems, that has to be at the front of our mind. It has to be at the front of our mind of who is experiencing failures, and who else is experiencing failures at the same time.

Resources

If you want to see what this looks like in practice, we wrote a paper earlier this year, which we presented it at NSDI, called, "Millions of Tiny Databases." This describes in detail a system that we built at AWS, that is based on these principles, is based on these ideas of optimizing for blast radius and optimizing for correlation by building a system that is aware of who its customers are and aware of likely areas of correlated failure in an underlying infrastructure, and tries actively through active optimization to drive those down. If you want to see what this looks like, as a real world architecture, please check out that paper we wrote.

 

See more presentations with transcripts

 

Recorded at:

Mar 04, 2021

BT