BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Medieval Census Problem

The Medieval Census Problem

Bookmarks
21:29

Summary

Andy Walker discusses the principles of distributed computing used in medieval times, and the need to understand high latency, low reliability systems, bad actors, data migration, and abstraction.

Bio

Andy Walker has been doing software engineering for nearly 30 years at companies like Google, Netscape, Skyscanner, Sony and IBM. He is currently semi-retired which consists of trying to write up everything he has learned in the last 30 years and doing advisory work for startups and scale ups.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Walker: Welcome to the medieval census problem. My name is Andy Walker. I spent quite a long part of my life at Google. This is my attempt to reconcile the problems of conducting a medieval census with microservices. You might think this is a slightly odd topic for a talk, given that we are in the 21st century. It turns out, there's nothing new under the sun. One of the problems I run into a lot during my career is that it's very hard for developers to think about why distributed systems are hard. Making the switch to microservices is all about thinking about why distributed systems are hard, and what you can do about it. It turns out, all of the problems we run into today are problems which you would have faced in trying to conduct a census in medieval times.

It Starts With a Question - How Many People Are In My Castle?

Welcome to the fictional Kingdom of Andzania, ruled by wise Queen E-lizabeth, population unknown. This progressive, small territory was looking to move forward in civilization. The queen wanted to embrace everything about making her country successful. She didn't know anything about her country or her subjects. This led to the first question, how many people are in my castle? These were the first poorly worded requirements, because how can you know how many people are in a castle at any point in time. It's in a constant state of flux. People are going in and going out. Do you mean the castle, or do you mean the castle's environments? Therefore, the first product management was born to try and understand what the requirements and what the success criteria were. Also, the first data scientists were created because now we have to count something which is constantly in flux, which is a lot harder than you may think.

When I was working on Google Maps, we had the constant problem that there is no authoritative source of map data anywhere in the world. All you have is a series of unreliable and possibly conflicting signals to tell you where roads are, where restaurants are, that are constantly changing, and people have vested interests in giving you the wrong information, in some instances. We also have race conditions because somebody could be born. Somebody could die. Somebody could be leaving the environment. The best answer you're ever going to come up with is an approximation. One of the problems we have when we try and represent the real world is that we can only ever come up with an approximation. The problem we have here is the traditional tension between data science and product management, and leadership, which is, the data scientists would like to go away and come up with a perfect, mathematically sound answer for how many people there are, including confidence intervals, and understanding what the limits of the question and the answer are. The leader just wants an answer yesterday. Therefore, there is a tension between getting the answer right and getting it good enough that we can make progress well.

More Questions, More Data - Let's Store It in the Shed

Once we know how many people are in our castle, we then realize that this information is useful, but not that useful, so we have more questions. If you're a monarch in medieval times, you may also wonder, how many people are of fighting age in your city? Therefore, you start needing to capture other information such as their age. You might need other information, such as their names, because you might want to understand how many greeting cards for Bob, you have to print. You need a richer set of data. Where do we store this data? In this case, the kingdom of Andzania decided to store it in a shed. They would just write it on slips of information and store it there. They also realized they needed to access that information, and suddenly you need primary keys. You also realize that the same information happens to everybody. Therefore, you have structured data. Your information changes over time, because people are born and people die. Therefore, you have stale data problems and you need processes for keeping that data up to date.

You also have the problem that you have free-for-all access to the shed, where anybody can walk in and do anything they want to the information. Over time, you're going to want to add or change your information and therefore the structure of that data is going to change. This is not an optimal situation. In fact, a database with unrestricted access over time becomes your worst nightmare from a maintenance perspective. Back at Google, the original ads database, by the time they finally deprecated it, had tens of thousands of clients to the point where it was impossible to make meaningful schema changes without breaking a large proportion of the people who depended on you, and there was no way of automatically working through those dependencies to know what would break before you made the change. It was a case of both chaos and stasis, at the same time.

How Do We Access The Shed? (Ned's In Charge Of the Shed)

This led to hiring Ned to be a clerk running the shed. The first microservice is born, Ned 1.0. Ned very quickly realized that a question from the queen was far more important than a question from one of the local merchants in town. Therefore, quality of service was born. As the first abstraction, Ned realized that it was his job to hide the complexity of the data in the shed from the outside world so they could get value from it.

Doctor Hofmann's Leeches - Nobody Can Read His Writing

As time progressed, it became obvious that other people wanted to store information. Here you have the choice of, do you have one unified data store for absolutely everything, which quickly becomes impractical, or do you wind up with little islands of heterogeneous data? In this case, Dr. Hofmann's handwriting was so terrible, they realized the only sensible option was to have him store his own information. Then every time medical information was requested from the shed, they'd send a run out to Dr. Hofmann, and he would actually answer it and the request would be combined in the original response. This was Ned 2.0. This is where microservices act as an abstraction, because nobody wants to understand where all of that data is, they just want to be able to get an answer. One of the reasons we build microservices is to provide that layer of architectural glue, to simplify things for everybody else trying to use complex systems.

Another system I worked on at Google, in order to understand what a user should be able to access, you had to join data from four separate very large databases. This led to some interesting problems because there was no foreign key enforcement between these heterogeneous data islands. Therefore, the logic for joining them together was quite involved because of all of the possible failure modes. One of the reasons, again, you can use a microservice, is to provide protection from that so that every client wishing to ask the same question does not have to go through the same complexity of coming to an answer because the more times you replicate that complexity, the more times you do it differently, and the harder it is to effect meaningful change on your system afterwards.

Ned Becomes Ill - We Need More Neds

Then, woe of woes, Ned became ill. Suddenly, nobody could ask their questions of the shed anymore. The local cobbler could not understand what size shoes they should start making in order to satisfy the needs of the castle. The queen couldn't understand how many people have been born in the last month or the last year. They realized that an unredundant system was a problem. We need to hire more Neds. This production outage then led to additional problems that, who is going to actually answer the question at any point in time, which led to the first load balancers. We need to deploy changes to the microservice to multiple clerks, so now whenever we want to make a change to how people can request data, we need to train people, and we're deploying to multiple instances. This is still relatively easily contained, because all of the clerks are working on the same shed and it's a relatively small data set.

Crazy Bob Is Drunk again - Our Data Is Now Broken

Unfortunately, one of the clerks, crazy Bob, has a drinking problem. When he attempts to change data in the shed, sometimes he gets it wrong. We now have an unintentional bad actor in our infrastructure, and we need our first backups, which in turn leads to our first scheduled maintenance because there's no way of snapshotting handwritten information. Therefore, every Sunday, the clerks would get together, and they would divide up the work and make a copy of everything in the shed for the last week. This meant that if Bob went nuts during the week, they had at least one checkpoint to recover from. They also realized that they needed to keep a log of requests and responses so that they could recreate changes happening. This then was stored as a transaction log, so that they can recover from a known state using only partial updates. Unfortunately, crazy Bob didn't last long in this job.

Nicholas Wants To Undermine the Queen - We Have Our First Bad Actor

Bad Prince, Sir Nicholas, decides to undermine the queen. Data has become such a way of life for the noble Kingdom of Andzania that Nicholas realized the way to take it apart was to operate on the fabric that decisions were made, and he starts inserting bogus data into the shed. Knowing that over time, the decisions made by the queen will look increasingly foolish, and he will be able to scheme against her. This leads to our first access control lists. This leads to data validation, where we need the ability to look at data going into and out of our system, and ask ourselves, is it actually sane? We need to look at the aggregate of data in our system, and say, is it actually sane? We need abuse protection to understand when a bad actor is corrupting the data that we have. We also need to understand that one person should not be able to disproportionately affect the operation of our data and microservices infrastructure, so we have rate limiting.

I had a fun experience of this where I inadvertently broke the microservices infrastructure running Google's routing tables, where I was making millions of requests. I woke up the next morning, and I hadn't rate limits in place, and I had to find a non-destructive way of putting it in place. Having the ability to per user or per client change the amount of requests they can make is one of the most effective protections we can have. If you're not thinking about dividing it by client as with microservice, you're setting yourself up for problems later on, because everybody has the same level of access. When one client goes bad, for a good reason or a bad reason, then you're going to have a problem and it's going to affect everybody.

Census All The People - We're Going To Need a Bigger Shed

After a period of time, the queen realized that census was actually very powerful. She decided to expand the census to her entire country. Now we have just gone from a single data center architecture to multi-region, replication and requests are done via Horse 1.0, which is a very unreliable protocol. It has very high latency. There is a high possibility of bandits or the messengers might just go astray or lose the information that they're carrying. We now have too much data to store in a single shed, so we need to go to a multi-shedded data environment. This allows us to store information differently, but this also means the way we access it is different. Luckily, having the Ned's microservice means that this abstraction can be largely hidden from the users, but we do need additional technology to search and collect information later on, which leads to things like MapReduce.

We have primaries and secondaries because of the difficulty of replicating data, actually having to do a request via the capital to update data in a local city is disproportionately low. Therefore, it makes sense to have the data primary where it is most likely to be used and then replicated later on. We have replication of data and we need retransmission because Horse 1.0 is just a terrible protocol for moving things about.

Fire! Fire! - Some of Our Shed Burned

Then another disaster happened. There is a fire in one of our sheds, and it's burned to the ground. If we lose an entire city, we have the problem of primary and secondary election. We have the problem that requests going to the town where the shed's burned down need to be rerouted. This is our business case for the first service mesh, because you really do not want to have every client making a request to your system, understand the logic for rerouting to the right place when there is a problem with one cluster of your infrastructure. By having this service mesh in place, it means that when the city of Bobville burns to the ground, then we're able to decide where we should be making requests for that information. This is built into our microservices layer, so that again, clients do not need to understand this particular complexity.

Now the Real Problems Start

This is the start of our problems, because once you go multi-region, everything changes and everything becomes orders of magnitude more difficult. We realize very quickly that the replication and latency problems and reliability problems from Horse 1.0 simply make it untenable for running this wide area network. Therefore, we need to deprecate Horse 1.0 and come up with something better. We don't have electricity, so the ability to build any radio or wired communications is limited. Therefore, some kind of semaphore system, where we can relay messages quite quickly along route by site is quite important. This is provisionally codenamed Clerks 1.0. We have the problem that now we have multiple primaries all over the place, and there is still the possibility for two people to try and change the same data at the same point in time. This will eventually lead to the development of Paxos. Unfortunately, the latency of anything, including the future version of Clerks is still going to be too high for Paxos to work. In fact, it's going to be many years later when Google comes up with Spanner that somebody actually is able to build that properly.

We have too many sheds. As our data grows, at a certain point, it simply becomes impractical to manage it the way we were before. We can continue joining shards together or sheds together to a point where it just takes too long to process the information. The ads database at Google, when it grew to a certain size, they added a second shard, and then they added another shard. By the time they finally deprecated it for Spanner, there were 130 shards. This meant anytime you wanted to search on the data, unless you could do some magic around which shard the data might be living in and didn't have to do any joins across instances, you had to have 130 database connections open, all of which could drop out on you and cause you to have to redo the query from scratch again. This was insanely painful. This was one of the motivators for coming up with a notionally shardless interface for storing information where that complexity was hidden from the users. We have the problem now that updating both our schema and also our microservices has to happen over a much wider area.

When Ned decides that Ned 2.0, Ned 3.0 is insufficient and starts building more functionality into it, he has to train clerks, both in the capital city and all of the cities where there is either a mass primary or secondary data store there, so they're able to give consistent answers. There is the problem that the data is going to be in inconsistent state, so if you ask a question of one region, you may get a very different answer to a different region. Therefore, we have to be comfortable with the fact that as long as a region is consistent, that is good enough. If you look at Google search, this is exactly what happens. Google runs on many tens of thousands of machines. If you assume that you're going to get the same answer every single time from different regions, then you're building yourself into a hole because there is no way to replicate information that quickly. You have to accept that an answer that is good enough, and is useful, is better than an answer that is perfect. The problems go on and on. When you go to multi-region, you have to understand that you're going to be making tradeoffs which lead to imperfect answers in the name of getting a useful answer quickly.

Conclusion

As a software engineer building microservices, you need to learn to think about the systems in terms of high latency and low reliability. This is particularly difficult because when we're building software, we tend to be building on localhost, which is low latency and very high reliability. The failure modes, which are crippling, are unlikely to be experienced in the day-to-day development of software. We have no obvious way of testing it, unless we invest the time and effort in building infrastructure to build unreliable components or build test cases where components are unreliable. However, if we don't invest the time in this, when we actually push our software to the real world, we will be constantly surprised that it finds new and exciting ways to break, which are user impacting and reputation impacting for our employers.

Successful systems will outgrow their original designs. Part of engineering is the tradeoff between building something useful now in a sane timescale, and building something which is going to stand the test of time. If we go to either one of those extremes, we wind up with a system which either is prohibitively expensive to change once it becomes useful, or is prohibitively expensive to build, because we're trying to account for all future variations. Knowing where our failure modes are, however, means we can start anticipating what abstractions we need to build to simplify things later on. This means we need to invest in clean interfaces. One of the beautiful things about microservices is its ability to put a clean interface around something messy in your infrastructure. The loose coupling it allows you to do enables everybody else around your infrastructure to change as long as they are willing to obey that particular contract.

If we look at things like protocol buffers, for example, then the ability to have optional fields, and where protocol buffers of slightly different versions can still be compatible with each other, means that you don't have to Big Bang update all of your infrastructure. It means that you can be moving things progressively as long as the core contract is still maintained. When you break that core contract, you know you have to come up with a new sub-version of that protocol, because clients are going to break. You need to minimize the cost of change. If I change the RPC spec between two microservices, I need to know as soon as possible, because the more microservices you have, the harder that is to actually account for. If I build sensible abstractions, then I can hide that cost of change by hiding it behind an abstraction that nobody else has to worry about.

In the real world, we can't assume that all of our users have good intent. Not assuming that people will try and misuse our system, or that people will use our system badly because they don't properly understand it is not an option for us. If our system is public facing and it's possible to make money from it, then people will industrialize the amount of abuse they put through that system once they discover the holes. If our system is internal facing, and you have a development team that isn't quite sure how to use it, and is just poking data into it to see how it works, you're going to find yourself with broken data pretty quickly. Therefore, being able to segregate that access and being able to recover from failure, for both intentional and unintentional bad actors is critical to be able to maintain the integrity of the data you're managing with your microservices.

 

See more presentations with transcripts

 

Recorded at:

Mar 26, 2021

BT