Transcript
Nolan: Welcome to essential complexity in systems architecture. I am Laura Nolan. What is essential complexity anyway? I think many of you have probably read Fred Brooks' original No Silver Bullet paper. It's very focused on the complexity of implementation of software logic. Essential complexity, according to Brooks, is anything that's caused directly by the problem that you're trying to solve. Whereas accidental complexity is anything that's related to details of the implementation. If you're building a web application, the actual features and problems that you're trying to solve, the business logic, that's your essential complexity. Your accidental complexity is anything related to details, so anything related to your database, or to your programming languages, or to your frameworks, or to HTTP itself, or to your certificates. All this stuff, accidental complexity.
However, Brooks was writing at a time when computing was a bit more like this. I think this is probably significantly older than the 1980s. However, it's surprisingly difficult to find a good photo of a mainframe on the internet. Please take this as representing big iron. In the 1980s, if you wanted to solve a bigger problem, you would scale vertically. You would get a bigger computer, or you might just run yourself for longer. Things were more oriented towards batch, and less oriented towards real-time. Scale looked like that. Now scale looks like running your system across a whole bunch of relatively small servers. That's for a bunch of good reasons. The scale that you want to work at, is pretty often larger than the largest single box available to us. We want real time. We don't want batch. Even a lot of quite data pipeline type applications these days, if not quite real time, they don't want things overnight, either, they want things within a few minutes, particularly if it's any numbers related that the customers might see. Anything related to optimizing recommendations or reducing fraud, all these things. We need to be able to split work up and run it across many small servers, and that adds a whole bunch of complexity. It's essential complexity, because these are non-functional requirements. We can't just throw a hand away and go, it's not directly the business problem that we're trying to solve. This is real stuff that we have to deal with. This matters. It matters for the architectures that we build.
A Problem Statement
Here is maybe the simplest possible problem statement for a distributed system. However, this is actually a real problem. This is a problem that I worked on while I was at Google a number of years back. I was in ads data infrastructure. We had this problem where we needed to join two streams of structured logs based on a shared identifier. You have log A, log B, a shared ID that's common to both, and you want to come up with a log output that has the fields from log A, and the fields from log B, and the shared identifier. We also had an extra constraint, which was that we could only ever join any log entry once at most. We ended up with a pretty complex system out of that simple statement. The reason why is, there's an elephant in the room, and that elephant in the room is scale, because one of those structured log inputs was pretty much all of the red results that Google search was serving, for example. There's also a cheetah in the room. We wanted this not overnight, not next week, we wanted it within a few minutes. There were good business reasons for that. There's a donkey in the room. The donkey is all about robustness. We wanted this to survive not only the demise of any one single machine, we also wanted it to survive the loss of a data center, which it could in fact do. Finally, there's a hawk in the room. This is a room that's getting very full of different animals, and that is just how practical distributed systems architecture works. The hawk is about being able to observe and understand and debug your system. That's something that we don't necessarily think about at the time we design our systems. I think it's something that we think about later on, when we get to the launch and readiness review phase. I think it's something that we should be thinking about early on. I think that our systems are better for it when we do.
Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams
This is the system architecture that came out of all that. Photon would run in two data centers. It will coordinate its work through this thing called the IdRegistry. This was a strongly consistent datastore that would get updated whenever two log records would get joined so they couldn't be joined twice. This was effectively a transaction that had to be done here. Each pipeline was able to join any pair of logs, but only one would be able to join them, so that the load could slosh between each data center very naturally, very easily. We had four main components here, the dispatcher, which would read the smaller log as it came in, and it would attempt to find the matching entry in the larger log. The EventStore is basically a key-value store that would index into the query logs, the larger log. Then the joiner, which would do the work and update the IdRegistry. There was a couple of other smaller parts as well that are not shown in this slightly simplified version, but this is the main story. This is not the most complicated system but it's fairly complicated for something that has one minor business problem: join two sets of logs. This is how it goes when you're dealing with scale and when you're dealing with reliability. The predecessor to this system was significantly simpler. It was a single homed pipeline. It was just a couple of different jobs that would run. The problem with it was it didn't have any robustness. If it failed, or if one of the data centers that you were running it in was going down, you had to pick it up by hand, try and get the state of the pipeline into a consistent state and move it over. That's stressful. It's a lot of potential for things to go wrong. It's quite a manual operation. If you have to do this, somebody has to come along and manually take this action. It's going to take several minutes for that person to come along and do what's necessary especially if you need to do a little bit of debugging to understand what's going on and whether it's actually the right action. It was a very big improvement in reliability, but also an increase in complexity. Was it worth it? I think so. I remember well, one morning, it was a Sunday afternoon or a Sunday morning. I was sitting there and I was having some nice breakfast. One of my colleagues pinged me. He said, "Laura, when is Photon going to be caught up?" I said, "What do you mean, when is Photon going to be caught up? There's nothing wrong with Photon." He said, "One of the data centers is down." I went and I looked, and sure enough, yes, one of the data centers was down, but all the load had just switched as designed. We did SLO based monitoring. We weren't responsible for the data center, a different team was responsible for that, so we wouldn't get paged just because the data center was unavailable. We got paged when our system was breaching its SLOs, and it was well within. I guess the moral of the story here is, get your essential complexity right and relax with the tasty breakfast, so you don't have to spend your weekends firefighting. It generally was a revolution compared to the previous system. It was simpler and less stressful to run.
How Software Operations Work Scales
This comes to a question of how software operations work scales, because a lot of people think that software operations work scales according to the number of different jobs that you have to run, so the number of different binaries. It does to a degree, but there's a little bit more to the story than that. It's true enough that every binary you run needs build and deploy, needs configuration, it needs monitoring, and it needs people who understand it well enough to fix it when it breaks. You don't really care about any one binary, you care about your system as a whole and whether or not it's doing the thing for your business that you intend it to do. We shouldn't be thinking about people understanding things at the level of the binary, but at the level of the whole system that it sits in. I think that's something that becomes important.
Running 1-10 Copies of the Same Code
When you're running a fairly small number of copies of the same code, you have those same concerns we saw on the previous slide. You need to know, is it working correctly? Can I roll it out? Can I roll it back? Can I understand what it's doing? Can I scale it? Although, if you're running up to 10 copies, you probably don't need to scale up much. If something goes wrong, it's small enough that you can probably SSH in and figure it out there. It's more or less proportional to the binary.
Running 100-1000 Copies of the Same Code
If I'm running a lot more copies of the same code, things get a little bit more interesting. I need to go to some extra effort here to make sure that I won't overwhelm the backends that I depend on. It's this phenomenon called laser of death, where I scale up my service, and now I've taken down the things that I depend on. You don't want to do that. You might need to start adding caching very soon for this. You might need to think really carefully about circuit breakers or to make sure that you don't overwhelm your backends. Another thing that you should think about when your system gets a bit bigger, if it gets bigger, it's probably more utilized. You need to think about maybe isolation between user requests. If you've got 100, or 1000 copies of something, it's probably a shared service, so you need to think about that. You need to think about hotspotting. Whatever is load balancing to my service, is it doing a good job at spreading that load? If the load that any piece of the system is getting is based on the data that it's serving, do I have ways of splitting that up? Do I have ways of replicating it? Where things also start to get interesting is if something goes wrong, you can't really just SSH into a random instance anymore, because it might be fine. You need significantly better visibility into what your system is doing, where the load is, where errors are, where saturation is. This starts to become very important.
Running 1000 to 10,000 Copies of the Same Code
If I'm running 1000 to 10,000 copies of the same code, optimizing is very important. We're starting to get into things that have a significant impact on the environment, in terms of energy use, and cost. Cost is part of this too. You want to start monitoring for regressions. You want to think about actually exceeding the capacity of the entire data center you're in. I have seen this happen, before. I've seen this prevented from happening. If you need to do backup of data that span across that many instances, it can get very challenging as well, in terms of bandwidth. If something goes wrong, there's a significantly higher chance that it's complex systems behavior problems, or some emergent property. This is where you have things independently doing things that somehow you're very optimist about synchronizing this behavior that should be independent. Think about things like these, where you have a bunch of services that need to talk to the same thing, it goes down for a little while. They're all retrying. Then it comes back up, and they all instantly query it. That's one of those laser of death type problems. You can get all these sorts of cascading behaviors, thundering herds, gossip storms in peer-to-peer type systems, these kinds of quite hard to diagnose complex systems behavior problems. You need to understand the system, and how information is flowing in the whole system, not just bad behavior on a single host.
Architecture Considerations and Tradeoffs
All this is to say, if you want to build a skyscraper, you need the techniques for building a skyscraper. You don't just take a thatched cottage and build it super tall. There is a middle ground. You cannot scale small-scale system techniques indefinitely. You have to start spreading out that architecture in the same way that we saw with the Photon pipeline. The considerations and tradeoffs and tasks that you have, when you need to scale up a distributed system like this, or build a distributed system, how do I split my work and my data? All distributed systems need to do this in some fashion. Maybe it's very simple. Maybe it's a very stateless system, and all I need to do is run n copies of it and load balance between them. When you start getting into data intensive systems, it gets more tricky. How do I coordinate state? Even if that's just as simple as, how many of my nodes are up and healthy and which ones are they? How do I manage failure or failover? How do I scale up? What's the next bottleneck I'm going to hit in terms of my total system capacity? Am I efficient? I've seen cases where a large fleet of instances were able to save 50% of the RAM they were using because they had some giant hash table they weren't even using. You need to start thinking about this stuff. Then there's questions of understandability and predictability. Is the system designed with predictable, manageable, understandable flows and behavior between components, or is it something that makes it these more emergent type properties?
Two Competing Styles of System Architecture
I believe that there are two main competing styles of distributed systems architecture. I call the first one command and control. The second one peer-to-peer. Your command and control architecture, you have some distinct control thing. You scale by adding hierarchy, maybe caching proxies. An example of this is Google's Bigtable key-value store. Then peer-to-peer. You've got coordination performed by the worker nodes, you don't have any nice, special controller node. You do random splitting of ranges, like use techniques like system hashing. You can coordinate by using peer-to-peer protocols. An example of that is Amazon's Dynamo key-value store. You have two key-value stores, both of these designed in the '90s. I think both the papers came out in roughly 2006, 2007, doing something that's similar, although Dynamo is eventually consistent, whereas Bigtable is more consistent. They're still pretty similar problems with radically different solutions.
Bigtable and Dynamo Overview
Bigtable has a controller, it's called Bigtable controller. It assigns tablets to tablet servers. Tablet servers are the ones who actually run the queries and manage the data. They store data on another system called GFS, which later became Colossus. The controllers do the health checking. They use a lock service to manage the controller failover. The original Bigtable paper used a different term, but I'm using controller because the older term is no longer considered ok. Actually, controller is more descriptive. Dynamo doesn't have any controller or any distinguished node. It gossips between the nodes for failure detection. It uses consistent hashing, which is a really good technique for being able to incrementally add capacity into a system without having to move everything else around. It's really useful in distributed systems as a great use of it here. It does a decentralized synchronization protocol to try and keep everything synced up over time.
Bigtable and Dynamo Operational Characteristics
In Bigtable, you have one system state, and it's managed by the Bigtable controller. If your Bigtable controller fails, you're going to have potentially a little bit of unavailability for some seconds, while the next controller comes up. It'll need to get a lock in the Chubby consensus database. Your cell size is going to be limited because eventually you're going to get to so many tablets and so many tablet servers that your controller is going to get overwhelmed, so there is a limitation here. Dynamo doesn't really have a global view of system state. It's one of those things that's constantly converging, maybe a little bit like running a network routing protocol like BGP or something. You can look at any one node and you'll see its view of the state, but there's no one canonical view. It also means that the loss of any one node is not going to have disproportionate impact on the system, so all the data should be replicated to secondary nodes, and my query should fail over. System size limits. The system doesn't become infinitely scalable, because you don't have this controller bottleneck because each node still has to have the full routing table. As the way I understand that the routing works is that clients, they contact an arbitrary Dynamo and then can get proxied or rerouted over to where the actual key that they want to read is. You can address both of these scalability limitations in different ways, by segmenting your data, and segmenting the clusters, adding hierarchy. You can totally do this. It should also be said that I'm working on very old versions of these papers. They're quite different now, I think. It's still really interesting, reasonably simple illustrations of these different architectural styles.
Maintainability and Operability
In terms of operability, in Bigtable, if you're looking at any coordination problems, the controller is the place to look. It gives you a continuous thread unlike an authoritative state for the system, and one place where that coordination behavior is happening. If you need to change something, it's entirely possible that you'll be able to apply some configuration there rather than have to change the entire system in order to influence it. Whereas Dynamo is the opposite, there's no single point for monitoring or for control. For monitoring, you're going to have to look at properties across the entire system. In Bigtable, for example, if you want to know about availability of your tablets, you can in principle, observe that from the controller. If you want to know it from Dynamo, you're going to have to collect statistics very widely. You might even need to rely on some probing rather than metrics that your system is going to give you itself. Bigtable, you've got to run two types of job, and Dynamo, one type. It's more complicated. It's going to be a more complicated code base. It's a more complicated behavior. You've got these peer-to-peer communication flows, and particularly in outages or times when things are degraded, you could find that your synchronization data flows could blow up. You could end up with lot of the time being spent on those sorts of background activities if you aren't careful. That's something that can also be harder to understand and harder to control, if it goes wrong. Bigtable as well, it should be said, it does less work because it uses external infrastructure, meaning, keep the Chubby coordination with key-value store, the lock service, and also the GFS file system. That's really interesting, because instead of having to manage every single thing, in particular, they don't have to manage a lot of spinning rust, they just manage the query logic. They're able to leverage the expertise of those teams that run that shared infrastructure. Whereas Dynamo, as far as I can tell, is pretty much more in-house. Your styles are a little bit like this. On the one hand, it's very coordinated, quite hierarchical thing. You have a controller who is able to influence the system state and who you as an operator can influence if necessary, whereas your peer-to-peer style, you have just people turn up, grab a pint, and grab a musical instrument and dig in. It's a very different world, and it's a very different style of system to manage and to operate.
Architecture Considerations and Tradeoffs
You can do both. Both of these kinds of systems absolutely work, and you can absolutely run them. I do think that you need to think a little bit harder about understandability and predictability. How do I measure and monitor things in the peer-to-peer systems? You need to be extremely careful with controlling potentially emergent system behaviors related to communication between nodes. Make sure that you don't get overwhelming amounts of messages, particularly in error or failure system situations. You need to do this as well in your more controller based systems. Understanding how those control flow messages work is a little bit easier. It's a little bit more easy to predict. The controller gives you a place where you can potentially make changes and influence a system more easily than when you have to potentially roll out binary changes to hundreds of thousands of nodes. I think there are operability differences between these different styles.
See more presentations with transcripts