Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations 24/7 State Replication

24/7 State Replication



Todd Montgomery discusses lessons learned in designing systems, especially those based on replicated state machines, that need to continue operating.when things go wrong.


Todd Montgomery is a networking hacker who has researched, designed, and built numerous protocols, messaging-oriented middleware systems, and real-time data systems, done research for NASA, contributed to the IETF and IEEE, and co-founded two startups. He currently works as an Engineering Fellow at Adaptive Financial Consulting and is active in several open source projects.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Montgomery: In this session, we'll talk about 24/7 state replication. My name is Todd Montgomery. I've been working around network protocols since the '90s. My background is in network protocols, and especially as network protocols, and networking, and communications apply to trading systems in the financial world. I've worked on lots of different types of systems. I spent some time working with systems at NASA early on in my career, which was very formative. I have also worked on gaming and a lot of other types of systems. I'll try to distill some of the things that I've noticed especially over the last few years, with systems that are in the trading space, around 24/7 operation.

What's Hard About 24/7?

What's so hard about 24/7? If you probably are doing something in retail, or you're doing something which is consumer facing, you probably are wondering why we're even talking about something like 24/7. Don't most systems operate 24/7? The answer to that is many do, but they're not the same. Financial systems almost all in trading do not operate 24/7. They have periods where they operate, and then periods where they do not. The answer is, there's nothing hard about it, except when the SLAs are very tight. The SLAs need to be met contractually. The fines that can be leveraged on an exchange, for example, can be extremely penalizing, if they are unavailable during trading hours. It's not simply that it's ok to take 10 seconds, for example. Ten seconds is an eternity for most algorithmic trading. Components need to be upgraded, often. Upgrading components should not feel like it is a chore, it should be something that should be done often. It should be something that should be done continuously. It does not stop. This is hard for a system that has to operate in a way that it is always available. It's always providing its SLAs even when it's being updated. Transactions must be processed in real time. Almost all systems have that requirement if they're doing something that involves business. It's not always the case when you are doing things that involve consumers, for example. These all do have some differences.

When we think about this, we also have to think about disaster recovery. In fact, we'll talk a lot about disaster recovery. Disaster recovery is very important for applications because of the cost of losing some transactions or data, or things like that. This is not the same for all businesses. Not all businesses are equal in this regard. We're going to actually use one term, but I wanted to bring two in. Recovery time objective is one thing we will look at. That gives us an idea about how to return to operation looks like. Recovery point objective is also a term that is also used also with this in terms of data protection. We're not going to actually talk about recovery point objective. It's its own point that we could be making about it. When we're really talking about 24/7, we're really talking about business continuity and business impact. Just being able to continue operation when bad things happen that we're not planning for is a business differentiator. Also in trading, it can be something that is table stakes. An exchange, for example, in the EU, and the UK, systems that offer trading services have to be available and have to have a certain minimum capacity to operate in a disaster recovery case and be able to resume operation and be able to demonstrate it. Those are not things that are easy to do, most of the time.


Hopefully, thinking of it now, that maybe it's not always the case that this is easy. Maybe it's not always the case that it's something that is just kind of, we're already doing that. This can be difficult. We're going to break this down into a couple different areas. The first we're going to talk about is fault tolerance, because the fault tolerance model that is in use for an application has a big determination in terms of how 24/7 is done. Disaster recovery as well because that is also part of it. How fast can you return to operation, when you have to come and recover from a disaster? Also, upgrades. Components also need to be upgraded so you're operating in an environment where sometimes things are just not on the same version. Hopefully, we'll have some takeaways from this.

Fault Tolerance

First, fault tolerance. Providing fault tolerance is not difficult. There are lots of ways to do it. In fact, you could have a piece of hardware with a service running on it, and a client, and they're operating just fine. You restart the service when it goes down, everything's happy. Your first step in making this a little bit more fault tolerant in just restarting is you may put it on the cloud. That's perfectly fine. That allows you to provide you with some easier mechanisms of restarting services or be able to migrate services a little bit easier, providing you some protection of data in the event of things not going very well. You can also duplicate it and shard. You could partition your dataset so that things are handled by specific servers. This provides you with some forms of fault tolerance. It does provide you with a little bit of things there to think about. For example, now you start to have clients that could connect to all of them, or one of them, or route to them. You have to think about that. That's where we start to really see systems that operate in some cases just like this. There's nothing wrong with that. In fact, there's a lot of systems that work very well in this regard. When the problem domain fits into that, this is a perfectly fine model. It does have one potential problem that is tough to deal with, and that is where all of the problems start to occur. It's the state that is in these services. A lot of times what we do is kick the can down the road. We use something else to hold the state. A database or a cache or something else that is pushed away. Essentially, it is still the same problem. That state is really what we need to be fault tolerant. That is fault tolerance of state. That state that we have needs to be tolerant to losing members, services, machines, everything.

When you start looking at this, there is a continuum between partitioning that state for scaling and replicating that state for fault tolerant operation, where it is replicated in several areas. There's lots of ways to address this. Lots of systems do address this in different ways. We're going to talk about one specific one, because it is a semantic that has become very popular within certain types of problems, many of the ones that I work with in trading, and also out of trading. That is the idea of having a model where you have a contiguous log of data, or log events, with the idea of a snapshot and replay when you restart. In other words, you have a continuous log of events, you can periodically take snapshots. You can then start by rehydrating from the log, the previous state, or loading a snapshot and then replaying as well. Let's take a look at this in a little bit more detail.

Let's say we have a sequence of events. This could be a long sequence of events. It really doesn't matter. We do have some state that our service goes through from one event to the next. It's compounding the state. There might be a particular point in this state where a lot of the previous state just collapses, because it's no longer relevant. Think about if you're trying to sell a particular item and you sell out of it, while the current state, once you sold out of it, you don't have any more to sell. Why keep it around? That idea also applies to things like trading, like an exchange. You sell a commodity. You sell all that is available. You don't have to keep track of what is currently available to sell because there are no more sellers in the market. All the buyers have bought up everything from the sellers at any particular time. Think of it as your active state. Your history is not part of that. It's the active state of what is going on. At some point, let's say after event 4 here, you took a snapshot of that state. That's your active working set, you take a snapshot of it. Now we're going to use the term snapshot, but a checkpoint, or various things, there's all kinds of different ways to think about this. The idea is that you roll up the current state into something which can then be rehydrated. You would save that, and then the events that happened after that. The idea here is that you would have a contiguous log of events, state that is associated with that as it moves along. You can be able to equate the left and the right here with the idea of having a snapshot that rolls all that previous activity up into the current active state. The left and the right here should have the same state, at the end. They should have the same state at each individual place in that log.

This allows us to build basically clustered services. Now we can have services that are doing the same thing, from the same log of events, the same set of snapshots, everything. In fact, they're replicas, this is called a replicated state machine. These replicas, if you had the same event log, if you had the same input ordering. In other words, one event comes out right after another event, and that's always the case, that they don't just suddenly flip, and you can keep those logs locally, you now have a system which can be deterministic. Everything allows that to happen now. You can have a system that is very deterministic. You can play this forward. You can restart it. It should always come back to the same state. That's an interesting thing, because that allows you to now have an idea that it is deterministic. The system, given the same set of input will always result in the same state. Those checkpoints, or those snapshots are really events in the log, if you think about it. They roll up the previous log events in the idea that they just roll up the previous state. Some of that state just absolutely collapses, because it's no longer relevant. The active state can be saved, loaded, stored just like anything else, because it is just a set of data. Then each event is a piece of data as well. There are a couple different things here.

What about consensus? When do we process something? If we've got these multiple ones running about these replicas, when is it safe to consume a log event? The idea is that log event then reaches some consensus. We actually have lots of different protocols, but the best one to use for this that I've seen in many others is the idea of Raft. Raft is well known. It's used in many systems. It's implemented in many systems. The idea is that an event must be recorded and stable at a majority of replicas before it's being consumed by any replica. What this gives us is the ability to basically have that durable. It's durable. It's in storage. Now, if you lose, let's say you have three members, that majority comes into play. Once it's on one, that's not good enough, it's got to be a two. Once it reaches two, then it can be processed. That's a very key thing. You can lose one member, and continue on just fine, because another member has it. This works for five, works for all the numbers. A clear majority of those numbers having that reaches consensus. The idea is you're using consensus to help you with that. If you now have a system that is a set of nodes, cluster, and they're replicas and they're all processing the same log, now you have the ability to lose a member and keep going, or lose two members, if you're running five, or if you're running seven, lose three, and just continue on as if nothing had happened.

Disaster Recovery

There are some things in there to think about, and they all have an impact on what we're talking about, which is 24/7. I want to talk about a couple specific ones. The first thing is disaster recovery, because this is very much something that is always on the mind of system developers. Disaster recovery. We're not going to talk about recovery point objective, RPO. It's relevant, but it's a little different. It is not the main thing we want to talk about. What we really want to talk about here is RTO, recovery time objective, because it's very relevant when you're talking about 24/7, and you have tight SLAs. Where does disaster recovery come into this picture? We've got services. We've got archives. We got multiple machines that are all doing the same thing. We're able to recover very quickly when things happen. In fact, we may not even notice, and just continue on. If you think about this, those machines have an archive. Those archives are then being used to store snapshots. They have the log. If you lazily disseminate those snapshots to DR, and you stream, when each individual message in a stream reaches consensus, you stream it to DR. Now you have a way of actually recovering, because now you've got that log and the snapshots and you can recover your state. None of this should be new. The systems we have work this way. I'm trying to break it down so that we can actually see some of the dynamics at play and what 24/7 really means in those cases. If we think about this, that dissemination of the snapshots and the log and the log events, helps us to actually then understand what is actually the problem that we want to solve, to make it so that we can even in disaster recovery cases be able to meet our SLAs, or at least get close to being there.

It breaks down into two things. One, is the disaster recovery cold? In other words, it's just a backup, and it's just on disk. Or, is it warm? Is there some logic that's running there? Let's take the first idea of it being cold. This means basically, you have a cold replicated state machine for your recovery. In other words, it's not running yet. Everything is just being saved, just ready for backup. That means when it becomes hot, not warm, but when it becomes hot, because it now becomes active, you have to first load the snapshot and then replay the log. Let's break that down. This picture was what we saw before earlier. The snapshot takes some time to load, and then the replay of the log takes some time. Thus we end up basically having a particular time that it takes us to recover. If we break that down, the time to load the snapshot, and the time to replay the log, and the time to recover, they have an interesting association with one another. The first thing is, the time to load a snapshot is usually small, because the snapshots are usually just a serialized version of current state. They're not a history. In fact, that's the worst thing you could do to put into them would be history. What you really want is just the active state that you need to maintain. The time to replay the log is going to be determined by probably the length of the log. Yes, the length of the log is going to be the thing that has the biggest impact, and how complex it is to actually process the log. That can be large.

In fact, if you have to have a log that is 24 hours old, let's say you're snapshotting every 24 hours, the worst case is going to be you have a 23:59:59.99999 log that you have to replay. Some systems that are doing millions of transactions a second, and may have trillions of events in their log from the last snapshot, that can take a while to go through. It's not so much loading that data from disk or loading it from some storage, it's usually the fact that the processing of that log can take a very long time. That is something that you have to always consider is what happens if it takes a long time to process that log? Seeing systems where the processing of a log after that can take hours. That's hours before that is ready to go from disaster recovery to become something that is done right now.

How do we ameliorate that? How do we actually make it better so that the loading of a member going from cold to hot is faster? Reduce the size of the log that you have to replay. How you do that, snapshot more often. Instead of doing every 24 hours, you do it every hour. That means you have an hour's worth log, that you may be able to run through in a few seconds, or a minute, or a couple minutes, or 10 minutes. That may be acceptable. If you need to snapshot more often, that just means that you have a smaller log to replay. That's good. That's your main control, in that case. You don't have a lot more else that you can do immediately, but it does come at a cost. Taking a snapshot may not be something you can do very quickly. It may be something that takes a period of time to have that snapshot be actually done. In Raft, a snapshot is an optimization. You're taking it and you're trying to make it so that it's an optimization. You could always have just replayed the log from scratch. The idea of that snapshot, you may want to think about it as it might have a disruptive influence on your machine, on all the replicas. In fact, the way that a lot of systems do this is a snapshot is an event in the log, all the replicas take a snapshot at the same time. That simplifies some of the operation, but it also introduces the fact that when you take a snapshot it better be quick because it will disrupt things that are behind it that'll have to wait. What happens if you were to use the idea of taking snapshots asynchronously outside of the cluster and in disseminating them out of band back into the cluster? That could be a way to minimize that impact. It doesn't really change the recovery of those, but it does make it so that it should be a little bit less disruptive to take an actual snapshot.

Let's think about also this in a different way. What if we were to look at it from a warm standby perspective? Instead of it being cold, and just thinking of it as a backup, we think about it more like the service replica is running in DR, and it's ready. It's hot if it needs to be. It just has to be told to go hot. That is actually another way to handle this. Instead of it having being cold, and it just being a backup, you could just be streaming those snapshots, and that log, and have a service replica just taking and doing exactly what a normal service would do. This is the power when it's deterministic. Because the service, the same one it's running in the cluster could be running in your DR. It's hot at the drop of a dime. It's sitting there just warm, and you can light a fire under it at any time. It does not need to be active in the cluster, it just has to be there getting the same log, getting the same snapshots, doing the same logic. That means that recovering from a disaster can be just the matter of making that go from warm to hot. That allows SLAs to be met.

24/7 Upgrades

That's how you can think about potentially making 24/7 in a SLA demanding type of environment, work really well within some systems that allow them to use this model. That's only the beginning of a lot of different pieces. The other pieces I want to talk about that is normally a problem for a lot of systems is really more along continuous delivery, and 24/7 upgrades. How to do upgrades within an atmosphere like this. The first thing is, there is no big bang, forklift, stop the world upgrade, because that is too much of a disruption to all SLAs. Blows them out of the water. You're never going to be able to do it. It's never going to be possible to go ahead and do that in any real system that needs to be 24/7 and has tight SLA. That's not something you can really rely on. When you start to think about that, the implication of that is, since you can't upgrade everything at once, you're going to have to live with the fact that you're going to have to upgrade things piecemeal. That means that components will be on varying versions. They're going to be on varying versions of an operating system. They're going to be on varying versions of an OS. They're going to be on varying versions of software that's outside of your system. Your own system will be on varying versions. It's just a fact.

That gets us down into actually an area that I come from, which is protocol design. Protocols have to work in ways that mean that they can operate with varying component versions of various pieces, they just have to. How do you make them work in those situations. I'm going to take a lot of inspiration from protocol design here. I suggest you also take a look at this because, in my experience, this is the hardest part of making something where you have the ability to upgrade things 24/7 at the drop of a hat. Good protocol design is a part of that. It's a very big part of that. The first thing is backwards compatibility, table stakes. You have to be backwards compatible. You have to make it so that new versions won't necessarily break old versions. You have to figure out how to do that in the best way while moving things forward. That's not always easy. It gets a little bit easier when you start to think about how to do forward compatibility. What does that mean? It means that version everything: your messages, your events, your data, everything should have a version. It should have a version field, whether it be a set of bits, or whether it be a string, or something else. There has to be a way to look at a piece of data that's on a wire or a piece of data in a database or a piece of data that is in memory and be able to look at it and go, that's the version for that.

Not all versioning schemes are the same. They all break down in certain ways. I want to suggest you take a look at something called semantic versioning. Semantic versioning, there's actually a very nice web page that goes into the idea behind semantic versioning. What it does at its heart, is it gives you some definitions for MAJOR, MINOR, and PATCH that are useful for a lot of systems. You don't have to use this for messages. You don't have to use this for data. I strongly suggest looking at it for that. A lot of times it's looked at for versions of whole packages, things like that. Think about it from the protocol perspective too, the messages, and the data, and everything else. Give it a version number, MAJOR.MINOR.PATCH, increment the major version when you make incompatible API changes. Version two is incompatible with version one. That's an interesting thing there, the API. Minor version when you add functionality in a backwards compatible way. Minor version can change as long as it is backwards compatible, and it's adding functionality, not changing it. Patches when you actually make a backwards compatible bug fix. That's the big-ticket item. There are some details here that I suggest you look at.

Additional labels like pre-release, build metadata, all those are extensions to that format, but the heart of it is those particular types of things. Now we have a way to think about what happens if we introduce an incompatibility, should be a major version change. Minor if it's compatible, but you've added some functionality, and patch if you've just added in some fixes. That's very interesting. We have some semantics that you can hold your hat on.

Let's take a look at this example applied to that log idea. If all of our events or messages are versioned, we can see a sequence of version changes, 1.2.0, 1.2.0, 1.2.1, 1.3.0, 1.3.0, 2.1.5. That was a big change right there. What happened to that? What happens to that if an old version of a service, say for example, sees that, does it ignore it? Does it try to struggle on? Does it just let an exception happen if that's ok? Does it terminate? What should it do? There's no hard and fast rule here. Your system probably has some impact. For all those actions, there's going to be an impact. In the case of replicated state machines, termination may be the best option, because you might want to just terminate, and then continue on. If you're upgrading each individual member of a cluster, you probably want to upgrade one, upgrade the second, and upgrade the third for a three-node cluster, and then start using a new format after that, after all those have been upgraded. What happens if you're doing that you forget to update one. You just have it terminate. That might be the best thing in the world as opposed to corrupting state somewhere. That can be the catch if something doesn't happen.

With large network systems, what could possibly go wrong? There might be not only this, we may need to have a little bit more information here. For example, let's think about what we were to do if we were to add a couple more bits here. Let's borrow something from protocol design. TCP itself has a set of TCP options, and some of them have been added over time. They thought upfront about how to actually add them. They have a bit and other protocols besides TCP do this. It's probably the best example, at the moment, that comes to mind. It has an ignore bit. All the options have what's called an ignore bit. Instead of it being version, it's type, so the type of option. If it doesn't understand the type, and you receive a TCP message, you look at the I bit, that can be ignored. If the I bit is set, then you can continue processing the rest of the options, and you just ignore that option. If it's set to zero, it can't be ignored and you have to actually drop the entire message, the entire TCP frame. This is forward compatibility at its best, in my mind. You're actually thinking, I may come into a situation where I have to throw away this whole message, or I can just ignore this particular piece of it. That's thinking about things upfront, and then you have that ability.

Let's add the ability to terminate to that. You have maybe a bit that says, can be ignored. Can it be ignored? Yes, this particular piece, whether it be a whole message or part of a message, yes, it can be ignored. You might have an additional piece in here that says, should I terminate if I receive this, and I don't know what to do? Because the answer might be true. This is how you can think about it. You don't have to think about everything upfront, but you can start to think about an additional way to have some additional pieces of information that give you an idea of how to handle these situations. When you set them up beforehand, you now have logic that can be exercised when it needs to be exercised. I'm not saying termination is the best thing. I'm not saying that dropping the message is the right thing. It may be the best thing to just let an exception occur, and just then have it logged, and then just continue on. Or it might be, have that exception recorded, logged, and then have the thing terminate. You may have a set, this is what we do when the unexpected happens in our system. I'm just saying you can give yourself a little bit more freedom and a little bit more ability to handle the unexpected when you look at that.

Feature flags are an example of things you need to think about for upgrades. Often, feature flags, in a lot of instances are tied to a version of a protocol. The hard thing there is you have to have a new version to understand the new feature flags. That hurts. That can be tough, because it ends up in a chicken and egg effect. We have to upgrade the protocol so that we have new feature flags, but we can't actually have the feature flag because we have to actually update the protocol. There is a dependency there that you want. If you version them separately, if you version the feature flags separately, you actually get some interesting effects. If you can be able to push out a new set of feature flags, then push out a new version of a protocol, that's actually pretty powerful. Now you have some ability to not have to do the opposite. I also think that feature flags, they are not always a bit true or false. They often have values. They often have meaning. They also have other things with them. Having them in a self-describing format is actually fairly useful. Some of the things that you might think about from a self-describing format are things like, is it actually going to be an integer? Is it unsigned? Is it signed? How big is the integer? How many bits is it, in other words? Those are things that I think about because I'm a protocol designer. If you start to think about this and step back for a moment, and think about it from a usability standpoint, knowing that something is a integer versus something which is just a string, gives you some ability to even think about these things a little bit more. When they're self-describing, they do open up some doors.

I want you to think about that for a moment. Let's go back to having a set of messages and having them in different versions. Interesting thing about this is that log now has distinct events of protocol change in it. That's interesting. That's really powerful to be able to look at and see the sequence of when, in operation, protocols changed through a system. I think that is neat anyway. What it comes down to with upgrades is thinking about things not as being static and thinking about things as just a big upgrade. You have to start thinking about these things as your system, it has varying versions of things going on. Not everything is going to be on the same version. You may try to minimize the time that they're on different versions, but you'll never get rid of it. Unless you can actually upgrade everything all at the same time. Which is not going to work really for 24/7 types of upgrades.


I really did just hit on the biggest things that I've seen over the years. I've seen a lot of people with a lot of systems try to skip over some of the versioning things that I mentioned. I think one of the things that I've noticed repeatedly is there is no shortcuts here. You actually have to sit down and think, how does older versions handle these things? What can we do? How can we minimize the disruption when doing upgrades? This goes well beyond just the 24/7 idea. I've noticed a lot of microservices architectures have the typical thing of, you put in a microservice architecture in place, you had no versioning of your protocol or data. Then you've realized, now we have to add it in, because we're running different versions all over the place that speak various things and we have to handle these things as piecemeal. That I've seen a few times now of just looking back on it going, it's really hard to shoehorn that in later. It's really hard to add versions later. I and some of the other people that I work with are no exception to this. We added a protocol that had no versioning at the application layer, to exactly these types of systems, and we've had to go back and add it. We were very careful with it.

Questions and Answers

Reisz: It's a fine line because you don't want to prematurely optimize something you need to get to product market fit, before you start adding any boilerplate, any gold plating. It's tough sometimes to make all the right calls, so a talk like this really helps you think about some of these things.

Montgomery: That's been always my goal to just give some ideas of things for people to think about and adopt what you like. If it doesn't matter to you, then just forget about it.

Reisz: Honestly, it's what I liked about this talk, because there's a lot of things to think about. It's like almost a checklist. These are some things to think about when you want to run a high resilience system like this. Speaking of that, we did the panel on microservices in San Francisco, you were on it. One of the things that we were talking about was slowness. You talked about, like getting to what does it mean when something is slow? When someone tells you they need a high resilience system, a 24/7 system, how do you have a conversation with them just like slowness? To be able to say, what do they really need. How do you go about that?

Montgomery: I always come at it from business impact, whether it be slow, or all the different qualities like, is it slow? What's the quality? What is the business impact of that? Fault tolerance is also part of that. There are systems where losing data is ok. It can be figured out later without any business impact. There are a lot of businesses like that. Unfortunately, I don't deal in any of them. The impact of losing a million-dollar trade, that is much more than a million dollars to large numbers of entities. A lot of times when you're thinking of, whether it be security, quality, performance, and even fault tolerance, it comes down to what's the business impact and how much is it worth? Certain things, the business impact is too big, therefore, you need to spend additional time and come up with the right thing.

Reisz: How do you deal with things like blue-green deployment and A/B testing? Is it all really like the trickiness around that state that you were talking about with that log replication? What are your thoughts?

Montgomery: Something like A/B testing is an interesting mental exercise to think about with something like a cluster where it's replicated. How do you do A/B testing? In that case, you almost have to have two different systems. One that is running one version, and one that's running a separate version, or they're running the same version, but the feature flags are different between the two. Then you're feeding that back into another system. There's only that real way to do that, because any other way that you do it, introduces an opportunity for divergence of those replicas, which is hard to recognize. Because once you put any type of non-deterministic behavior into a system, that you're relying on replication, that non-determinism will replicate just as quickly as everything else. You do not want that. You almost have to think of it as two separate systems.

Reisz: There's a few things that you talked about, deterministic, durable, consensus. I want to dive in just for specific on a few of these, like deterministic. A deterministic system is something that, given a set of inputs gives you the same results. Dive a little bit more into that, what does that mean in practical sense for folks?

Montgomery: Deterministic systems mean that the same input gives you the same output for all the replicas. What that really means is that the same input sequence can be thought of as being immutable. You don't go back and add or change it or anything. They always present the same state, for example. When you add non-determinism in, a couple different things that you could do to add non-determinism. You could have a piece of randomness that is using a different seed on each node, that is non-deterministic because they would then diverge at some point that is important, where they would have different state. A couple other things that you might do that is a source of non-determinism is data structures that don't iterate the same. You might go through and iterate through something like let's say, a hash map, but it's different iteration order at different replicas. If your iteration order has some impact into the state that you would come up with, that is a source of non-determinism.

A lot of the things that we end up doing in, basically, business logic has a lot of non-deterministic aspects to it, that we don't realize. The data structures, some of the details about how different things function, it's very interesting to see that some of the practices we've developed over the years, are very non-deterministic. Here's another source of non-determinism, reading from files on disk that are different between different replicas. That is another source of non-determinism. When you have a system that is deterministic, you do have something powerful.

Reisz: What are some ways to deal with that, like reading from files, for example? Are there some frameworks out there, some projects out there that may help us along some of these lines?

Montgomery: For me, I think about everything should come in through that log. The reason why you go to a file is usually to get some other piece of data. If you're going to go to that file to get it, you should be going to the same place for all the different replicas, so it should be a service. It should be, you're going and grabbing the same piece of data. That service that you're going to should be deterministic as well, and give you the same result. You can't get around those types of things. Most of the time, when you go to a file, you're looking for something very small, you're not looking for a large piece of data. I like having that come into the log, because it's a fact in the log. It's a command almost. In that case, you can look at it, instead of reading it from a file, you get it pushed into the system externally. Usually, that's also an indication that you need that piece of data at that particular point in the log, which is actually itself a very interesting way to think about how you would normally push things into it. Really, when you come to think about it, that piece that comes into the log is updating the state in some way. It should really come in through the log.

Reisz: There's a question about microservices and data ownership, each microservice owning their own data, for example. What are your thoughts on maintaining data integrity and cross-service transactions in a microservice architecture where things own their own data? Is it, again, all come back to that log? What are your general thoughts.

Montgomery: I think it actually does come back to the log. To give you an idea of how you would think about two different clustered services that are depending on one another, which is a very common pattern, is if you think that as you're processing data, you're sending a message to another cluster, another set of services, then it sees that as a message coming in, processes it. Sends a response back to another cluster, or the same one and places that in the log. If you think about it, from everything coming into the log, and everything going back out to other services, if you think about it that way, things get a lot cleaner. Because if you look at it from that first service, you see the logic that spawned that request, it went out to another cluster. At some point, it comes back, because the response comes back in the log. When you think about everything coming into the log that way, it's a little bit easier to think about the sequence of events that went through.

Reisz: Finding that non-determinism, that seems like that's going to be a challenge.

Montgomery: Yes, once you let it seep in, it is very hard then to make your system deterministic or non-deterministic.

Reisz: Tony's got a question around legacy code. He has lots of legacy code, and he's wondering your thoughts on how to introduce versioning, or how to update the protocol. The legacy code has some intolerant protocols, some older protocols, and he wants to update the protocol. He's curious your thoughts on how to apply versioning to update this protocol. What strategy might you use to update this? He's asking here, do you think it would be a good idea to do a minor version update to add versioning to the protocol, and then do a major version and update the protocol? What are your thoughts? As someone who designs protocols, how would you go about recommending him address the solution?

Montgomery: I would actually call it a major version change. Previous versions will never understand the protocol, that version, scheme, or anything else. At that point, I would consider it to be a breaking change, so in the semantic versioning sense, would be a major protocol change. The reason why I do it is this, it's better to take the pain at that point than to push it off. You really want to take that pain at that point of, now we do versioning. Now going forward, we have this scheme of how we do versioning, and how we can rely on things to do. Because if you're thinking of it from like, we upgrade everything all at once, the best place to add your versioning is when you're going to do that. If you upgrade everything all at once so it now understands versioning. Now afterwards, you can pick and choose whether you need to have that. Going into it, you do not want an older piece that doesn't understand that there even is versioning, to try to do anything else. It's got to be a big change. I think you take that pain up initially, and then later on you reap the benefits of taking that pain.


See more presentations with transcripts


Recorded at:

Jul 21, 2023