Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Interaction Protocols: It's All about Good Manners

Interaction Protocols: It's All about Good Manners



Martin Thompson explores the history of protocols and their application when building distributed systems. Protocols provide the foundation on which the quality attributes are delivered; qualities such as performance, resilience, and security.


Martin Thompson is a Java Champion with over 2 decades of experience building complex and high-performance computing systems. He is most recently known for his work on Aeron and SBE. Previously at LMAX he was the co-founder and CTO when he created the Disruptor. He blogs at, and can be found giving training courses on performance and concurrency.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Thompson: What am I going to talk about here today, because this is a performance track and it seems a little bit odd to talk about protocols? But I want to cover with two major things, to begin with, which are interesting. First thing is code. And probably not the code that you think, but the code that we write is encoding something else usually. So I'm going to cover a little bit of that, and it's the way we do stuff, the way we behave, the way things interact.

There are interesting codes that we have, like, what is the code to make good coffee I want to find? You go to Tuscany or anywhere in Italy and you get amazing coffee; go to a hotel anywhere in the U.K., or pretty much in the U.S., and you get coffee that's disgusting. Maybe they make coffee a bit like this. Is this the correct protocol to go make coffee? Well, actually, it would work, and that's most of the code that I get to see in production. People make stuff work. Questionable. Is it the right way to do it?

I want to talk about protocols, really. I'm going to say that I think protocols are one of the most significant discoveries of human history. At the end of the talk, we'll come back to that and see if we agree with this or not, and some of the points around it.

Protocols, what are they about? What sort of things are we thinking about here? As a word, what does it mean? Well, we're talking about a code, yeah. Okay, we get that, that's describing a strict adherence to an etiquette or precedent, etiquette and precedent together. And particularly, the precedence side, I find, is so important in code, because precedence is all about timing, the order in which things happen. As we get into concurrent and distributed systems, the order in which things happen is very key to making that work. How do things behave? Etiquette is really about behavior. I'm going to explore this in some detail.

From a computing perspective, it's also described like this: a set of conventions governing the treatment and formatting of data. Often people just think about the formatting. It is an important part, but it's only a small part. The treatment is actually way more important, and treatment is about etiquette and precedence. We're going to go into that. But before we get into some of the computing stuff, I want to talk a little bit about how we've evolved as humans and what we have learned. I find this subject fascinating, because I find there's so much to learn in it for myself as well.

Evolutionary Biology

Going back I started studying this, and there's a lot of origin in evolutionary biology, where we come up with protocols, where we come up with etiquette manners, all of that sort of thing, and how it influences what we do. Interesting things go right back to things like facial expressions. Charles Darwin, we know his work on things like Origin of the Species, but he did a lot of other things, like looking at how different societies evolved and how we communicate and learn. There are some real innate things that don't require teaching. People know this without it being taught to them around how we recognize things. We recognize shame, we recognize disgust, regardless of language, even if someone has never been taught language. There are incidences in the world where people have unfortunately had a feral upbringing that were never introduced to language, they still understand things like shame and disgust. So we get that.

We see this in other things. I love dogs. I've interacted with dogs all my life. It's really clear, dogs know shame. There are some things that are innate in what we do. There are things that we can learn from this. And does it work in code? I've seen a lot of JavaScript, and it's pretty disgusting. We can definitely learn from some of these sorts of things.

Then it's evolved. So it goes much further. That's the wheel to the core biology stuff. We have taken this much further. We're the smart apes that exist in the world. We've taken it through things like etiquette and manners. This is actually really interesting, where we start to form protocols around doing stuff. The etiquettes around things are very interesting.

There are three types of manners that come out from etiquette. At a base level, we have what are known as the hygiene style manners. That's all the really base level stuff that's important, mostly about the prevention of disease. So a lot of the shame and disgust and different things come about from that as well. We teach people around things that will help prevent disease. It's also things like how close we get to different people, the public space, personal space, intimate space. You don't involve people, go into people's intimate space, mostly because it prevents the spread of disease. We have evolved in certain ways to do stuff. And then there's ways we relax some of these rules.

Going on from the hygiene stuff, we then have the courtesy side of things. This is how we bond and create societies. This is being polite, is one way of thinking about it. So someone else can eat before I eat. What you're saying there is, you're saying that the society is more important, the community is more important than me as an individual. We learn these things because, actually, our individual survival is based on survival of the society. So you start to see how things should behave around that level.

Then it gets more advanced again with like cultural norms. So people all behaving in a similar way, and you identify as part of the group. What that does is it builds trust, and then certain other things can be relaxed. You can relax some of the more strict things you would have around boundaries, mistrusting other people, you make things more efficient. We've evolved through a lot of these sorts of things. As we got more advanced as a society, we start to use them in interesting ways.

One of the very common ones that is well-documented and well-understood, is the military. We have very strict protocols around how we go to war. How do we act in the cases of wars? This is known as the rules of engagement, for anybody who's had any military experience. And you'll see, it's all about what is acceptable whenever you go to war, because war should be considered as a last ditch attempt to do something. I've seen a great talk with a guy called General Mike Jackson a number of years ago, and he said, "The point of war is not to defeat or crush your enemy. The point of war is to change someone's mind." It's whenever you've got a disagreement, it's the last thing you do before trying to change their mind. It's the ultimate thing you can do, but it's the only thing that should be done as a last resort.

Talking of the good conditions to succeed. If you look at the laws that surround this, they fall into two groups. This is the jus ad bellum and the jus ad bello. The first one is the Latin term for the right to go to war. There's a strict protocol you should follow before you start a war. That's things like you must go through diplomacy, you must behave in a certain way, and then it's a last ditch attempt. If you don't follow those, there's a difference between being a soldier and a murderer or a criminal. The difference in murder and war is, do you follow these laws? So there's a strict protocol around this.

There's a lot to be learned about how people have done this. By doing that, they end up behaving in better ways, we cause less damage, we do certain things. Examples in this is like, two things should be proportional. If you're a superpower that has nuclear weapons, and somebody happens to blow up one of your ships, you don't retaliate and nuke their capital. That is not acceptable as behavior. You're supposed to be proportionate. Then if it escalates, you're still within the law. So this gets interesting. A lot to be learned from our history in this.

Concurrent & Distributed Systems

Let's go on to concurrent and distributed systems. How does some of this start to go through and how can we apply- because it might seem a bit high level. What fundamentally we're talking about is how do things interact? This is really the whole essence of when we write code, is we are encoding how things interact. Getting that right makes a big difference to the quality of the given output. We've evolved as humans with all of these protocols because we want a certain quality outcome. As an engineering discipline, we're very young. We're very immature and we're working some of this side.

What can we do to get a bit better at this and what can we learn? There are some bodies out there that have done really great work in this space. The IETF is probably one of the best examples of this, where we very strictly document our protocols. They're known as the RFCs typically, and we follow those. This allows us to create great things like the internet. We wouldn't have the internet without the RFCs to describe the protocols for what we're going to deal with. Coming back to coffee again for a second, how do we make coffee? The IETF has even got a fun thing that happens. On the 1st of April each year, they usually release an RFC that describes something interesting. This particular one was a hypertext protocol for controlling the coffee pot. They produce some of this. And it's fun, and some of the bits is like you can return a 418 to say, "I'm a tea pot," rather than, "I make coffee."

That's the fun side. And they do continue on, there are various other examples of this. Some of them are fun, but also really informative. Anyone heard of Big-Endian and Little-Endian as a way of encoding integers? This paper is the best description I have seen of it anywhere. It's one of the 1st of April papers that come out from the IETF. Back in 1980, and if you haven't read this, you'll actually discover the origin of why we call it big-endian and little-endian. It's to do with "Gulliver's Travels," Lilliput, where the people there were divided and had different views. How do you have societies believing in different things? Well, what they believed in was one group said, you eat your boiled eggs by smashing the big end, and the other one smash the little end to get into the egg. And that is the origin of big-endian and little-endian, it's the endian of the egg.

This paper describes the origin of that, but it also describes why we do it with memory, and the interesting stuff around that. That's cool, a way to learn some of these things, but also have a bit of fun at the same time. Various other examples is like, how do we do IP over Avian Carriers? That was there. That was expanded with, how do we add QoS to that? We've got quality of service. We can take on a bit further. And just recently, we updated it to next year, we've covered IPv6. This is the stuff in there, good fun, sort of interesting.

How Should We Document Our Protocols?

How do we make this a bit more practical for ourselves? Particularly, how do we document our protocols? I find this is actually one of the really good things you should do when you're working on any project that's of any sort of complexity, is start documenting your protocol. What do I mean by this? Well, you can hear about APIs versus protocols, I think APIs are a very one-dimensional, a very anemic view, and protocols are a much better way of thinking about it. So people who have API design, that's a very narrow dimension of it. But it's not complicated. For example, let's say I wanted to write a protocol for how to use a file system. I want to use a file. It could be as simple as, I open a file, I can perform zero or more read or write operations, and I close the file.

Now, that's the pseudo regular expression style syntax. I've seen some people use extended NET Backus-Naur form. There are other ways we can do this, but I like this as a way to do something. It's actually pretty simple. Notice, it tells you the precedence, as well as what operations are available there. An API does not give you the precedence for its usage. That's really important. So, you end up making a lot less mistakes when you've described the precedence.

Then you go on and you expound that and you say, "Okay, what is a read operation? What's a write? What's an open?" These sorts of things. But just doing that, it really improves the quality of the component you're developing or the system you're developing. You know how you should interact with it. That can just be a piece of ASCII, that's there documenting your component and how to use it. That's usually how I start with most of my distributed systems or concurrent systems I work on; I put something like that together, I flesh it with so much of my own thinking, by doing something as simple as that.

And then what do you do with this? Well, just think, what are the events that are happening in your system? What are the pre and post conditions? What are the invariance that are going to hold true whenever these things occur? And just document it. Write it down in text. You're encoding what your protocol is going to be, and then later, you will encode it in whatever your preferred language is, hopefully not JavaScript. Otherwise, you see the disgust again coming through. Oh, and particularly in the disgusting bit, what can go wrong besides writing JavaScript? Todd Montgomery who I get to work with a lot, this is one thing he's taught me that's such an important thing in designing protocols- when you look at any step in the system, or any interaction that's going to happen, what can go wrong? Always answer that question, or ask that question and try to answer it.

Todd has written lots of software that was involved in space exploration, so he worked for NASA and different things. Just have the mindset of what can go wrong. This is why I think things like RPC is so broken; most people just assume it's going to work. That leads us down a wrong road, where you think, "It's asynchronous communication. I'll send a request, I'll maybe get a response back. What can go wrong?" You may not get the response. That should be a normal thing that you think about. How do you deal with it? How do you tie a knot? How do we do things in an interesting way? Very, very simple examples.

Multicast Example

Let's make it a bit more real and show a reasonably interesting, complicated example. So, let's say I want to build a system, say it's a whiteboard or a chat room, or just something I want to show lots of information to a large number of users. Multicast is a really nice way to do that. I'm going to send the data out from a source to many different receivers. Now, we can implement that in a number of different ways. If we did it with TCP, one of the problems is it doesn't scale. If you do it that way, you end up having to send the data to all of the different receivers and handle acknowledgement back from all of them. So you get a problem of all of this traffic exploding up. Multicast is nice, because you can send it once and all the receivers get it.

But we've got some issues with multicast and UDP in general, it's not reliable. How do you know if you've lost data? Do you use an acknowledgement protocol if you start acknowledging back? That doesn't scale up either because as you add numbers of receivers, it doesn't scale up, because we're dealing with too much data coming back. So we take a different approach; we won't just acknowledge, we will negatively acknowledge. So just send a NAK when you don't get data or data arrives out of sequence. That way we can scale up much better because you're going to rely on the fact that loss is going to be a more infrequent event than successful delivery. That's a different protocol, a different way of thinking, but it's got some issues as well.

Either we have an ACK implosion, but we can also get NAK implosions. Let's say you send from the source, and most times when you have loss, loss is correlated with some event, like I have overrun a buffer at the source. I've overrun a buffer at the network switch that's then going to distribute this data. Once you've done that, you now have some loss. All the receivers are all probably likely to experience the same loss. If they're good in NAK, they all NAK at the same time. The source gets all of those NAKs, a huge amount of NAKs, it then goes to send all that data multiple times to all of the different receivers. And now we've got a meltdown scenario, because all of this data goes out, that floods the network. You’ve probably got a loss problem in the first place, Now you’ve got even more data in your feedback cycle, and you get into what's known as a NAK implosion or a congestion collapse. These sorts of things end up happening.

How could we deal with that better? Well, there are some really interesting work in this space, like going back to the 1980s and slightly beyond that. One of them is a nice paper by Sally Floyd. Sally Floyd worked with Jacobson, and rescued a lot of the internet in the 1980s. We suffered from a thing called congestion collapse over TCP. They brought in congestion control to TCP and really improved things. Sally, some other great contribution is some of the work here, where what ends up happening is when you go to a NAK, how about not everybody NAKs, not everybody screams at the same time? So let's say something goes wrong in the room. If everybody screams it's wrong, it's just too much noise. What if, say, you notice that something's gone wrong and you decide, "I'm going to wait a little bit of time to see, has anybody else noticed it first? Just a reasonable little bit of time. If somebody else raises it first, I won't raise it as an issue because it's fine, it's already been dealt with.”

She encoded it in this algorithm by doing this. What you do is, if you get loss, you set a random timer for a very short period of time in the future related to how long it takes for communication to happen, so quite a short period of time. The random timers are going to go off between now and this given timeline in that normal distribution. The first one to go off will end up sending the request to get the data sent again. The data will be re-sent; because it's multicast it's sent to everyone. Then whenever the others have seen this data, they know that they don't need to send the NAK themselves because they've already got it. So they just listen for the data to be re-sent.

Now we've only had one or maybe a small number of NAKs, and we've dealt with all of this data. Nice, simple, clean. No central control, using random timers as a way of doing something. Beautifully simple algorithm, scales really, really nicely. I think that's a really interesting way of thinking about the problem. To me, that's the essence of really good protocols, especially thinking about how a group or community of entities, whatever they happen to be, this case happens to be computing knows that are saying that I haven't got the data, they're behaving in a really nice, interesting, coordinated way. And this ends up scaling quite nicely.

But that sort of thinking inspired some other people. Now, this was done with a normal distribution. What happens is with a normal distribution, you still can get clusters of NAKs, you can get quite a lot coming along. So you end up with a thing called NAK suppression, needs be added to that algorithm to dampen it out a little bit more. Someone had a good idea of what will change this. Rather than doing the timer's firing with a normal distribution, if we change the distribution to be an exponential distribution and we truncate it at both ends, we get this really nice, interesting property where the bulk of the timers are going to fire later in the cycle. But we're going to truncate the lower end, and with reasonable numbers, and if you set the sample sets correctly, you will likely get the timer to fire really quite soon, but you don't have to suppress stuff because there's very little NAKs flying around.

This ended up being a really nice improvement on that. We end up getting even lower latency, we don't need NAK suppression, and we take a lot of network traffic away. So, again, beautiful algorithm. Using mathematics, using cooperation, we get higher throughput, lower latency, and a really nice, elegant solution. It's not all clever coding and bit shifting, but we get a really good solution to the thing by thinking, what are the protocols of interaction? As people are thinking differently, I think there's some really cool stuff in here. By the way, that is the algorithm we ended up implementing in Aeron and we've been getting really great results from it. It's such a simple, elegant code. It just works really well, rather than going with something overly complicated and complex.

What Should We Focus on?

What should we focus on if we're going to look at protocols in general? There are lots of things within protocols. I'm not going to cover all of these, but I just want to highlight, it's a really interesting, rich subject and a rich history, the different things that are on there. I'm going to pick out a few of them and show as I go, how do they impact how we interact? How do we learn from that? How do we make some of our code a bit better?

Since we're in a performance track, I'm going to give you the one tip that will make the biggest difference to your code from a performance perspective. And that's when you come to the encoding step, please don't use text protocols. Use binary instead, it's the single biggest thing you can probably make as a difference, yet we keep producing all of this JSON and XML and YAML and, oh, disgust is going to come out of me again. Do the people who write all of these things, are they typically neurotypical and have no interaction skills with the world, did they never experience disgust from the people around them as they produce this stuff?

Clearly, some of it has gone wrong in our interactions with these things. Let's use binary protocols. Well, people say, "I know, but text is so much easier, it's human readable." Rubbish. It's not human readable. I know very few people who can read ASCII, and virtually no one who can read UTF-8. You've got a text editor that can read it, and you can read that. You've got a tool that can do it. If you want binary to work well, you write a tool to do the binary. That's what you need. It's not that the text itself is human readable. It's really interesting how we contort how we think about the world. We miss some of the interesting points.

Whenever we do it in binary, we can go with what people in IETF would call protocol porn, and we describe it in ASCII, it is how things are laid out. And then we can write our tools to deal with this and work quite nicely. We'll see so many cycles in how we deal with these things. So, that's the main thing on performance. I'll come back to a few other things on performance, but if you just at least do that, hopefully we'll save a few more trees around the planet and people will stop wasting lots of CPU time. I'm not joking- I get to profile many real world applications, and sometimes the business logic, the percentage of time spent in it is an absolute rounding error, often less than 1%. So much is the time taken to scrap off the network to eventually go through many, many, many layers of translations to eventually invoke some business functionality at the end of it. That's what we do a lot of.


Let's pick another topic in here, versioning. Versioning is a really interesting, fascinating one as well. Something I've discovered, that my own learning and understanding of it has been really advancing. It's a subject I keep finding more and more interesting things within it. And so, what are we talking about in versioning? So, versioning and identity can often be interlinked as well. When you talk about protocols, what conversation are we having? That's the first one of the important things that we're having, is that when you're engaged in any conversation, any sort of message that you get, which protocol does this belong to? It's good to know that and count that version over time, so can we upgrade stuff, keep things going?

The messages themselves is what is the version of the encoding? Is this version one? Is it version two? Is it version three? We can add fields to things. We can never take stuff away if we want to have a system be up all of the time. We can add stuff and we can expand, but we deal with this for versioning. It all works really well. What's way more interesting and something my own understanding has been coming forward a lot with, is the versioning of state. This is a really fascinating topic in its own right. What do I mean by this? Let's give some examples.

Well, I've done some work on some multi-producer, multi-consumer queues, and I like to do things lock-free. The space I work in requires a lot of performance, so we are doing lock-free algorithms using atomic primitives around things. And there's some great work like Lamport's or Fast Flow, various other stuff that's out there over time. I've been taking algorithms that were like single-producer/single-consumer, multi-producer of single-consumer, evolving all of these, combining some of the algorithms, doing quite well. Multi-producer/multi-consumer is a really interestingly complicated one. And you can evolve some of Lamport's work and some of the other work to get a working solution.

I find that some of the ones I produced in this worked reasonably well, and got reasonable performance. And then after a few years, I've seen these odd bugs that would appear, and then discovered there were some flaws in my algorithm, I needed to fix it. I eventually dug into it to find out what was wrong in it, and had these odd cases of where queues were particularly full or empty at wrap cases. You can end up with ABA problems, and you get bugs in your code and you have to fix them. I came up with some quite complicated solutions to fix it by CAS-ing different things, and I could do an algorithm over two CASs and some ordering rules. And it would work, but it was uncomfortable.

Then I started searching around to see, "Well, what can be done better? How can we deal with this better?" I came across this site. There's a guy, he produces this, Dmitry Vyukov, he now works on the Go team, he was at Intel before, and he has a beautiful algorithm for multi-producer/multi-consumer. I remember whenever I read it, I just thought, "Wow." It just so made it so much easier. He does it with one CAS, but the thing that he does that's fundamentally different is he versions everything that comes into the queue and everything that goes out of it. So you've got the sequence slot that you use, so you never get the ABA problem because everything is given its own sequence.

There's an interesting trade-off for that. It requires more memory to track this, but it actually performs really, really well. In C, it's beautiful because you can keep an array of structures where you've got a pointer to the object that you're putting in and the sequence number side-by-side. Unfortunately, we can't do that sort of thing in languages like Java, because we've only got arrays of references. If we’ve got arrays of value types, this would be much better and we could deal with keeping code. But you can implement it by two arrays with the same slot index, it means the same sort of thing. This was so much more elegant, but it was just versioning the state of the interactions, because you got producers and you got consumers. If a producer produces something, if it's given a version ID, you know that it's the right thing that's been handed off. And you know who owns a slot at any given point in time without any locks. So, versioning the state, really interesting.

Another example of versioning the state is consensus algorithms. So, things like Raft is the current popular algorithm in this space. We version everything around this. Every message has got a version number. It goes forward monotonically, but also things like elections. So if we get an election, the election has a leadership term ID and a candidate term ID that will deal with this. And this goes forward monotonically. If you get a message that says, "Ways to run your network," it comes back later but it's from an earlier election, you know it's from another election because it's got the version number on it. These sort of things start to matter. And then it makes it so much easier in a distributed or a concurrent context to do this. I think this is one of the most important personal discoveries for me, getting some of these algorithms right. It really makes it so much easier. So be aware of that.

Sync vs. Async

The good old sync versus async debate is really interesting, especially in concurrent distributed systems. Let's explore what some of this means from a timing perspective. If I've got two things that want to communicate together and they're going to use a synchronous protocol, after I send a request from one, send the response back. Send another request, another response back. Request, response. Time is progressing as we go with this. If we increase the latency, what happens? I can get less done in the same amount of time. We program lots of things using synchronous protocols by request-response.

What's happening in the meantime? We end up having to send a request, but waiting for a long enough period of time. We often block on the response, which means the operating system has to get involved, this thread has to be suspended, it has to be woken up again whenever the packet comes back from the other side. We spend a lot of time on overhead and bureaucracy and dealing with these, plus the time to deal with these sorts of algorithms. If it's asynchronous, what happens instead? We send a request, we don't wait for a response, we can send another request, we can send another request. And then the responses all start coming back. We greatly compress the time we're dealing with this. In any distributed system, if you start being asynchronous, we can get so much more done by sending requests and getting the responses.

We increase the latency, what happens? It does much better. So, even as latency increases, async plays off much better. When I say Todd was banging into me, the guy who taught me about what can go wrong, he was writing a software that was talking to things going to Mars and farther out in the solar system. Latency is a real serious problem there. You do not do rest, request, response, when you're talking to something on its way to Saturn. It's not going to really work. We have these silly ways of thinking and looking at software which deals with it.

What's happening in this sort of case? Well, if I'm making non-blocking sends and non-blocking receives, I can do other things in the meantime. So, I can send the stuff. I can go off and do other things. I can come back. I teach concurrency, and I do a few courses a year, it's interesting. Looking at, what are the characteristics that make good concurrent programmers? One of the things I find quite common are the people who play musical instruments in a band, because they're used to working with other people, and people who are good at cooking. Because if you're in the kitchen, and you have to deal with multiple things all happening at the same time, if you're synchronous, you can't make anything interesting. Thinking of the ways we approach this stuff, it's really interesting.

But people will say, "But synchronous is so much easier." No, it's not. It's just it's how you think about the problem. It's how you manage the feedback and how you manage the state. Using the thread stack to manage your state is a very opaque way of doing that. You've got multiple requests going, how do you monitor a system that's doing that without doing something really weird? If you have the state model for what's going on, like proper state machines, then okay, I'm sending this off, I send a request off, I change the state. I get a response back, I change the state. I don't get a response back, I get a timeline, whatever. You make all of that as part of the model of your demand and make it observable. You can now see what's going on in your system. It's not opaque anymore. It's a different way of thinking.

I find that synchronous systems, you get going really quickly, and then it gets really hard as complexity builds up. Asynchronous systems are a bit harder to get going initially, but then you hit the same level of complexity. Unfortunately, we're driven this instant, we want instant feedback and instantly getting things done well. This is why we have problems like JavaScript, where we have problems with VB. We have to know that we have to actually do a little bit more work up front, but then we can do it much better later on. We need to demonstrate that disgust much more when we see these bad practices being put into place.

Think about it, this is the real world. The real world is decoupled in space and time. To pretend it's not is crazy. All underlying protocols are send something off, get something back; they're asynchronous under the hood. But we force synchronous on top of it. If you're going to build your own system and your own API, at least think about this, is if you're going to build something, build it with an asynchronous interface. Then wrap it with synchronous. If you get the people who need methadone, and they can't get off, the bad drug habit they've got of using synchronous programming, give them this for a bit and try and wean them off it. Don't go ahead and just build a synchronous API, because then you're just locking it in and you're making it worse for the future. You won't get the performance.

We can see some of this with some of the history of certain protocols and how they work. Things like RPC, HTTP, TCP. Typically these days, RPC is on top of HTTP, which is on top of TCP. Really inappropriate protocols. TCP was not designed for request-response. Anybody who wants to grab me afterwards, we'll talk about time wait and fast start, and all the problems that are in TCP. It's such the wrong protocol. It's a great protocol, not for what it's being used for. We then wrap it with HTTP, which is a document fetching model, and we're using this to communicate to servers, and then we put in RPC. No one has been broken for years, and yet we keep doing this. Come on, please, let's show our disgust more and stop doing some of this stuff, because it's bad.

We end up trying to fix this by gaffer tape and bonding. It's like TCP now has fast open to try and get running some of the slow start processes. We're now using QUIC. So if you use a Google browser, talking to a Google service, you don't use TCP anymore. You're typically going over QUIC which just a UDP-based protocol. It's much better for what you end up doing. We're seeing HTTP/1 go through things like SPDY, and HTTP/2 is available now. Also things like the encodings, we're not using text encodings anymore. We're moving to binary encodings. People are learning, "This is what we need to do. We need to get better." Some really good examples in TLS going from 1.1 to 1.2 receive or round trip time, because being locked into some of the synchronous stuff is really, really hurting us. Go read protocols, you're going to find out much more about this.


Batching. This is an interesting one. Take this back to protocol. Think about the etiquette of making a request. Humans do this really horrible thing, whenever we want to be unpleasant to any group of society or any group of individuals, we dehumanize them. Because then that takes us away from our instincts, our protocols that come into place. We dehumanize them. Imagine you going to a friend's house, and they're making dinner and you're getting involved with it. And he says, "Do you know what? I need a pint of milk. Can you run down to the shops for me and get a pint of milk to finish this?" All friends go, "Yeah, no problem. Let's go do this." So you go get the pint of milk and you come back again. And then they said, "Oh, actually I need another pint of milk. Could you go back to the shops and get me another one?" You go back to the shops and get another pint of milk, and then you come back and he says, "No, actually I need another one, and I need a block of cheese." You're going to get pissed off with your friend at this. You're going to start showing some of those basic hygiene manners. You're going to train them into, "You should have really thought about how much you need and go get them."

So, think about it. We deal with computers without thinking about the interactions. We disconnect ourselves from that, and that causes a lot of really bad behaviors and things that we have evolved to deal with. We've got to think a little bit differently. We need to, for how some of the stuff's going to do. Like, can we use 100 gigabit Ethernet connection using our current APIs and ways of interacting with it? We can't. We can't use the sockets BSD API to get 100 gigabit Ethernet working, because we just can't make that number of syscalls. It becomes a real problem.

Syscalls are something that is such a limiting factor in our performance. And now with things like spectrum meltdown, they're becoming even more expensive. We need to work out if we're going to use these as an expensive thing, we use them as infrequently as possible. And we batch and we think of different ways of doing stuff. For example, if you're in the networking space, if you're on Linux, for a while, you've been able to do things like send a message and receive a message, where you can send multiple things to the network in one system call, receive multiple things. Like, don't send a friend to get a pint of milk three times. Ask for three pints of milk. Do that sort of stuff. So think about being, "Can I do this?"

We've seen other protocols that have come up in user space to save us these. Then we're going to asynchronous protocol, so things like DPDK, ef_vi, RDMA. These allow us to get full usage of things like 100 gigabit Ethernet stacks. We’ve got to think differently. RPC, with TCP and HTTP and all that stuff, forget it. You're not going to get a proper usage out of these. We've got to think differently.

Natural Batching & Mechanical Sympathy

Think about naturally batching up your stuff, treat things with some sympathy. Think of the computer not as an inanimate object. Let some of your instincts come into play, that were designed for interacting with people, and other organic things come into play. So we'll do things in the right sort of way, rather than dehumanizing and going the other way. Then we can get better use. It's not that we're obsessing about it, it's not like it is a human, but don't try to overcome the other things. Try to think about, its natural way is to write a frame to the network. Its natural way is to write a block to a disk, or whatever. Well, take account of that. Show some sympathy for it.

Examples of where we don't show things- ORMs. Again, like I said, another great source of some disgusting performance issues that we have out there are things just aren't designed to work together. And database, you need stuff from a number of tables. Write a joint statement, get it back in one go. And actually, SQL is not hard. It's actually quite pleasant if you bother to learn it. Most ORMs are not that pleasant if you've got to deal with them. They're pretty disgusting.

I'm going to ramp up a little bit here and think about things I call snake oil protocols. This is a really interesting way of starting to look at how some of our human behaviors go. One of my favorites is two-phase commit or XA, as this new one. This is something people often ask me about- and typically these are people who are architects. What they're doing is they're trying to make what should be their problem someone else's problem. This is bad protocol. This is the sort of stuff that we should not allow in society. There are a lot of interesting things we dig into what's going on with XA.

So, more great papers to look at, is one of the ones by Jim Gray and Leslie Lamport. Unfortunately, we don't have Jim anymore. This is only 2004. Relatively recent, but there's a body of work showing how bad XA and two-phase commit is. Here's one of the quotes from these guys, and this like, "Two-Phase Commit is not fault tolerant because it uses a single coordinator." It's a single point of failure, and it's synchronous with a blocking approach to it. There are loads of problems with it. It's a really bad way to design. Consensus-based algorithms are significantly better, give you better performance, better resilience, better qualities of service that we need for this. We think about different protocols. We keep asking for these same stupid things over and over again. We calibrate things into doing the wrong sort of offset.

Particularly, I find in these spaces, any protocols that require arbitration, it's the command and control mindset, it doesn't tend to work. It's much better having that mindset. So think about Sally Floyd's idea with scalable reliable multicast, using timers to go off, getting consensus across. It's a much, much better way to achieve these things. You actually get greater performance side effect. Another favorite is guaranteed delivery. What the hell does this mean? Again, it's making the problem someone else's. It's like pushing things away. It's terrible human behavior.

Applications Should Have Feedback & Recovery Protocols

The way to think about it is all applications need to have their own feedback and recovery things. You can't just tell the underlying transporter, “Whatever, you're going to totally take care of this, because there's so many failure conditions, there's so much that can go wrong in this.” We have to think about it. And how do we think about it? How do we deal with it? We start thinking about protocol layering. Because you happen to know that your transport does something a certain way, you start relying on the implementation. That's breaking the contract. You've got to be relying on behavior, not the actual implementation to do this.

For example, if you're communicating between two systems, you have your own protocol. Let's say you give monotonic sequences to your messages, you can tell those gaps. You can deal with idempotent delivery, you can deal with all of this at the right level. Trying to make it a problem with the transport is crazy. We've got people who are even enabling the bad behavior. I was really shocked whenever the Kafka folks were saying, "We are going to give you guaranteed once deliveries." It's like, “No, it's been proven to be wrong so many times,” but yet we keep going forward with this. Again, let's show our disgust a bit more.

So, let's wrap up here. Are protocols some of the most significant things we have come up with as humans? I argue that they are. I'll give you an example of one thing that will come up as an example. Anybody know this protocol? Scientific method. Probably one of the greatest things that we have discovered as humans, but at its core, it's a protocol. It led us to things like this, famous Photo 51. What was happening with these protocols? How do we encode information in ourselves? Crick and Watson were, "How is it?" Their question was, "How do we do that?" Their hypothesis was, "We think it's done in a helix structure.” “Okay, what's the experiment to do that?” “Well, if we use X-ray defraction, we should see the X." They performed the experiment. That's the result of it. They were able to conclude because of what they knew about molecular interactions. This is a helical structure.

How good is this from an evolutionary perspective? Well, it looks like we can encode about 2.2 petabytes per gram. That's a pretty impressive tape to store stuff. This is some pretty cool stuff out there. But it comes from a protocol. So, Newton defined the protocol for how we do this. There's much more detail to this, right? You're going to do science, you're going to do single blind, double blind, triple blind protocols on how we deal with stuff. We do really amazing things when we apply them correctly. Medicine, science, engineering, many things have advanced to a really good level because they take protocol seriously.

Some of this is my call for, if we're going to grow up and get better as an industry, we have to take protocols more seriously. Actually, protocols are the essence of ethics as well. Because ethics is how we encode some of this. So, I ask, why do we not take this more seriously? Why are they not studied and practiced more? I find people have heard the terms. We even get misled. Google talked about protocol buffers; they're not a protocol, they're a codec. Us as an industry, we do terrible things.

So I'll leave you with one thing to think about. On your next project, your first step that you go to do, is your first step to install all different frameworks and tools and do whatever, or do you actually start thinking about, "What is the domain of this problem and what are the protocols of interaction? What are the domain events that I'm going to start dealing with?" And start from that perspective, seeing what you get out as a result. It's a different way of thinking, rather than just being tool and framework-oriented, and playing with the latest thing. And so try to discourage some of the really nasty tools. On that note, I'll thank you very much.


See more presentations with transcripts


Recorded at:

Apr 11, 2019