The protocols we use should be studied and practiced more, and they are very important in many aspects, Martin Thompson claimed in his presentation at QCon London 2019, where he first looked back at the evolution of mankind and argued that protocols is the most significant human discovery, and then did a critical analysis of the protocols we use today.
Thompson, a high performance and low latency specialist, started by defining the term protocol both from a general and a computing perspective: a code prescribing strict adherence to correct etiquette and precedence, and a set of conventions governing the treatment and formatting of data in an electronic communications system. He emphasizes the word precedence because it’s about timing — the order in which things happen. In concurrent and distributed systems this is key to making them work correctly.
Documentation
Thompson highly recommends that we document our protocols. When comparing an API with a protocol he describes the API as a one-dimensional and anaemic view compared to a protocol. He notes that protocols are not complicated, and used a file system as an example using a pseudo regular expression style syntax:
Open, *[Read | Write], Close
After you open a file, you can do zero or more read and write operations, and finally close the file. He points out that this, besides describing the operations available, also tells you the precedence of the usage, leading to less mistakes, and this is something an API will not give you. You can then expand this and describe each operation:
- Open: …
- Read: …
- Write: …
- Close: …
This describes the interaction in a very easy way and is for Thompson a good way to improve the quality of the component or service you are developing. Ususally, this is how he starts working on a concurrent and distributed system. He also recommends thinking about and documenting the events that happen in a system and what their pre, post, and invariant conditions are.
Another recommendation from Thompson is to think about things that can go wrong when you look at the steps or interaction in a system. A simple example is when you are using asynchronous communication: you send a request and wait for a response. What can go wrong? You may for instance not get a response at all, which is something you have to deal with.
Multicasting
A more realistic and complex example involves multicasting. If you want to build a chat room that shares a lot of information with a lot of users, multicast is a common solution. If you implement this with TCP, you will have a scalability issue because you must send the data to all the different clients and handle acknowledgements (ACKs) back from all of them — which may cause a network traffic implosion.
With multicast, using UDP, you can send data once to all the clients, but this also has some issues. UDP is not reliable, so to know if you lost data you need acknowledgement, but then you are back into a scalability problem — with an ACK implosion.
A different way of thinking, and a different protocol is to negatively acknowledge (NAK) when data is received out of sequence or not received at all. This also has its issues because if a data loss occurs, all clients will likely experience this and return a NAK at the same time. This will cause the server to send all data again to all clients, causing a network meltdown scenario.
To deal with this Thompson refers to a paper from 1997 by Floyd et al: A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing, where they introduced an algorithm for doing a very short random delay before a client returns a NAK, but only if no other client already has returned a NAK. This minimizes the number of NAKs sent, while still resending all data needed. For Thompson this is a nice and simple solution, a simple algorithm that scale really well, representing the essence of a good protocol by thinking about a problem in new ways.
Protocol considerations
When we look at protocols in general, there are many aspects to consider, but with his presentation being in the performance track, Thompson highlighted a few important ones with a focus on performance:
- Encoding. Don’t use text protocols, use binary protocols. Thompson emphasizes that this is the single most important change that will make a difference. He also points out that a text codec is not human readable, we are using an editor to read it, a tool, so we can use a tool for a binary codec as well.
- Sync vs Async. Synchronous protocols are very slow, and they block. With increased latency they get even slower and you get less done. Using an asynchronous protocol, you can get so much more done, and it still works fine with increased latency. A common argument for using synchronous is that it’s so much easier, but Thompson claims it’s not. For him it’s how you think about the problem and how you manage state, you should design for asynchronous communication from the beginning with a state model and a proper state machine. His experience is that synchronous systems are easier to start with, but that they get much harder as complexity increases. Asynchronous systems are a bit harder to start with but then stays at the same level.
We should use events more. Thompson emphasizes that the real world is distributed, it’s decoupled in space and time, and we should use events to model it. Most, if not all, real-world protocols are asynchronous, but we enforce synchronous protocols on top of them. He strongly recommends that you provide an asynchronous interface when you are designing an API. Then, if you really must, add a synchronous wrapper around it. Don’t just build a synchronous interface because then you are stuck with it.
Looking into protocols that he thinks have performance issues or other problems he points out:
- RPC – HTTP – TCP, a common stack of protocols used today that Thompson notes is inappropriate. TCP was not designed for request – response. It’s a great protocol but not for what it’s used for. We then wrap it with HTTP, a document fetching model, and then on top we put RPC which we know has been broken for years. He notes though, that we are trying to fix the problems with TCP Fast Open, QUIC and TLS 1.3.
- 2PC / XA. Two-phase commit is an example of trying to make what should be our problem someone else’s problem. He points out that they are not fault tolerant and refers to a paper by Jim Gray and Leslie Lamport: Consensus on Transaction Commit.
- Guaranteed Delivery, something that has been proven to be wrong so many times. For Thompson applications should not rely on the underlying transport to deal with all errors that can occur, they should have their own feedback and recovery protocols. He also notes that we must depend on the behaviour of a protocol, not its implementation.
Thompson concludes by arguing that protocols should be studied and practiced more, they are very important in many aspects.
The slides from the presentation are available for download. Most presentations at the conference were recorded and will be available on InfoQ over the coming months. The next QCon conference, QCon.ai, will focus on AI and machine learning and is scheduled for April 15 – 17, 2019 in San Francisco. QCon London 2020 is scheduled for March 2 – 6, 2020.