Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building a High-Performance Networking Protocol for Microservices

Building a High-Performance Networking Protocol for Microservices



Robert Roeser and Arsalan Farooq talk about how techniques used in Fintech and Adtech – such as zero copy, flyweights, composite buffers, pooled memory, shared memory transport, and direct memory in languages like Java – can be used to improve performance in distributed applications.


Robert Roeser is a co-founder and CEO of Netifi, where he is working to lay the groundwork for the next generation of cloud-native applications. Arsalan Farooq is CEO of Netifi, where he is focused on bringing high performance, application networking to the traditional enterprise developer with minimal effort.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Farooq: Topic is RSocket which is a new application-level networking protocol, well not new, it's been around for a while, but new in terms of concept. I'm Arsalan Farooq, the CEO at Netifi. I'm a bit of serial entrepreneur, as well as I grew up at Oracle, so lots of history dealing with enterprise-class deployments and infrastructures.

With me today is Robert [Roeser]. He's our Chief Innovation Officer, which is just a big word for the smartest guy in the company. I'll just very quickly introduce the project and then we'll let Robert [Roeser] talk about things that you guys are more interested in. We're Netifi and RSocket is an open-source protocol that was basically germinated sometime back inside Netflix and other places, and right now we're working with Facebook, Alibaba, Pivotal and other partners who are on board and we are using and promoting RSocket.

We're hoping that us together and with other incoming partners will be able to change the fact that few people have heard of RSocket, hopefully more will have and will use. I should mention that Facebook, Pivotal, and Alibaba, they're all using RSocket internally. Pivotal's, for you Java developers, Spring Framework 5.2 is coming up next month and that has RSocket built-in. If you like to use Spring, you'll be able to get on RSocket immediately.

Basically, it's an application layer 5, 6, loosely speaking, I call it layer 7, but I know I'm wrong. It's got built-in reactive semantics, so back pressure, composable, flow control, all of that stuff is built into the protocol. You should be thinking, compare that to HTTP or other est type protocols you've heard, and we'll see how that works in a minute. It does have flow control built-in, which means that you don't have to worry about things like retries and other things that you have to write in your glue logic when you write your applications, the network and the protocol will take care of it for you.

For those of us who are old enough and remember TCPIP at a lower level that had flow control built-in, and that's why we'd never have to write a lot of the glue code that we have to write at application level, so going to RSocket will change that. Binary encoded – it’s message-passing based, which Robert [Roeser] will talk about why that's important, and allows you to get asynchrony in your applications very quickly. I'll hand it over to Robert [Roeser].

What’s Needed for High Performance Microservice Networking

Roeser: I want to talk about what's needed for high-performance microservice networking. When I thought about this talk, I came up with a list of things that I think are important for building performing applications. Firstly, we need temporal and spatial decoupling. Next, you should actually be sending binary across the network, you shouldn't be sending texts and coding. Then finally, you need application flow control, because you want a consistent application, an application that isn't consistent, that has bad tail latency in a distributed system is actually very slow.

Let's look at loose coupling. When I think about loose coupling, I think about spatial and temporal decoupling. What spatial decoupling is, is a message sender should not directly call a destination, instead it should pass a message to somebody else through a stream or a channel. Then temporal coupling is that when you call something, the thing that you call should not block the execution, so, you can take that as like nonblocking code, and this will let you more efficiently use your resources.

The Difference Between Message-Passing and Event-Driven

RSocket is loosely coupled because it uses asynchronous message-passing. When we talk about message passing, something that comes up a lot is, what's the difference between an event-driven architecture and a message-driven or message-passing architecture? I created a quick list of things here. The really big thing to take away from this is is that, in a message-driven architecture, you send messages to a destination. In an event-driven architecture, what you do is you actually listen to events emitted from an entity. One way to think about this is, message passing would be your boss tells you to go fill your timesheet out. An event-driven architecture would be more like your boss actually sends an email to everybody and says, "Fill out your timesheet."

Let's take a look at what that looks like. On one side, you have something that emits events, in the middle you have some kind of queue usually, and on the other side you have things that are listening for events. It's all good, you begin to emit events, things go in the queue, many people are familiar with this, and you begin to just pull stuff off the process it. The events that are sent aren't actually correlated to anybody, so anyone can go ahead and just pull it off and you can't actually tell where the message originated and where it came from. They just go into the queue and people go ahead and process this.

This is generally ok, except it ends up leading to a few problems, and some of those problems are, how do you actually send a response back to someone or send a message back to a particular person emitting events? It's very difficult to do this. You can't do this because you generally don't actually know where the event came from, Besides just sending an event, you might actually want to send an exception. What could happen is, you might have a device that's emitting payloads that are causing exceptions. Those payloads go into the queue and you can't actually tell which device is doing that because there's no correlation between the thing emitting the payloads and the processing causing the exception.

Then finally, back pressure is a lot harder to deal with in this kind of environment, so you can only really do back pressure around a queue right now. You can't actually tell one of these things emitting events to slow down and stop sending you events, because there's no signal. You can't actually say, "You're sending things too fast for me to process," What could happen is, you can have one bad device or one bad deployment that actually goes ahead and blows up your queue for everybody else, because you can't say, "Stop sending me stuff."

Actually this is kind of funny, at Nike when I worked on the FuelBand, we call this the "Mark Parker Problem." Mark Parker is the CEO of Nike, and every time we gave him a FuelBand, inevitably he would have a bug and then buried halfway through the queue of his device he'd end up actually breaking the FuelBand so he'd have executive tech support come to fix it. I'm very familiar with this thing, and it ends up causing a lot of issues. Let's take a look at how this would work with message-passing.

In message passing, instead, you have different destinations, What the destinations do is they begin to send messages to each other over distinct streams. Destination six sends a message to destination one. Conceptually, unlike event-driven architectures, something that receives a message can go ahead and pass it on to somebody else. In this case, destination three gets a message, it can send it off to another place. Then, more importantly, you can actually go ahead and send messages back and forth between different destinations. You can use this for request reply or you can also begin to use this for other things, signal, maybe you need to slow down or send an exception for someone to handle.

Let's take a look at the way that RSocket uses message passing in the protocol. RSocket has a bunch of different interaction models and there's some really great talks about it, I won't go over that. One of the interaction models it has is the ability to create a bi-directional channel between two different destinations. The way that it starts doing that is it goes ahead and sends a message over from destination two to one, it says, "I want to do a bi-directional channel." RSocket has this built-in idea of back pressure. Even though destination two sent a message over, nothing will be emitted until the person goes ahead and asks for it. What happens is, it sends a message and says, "Go ahead and send me three different payloads." It begins to go emit the different payloads at the rate it cares about, and then importantly too, it doesn't actually have to emit them at any set rate, so it can choose not to emit it or, if it ends up chewing up too much resources, it can just stop.

It emits these payloads and then when it gets to the third payload it stops sending anything else. If I talked about the previous example in an event-driven architecture where you go ahead and have this rogue program at submitting data, it can't actually happen in this, because what will happen is, it'll emit the three pieces of data and then it will just stop, it won't send any more data and you won't actually have that situation where one bad actor can go ahead and just take up all your system resources.

Then because this is a bi-directional channel, destination one can go ahead and say, "Go ahead and send me data back as well." But two doesn't have to do that, instead it can go ahead and send other messages, which is "Go ahead and complete this transaction, I want to stop doing it."

Then the thing that's actually the most useful in this architecture is, let's say the third payload is actually causing you issues. When destination two gets the third payload in an event-driven architecture there's no way to actually go ahead and really correlate that back to the area where it's emitted, but in RSocket, when it gets it, it goes ahead and basically sends an exception frame back to the caller so that you can very easily see "I sent this message to this guy, and then the message comes back." This is how RSocket is loosely coupled. It is spatially decoupled because it sends binary messages over a dedicated stream and it is temporarily decoupled because all of its APIs are nonblocking.

What Type of Messages Does RSocket Use?

What kind of messages does RSocket send between the different destinations? RSocket sends binary payloads between the different destinations and it accesses those payloads and processes them using a flyweight pattern, which is very efficient. It doesn't have to allocate objects that do this and it allows you to do things like zero-copy in a very efficient manner, and it does this across languages. The different implementations of RSocket, whether it's in Javascript, Java, C++, all use this same pattern, so they don't create extra garbage.

How Does RSocket Process Messages?

Let's take a look at how RSocket actually does this processing. This is something like when you move to threaded code and multi-threaded applications that you can get wrong really easily and actually create a situation where it looks like you might be CPU-bound but you're actually not. I have a really simple application that's just doing some work, pretty easy to see it's going to go ahead and count. What this conceptually looks like is, there's one thread going ahead and calling this simple little program and running it. It turns out this is incredibly fast. On my laptop, this took less than a microsecond to run. In a microsecond you can count to two billion on a laptop. Of course, we need to like make this faster and everyone knows the best way to do that is to just go ahead and add as many threads as possible. We get 128-core box, we're going to just add 128 threads to this guy and we are going to totally blow this out of the water.

Not only that, we're going to use atomics and just make the best program ever. How long do you guys think this took to run? This is what it looks like, everyone is helping you out counting. Everyone's going to lend a hand and add to this, but, when I ran this, guess what happened? I didn't wait for it to finish it. Let it run for 10 minutes on my laptop and it never completed. In fact, I didn't actually think it was working, so I'd changed the target of 2 billion to 2000, it still took 40 milliseconds to actually run.

What's happening is you're causing contention in the CPU. Not only are you not on your CPU cache, which is very fast, you're actually in slow main memory. Then you have all these people going in at once trying to go at it. The way I like to think about it is, imagine you have an eight-lane highway. Then at the end of the eight-lane highway, it all goes down into one road and you're just trying to jam as much stuff through. When you run something like this, you can actually go look at your task manager and it looks like you're running at 800% CPU on my laptop, it looks like it's just churning through stuff and I'm "I must be totally CPU bound." You're not, because you saw the last application ended up like taking less than a microsecond.

What can we do? This is something that happens a lot behind the scenes in RSocket and other interactive libraries. What we want to do is, we want to take our application and turn it into a bunch of single-threaded applications for you. What we end up doing is we end up controlling threads accessing the work we want to do with a shared variable, and then locally, we go ahead and cache it. We end up doing the work, so it's all in the same CPU, so it's very fast, and then when we're done we go ahead and we make it available to everybody. What this ends up looking like is this, you basically have a bunch of little single-threaded programs that run. There's way less contention, so when you go ahead and run this, it also ends up being pretty quick, it takes less than a millisecond to go through and churn through all this. A nice side effect of this is, the threads that aren't actually doing the work are free to go do other stuff, They're "I want to do counting, but I can't count," so they might go do some other application.

The way this actually ties back into message passing is replaces this silly little counter with a message queue or something. Everyone's pulling the message queue, one person wins, they get to drain the queue, the other people are free to go off and do other stuff. But no one actually wants to program like that all the time. I can tell you it's pretty awful to deal with, it gets way more complicated. Thankfully, we have libraries available that abstract this away for you and end up actually doing the same thing under the scenes. They end up going ahead and doing all that complicated work. You end up getting the same result that happens is if you did this by hand. You get these little single thread applications, and when you run that application, it ends up taking less than a millisecond as well.

To circle back to RSocket and how this connects together, reactive streams abstracts away message passing for you. The interesting thing about message passing is, there is no difference in message passing between something that's on-box or off-box, like for the caller. If I call something, I don't necessarily have to have it be on the same box as I'm on. What RSocket ended up doing is taking that abstraction and then creating a protocol that makes that work across network. The end result is, you can basically take this reactive code that was there earlier and swap in RSocket without actually changing any of the calling code. As far as the person making the call is concerned, acts as if it's on-box, or just some asynchronous code that's running.

If we go back and look at the last little snippet of code, what we have here is, the piece that goes off and does the counting is actually encapsulated in a reactive streams context, and we can go ahead and switch it with a really simple little request-reply piece. Then if you go look, the person that's actually doing the calling isn't changed at all. As far as they're concerned this could just be some threaded code. You can begin to go ahead and use this to build out more complicated applications that could either run on-box, run off-box, you combine them together, and as far as the person calling it, there's no difference to you. Then conceptually you end up with this: a bunch of applications with the work paralyzed out where you don't have to go ahead and worry about it.

To go ahead and review what we talked at the top, we need something like temporal and spatial decoupling. We want to make sure that we pass messages through a stream or channel to caller. We need spatial decoupling which is sending a message through a channel or a stream. We need temporal decoupling, which of course is like nonblocking APIs. We want what we're sending across the network to actually be encoded in binary so it's efficient. I should mention how much more efficient the binary is. The encoding RSocket, the frame encoding in binary can do about 10 to 20 million encodes a second versus some of the faster stuff like JSON, you can do like 30,000 a second, so it's not an insignificant difference when you switch to stuff like that.

Then finally, you need some kind of application flow control. The reason you want application flow control is, when you move to a distributed system, as you begin to make network calls, your tail latency actually dominates the entire request. The slowest requests that you make ends up actually slowing down your whole application. Using something like flow control, it actually evens out your latency and controls your tail latency which actually makes your entire application faster, and that's the stuff that RSockets provides you for free basically using it. Then RSocket is going to go ahead and go over what this actually works or looks like in the real world.

Real-World RSocket Performance

Farooq: It works in concept, so the question we asked ourselves was, we can build small test programs and show ourselves that we're millions of times faster than other things, but, what happens if you build a real-world example? What does that look like? What we did was, we applied it to the microservices problem, because it's a distributed computing flavored Azure. What we're going to do is use a test that's out there - some of you may have heard of it, it's called ACME. It's a bit of a test and test harness that was built to simulate real-world networked microservices infrastructures. Good news is, IBM actually tests and publishes the test results on a specific system that's listed here on a regular basis.

What we said was, that's a good benchmark. This is what these days is known as a service mesh - Istio, Linkerd, a couple of others. It's a rest-based, proxy-based, networking scheme that people are trying to use for building their highly-granular distributed microservices applications. We have this test that's published and I'd like to point out a couple of things in this test. These are not our numbers, these numbers are available online, published by IBM. They ran the test. The first thing I'd like to point out is, they got up to around 60 users. If you look at the throughput, you see that it's starting to level out.

The throughput chart there in the middle is telling us that we're reaching some sort of a capacity constraint there, so, they were able to ramp it up to 60 users and did a pretty good job. They created a throughput of around 4.4K requests per minute. Then they delivered the response time profile of 12 milliseconds, so 4.4K for 60 concurrent users, simulated users, in a real application at 12 milliseconds performance delivered and that's where they started tapping out in this, and they run this regularly so you can go and get the updated results any time you want.

We said, "That's working on HTTP rest, all the good stuff, let's run this entire thing based on RSocket," and so we did. Stare at that for a second: same test, same machine, same everything. We got to 60 users, delivered a throughput that was four times greater at a response time of a third of what they had. Not only that, we are still ramping in the throughput chart, so there's capacity available. 4X the throughput in a real-world system at one third the latency just by using RSocket. We got very excited - by the way, there's a blog post that lays out all of this, it's listed there. We said, "How far can we go?" Same system since we had more capacity.

We went to 300 users, got to 27 - close to 30K - and look at their response time, and compare that to the 60 user response time, and then we just got crazy. We said, let's just go all the way, we turned it up to 11, got to a thousand users on the same system, concurrent users, and the system kept churning, kept delivering more throughput, 37.8K and the response time starting to spike. RSocket, with all of the flow control built-in, etc, all these things are acting together to start yielding a little bit on response time but still generating more throughput in the system. All we did was, did the same test except we did it on RSocket.

This was the moment, and this was when we realized that all the stuff that Robert [Roeser] just talked about, when applied in a realistic sense can be, at least we think, pretty game-changing. We invite you to look at this, take a look at the blog post, you'll see what our methodology was, and also, if you're interested, we've got some good quick starts so you guys can get started if you're not familiar with RSocket on our website.

That's pretty much it. I wanted to frame this in a real-world scenario because sometimes it's hard to see what the basic theory is going to get you.

Questions and Answers

Participant 1: There are different companies when you're making service to service calls. Are service discovery, serialization and deserialization part of the RSocket?

Roeser: There's RSocket protocol, and then on top of that we have a routing and forwarding protocol that builds on top what RSocket offers. With it you basically don't need separate service discovery, load balancing. We have built-in serialization and then coming in like the spring framework. This is on the Java side, there's built-in integration with serialization stuff into there, and then other languages as well have sterilization built-in. What you do is you just set up the application, they point at each other and then all the, everything is taken care of.

Participant 2: I think GRPC has two components, the server-side and then peer-to-peer communication. GrPC peer-to-peer is similar to on-stream or how does it compare?

Roeser: There’s one thing I didn't talk about, and there's other talks that talk about this. In RSocket, once you make a connection, both sides are co-equal. There isn't really a client and a server in RSocket, there's the person that made the connection and that's it. I can connect to you and then actually act like a server, and in both directions. On either side you can actually set up clients and servers and exchange messages. We have a portable serialization, if that's what you like. We're looking at GraphQL and then there's more of the standard, kind of more straight-up message passing kind of stuff where you just send messages back and forth.

It's this nice layer with pluggable transports, which I also didn't talk about, so you can say, "I want this one web socket so it works in a browser, I want to use TCP," I have a shared memory driver based on some stuff Todd did. Then you can go ahead and plug that stuff in and pick what you want and then pick the API that you want to use to talk in your application.

Farooq: Just to summarize, there's no server or client distinction because it's a completely symmetric scheme, so you don't have to worry about. We have some demos where we show an entire service class demo running inside a browser in Javascript that is being accessed by other browsers, so you can do those kinds of things with this thing.

Participant 3: My question specifically is for our message compared TR.

Farooq: Performance or in features?

Participant 3: Performance.

Roeser: We could do with 92 megs of ram, we could do like 800,000 RPS and they died.

Farooq: You'll see similar deltas and we can point you to, if you visit our blog, similar deltas to this, in fact, better in many cases.

Participant 4: When you had the graph up with destination one and two and destination requests and say "You can send me three packages or three response packages," is destination one dependent on getting all three packages? And what kind of retry methodology or mechanics do you have then?

Roeser: First off, it's this request and semantics from reactive stream. I send you three, that means you can send me three messages. You don't actually have to send them, I could even say, "Send me eight." That's like these tokens you can use up, once you use it up then the rules say you can't send anything else. As far as retry semantics go, once a stream receives a terminal event, like an exception, the stream is torn down. Because of this message passing, what you can do is you can intercept the exception and then not hand it back and then just do a retry. That's built into the frameworks around it, so it's not something you do yourself, you just basically dot retry.

Participant 5: This seems somewhat reminiscent of and I'm wondering if you're familiar with that and could sort of maybe draw some distinctions, because I'm more familiar with that.

Roeser: Nats is probably more like a protocol for a message broker. One thing they did that some people liked, and that maybe I don't like as much, is that it was string-based. It's a very simple text protocol so it's easy to understand and maybe debug, but then you get this trade-off with having things encoded in string, and it's actually really significant to do that. You can end up making all your work, like JSON and text.

I had a Joke with a guy at work. If you just stopped any box randomly at amazon and you go look at it, 99 times out of 100, they will be parsing text. You're basically like burning money. The other thing I said, if you care about the environment you shouldn't use text. That's a huge difference, because it's very inefficient to do that, and it's one of the big things. Then the different interaction models and the final thing is the pluggable transport piece.

When we built RSocket, we wanted to make it so that you could pick a transport that was appropriate for a use case. Sometimes it's web sockets, sometimes it's shared memory, sometimes it's TCP. We want to future proof it so if something like Quick comes out, you don't actually have to change your APIs, you just say, "I'm going to use Quick," and that's it.

Farooq: Robert [Roeser] briefly mentioned the various interaction patterns that you can have. If you just look at resting HTTP, you've got a request response, that's it. In RSocket, first-class interaction can be request response, fire and forget channel or a bi-directional channel, so, stream in channels. These are all first pass primitives of the protocol. A lot of the work that people do with buffering and cues, etc, are actually handled straightaway inside the network protocol itself, so that's something we didn't get to spend time on, but if you're interested you should probably take a look at it, you'll find that very interesting.

Participant 6: Being that you are using [inaudible 00:31:54] connections how do you deal with load balancing?

Roeser: When you connect you get an RSocket object. Then what we do is we actually collect those RSocket objects together and you frankly just load balance across the connections. Are you talking about application load balancing or wanting to go ahead to load balance and quiess and move them between?

Participant 6: On level, if you're in Kubernetes, you've got three pods that all do the same thing and they're communicating with two other pods that do something else.

Farooq: It's like ELBs talking.

Participant 6: If you have one pod that goes down and is replaced, how do you not overload the one pod that?

Roeser: This is actually something that we do that is pretty cool, because there's back pressure and this availability metric, we can actually look and see how much load you can handle. We also take a look at the latency that you have and then we actually predict the latency. If you're not within a window, we can slow you down and speed you up. We get this bucket of persistent connections come in. We actually use a flyweight for the load balancer, put it over the bucket of things you want to talk to, and then pull two at random and then actually go ahead and find the best candidate and send it.

If one of them goes down, very quickly it moves them off and goes ahead and sends it to somebody else. Here's the other cool thing, because there's this idea of back pressure. If someone's down, they're not actually sending you a request and to send you more data, so you'll actually just stop getting having traffic sent to them too.

Participant 6: Do you then also do take over? If you did have one node that did go down but was actively trying to communicate or process some sort of work with another service.

Roeser: There's this concept in the protocol called resumption. If you have a flappy connection, something goes down, what it does is it actually makes it seem like a stable, persistent connection. Connection flaps, buffer stuff on either side, someone will pick it up and fill in where you left off, and so that's how you would do something like that.

Farooq: Resumability can be both warm and cold, so mobile use cases, unreliable networks, connections, it's a natural to use RSocket.

Participant 7: Some of the things require eventing. Does RSocket have anything to support eventing or it's completely different?

Roeser: Do you mean event sourcing?

Participant 7: Yes.

Roeser: I've been asked this before, we haven't built anything right now, but it would be very natural. Another thing we didn't talk about is it's connection-oriented. When I connect to you, all the streams that I have come on that same channel, you could begin to go ahead and save them. You could maybe make the case that resumption could be a form of event sourcing too, so you might not need to use it quite the same way so you could probably use resumption as a fill-in for event sourcing in some use cases I think.

Farooq: It's important to understand that we're not advocating for death to all event queues, we're not saying that. I mean there is a natural use case for events, maybe your application requires a queue, then use the queue, it's there. We're also not advocating that you throw out all of your MQ and your Kafkas because now you can actually do. Where RSocket becomes very useful is when you are using those mechanisms. You may be using rabbit, or you may be using Kafka because you have bi-directional streaming interaction, and there is no such facility in rest or HTTP, so you basically use something that was designed to ingest today and consume tomorrow as to do your bi-directional streaming. It's those cases that tend to introduce a lot of infrastructure in your applications.

There's a place for queues, there's a place for Kafkas of the world, and the Rabbits of the world and even HTTP, I mean it was designed to schlep documents, that's great. The important thing is that, oftentimes you end up being forced in a place where you're using some of these architectural elements, these infrastructure elements unnaturally because you have no other options, and that's where a massive amount of performance degradation etc comes in.

For example, if you just take the example of these service mesh thinking that's going on these days, what is a service mesh? What is the problem it's trying to solve and how is it solving it? A service mesh is basically a response to the fact that I have two services sitting apart from each other and they need to be able to talk to each other in a secure controlled fashion.

The response is, let's put a sidecar there, which is a reverse proxy. Let's put more proxies there, let's introduce service infrastructure, discovery infrastructure, all of that stuff, build all this apparatus of layers upon layers so that service A can talk to service B. Well, we have a model for this, it's called networking. That's how you talk, one service talks to another service, not proxy upon proxy.

Is it saying that you can't ever use Rest or you should never use a service descrip...? No, that's not what we're saying. The question is, when you are trying to build a performant, simple to deploy and use system, do you have options other than building those big scaffolding around your services? That's one of the ways you use this.

Participant 8: How does the contract between two different services look like? Do you do cogeneration, or do you use a proxy?

Roeser: There's two ways you can do message passing between both of them, which is mark and arrest, you just send. Spring has this concept of billing into billing to other stuff like a named stream. You can do IDLs, so Protobuff, GraphQL, stuff like that. Like you said, you can bring your own API. You can pick what is appropriate for the use case that you're doing.

Farooq: Strong contracts are always good if your application can have. We do provide the obvious instruction, the IDL instruction.


See more presentations with transcripts


Recorded at:

Oct 21, 2019