Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations RSocket: Solving Real-World Architectural Challenges

RSocket: Solving Real-World Architectural Challenges



Ondrej Lehecka of Facebook, Robert Roeser of Netifi, and Andy Shi of Alibaba explain the use cases for RSocket within their companies, as well as how it can be used by enterprises to simplify the way they build and operate cloud-native applications.


Robert Roeser is the Co-Founder & CEO of Netifi. Andy Shi is a developer advocate for Alibaba group. He is mainly focused on Service Mesh and middleware technologies. Ondrej Lehecka is a Software Engineer at Facebook.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Roeser: I'm Robert Roeser, I'm a co-founder of a company called Netifi and prior to Netifi, I worked at Netflix on the Edge Engineering Platform team. Netflix's edge is resilient front door to all of Netflix's control plane traffic, so we have to deal with massive scale. You might be familiar with some of the projects our team created, like RxJava in Hystrix. When I was there, I actually helped work on the Java version of our socket.

I found this graph that InfoQ had, and I thought it was pretty interesting. It's design trends from 2019. I don't think it comes as a surprise to anybody that pretty much everyone is building microservices today, but the thing I found that was interesting about it was in the early adopter's section, it listed correctly built distributed systems, why is this interesting? Microservices are distributed systems, once you start with a monolith, you break it in these different components, you basically end up having distributed system problems, so you run into all sorts of things that you wouldn't have built before. What happens is people start on this microservice journey, and they end up running into complexities and issues that they wouldn't have had to deal with before. We had to deal with this a lot at Netflix, wo we ended up building tools like Hystrix to deal with the problems that we ran into with these systems. When we're thinking about how can we actually architect this and move this forward, we wanted to encapsulate a lot of learnings that we had, and build it into a protocol that we thought can make it easier to build these distributed systems.

RSocket’s Help in Distributed System Building

This is the impetus for starting to build out RSocket. What we wanted to do is we wanted to create a standard way for applications to communicate with each other across the network, we wanted to solve the ways that these applications communicate in a consistent way. We didn't want to do it between microservices, we wanted to deal with all the different devices that Netflix has to deal with, and all of the different back end systems. In most modern systems, people think of their distributed systems, just as their microservices, but it also includes all the web devices that you might have to deal with, all the set-top boxes, all the different mobile devices and how they interact.

When I was thinking about distributed systems, I came up with a list of things that I think that you need to actually go ahead and build a system when you're going to go decide to do this. From this list, I thought, "Okay. Where is RSocket help? Where can it provide help for you when you're building this system?"

Where RSocket helps is the communication model, it helps you define a way that applications communicate, how they interact with each other. RSocket also helps with how you transport data between the nodes in your system, so it abstracts that away and provides a way for you to cleanly transport data between nodes, but as a developer, not actually have to worry about the underlying transport and then, really importantly, it provides flow control.

RSocket’s Communication Model

RSocket is a Message-based Binary Protocol, in RSocket, there is a concept of a requester and responder, they interact with each other by sending binary frames between each other as messages. We picked binary for this instead of text-based because it's up to 30% faster, we found from testing to actually process binary data. When you get to a distributed system, latency becomes more and more important, the slower your system is, it actually affects calls down, your graph of services, creating a worse experience.

The other thing we want to do is make it Payload Agnostic. We have some use cases that required JSON, some use cases that required Protobufs, some that required Avros, we wanted to create a system that made it easy to go ahead and send these without needing to actually worry about how you're doing that. The other thing we wanted to do is make it multiplex and connection-oriented, it's really inefficient to have a protocol where each request, you actually go ahead and create a new request each time. What we wanted to do is basically have each device create one connection into our infrastructure, and then over that one connection, multiplex or send multiple requests through virtual streams. This way, we can make efficient use of the number of connections, and we could actually lower the number of servers that we required because it wouldn't have to maintain as many connections between devices.

The other thing this allows you to do is keep soft sticky state from the device in the actual application, so you could go ahead and keep some interesting data about the application you're working with, tied to the connection. When the connection goes away, it would actually go ahead and clean up that state.

RSocket Interaction Models

At the network level, RSocket provides some different interaction models, the most common one is Request Response, which is a stream of one. You send a request to a responder and they emit back a single response to you. The next interaction model is Fire-and-Forget, this is sending a single frame to a responder and they don't send back anything, so actually, it's not sending the request and ignoring the response, it's literally just sending the frame of the network, it's more efficient than sending something and ignoring it. The next model we've provided is Request Streams, this is more like a pub-sub model, so you send a single request and then get a stream of responses back. The final interaction model we provide is a bi-directional stream, you can send a stream of requests and get back a stream of responses.

Something that's pretty unique to RSocket is it's actually bi-directional. With RSocket, a client and a server are no different, the only difference between a client and a server is that a client initiates a connection to a server. Once the connection is established, they're co-equal members of the transit of the connection, a client connect to a server, and then the server can actually go ahead and request API calls from the client as if it's connected to itself. This lets you do stuff like actually have a web browser connect to a service, and then you can go ahead and call API's that exist in the JavaScript code running in the browser.

RSocket Transport

The other thing that RSocket did is it abstracted away the underlying transport, we wanted to create a protocol that you could pick a transport that was appropriate for the needs you were working with. If you wanted to communicate to a web browser and support WebSockets, we had some use cases for some high-performance storage, so we use Aaron for that. For our normal microservices, we want something where we can plug in TCP. The way it works with RSocket is the developer goes ahead and programs against the API, then you can switch out the underlying transport without actually having to go change the business logic. You could write an application once, provide an interface to it via TCP and then take that same application and make it available to, say, like a web browser via WebSocket.

RSocket Flow Control

Then, really importantly, on the Edge, we were constantly dealing with tuning Hysterix, dealing with circuit breakers, and tuning request throttles for incoming traffic, so we wanted to come up with the way that the underlying protocol could do this for us automatically, so we added a concept of Message-based Flow Control. A lot of flow control for network protocols is based on the number of bytes flowing through the system, that's really hard to reason about. It's very hard to figure out that if I send you 100 more bytes, than your system is going to blow up, but it's very easy to figure out that your system can handle 50 messages a second. The underlying flow control from RSocket is based on the number of messages flowing between this system. It still differs to the byte-level flow control, so if TCP is filled up, it will actually stop sending messages and send that downstream to other systems involved in the interaction. The other thing that happens is the flow control has no semantic loss between hops and a microservice called chain, so if your microservice ends up creating a call that calls five services downstream, the back pressure is actually transmitted seamlessly between the different nodes in the system. You don't actually end up with a slow system causing your service to fall over.

RSocket in Practice

In practice, HTTP works like basically drinking from a firehose, you have no way to control the data that is being sent to your downstream systems, but with RSocket, because you can control the flow of data, it's like drinking from a glass. If you don't have flow control, you end up working around it by building in circuit breakers, having to deal with retry logics, there's thundering herd, you get cascading failures, and all those sorts of things that people are used to in a distributed system.

To recap, where RSocket helps, I think it helps with your communication model, it defines the way that your applications communicate with each other across the network, the method that they do this. It abstracts away the network transport, so you can go ahead and plug in the appropriate transport for the use case you're working with and then it provides flow control between the nodes at the application level, so you don't have to be getting to work on stuff like circuit breakers and, like, throttling anymore takes for that automatically.

RSocket Use Cases

Lehecka: For the next 15 minutes, I would like you to have two things in your mind. One, we're talking about a connection-oriented architecture, that means, once the client establishes a connection to the server, then both client and server, act as a requester or responder, it's symmetrical that way. The second thing I would like you to keep in mind is that I'm going to walk you through two use cases, I'm going to show you how the architecture looks like when you model it with the sort of standard current technologies. I would like to invite you to think how the architecture can look like if you would have a ritual protocol. How would you design it with the most convenient technology if you can sort of start like from scratch?

My name is Ondrej Lehecka, and I'm from Facebook. I worked on the RSocket project for a couple of years and brought it through production for a couple of use cases, I'm going to talk about two of them. The first use case, which I'm going to talk about is going to be Live Queries. You will see that it's one of the main vehicles for getting data from our services and building UIs. The second thing I'm going to talk about, the Client Monitoring.

Live Queries

To set up for the first use case, imagine that you're building user interface for your app, you need to acquire or get some data from your back end system and use some query language. The main vehicle for Facebook is GraphQL, it's how we get data from the social graph, and then we display it in the UI. This is an example of how the NewsFeed query can look like. The two interesting aspects which I would like to look at is, there is a property called liked by and likes count. Those are the two properties which the user sort of is expecting that they will update on the UI, as the other systems are interacting with our system.

How would you build a UI to have this kind of interaction? You typically have a client who will issue a query, it's like a point query to your back end, so you say, "This is what I want." and the system, eventually, the backend system, like, responds back and you have the data to display. You know that some of the data updates as time goes, so the most simple thing you can do is to just introduce some periodical sort of refresh, so every so many seconds, minutes, hours defense, you just re-execute the query. We basically just display the new data and that's how the user gets interactivity with the system. If you think about it, it's quite a simple model, but it has a lot of limitations when it comes to scale, large scale, especially, like the one we have.

If the query is expensive, then you're wasting load of compute, especially for cases when the data didn't change. That's one sort of downside or one sort of problems which you deal, typically, on the backend side. On the client side your devices to wake up every so often, make the network call and effectively, you're just draining battery and often just wasting resources to make sure you didn't miss any updates, there are a lot of these cases where nobody really changed.

The incremental change and improvement can be that you sort of introduce a signaling service, it's essentially another service, which you subscribe to some event, which will fire to basically tell you that you need to refresh your UI. The typical sort of solution is that you do long polling over this signal so you, essentially get notified, and then you re-execute the query. This is just like an incremental sort of improvement.

Now, what if you would build it in a different sort of more effective way? I'll start again with the GraphQL Query because this is like a declarative way of how to get the data, you see that the only thing which changed is that we added the attribute @live. What it means for our end for components is that it actually goes through a different channel. It doesn't issue a regular HTTP request, but it goes through RSocket connection. The query, which we're executing with the @live attribute has slightly different semantic, it's essentially, get-and-subscribe on the results set. You want to get the data, but you also want to get new data as the data changes over time.

This is how the server architecture of our backend system looks like, the client is connecting to the server, in our case, we call it Live Server over RSocket, essentially, just establish a connection. The client can issue as many queries as at once to one system, the system remembers the client as a session. The interesting bit here is that once you sort of represent a client as a session, the client can come and go, you can survive disconnects, you can resume, but the sort of stay for the client stays as an instance of a session. The session also remembers the state of every query, so, you can also deal with cases where if the client is disconnected, what do I do with the query, which I'm about to deliver, and my client is gone? You can do all sorts of things like caching, and you can essentially just program for these cases.

You see on the right side that the Live Server is connected to the sources of data, I call it Reactive Data Source, but it's just like a general term. This is basically connected through another RSocket connection. When there is a change in the data, the reactive source just essentially sends the signal back to the server, the server evaluates what really changed, re-executes pieces of the query or the whole query, all happens on the back end, there's no round trip to the client whatsoever, then it sends the updated sort of results and back on the client. So far, the client didn't have to request anything, there was no time interval on refresh, so the client didn't really sort of wasted any resources to constantly pull.

If I recap this kind of use case, we built a stateful back end. We use the connection-oriented architecture, to basically save a lot of compute on the back end. We also improve the latency in the system because we can basically notify the client as soon as the data changes. You can imagine, the very typical would be, if you have a system, which sort of refreshes data every so often, then this will ensure your latency in the system. We also built the resumability in the protocol, so that now we can survive a short intermittent disconnect, which is very common. You're going to get too elevated or when you are switching from Wi-Fi to mobile network and back, you basically just lose connection and once you reestablish the connection, you want to just carry on in your session without re-executing all data. That was the use case where we prove the sort of efficiency of that architecture.

Client Monitoring - Architecture

For the second use case, I want you to imagine you have mobile applications, you have all sorts of different types of clients, and you have tens of thousands of them, maybe millions. Every device is collecting its own runtime information, it collects telemetry, counters, debugging information, logging, all sorts of things. Typically, when you design your system, you need to basically collect this logging, such that you are able to debug issue, and also when somebody logs a problem, or somebody sort of like files a bug, that you are able to go back and look at chases, and see if you can understand what actually happened.

In this kind of design, you have some log store and your client constantly update logs and telemetry into this log store, so you might have a last month of data for what was going on for some users. The problem is that if you have millions of devices, obviously, you can't save all of it. It also feels that, for every query, you can't really log, the request sort of payload and the response payload because that will be just extreme amount of data to store, so, typically, just sample the data and store it in the log store.

The same thing you can do on the server side, as the server is responding with a payload, you can log the payload, so that later on when you will try to find a correlation between a bug which I'm dealing with, like what data was flowing through the client, and the server to make some sort of sense out of it. If I will tell you that out of your million devices, you have issue with a specific type of device for users in a specific type of place, say, East Coast, only at a certain time, and they're filing that the system or their application is just not working, then what do you do?

What you need is to target this very small specific group of users and you want to extract as much data from them as possible, but it's sometimes very hard to turn only these specific devices to start sending you more detailed information. Debugging the system is like post-processing the log data, you have to find the entries which belong to these users and which belongs to this particular problem, and you analyze after the fact. What if we would build this from scratch in a way that you get real-time updates from the particular clients that we target, and you can debug the issue, not after the fact, but right away, as the users are interacting with the system?

In this model you have your monitoring/debugging tools, which contact your backend, your service, and say "For this particular user, this particular connection, I want all detailed information." One of the things I told you, we have connection-oriented architecture, so this server can actually initiate a request to the client. The server can say "I want you to start sending me detailed logs about this piece of API," -their request can be quite detailed of what he wants- and then the client starts streaming the telemetric data back right away. You have a system with very low latency and you have real-time data, you can look at it right away, and this is essentially the monitoring which we built.

The fact that the server can initiate the request, it's one of the key aspects of the solution. With this technology, you're able to target very specific clients, you can just basically say, "On the connection level, I want the login for this particular connection.". I don't have to collect a ton of data for the rest of the users because they might be doing just fine, and you basically reduce latency in your debugging tools.

Future Use Cases: Delayed Execution

As a little teaser, I'm not going to dive too much into it, but this is where RSocket or connection-oriented sort of stream protocol can help. You can decouple your system in time, as well. Imagine that the client initiates a request, and says "I want this kind of data." If you know that this data is not timely, you can basically just say, "I don't have capacity to calculate this right now, but since I know that I have a connection with this device, I can calculate it later." This is essentially just different compute based on your needs, in this case, you might differ it because of your end peak hours. This is just a teaser for you to think where else can I use connection-oriented architecture, and namely, RSocket, to decouple my system in time?

Streaming Data Use Case

Shi: A couple of months ago, I started reading the fantastic RSockets back and I was reading all those features, like request response and request streams. I was thinking, "It's nice, but what is that to do with me? How is it going to help my organization, my company?" My name is Andy Shi, I'm from Alibaba. Today, I'm going to talk about some demos that we are familiar with because I think that same question that I had is probably in your mind right now. How is that going to help your organization and your company?

Use Case: Streaming Log Files

Let's take a simplified example of what Ondrej [Lehecka] was just saying, of the Client Monitoring and we turn to something we're all familiar with.

Log aggregations have been around for a long time and we have seen solutions, I bet each and every company has its own way of doing things. This is normally the architecture we're using, we have certain amount of devices connected to a messaging cluster, and then the messenger cluster will aggregate all the logs from the devices because we cannot have a real-time query on the devices. You don't know where they are, you don't know how to query them, so you put all the data in the database, and then you have consumers to query those databases.

The problem with that, as Ondrej [Lehecka] was talking about, is your database would just really big and it's not only associated with your physical facility, it's also associated with your cost. Imagine you have so much data and all you need is this amount.

Challenge: Send Log Files on Demand

Let's do a challenge and think how we can solve this problem using what Ondrej [Lehecka] has described. We want to single out a device, and we want to stream the file, the log files on the device to the server and to be consumed by your consumers in real-time. Here are some of the requirements.

We want to identify each individual devices, we want to locate the device and locate the file and we want to be able to pull and not push the file. Think about that, how do we do that using today's message queue solutions? We will not want to use any databases involved because we don't want to waste money on databases and we need to be able to support large number of devices like Facebook has or Alibaba has. We want to run queries, not only real-time queries, but we want to have multiple consumers to run that query because 2, 3, 5 engineers can work on this same problem, and they might be querying on the same file, but with different queries, so we want to be able to do that. Lastly, we want to have this flow control capability because we are streaming and when we are doing streaming, we all know that if you don't have flow control, you get into situations that's out of control.

These are all the requirements, now let's think about how we can do that using the traditional message queue solutions. Here's a math question on how we're going to solve this, let's calculate how many topics or subjects or queues we're going to need using message queue. How many queues for each device? It's actually tied to the number of APIs because each API is a subject and I hope you remember what Rob was talking about. There's a diagram showing, there's a request, and there's a response, so these are the two queues that you need. But wait, you need to actually send the data file back, so you need the third one. Then think about that we have different consumers that might be querying the same file. You want to be able to identify which consumer to send it back to, so you need to be able to identify the responder. There you go. that's your simple math formula for all the things we need, but that's only on the device side. We have to repeat that on the consumer side, it's the same situation with a little bit different formula.

What I could come up with last night was only two commands for each consumer. I know the lines are crooked, the spaces are not even, but don't feel sorry for me, feel sorry for those people who have to work on those queues, and feel sorry for those who have to pay to Bezos or whoever you're paying to as club provider because that wastes a lot of infrastructure and it's a lot of costs associated with that.

RSocket Solution

Let's take a look at how we can solve this with RSocket. That's pretty simple, there's not much to say about this. We have devices connected to the server and we have consumers connected to the server. Once they're connected, they're peers, they can talk to each other. Then we have server do some magic and let's see how many connections we're going to need. Basically, four in the situation, that's pretty straightforward.

Before I show you the demo itself, I would like to go through some code and show you how that's done, so you get the principles Robert [Roeser] and Ondrej [Lehecka] just talked about. This is a server code, it starts to transport, there's a WebSocket one and there's a TCP one. Normally, it's really hard for two different transports to talk to each other, but since we use RSocket frames, they're the same, we are actually able to relay the messages.

On the device side, there's a thing in RSocket called setup frame, which means you can send out your own configuration, you can send out whatever you want, as long as it can be converted into a setup frame. In this case, the device was sent to user server, its device ID and the log files on the device, very straightforward, then it's going to connect to the server. Even though we're talking about binary, you can actually use JSON to use the serialization on your side. This code is a little bit compact, but actually, what I want to show you is, on the server side, what it does when it receives the setup frame.

The server is going to take the setup frame and process it, it's going to do two things. It's going to keep the device information, it's going to take the device ID and it's going to take that handler to that RSocket connection so we can reuse it. Look at how simple it is to do the closing part, it's two lines, there's one line that's doing the logging. As Robert [Roeser]was saying, it's a soft sticky connection, and it keeps a state of the connection, so when you are doing the thing up, it's pretty straightforward. You know where you are, you know what to do.

This is a function that's on the server, it's calling the device, “send me your log file”. This happens when the server is getting the request from the consumer, and it's going to look up the device, and it's going to tell the device, "Hey, give me the file." Remember this is a server and as both Ondrej [Lehecka] and Robert [Roeser] were saying and emphasizing on, there's no server or client. In this case, a server is requesting the device to send back the file, that's pretty important thing to remember.

On the device, this is a request stream, RPC, that's implemented on the device so it's going to look at the requirement that's bypassed by the server, it's going to look up the file, and it's going to transform back. Here's the thing, when the consumer is sending the server its request stream, it's in a frame. That frame has the information, it also has the back pressure setup. I want to have certain amount of buffers, I want to have the rate at this level, these are all passed through to the device, so the device actually knows what the consumer is asking for, without any loss of information.

Connections Needed for One Device

Let's recap, connections needed for one device is one connection and the consumers and the devices are not intertwined in calculating the number of connections you need. The next thing you can do is, when you're changing API's, since we don't have the idea of the API, in the stream, you don't have to change any code, unlike in your message queue situation, you have to change the topics.

Here I'm going to start the server first and as you can see, it has initiated two sockets, one is TCP, one is WebSocket and this is the device. It has started, now let's run some commands. This is the listing function that I asked for all the files that's available on this device. There are three log files and we're going to query one of them.

The file has been received, this is one of Shakespeare's play, I just copied it from the internet because we're in London. That shows you how simple that is, look at how many lines of code that's involved to get this thing done.

In summary, RSocket simplifies the architecture, it saves this money, that's the most important thing to remember. There are three features that we just talked about, one is, connection-oriented, one is bi-directional and then we have the end-to-end flow control.

Questions and Answers

Participant 1: Very interesting talk. I was wondering, maybe I missed it, what about error control? Maybe for the first speaker. Do you handle error control?

Lehecka: Error handle is built in, so, on the protocol, there's an airframe that actually is sent back across. When an exception occurs, it's sent back, it will cancel the existing stream upstream, and it will send the error message downstream.

Participant 2: Thanks for the presentation. Are we now back with servers that are stateful in order to manage our connections?

Roeser: You can choose to do it. Because it's connection-oriented, the connection stays around. If you wanted to do something, imagine you want to have more of a conversation with the application you're dealing with. Generally, people build stuff, where it's like one and done, you do your request response, you send the data, and you have to do all your processing and sending it back. Because you have this idea of this connection, I can send a couple of requests to you, you can send a response, and you can send data back and forth. You don't actually have to use it that way, if you want, you can send one off and then discard it.

Participant 3: How would you compare and contrast this to gRPC?

Roeser: RSocket is a layer five and six protocol in the OSI stack, so you would use something like RSocket to build an RPC layer. It's better to think of RSocket more like HTTP for microservices. Some people don't want to use an RPC obstruction, some people want to use more of a pure message passing abstraction. Spring is actually adding support for RSocket to an upcoming release and they're going to make it available to you as message handlers, more like a restful or message passing thing. We created an RPC layer on top of it, so if you like the RPC style semantics, it's a drop in replacement for gRPC.


See more presentations with transcripts


Recorded at:

Jun 08, 2019