Transcript
Joras: We're going to talk about how Facebook brought QUIC to billions. First off, who are we? My name is Matt Joras. I'm a software engineer on the Traffic Protocols team at Facebook. I'm also one of the co-chairs of the IETF QUIC Working Group, which is the group that standardize QUIC.
Chi: My name is Yang. I'm also a software engineer on the Traffic Protocols team at Facebook.
What Is QUIC?
Joras: What is QUIC? QUIC is the next internet transport protocol. In terms of the more traditional OSI layering model that everyone's probably familiar with, you have application, security, and transport layers. At the very top, you have HTTP semantics. These are things like GET, POST, other things like that, that form the basis of HTTP requests. Under that is the mapping of those semantics to an actual protocol. The one that's existed for a while now is HTTP/2. As you might expect, HTTP/3 being three is the next version and is QUIC specific. Both of these provide things like prioritization and header compression. In HTTP/2 it provides the stream multiplexing. The next layer is the security layer. This is where things start to diverge a little bit. With the traditional stack with HTTP/2, underneath that is usually TLS, and usually TLS 1.3. TLS provides things like authentication, decryption, and generally makes the transport secure for use on the internet. On the QUIC side, you notice that there's not really a clean delineation. This is because TLS is baked into QUIC. Then, what that means is there's only one handshake, there's not like a separate TLS handshake and QUIC handshake, there's just the QUIC handshake. Also, the encryption is done on a per packet basis.
Then, when we go into the transport layer, again, we see that there's a bit of a difference. On the left-hand side, we have TCP, which everyone's familiar with. It is connection oriented. Provides reliability on the internet. Provides congestion control, things like that. Also, first provides the typical TCP bytestream semantic, which is usually exposed from something like a socket. QUIC, has some of the same things. It has things like congestion control, reliability, has connection oriented behaviors. What's different is that it has stream multiplexing. What this means is that QUIC exposes multiple independent streams on a single connection, and they're independent so far as one stream does not block the delivery of another stream. This is a fundamental difference with TCP that you can only achieve with multiple TCP connections.
Additionally, we see another difference, which is that QUIC has UDP underneath. Why is this? TCP was defined a long time ago and had the luxury of being its own separate internet protocol, and getting its own internet protocol number. This isn't realistic to do anymore, because it's hard to update old equipment to understand new internet protocols. Instead, QUIC basically piggybacks on top of UDP like a lot of other protocols do. That allows it to transit around the internet unperturbed. Of course, both of these sit on top of IPv4 and IPv6, which is just the way that we wrap data around the internet.
Why QUIC?
Why QUIC? One of the biggest reasons is something that we call ossification and protocol design. Ossification is a word to me that means to become bone. What that's referring to is that protocols like TCP have existed for so long, and their wire format that is so well known that it gets really hard to change them. That's because there's a lot of equipment on the internet that makes assumptions about that wire format, and may do things like drop it or modify it if it diverts from its expectations. QUIC solves this by encrypting the vast majority of the protocols so there's not really much exposed, which means that we can change whatever goes on underneath the encryption because no one can make any assumptions about it.
Another key advantage of QUIC is since it's built on top of UDP, that allows us to implement it in user space. This is important because TCP is generally implemented by your operating system, and is untouchable and unchangeable. It will change slowly over time, but you don't have much control over it. It's hard to iterate on it, or extend it, or modify it, and then experiment with those modifications. With QUIC, since it's implemented in user space, we can change and iterate it as fast as we can push software to our servers or release software to things like mobile apps. Of course, additionally, QUIC has independent streams that can be multiplexed at the connection layer. Since it's been designed recently, it can integrate decades of experience running protocols on the internet, like TCP, which lets us have state of the art loss recovery.
QUIC at Facebook
Let's talk about QUIC at Facebook. We have the same diagram we saw earlier on the left. This is how our software fits into it. All of these are pieces of open source software. The top three are Proxygen, Fizz, and mvfst. They together are C++ implementations of HTTP, TLS 1.3, and the QUIC transport protocol. All three of these taken together are how we implement our servers, so our web servers, our load balancers, things like that. It's also been packaged together with what we use in the mobile apps to implement basically a networking stack. This is really valuable because we can share code on the client and the server and iterate it as fast as we can release apps and as fast as we can release software to our servers. The bottom one is the one that's a little bit of an odd one out, but it's our layer 4 load balancing software called Katran, which is based on XDP and eBPF. It allows us to load balance both TCP and QUIC connections consistently and reliably for our data centers.
All that sounds great. We have this new modern transport protocol. It's encrypted. Everything's great about it. We have all the software that we implemented. We probably just implemented it, turned it on, and everything was great. Obviously, though, that's not really how it happens.
Facebook and Instagram Baselines
Chi: It turns out changing the transport layer protocol and HTTP at the scale of Facebook is actually quite challenging. Before we dive into some of the interesting problems we've solved the past couple years, let's first look at how exactly the networking stack looks like in Facebook and Instagram app, before we start to do this transition into QUIC. They were both using TLS 1.3 with early data enabled. On the Facebook side, all the API requests, aka GraphQL, and all the static requests, whether it's image or video, they all use HTTP/2. For Instagram, the API request use HTTP/2, but for the static content like image and video, they were still going over HTTP/1.1 connection. For Facebook app, graph, image, and video, they all go through separate connections. On the Instagram side, image and video can potentially go through the same connection. On the server side, we just use standard Linux TCP stack. The congestion control we use is version one of BBR.
QUIC with TCP Backup
On the internet, there are still roughly about 1% of the users who cannot really have their traffic go through UDP over port 443. One percent of users in Facebook's user base is a very large number. To make sure that when we enable QUIC and HTTP/3 for our users, they can still safely and reliably use the app, we come up with this algorithm to raise the QUIC and TCP connection when we need to open new connection. The idea is very simple. We open a QUIC connection and we wait 200 milliseconds, after that, we fall back to a TCP connection. If we have both QUIC and TCP available in our session pool on the client side, we always prefer QUIC connection. This algorithm clearly has a preference in QUIC over TCP. That's because our data shows as long as QUIC is not blocked, it always performs good or even better than TCP. With this algorithm ready, we were ready to run real experiment with production traffic.
Congestion Control and Flow Control
Joras: To understand some of the challenges we face next, let's start with some basic definitions of congestion control and flow control. Congestion control is essentially an algorithm used by a sending transport like TCP or QUIC to minimize loss or congestion, and maximize something like usually throughput. Our ability to deliver data to the client. Flow control is different. Flow control is essentially a way to limit the amount of data that an application buffers. If I'm receiving data, I might want to say, I only want to receive 100 kilobytes or 2 megabytes of data. Once you hit that, don't send me any more data until I tell you it's ok. This is essentially a way to limit the amount of data that a receiving application has to buffer.
Initial Flight: 6 TCP conns vs. 1 QUIC conn
An interesting thing that we saw with congestion control, and specifically on Instagram was, Instagram was using 6 connections to fetch data from the server. It was not only doing this basically for multiplexing reasons, so that it could have multiple inflight requests at the same time, but another side effect of this is that for the first request that the app makes, it's able to download quite a lot of data in that first initial flight. This is because the sending congestion controllers, they're all independent. They all decide, ok, we're going to send not too many packets, because we don't know how much congestion there might be, so we're going to send 10. Since there are 6 connections, actually, the same server will end up sending about 60 packets to the same app.
When we changed everything to QUIC, since QUIC has multiplexing on a single connection, we don't necessarily need those 6 connections to achieve multiplexing. However, it ends up being disadvantaged compared to those 6 TCP connections early on, because that same connection is still limited by those 10 packets. This showed up meaningfully in the metrics when we tried to use QUIC in Instagram, mostly in metrics related to app startup time and media loading. How did we decide to fix this? Basically, we split the difference. We had one connection, but we decided, we're not going to go all the way to 60 because that's quite high. What about 30? 30 is a more reasonable number. Indeed, it's still less impactful to the network than having those 6 connections. By doing this, we were actually able to completely make up the gap in those metrics with TCP and actually even improve it. The media load times during startup were improved by switching to QUIC with this increase in initial congestion window.
Congestion Control and Congestion Window
I've already mentioned congestion window a couple times. Let's talk about it again, a little bit more in detail. Suppose you have a 10 megabit per second link with 100 millisecond RTT, or round trip time. In round trip time, 100 milliseconds means that the time it takes for a packet to go from one end to the other end, and back again, is 100 milliseconds. A question you might ask, and a question that we ask in servers is, how much data do I need to have in flight in order to fully utilize this 10 megabit per second link? To do this, it's just simple math. You take the link speed, you multiply by the round trip time, and you get something called the bandwidth delay product, or BDP. The BDP is also idealized congestion window, which is the amount of data that a server should have sent to maximize the throughput here.
TCP + HTTP/2 Has Two Levels of Flow Control
Why is this relevant? This is relevant because of flow control, because congestion window can essentially be bound by flow control. What I mean by that is that flow control essentially limits the amount of data and applications willing to receive. In TCP, and with HTTP/2, there's two layers of flow control. There's the TCP flow control and there's the HTTP/2 flow control. The TCP one is determined by the TCP stack. It shrinks and grows over time, and you don't really mess with it too much, usually. The HTTP/2 flow control is set by the application very explicitly. For us, it's usually static inside. You set it to a static number of like 100 kilobytes, 1 megabyte, 10 megabytes, whatever you want. In QUIC, since the stream multiplexing and all that is built into the transport, there's no separate flow control for the transport and HTTP. There's just one and it's controlled by the transport. Similar to the HTTP/2 one, we have it set by the application. It's also static in size. Something that we found was that the static limits set by our applications for HTTP/2 did not work well for QUIC. It was set into something like 165 kilobytes. When we tried to use this with QUIC, we would see a lot of negative metrics. To figure out why, we have the next slide.
Flow Control Constrains Bandwidth
One way to visualize this is we can go back to that ideal congestion window idea, which is for that 10 megabit per second link, at 100 milliseconds RTT, we want to have, ideally, 122 kilobytes of data available in order to maximize that throughput. The flow control limits the sender from sending more than a set amount of data. If you have a flow control set at 100 kilobytes, then the sender has no choice but to respect that. It's unable to really fully utilize the link. Like in this case, the flow control is too small, it's 100 kilobytes. It's less than 122 kilobytes. All that image and video data, it's constrained by what it could potentially send simply from the flow control.
The way that we can fix this is essentially allow the application to have a higher flow control limit. In this case, you set it to something higher, like 2 megabytes, your ideal congestion window is 122 kilobytes. Now the congestion window can grow because the server is able to send more data that fits more data onto the wire, which has a higher throughput. That leads to a better experience for the user as they're able to download things like images and videos much faster. This is something we had to do for QUIC in order to show improvements over HTTP/2.
Clever Heuristics Aren't Always that Clever
Chi: We covered congestion control and flow control. We talked about how we can safely raise a QUIC connection and a TCP connection so that users always have a connection to use. If you're a networking person, I hope you find all those topics very relevant to you. As a networking person myself, when I started to work on this project, I was also thinking, those will be the problems I will run into and I will have to solve. It turns out there are also some very surprising application behavior that I actually need to understand to actually make QUIC and HTTP/3 work well. The first example here is when we started to enable QUIC and HTTP/3 in the Facebook app, we started small. We first only enabled it for the GraphQL, or the API requests. Surprisingly, by enabling QUIC for GraphQL, we regressed image and video requests, which was still over HTTP/2 and TCP. We did not touch image and video at all, but somehow they regressed. How did this happen?
After some debugging, what we figured out is the first initial feed loading request, or when this was sent out from the application, application had this heuristic. If it gets the response within 2 seconds, it will try to load fresh feed from the network. If after 2 seconds it still has not received the feed loading response, we will actually try to load feed from a local cache. As you can imagine, when we switched to use QUIC for the GraphQL request, because now it is faster, we have more users that's within this 2 seconds limit, but their image and video requests are still over TCP. The percentage of image and video request that will be used in a relatively bad network actually increased. This is why those metrics actually regressed. Once we also enabled QUIC and HTTP/3 for image and video, this problem disappeared.
App Request Queue
Another very surprising app behavior that we had to understand was in the Instagram Android app. We enabled QUIC for video, image, and API requests. For API and video, everything looks very positive. On the image side, the number was quite mixed. Notable regressions is the image queuing time in the app itself. It turns out in the application, there is a rather complex request queue that has a separate portion for the image. As QUIC makes the video downloads faster, the video player started to fetch video of higher resolution. The video response start to take a larger chunk from the request queue because the request queue also has a response bias limit. As a result, some image has to wait inside the image queue slightly longer. We solved this problem by tuning the response bias limit in this request queue.
QUIC Had a Transformative Effect on Video
With all those changes that we have mentioned, eventually we were able to deliver a very significant win in both Facebook and Instagram app, especially for video performance. We were able to reduce the mean time between buffering in the video player by 22%, and also a 20% decrease of video stall. We were also able to reduce 8% of the networking errors that the video has to run into.
The Voyage
This is the timeline of our project of working on QUIC and HTTP/3 at Facebook. There are many milestones in the past three years. One thing very notable is, from the beginning, which is early 2017, to the point where our code is able to interop with other implementation, that's certainly beyond half a year. From there, it took us another three years to actually ship this project.
Takeaways
Joras: What are some of the takeaways here? One thing that I think is really significant is that, while QUIC does perform better, and QUIC can give you great performance, it's not as simple as turning it on. It's not as simple as implementing it, flipping a switch and turning it on. This is usually because existing, especially complex applications can be implicitly tuned or coupled to TCP. Another interesting thing that we saw when we were doing early experiments, is that HTTP and transport metrics for things like latency, number of losses, number of retransmits, things like these will often just show blanket improvement with QUIC. They'll just be like, "Yes, this all got better. This is all really awesome." Then we look at the application metrics, and there'll be regressions. This can be really confusing and really difficult to reason out. You often have to do this in order to ensure that you're not just making some numbers better on paper, you're actually improving the experience of those using your applications. Once you do this, and put in this effort, QUIC will have a transformative effect on your application performance on the modern internet.
Future Work
In terms of future work, where are we going? You might have noticed that we didn't mention two probably headline features of QUIC, which are 0-RTT connection establishment, and connection migration. These are very important features, but we didn't actually need them to replace TCP with QUIC. There's a couple reasons for this. One is that 0-RTT essentially improves the experience when you need to make a new connection. With our applications, they'd already been highly optimized to reuse connections really aggressively. Being in a state where you don't have a new connection is relatively rare. It mostly happens when you're just starting the app up, which is not the majority of the time. It's important. In fact, we're experimenting with 0-RTT right now, but we didn't need it to ship QUIC.
Another one is, of course, connection migration, which is this notion that you can seamlessly transition between a WiFi interface and a cellular interface, and not interrupt the user's connection. It is really great on paper but the implementation complexities are significant, both on the client side and the server side in order to get the most potential out of it. We just haven't been able to invest all of our time in it yet to ship connection migration. We do want to do it in the future. In addition to these big headline features, QUIC, because it has inherent extensibility, because it's resistant to ossification, and because we can iterate on it really quickly gives us a platform for transport and HTTP level experimentation well into the future in a way that was never possible with TCP and HTTP/2.
Questions and Answers
Fedorov: You've mentioned that timeline that it took you quite a bit, like about three years to go through all the steps. What are the main reasons for that taking three years? Was it some of the operational challenges about it, some aspects about QUIC servitization, or something else?
Joras: It was mostly the fact that we took a while to get to the baseline so the infrastructure supported it. That took quite a while. It's one thing to interop with some of the other implementation like a one-off. It took a lot longer to make sure that everything was ready to go, and like in the app and the code was there and even working. Then the majority of the time really after that is, we can't just turn stuff on because we want to. We have to make sure it actually improves the experience. Otherwise, it's pointless. I think most of it was investigating regressions that are caused either by tuning or bugs.
Chi: I think one thing I would add is understanding the networking behavior, and understanding why an application is going to see a regression. There is always a gap there. The visibility and the causation, we change A and then suddenly system B has some number moved. Sometimes that's not always clear. Figuring that out took a surprising amount of time.
Fedorov: Can you identify maybe some of the good practices that you follow at Facebook to optimize that process? In my experience, I can totally relate that. You nailed it. This is the key of rolling out changes, especially on the transport or on the infrastructure layer. At Facebook, what do you do to make it more optimal and prevent some of the delays or prevent some of the outage issues that will be impacting users?
Joras: We had a couple things, which is like, first off, when we say we turned on or shipped QUIC, this is all gated by basically experimentation frameworks that let us selectively turn it on in the application, so like in the mobile apps. We can also turn it off. We can turn it on, and we can turn it back off. Then, of course, we can also turn it on and turn it off at the server side. Like if we found a problem, which did happen, where we'd find like, we shipped a bug, we can still turn it off both on the clients. Then really in an emergency, we can just totally turn it off entirely. Facebook has a lot of this stuff just for general experimentation, because this is how the products are developed, that's how the infrastructure is developed is to iterate quickly. There's a lot of tools to help us do that. We were basically able to integrate into those really easily.
Fedorov: As you go through those migrations, do you enable things for all the traffic, or do you try to break things down by the functional or by the user sharding, or something else? You mentioned that you treat Facebook and Instagram as two different use cases, because they operate quite differently. Are there any other segments that you have to make those deployments easier?
Joras: Yes. We would basically have little groups, that this group has this on this group has this on. With the flow control experimentation, for example, we have different groups that have the default, a different group that had like another number, or another number, and we just run that experiment. Separately, we'd have like, this is the experiment where we're changing the way we map error codes or something. Try to test things individually to see what their individual effect is. Try not to compound the signals too much. Like Yang said, things can have really unexpected, like you turn two things on, and a third thing moves, but only if you turn these two things on. Being able to isolate that is really important.
Fedorov: QUIC works on top of UDP. Can you elaborate a little bit more on the experience of dealing with UDP being blocked? How do you generally deal with that?
Joras: Basically, the flow that Yang showed with the racing, that is the way that we deal with UDP being blocked, which is to say that we always have a TCP fallback in the case that we can detect that a QUIC handshake is not going to complete. Most of the time when that happens is because UDP is blocked or malfunctioning for some reason. In terms of its prevalence, it's difficult actually to answer that question, just because it depends on how you frame it. If you frame it in terms of number of users, if you frame it in terms of number of networks, number of sessions, like these are all different things. What we can say is that the vast majority of the place, it is not blocked. A lot of this is probably because it's been normalized a bit. Before even we were rolling out QUIC, Google had a proprietary variant of QUIC that also ran UDP on port 443. They've already done a lot of groundwork there to get it unblocked generally, but there still are places. Usually, you can tell really obviously, that like some network, or like, a lot of them it's companies, like some companies clearly blocking it with their firewalls. It's in general, not a huge problem. It's really less than 1%, no matter how you slice it, of users are going to be impacted by this.
Fedorov: I think I remember from Yang's flow that he chose 200 milliseconds as the threshold to fall back to TCP. Why 200 milliseconds? What's so magical about this number?
Chi: We actually tried a couple different numbers, and it honestly didn't show too much difference. We tried something slightly longer, something slightly short, and there isn't much difference once you are a sub-500 millisecond range. There isn't too much scientific behind that either. It's for an experimental result.
Fedorov: Do you save that decision to stay in TCP throughout like a local only device, so every new session, that decision is being made.
Chi: Yes, every new session, there is this 200 millisecond delay with one exception, which is if your last attempt of using QUIC actually ran into an error, we actually thus now delay the next time we open TCP connection.
Fedorov: Is there a Java implementation for QUIC at this point?
Joras: There's a lot of implementations. The QUIC Working Group has a GitHub and there's a list of implementations there. There's literally dozens. I think there's at least two or three that are Java or something that runs on top of the JVM. The nice thing about QUIC, as long it's built on top of UDP, like lots of people have implemented it. There's many more QUIC implementations now than there have ever really been TCP implementations.
Fedorov: How would you estimate the maturity level of those implementations? Because still, it's a relatively new protocol, especially compared to TCP. Being the co-chair of the QUIC Working Group, what's your current take on it?
Joras: It varies a lot. We're at the beginning of people really using it. Facebook and companies like Google are in a unique position where we often control both sides with the client and the server. This gives us an ability to be like, we're going to control these implementations. A lot of people are not in that position, they only have one side, and that makes the iteration time slower. That means though, we're going to find bugs and find these other things slower. They've done a lot of great work. If you're going to go invest time into QUIC, expect to find problems like that right now, because it's still early days for these implementations. I'm sure they will mature really rapidly over the next year or two.
Fedorov: You've mentioned some interesting findings that you've had at Facebook of application code, making some assumptions about transport behavior. Can you elaborate a little bit more? Were there any more stories and learnings from Facebook, or maybe you've seen some other patterns like in the industry in general how applications are coupled with TCP.
Joras: With HTTP/2, you have multiplexing, but it goes over that single bytestream. That makes the multiplexing different than in QUIC when you have these independent bytestreams. I think a lot of complex applications will see this weird behavior where they have some gating mechanism that they're implicitly relying on. We saw it in Instagram with that request queue. We saw it in both apps with the flow control, but some mechanism to try to limit how much data is being fetched. These seemed common, because we've seen them multiple times. They seem to arise naturally. When you have these gates, you have to decide on a number, as often. Those numbers, that's where the implicit tuning comes in TCP, which is to say that somebody iterated on that a few times, and they tweaked it, and they arrived at this magic number. That will potentially be where you need to change stuff. It can be difficult to realize that because you don't realize, somebody made this change a while ago. That's where you often need to look. Other things, it's more like concurrency. In general, your concurrency limits might be different for how many requests you can send. Things like that would be where I'd say it's just fundamentally different, because of how TCP works versus how QUIC works.
Fedorov: Basically, the connection pool on the client on the server is making specific assumptions of how the transport layer works?
Joras: Yes, pretty much.
Fedorov: You've talked about the public facing network interactions and using QUIC for that. How's your position of using QUIC for data center communication, so communication between edge and the origin server?
Joras: We had one slight mistake on our slide there, which is corrected in the downloadable slides, which said 50% of edge to origin traffic using QUIC. That was specifically for, basically when we proxy the API request, 50% of those go over QUIC. We had that as a test bed. In terms of QUIC for other traffic, it's something that we're working on. It has more challenges in the data center, which is to say that because it's implemented in user space, there's obviously performance overhead that come with that. Also, the encryption happens a bit differently in QUIC, which just happens on a per packet basis, which is less efficient than with TCP. These are challenges that we think we can overcome, but they make it difficult for certain use cases. For some use cases in the data center that aren't super high throughput, like sub-gigabits per second use cases, or use cases where you go from an edge location and out to another location on the internet, that's some place that we think more immediately could benefit from QUIC. For the same reasons that it does benefit QUIC, going from the edge to the client. It's something that we are hoping to explore over the next year. We're hoping also to get the cost down in terms of making QUIC truly usable everywhere where TCP is currently used.
See more presentations with transcripts