LinkedIn was able to dramatically improve the scalability and performance of its Espresso database by migrating it from HTTP/1.1 to HTTP/2, resulting in a reduction in the number of connections, latency, and garbage collection times. To achieve these gains, the team had to optimize the Netty's default HTTP/2 stack to make it fit their needs.
LinkedIn uses Espresso, the document platform built on top of MySQL, to store and serve most of its data. With the organic growth of LinkedIn’s platform, the volume of data is increasing, forcing the company to constantly expand the footprint of Espresso clusters and work on optimizations, such as introducing a centralized caching layer for Espresso or adopting Protocol Buffers for its inter-service communication.
Espresso’s High-level Architecture (Source: LinkedIn Engineering Blog)
Espresso's transactional stack comprises two major components: the router and the storage node. The router is responsible for directing the request to the right storage node, which in turn takes care of interacting with the MySQL cluster and adapting the data format appropriately. The communication between these components uses the HTTP protocol and, more specifically, the Netty framework. The team has observed a gradual degradation of scalability as Espresso clusters grew over time.
More recently, adding 100 router nodes has resulted in increased memory usage in storage nodes, as well as a latency increase of 15% caused by additional garbage collection activity. Additionally, because of the connection pooling applied to a large number of HTTP/1.1 connections, the time to acquire a connection can be in several milliseconds (95th percentile of 15ms). Lastly, during networking events, like switch upgrades, re-establishing thousands of connections can result in errors due to hitting connection limits on storage nodes.
Abhishek Andhavarapu, staff software engineer at LinkedIn, explains the differences between HTTP/1.1 and HTTP/2 and how these impact the scalability and performance of the Espresso platform:
In the communication between the router and storage layer, our earlier approach involved utilizing HTTP/1.1, a protocol extensively employed for interactions between web servers and clients. However, HTTP/1.1 operates on a connection-per-request basis. In the context of large clusters, this approach led to millions of concurrent connections between the router and the storage nodes. This resulted in constraints on scalability, resiliency, and numerous performance-related hurdles.
The team decided to continue using the Netty framework while migrating to HTTP/2 but quickly discovered that out-of-the-box performance wasn’t satisfactory (45% lower throughput than the HTTP/1.1 implementation, with around 60% higher latency), so the engineers had to investigate and address the bottlenecks in the HTTP/2 stack. Upon some investigation, they identified two areas for improvement: acquiring connections and processing requests, and encoding/decoding of requests.
Developers augmented several internal Netty's implementation details by forking relevant classes. They created a handler implementation that would reuse the existing channels to avoid creating new processing pipelines for each request. They also introduced a custom EventLoopGroup implementation that balanced connections across worker threads more eventy. To reduce context switching when getting a connection, the team reworked the connection pool implementation to employ a high-performance, thread-safe queue.
Furthermore, SSL handling was optimized by using the native, JNI-based SSL engine, and a custom SSL initialization logic allowed avoiding lengthy DNS lookup delays. Lastly, the team optimized encode/decode performance by creating a custom codec that encapsulated an HTTP/2 request as an HTTP/1.1 request to help deal with many custom HTTP headers used by Espresso, as well as disabling HPACK header compression.
Latency Reduction After HTTP/2 Migration (Source: LinkedIn Engineering Blog)
The team reported that, after all these customizations, the migration to HTTP/2 resulted in substantial improvements over HTTP/1.1 by reducing the number of TCP connections by 88%, the latency by 65-75%, garbage collection times by 75-81%, and bringing down the waiting time to acquire a connection from 11ms to 0.02ms (99% improvement).