Reddit introduced Envoy into their backend framework as service-to-service proxy to support their ongoing architectural improvements. As their architecture evolves from monolithic into smaller services, the complexity of supporting and debugging their existing framework is becoming too costly. By adopting Envoy as a service-to-service Layer 4/Layer 7 proxy, they discovered significant improvements in observability, ease of adoption, and performance.
According to Courtney Wang, senior software engineer at Reddit, Reddit has undergone significant growth in both their engineering team size and product complexity over the past three years. This has run in parallel with an evolution of their backend architecture as they move from a monolithic application and begin to adopt a more service-oriented architecture. These changes have led to an increase in complexity in how they debug their applications in moving from investigating function calls to tracing RPCs between multiple services. As well, the number of considerations that engineers need to entertain when standing up new services has increased and now includes understanding client request behaviours, retry handling, circuit breaking, and granular route control.
Reddit has been using Airbnb's SmartStack as their Service Mesh since they begun splitting services out of their monolith. As service instances are stood up and torn down, registration is handled by SmartStack Nerve. Nerve is a Ruby process that runs a sidecar on each instance and registers them into a central Zookeeper cluster. To simplify work for developers, Reddit developed Baseplate, a common framework that provides a health check interface and an abstraction layer for connecting to Nerve.
Reddit makes use of Synapse, a per-instance Ruby process, to manage their service endpoint discovery. Synapse reads the Zookeeper registry that Nerve populates and then writes the endpoint entries to a local HAProxy configuration file. HAProxy runs as a sidecar process handling proxying and load-balancing the downstream service traffic.
While the SmartStack implementation has remained relatively unchanged and operational, their evolving infrastructure has begun to push against the limits of what SmartStack offers. As Wang notes, this led the team to re-evaluate the service mesh landscape and see if a replacement made sense for them. The key pain points they were hoping to resolve were:
- Nerve and Synapse can only accept static configuration meaning and service registration updates required Puppet configuration changes and updates across their service fleet
- Synapse's configuration writer for HAProxy provides only basic routing definitions
- They had minimal observability for traffic going through HAProxy as it doesn't understand Thrift, which is Reddit's primary internal protocol
In selecting a new service mesh candidate, Wang indicates their key requirements were ensuring no performance impact, gaining Layer 7 Thrift support in the proxy, and ease of extending and integrating with the new tool. The team decided on Envoy as it fit within these requirements with tradeoffs that they deemed acceptable.
The largest gap with Envoy was lack of first-class Thrift support. Wang recounts that they worked with Turbine Labs, who had recently announced their support for Envoy, to contract development for Thrift support. With that partnership they were able to introduce Thrift-aware proxying, routing, request/response metrics, and rate-limiting.
The first step that they took to deploying Envoy was to have it replace HAProxy for basic TCP proxying support. Nerve and Synapse would still handle the service registration and discovery which meant that they would not be able to leverage Envoy's dynamic discovery service. This allowed them to keep their service discovery layer stable while also rolling out Envoy to production. By running both HAProxy and Envoy in parallel, listening on different ports, they could rollback simply by adjusting the configuration. This also allowed them to audit Envoy configuration against their HAProxy configuration to verify the accuracy of their Synapse configuration generator.
Wang indicates that Envoy has now been serving production traffic smoothly for nearly four months. He states that there were no show stopping issues, but describes that Envoy's network connection handling differed enough from HAProxy to cause some unexpected errors in the application connection management code.
With Envoy and the new Thrift filter, they are finding greater observability at the network layer including request and response metrics that weren't available before without application code changes. They have yet to be able to make an accurate measurement on service latencies as HAProxy is still running as a sidecar to facilitate quick rollbacks during this transition period.
With the success of adopting Envoy at the proxy level to manage Layer 4 traffic, the next step in Reddit's plan is to the deploy Envoy's discovery service API backed by a centralized configuration store. Further reaching plans include investigating running Envoy at the edge as a replacement for using HAProxy as a load balancer for their core Reddit backend application and AWS ALBs for some of their external ingress points. Wang believes that this will provide greater observability and service routing control, such as shadowing inbound traffic and traffic shifting at the edge. eBay recently went through a similar migration in leveraging Envoy to replace physical load balancers at external ingress points and reported some of the wins that Wang and team are hoping to achieve. Wang's hope is that this further adoption of Envoy will assist the teams as they work to split their monolith into smaller services.