Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News The Many Faces of Envoy Proxy: Edge Gateway, Service Mesh, and Hybrid Networking Bridge

The Many Faces of Envoy Proxy: Edge Gateway, Service Mesh, and Hybrid Networking Bridge

Leia em Português

This item in japanese

At the inaugural EnvoyCon in Seattle, USA, engineers from Pinterest, Yelp and Groupon presented their current use cases for the Envoy Proxy. The overarching message was that Envoy appears to be moving closer to fulfilling its vision of providing the "universal [proxy] data plane API" for modern networking. Large commercial organisations are entrusting Envoy to manage production traffic for a variety of uses cases, including edge gateways, service meshes and hybrid networking bridges.

Over the past year the Pinterest engineering team has migrated from a "perimeter load balancer" model to an Envoy based edge proxy, in which all production edge traffic now passes through Envoy. Derek Argueta, traffic site reliability engineer (SRE) at Pinterest, described how the existing infrastructure was deployed within the AWS public cloud and used a Layer 7 AWS Classic Elastic Load Balancer and a Varnish cache to provide ingress traffic management. Challenges with this solution included ELB scaling issues, suboptimal TLS termination, and lack of effective support for dynamic upstream handling (for example, updating routing as new services are deployed or old instances are retired). After analysing currently available networking proxies, the team also discovered additional motivations for migrating to Envoy, which included the extension API, more effective observability, and consistency with service mesh plans.

The initial Envoy migration effort focused on creating a new edge solution that offered feature parity with the existing stack, which included serving A/B deploys, bot detection, and CDN request signing. Several "hiccups" were encountered during the migration, including Envoy's aggressive default circuit breaking, orchestrating hot restarts across containers to prevent traffic being dropped, and various HTTP "nits and mismatches" (related to Hyrum's Law). Accordingly, the engineering team invested heavily in developing abilities to debug Envoy issues. This was a challenge, as existing skills within the edge engineering team typically focused on hardware-based enterprise load balancer solutions.

Argueta provided a comprehensive overview of how Pinterest performs "stage-based" A/B deployments with the new Envoy-powered edge solution. This was implemented as a "custom solution" due to the requirement for compatibility with the existing deployment system. During the roll out of a new feature, multiple versions of the service are deployed, and routing is controlled via a list of host, IP and stage data added to Zookeeper. This data is passed to the edge Envoys via a custom control plan consisting of a sidecar process interacting with the endpoint discovery service (EDS) API.

Pinterest Envoy edge control plane.

In regard to observability, a series of graphs and dashboards have been created and iterated over, with the goal to provide a "high signal to noise ratio" and help SREs and other engineers to quickly identify issues and associated underlying causes. Alerts have been configured for edge Envoys that watch for high resource usage, connection errors, and protocol errors. Future work to be undertaken includes providing Thrift support and Memcached integration, adding a MySQL and Apache Kafka filter, and moving API authentication to edge Envoys.

The next session, "How to DDOS yourself with Envoy (and other migration horrors)", was presented by Ben Plotnick and John Billings, software engineer and technical lead at Yelp, respectively. Billings began the talk by stating that application developers primarily care about quickly shipping bug-free code so that they can solve customer problems; they are not so concerned with things such as using new RPC technologies or setting communication timeout values. Accordingly, in 2014 the Yelp infrastructure engineering team implemented what they refer to as "service mesh v1". This abstracted away some of the network communication handling from the application to the infrastructure, and was implemented by running a centralised (and manually configured) HAProxy routing layer. This was somewhat similar to the enterprise service bus (ESB) pattern from classical service-oriented architecture (SOA).

Although useful, the manual reloading of HAProxy (and associated engineering effort) combined with scaling issues led the infrastructure team to implement "service mesh v2", which was based on AirBnb's SmartStack service discovery solution. This solution still used HAProxy, but many routine operational tasks were implemented, and automated, using sidecar processes. This meant that there was no more waiting on human operators to implement configuration changes, and no more manual scaling issues.

Although this solution ran successfully for four years, a survey of application "developer happiness" in 2018 revealed several areas for potential improvement. This included the ability to implement dynamic request routing, and to gain access production canaries prior to exposing them to user traffic (without the need for operation engineers time). An associated operational risk assessment was conducted, and three potential solutions explored: extending HAProxy to provide this functionality, as the engineering team were already familiar with it; switch to Envoy, which was gaining traction from success stories from Lyft, Google and IBM; or switch to Linkerd, another popular open source service mesh implementation. The decision was ultimately made to migrate to an Envoy-powered service mesh, using the underlying architecture pattern provided by SmartStack.

Yelp Envoy architecture

The infrastructure engineering team implemented "meshd", a simple control plane for Envoy that was written in Python (due to the team's comfort with this language) and used flat files for the loading of configuration. Applying the "incremental development" principle, a narrow use case was identified and a solution using meshd implemented end-to-end in order to verify this approach. Plotnick discussed how identifying "cut points" that provided maximum impact (with minimum intrusion) led them to implement thin client libraries for use by the development teams. Functionality provided in these thin clients included: context propagation, setting x-envoy-* headers and data marshalling. The client libraries were also an ideal location to implement feature flags to allow experimentation by toggling the routing between the existing and new Envoy-powered solution.

Continuous integration and continuous delivery principles were used extensively during the development of the new Envoy-powered mesh, and automated end-to-end tests replaced the manual validation of the previous solution.

The presentation was concluded with a reminder that although working with emerging technology like Envoy is a lot of fun, the primary focus for engineering should be on solving users problems -- afterall, this is why the organisation and team were formed.

In the next session Tristan Blease, staff software engineer at Groupon, and Michael Chang, senior software engineer at Groupon, presented "Bridging the gap between on-prem and cloud: a story about Envoy + a hybrid boundary". Groupon currently runs the majority of its application workloads on premises, but the plan is to ultimately run a hybrid stack, with the inclusion of public cloud and containerised services that are deployed onto Kubernetes.

The current stack uses a variety of technologies for handling edge and internal traffic, and the planned migration created new requirements that needed to be assessed: avoid manual processes for configuration, provide a "stronger story" around observability, and implement TLS and access control for service-to-service communication. They engineering team experimented with and identified their "ideal architecture components", one of which was Envoy.

Groupon ideal architecture components.

Blease and Chang provided a comprehensive overview of the current and planned system architecture, and discussed the challenges of getting traffic from on-premises to cloud and vice versa. Envoy instances will be deployed as an edge proxy node, and will be responsible for routing traffic between environments. Envoy configuration will primarily be stored with a NoSQL database which, with a small amount of glue, will act as the control plane, providing host, routes and endpoint data to the associated Envoy management xDS APIs.

Groupon Envoy architecture.

Slide decks of talks can be found on the EnvoyCon Sched page, and the recording of many of the presentations can be found on the CNCF YouTube channel.

Rate this Article