Envoy as a Service-to-Service Proxy at Mux

In a recent blog post, Mux discussed their use of Envoy and how they incorporated it into their systems to solve a scaling and load-balancing problem with Kubernetes and long-lived HTTP/2 connections. Mux focuses on video streaming infrastructure software and performance analytics. Their services power popular applications such as Udemy and TED.

Mux runs its services on Kubernetes, and service-to-service communication is carried out via gRPC on top of long-lived HTTP/2 connections. Unfortunately, Kubernetes doesn't load-balance long-lived connections, and some Pods might receive more requests than others, resulting in uneven load and ineffective horizontal scaling. Although Mux’s engineering team can mitigate the issue with manual restarts, it’s clear that the solution was undesirable and unsustainable.

Mux’s engineering team first considered taking advantage of features within the gRPC client libraries to properly handle load balancing on the client-side, as suggested in this blog on scaling long-lived connections in Kubernetes. The elegance of this solution is that it requires minimal extra dependencies or points of failure external to the service itself. However, it also comes with several significant shortcomings. First, Mux had multiple languages and frameworks in their gRPC services; therefore, Mux’s engineering team would need to implement the client-side load-balancing more than once. Additionally, most client-based solutions were in somewhat early stages of development, including the official gRPC balancer for Go, the primary language at Mux, was still listed as experimental. Last but not least, client-based approaches in a distributed system environment require version and functional compatibility across multiple different services in various languages, creating a multifarious standardization challenge for future maintenance.

Client-Based Strategy, source: Mux’s engineering blog

The full service-mesh solution involves a combination of network-level components such as Envoy, alongside a management layer, such as Istio, Linkerd, Maesh, and Consul Connect, that serves as a configuration and control plane running over the network-level components. However, Mux dismissed this solution because it comes with a fair amount of complexity, and it would incur additional costs to operate and maintain. Additionally, some of these services were not compatible with the older parts of Mux’s Kubernetes clusters, which were on older versions such as 1.8 and 1.12.

Full Service-Mesh Strategy, source: Mux’s engineering blog

Mux wants to find a middle-ground solution where it meets the needs of being language agnostic and not introducing too much overhead in new services and ongoing maintenance. Mux's compromise was to use Envoy as the network proxy on the infrastructure and deploy it in such a way without involving a full service-mesh. Mux had the option to deploy Envoy as a standalone Kubernetes deployment, a daemonset, or a sidecar container.

In the standalone deployment setup, Mux would create a pool of Envoy instances to field requests for all services that wanted to make gRPC requests. However, this approach requires one additional hop into all gRPC requests, which potentially impacts network performance and requires additional autoscaling configurations to ensure sufficient proxying capacity.

Although the daemonset approach will mitigate the potential of the additional network hop by ensuring that every host that runs Pods that make gRPC requests has a running Envoy Pod that proxies all of their traffic, this approach has several risks that outweigh the benefits. The loss of this single daemonset Pod will mean that all outbound gRPC requests on that host will fail. In addition, fluctuating network traffic on the host would only be handled by a single Pod, making horizontal scaling more of a challenge.

The sidecar container approach solves both challenges from the standalone deployment setup and the daemonset approach. In this approach, Mux appends an Envoy to the primary service Pod as a sidecar container, and a flag or environment variable to the primary container to configure it to point at the localhost address of the Envoy sidecar for gRPC networking. This setup will allow network proxying capacity scale alongside any automatic scaling of services.

Sidecar/Proxy-based Strategy, source: Mux’s engineering blog

Since getting this setup running and rolled out, Mux no longer sees any load-balancing issues with gRPC and Kubernetes. Mux also got some excellent networking metrics for free via Envoy. They have been able to find several other valuable use cases for Envoy proxy injection outside of our gRPC services.

About the Author

Patrick Zhang

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Patrick Zhang

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter