Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Implementation of Zero-Configuration Service Mesh at Netflix

Implementation of Zero-Configuration Service Mesh at Netflix

This item in japanese

In a recent blog post, Netflix described why they engaged the Envoy community and Kinvolk to implement a new feature for Envoy, the open-source proxy developed by Lyft. This new feature called On-Demand Cluster Discovery helped Netflix to implement a zero-configuration service mesh.

Inter-Process Communications (IPC) is crucial for communication within the Netflix stack. Since the company moved all the infrastructure to the cloud (AWS) in 2010, it has needed tools for cloud-native environments. To manage IPC easily, Netflix developed Eureka for service discovery and Ribbon for client-side load balancing and communication.

The main goal of Eureka is to abstract the name of the destination services with Virtual IPs (VIPs) and, if necessary, to ensure secure communication with Secure VIPs (VIPs). The destination service name and the type of communication (secure or not) is required by a service to connect to the other service. IPC clients are instantiated with the target VIP or SVIP and the Eureka client takes care of the translation from VIP or SVIP and port to a set of IPs, fetching information from the Eureka server. The downside is the single point of failure migrates from the load balancers to Eureka. Ribbon provides a client-side load balancer implemented as a code library.


How Netflix implements IPC with Eureka


This architecture worked well for decades, but the company needed to migrate to a service mesh for three main reasons:

  1. The Netflix IPC technology stack now includes a mix of REST, GraphQL, and gRPC
  2. Services evolved from a purely Java-based architecture to a polyglot architecture
  3. Features have been incrementally added ti IPC clients

The decision was to use Envoy to centralize IPC features in a single implementation and keep the per-language clients as simple as possible. Furthermore, Envoy supports the Discovery Abstraction so the IPC client can continue to use it. The downside is Envoy requires to specify the clusters in the proxy’s configuration and this became problematic for the Netflix architecture because, a service can potentially communicate with a dozen clusters. Furthermore, the Neflix architecture is constantly changing which means that the clusters change over time.

To overcome this constant change of cluster topologies, the Netflix team evaluated several approaches:

  • Get service owners to define the clusters their service needs to talk to
  • Auto-generate Envoy config based on a service’s call graph
  • Push all clusters to every app

But all of these approaches have multiple downsides, so the solution implemented was to fetch the cluster information on-demand at runtime. However, Envoy Proxy did not support this functionality out of the box. From the collaboration of the Envoy community, Netflix, and Kinvolk the On-Demand Cluster Discovery (ODCDS) was developed. With this new feature, the proxies could look up the cluster information at the first connection. The new flow becomes the following:

  1. Client request comes into Envoy
  2. Extract the target cluster based on the Host. If that cluster is known already, go to step 7
  3. If the cluster doesn’t exist, the request is paused
  4. Make a request to the Cluster Discovery Service (CDS) endpoint on the control plane. The control plane generates a customized CDS response based on the service’s configuration and Eureka registration information
  5. Envoy gets back the cluster (CDS), which triggers a pull of the endpoints via Endpoint Discovery Service (EDS). Endpoints for the cluster are returned based on Eureka status information for that VIP or SVIP
  6. The client's request is unpaused
  7. Envoy handles the request as usual: it picks an endpoint using a load-balancing algorithm and issues the request

IPC with Eureka and Envoy

This flow is performed in milliseconds, but there are some use cases where the services required even lower latency. To overcome this limitation, the solutions identified included:

  1. The services need to either predefine the clusters they communicate with or prime connections before their first request
  2. Pre-pushing clusters from the control plane as proxies start-up, based on historical request patterns

Full details of this and continuing work can be found in the Netflix blog article, "Zero Configuration Service Mesh with On-Demand Cluster Discovery".

About the Author

Rate this Article