Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Monolith to Microservices: Migrating Snap’s Architecture Using a Service Mesh

Monolith to Microservices: Migrating Snap’s Architecture Using a Service Mesh

This item in japanese

Snap's two-year evolutionary architectural shift from monolith to cloud-hosted microservices has led to a 65% reduction in compute costs along with reduced redundancy and increased reliability for customers, all of this keeping security and privacy compliance requirements.

Service oriented architecture enabled scalability and ownership for engineers. Open source edge proxy Envoy is the core building block, creating a consistent layer for inter-service communication. An internal web app, Switchboard, constitutes the control plane for Snap’s service mesh, which provides one location for service owners to manage their service dependencies.

The cloud infrastructure evolved in the past two years from running a monolith in Google App Engine to microservices in Kubernetes across Amazon Web Services and Google Cloud.

Some of the challenges for starting with a microservice-based system from scratch included consideration towards their existing underlying infrastructure such as network topology, authentication, cloud resource provisioning, deployment, logging and monitoring, traffic routing, rate limiting, and staging and production environments.

As detailed on Snap Engineering blog, in order to find a feasible plan, they considered the current experience of Snapchatters as well. It was also stated that there was no team of dedicated engineers and hence there was no time to implement this plan.

Instead of starting from scratch, Snap decided to go ahead with the service mesh design pattern with open-source edge proxy service Envoy.

Envoy provided features such as support for gRPC and HTTP/2, client-side load balancing, pluggable filters, clear separation of data plane and control plane with a set of dynamic management APIs such called xDS. With availability on AWS and Google Cloud, Envoy became the layer for Snap's service-to-service communication. At Snap, each Envoy proxy connects a custom control plane, receiving service discovery and detailed traffic management settings over its xDS API.

With service mesh, it was important to address questions around the mobile client communication scheme in Envoy. Along with that, when working across AWS and Google Cloud, engineers had queries around managing their Envoy configurations from a security standpoint.

From that, Snap Service Mesh was formed. Snap Service Mesh has an internal web app, named Switchboard, which is a single control panel for Snap's services so that service owners can manage their service dependencies.

The Switchboard configuration has services at its core. Each service has a protocol and basic metadata - owner, email list and description. The clusters with these services can be in any cloud provider, region or environment. Switchboard services have their dependencies and consumers, which are other Switchboard services. If Snap were to expose the entire system API interface to their engineering teams, there would have been numerous configurations, resulting in difficulty in managing them.

Configuration changes in Switchboard are saved in DynamoDB. The Envoy-proxy on the service mesh connects to a xDS control plane through a bidirectional gRPC stream. When the Envoy configuration is generated for a service, the control plane sends the updated config to a small subset of Envoy proxies and measures their health before committing the changes across the mesh.

Along with this, service owners can provision and manage Kubernetes clusters directly from Switchboard. They can also generate spinnaker pipelines with canaries, healthcheck endpoints and zonal rollouts.

In the interest of keeping a minimum number of services exposed to the internet, Snap has designed a shared, internal, regional network for their microservices. An API gateway will be exposed to the internet, so that no external traffic source can communicate directly with the internal network.

This API gateway runs the same Envoy image that the microservices run, connecting to the same control plane. There are custom Envoy-Filters which handle Snapchat's authentication schemes along with rate limiting and load shedding.

The unified Snap service mesh architecture diagram can be found below:



Snap's service mesh is currently live in seven regions across AWS and Google Cloud, with 300+ production services live on the mesh.

Rate this Article