InfoQ Homepage Articles Service Mesh Ultimate Guide 2021 - Second Edition: Next Generation Microservices Development

Architecture & Design

Service Mesh Ultimate Guide 2021 - Second Edition: Next Generation Microservices Development

This item in japanese

Sep 09, 2021 28 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Learn about the emerging architecture trends in the adoption of service mesh technologies, especially the multi-cloud, multi-cluster, and multi-tenant models, how to deploy service mesh solutions in heterogeneous infrastructures (bare metal, VMs and Kubernetes), and application/service connectivity from edge computing layer to to the mesh layer.
Learn about some of the new patterns in the service mesh ecosystem, like Multi-Cluster Service Mesh, Media Service Mesh and Chaos Mesh as well as classic microservices anti-patterns like the “Death Star” architecture.
Get an up-to-date summary of the recent innovations of using service mesh in the area of deployments, with rapid experimentations, chaos engineering and canary deployments between Pods (K8s clusters) and VMs (non-K8s clusters).
Explore innovations in the area of Service Mesh extensions including: enhanced identity management for securing microservices connectivity including custom certificate authority plugins, adaptive routing features for higher availability and scalability of the services, and enhancing sidecar proxies.
Learn about what's coming up in the operational aspects, such as configuring multi-cluster capabilities and connecting Kubernetes workloads to servers hosted on VM infrastructure, and the developer portals to manage all the features and API in multi-cluster service mesh installations.

Service Mesh Ultimate Guide 2021: Download the Professionally Designed, PDF Version Here

This guide aims to answer pertinent questions for software architects and technical leaders, such as: What is a service mesh? Do I need a service mesh? How do I evaluate the different service mesh offerings? Get up to speed on the adoption of service mesh. Download Now.

In the last few years, service mesh technologies have come a long way. Service mesh plays a vital role in cloud native adoption by various organizations. By providing the four main types of capabilities—Connectivity, Reliability, Observability, and Security—service mesh has become a core component of IT organizations’ technology and infrastructure modernization efforts. Service mesh enables Dev and Ops teams to implement these capabilities at infrastructure level, so application teams don’t need to reinvent the wheel when it comes to the cross-cutting non-functional requirements.

Since the publication of the first edition of this article back in February of 2020, service mesh technologies have gone through significant innovations and several new architecture trends, technology capabilities, and service mesh projects have emerged in the ever evolving service mesh space.

The previous year has seen the service mesh products evolve to be much more than Kubernetes-only solutions where apps that are not hosted on Kubernetes platform couldn’t take advantage of the service mesh. Not all organizations have transitioned all their business and IT apps to the Kubernetes cloud platform. So, since the inception of service mesh there has been a need for this technology to work in diverse IT infrastructure environments.

With the growing adoption of microservice architectures, application systems have become decoupled and distributed in terms of cloud providers, infrastructure (Kubernetes, VM’s, Bare Metal Servers), geographies, and even types of workloads to be managed in service mesh integrated environments.

Let’s start off with some history of how service mesh came about.

Around 2016, the term “service mesh” appeared in the arenas of microservices, cloud computing, and DevOps. Buoyant team used the term in 2016 to explain their product Linkerd. As with many concepts within computing, there is actually a long history of the associated pattern and technology.

The arrival of the service mesh has largely been due to a perfect storm within the IT landscape. Developers began building distributed systems using a multi-language (polyglot) approach, and needed dynamic service discovery. Operations began using ephemeral infrastructure, and wanted to gracefully handle the inevitable communication failures and enforce network policies. Platform teams began embracing container orchestration systems like Kubernetes, and wanted to dynamically route traffic in and around the system using modern API-driven network proxies, such as Envoy.

This article aims to answer pertinent questions for software architects and technical leaders, such as: What is a service mesh? Do I need a service mesh? How do I evaluate the different service mesh offerings?

You can use the Table of Contents menu at the bottom of the page to quickly navigate this guide.

The Service Mesh Pattern

The service mesh pattern is focusing on managing all service-to-service communication within a distributed software system.

Context

The context for the pattern is twofold: First, that engineers have adopted the microservice architecture pattern, and are building their applications by composing multiple (ideally single-purpose and independently deployable) services together. Second, the organizations have embraced cloud native platform technologies such as containers (e.g., Docker), orchestrators (e.g., Kubernetes), and gateways.

Intent

The problems that the service mesh pattern attempts to solve include:

Eliminating the need to compile into individual services a language-specific communication library to handle service discovery, routings, and application-level (Layer 7) non-functional communication requirements.
Externalizing service communication configuration, including network locations of external services, security credentials, and quality of service targets.
Providing passive and active monitoring of other services.
Decentralizing the enforcement of policy throughout a distributed system.
Providing observability defaults and standardizing the collection of associated data.
- Enabling request logging
- Configuring distributed tracing
- Collecting metrics

Structure

The service mesh pattern primarily focuses on handling traditionally what has been referred to as “east-west” remote procedure call (RPC)-based traffic: request/response type communication that originates internally within a datacenter and travels service-to-service. This is in contrast to an API gateway or edge proxy, which is designed to handle “north-south” traffic: Communication that originates externally and ingresses to an endpoint or service within the datacenter.

Service Mesh Features

A service mesh implementation will typically offer one or more of the following features:

Normalizes naming and adds logical routing, (e.g., maps the code-level name “user-service” to the platform-specific location “AWS-us-east-1a/prod/users/v4”)
Provides traffic shaping and traffic shifting
Maintains load balancing, typically with configurable algorithms
Provides service release control (e.g., canary releasing and traffic splitting)
Offers per-request routing (e.g., traffic shadowing, fault injection, and debug re-routing)
Adds baseline reliability, such as health checks, timeouts/deadlines, circuit breaking, and retry (budgets)
Increases security, via transparent mutual Transport Level Security (TLS) and policies such as Access Control Lists (ACLs)
Provides additional observability and monitoring, such as top-line metrics (request volume, success rates, and latencies), support for distributed tracing, and the ability to “tap” and inspect real-time service-to-service communication
Enables platform teams to configure “sane defaults” to protect the system from bad communication

Service mesh capabilities can be categorized into four areas as listed below:

Connectivity
Reliability
Security
Observability

Let’s look at what features service mesh technologies can offer in each of these areas.

Connectivity:

Traffic Control (Routing, Splitting)
Gateway (Ingress, Egress)
Service Discovery
A/B Testing, Canary
Service Timeouts, Retries

Reliability:

Circuit Breaker
Fault Injection/Chaos Testing

Security:

Service-to-service authentication (mTLS)
Certificate Management
User Authentication (JWT)
User Authorization (RBAC)
Encryption

Observability:

Monitoring
Telemetry, Instrumentation, Metrics
Distributed Tracing
Service Graph

Service Mesh Architecture: Looking Under the Hood

A service mesh consists of two high-level components: a data plane and a control plane. Matt Klein, the creator of theEnvoy Proxy, has written an excellent deep-dive into the topic of “service mesh data plane versus control plane.”

Broadly speaking, the data plane “does the work” and is responsible for “conditionally translating, forwarding, and observing every network packet that flows to and from a [network endpoint].” In modern systems, the data plane is typically implemented as a proxy, (such as Envoy, HAProxy, or MOSN), which is run out-of-process alongside each service as a “sidecar.” Linkerd uses a micro-proxy approach that’s optimized for the service mesh sidecar use cases.

A control plane “supervises the work,” and takes all the individual instances of the data plane—a set of isolated stateless sidecar proxies—and turns them into a distributed system. The control plane doesn’t touch any packets/requests in the system, but instead, it allows a human operator to provide policy and configuration for all of the running data planes in the mesh. The control plane also enables the data plane telemetry to be collected and centralized, ready for consumption by an operator.

Control Plane and Data Plane combined provide the best of both worlds, in the sense that the policies can be defined and managed centrally, at the same time, the same policies can be enforced in a decentralized manner, locally in each pod on Kubernetes cluster. The policies can be related to security, routing, circuit breaker, or monitoring.

The diagram below is taken from the Istio architecture documentation, and although the technologies labeled are specific to Istio, the components are general to all service mesh implementation.

Istio architecture, demonstrating how the control plane and proxy data plane interact (courtesy of the Istio documentation)

Use Cases

There are a variety of use cases that a service mesh can enable or support.

Dynamic Service Discovery and Routing

A service mesh provides dynamic service discovery and traffic management, including traffic shadowing (duplicating) for testing, and traffic splitting for canary releasing and A/B type experimentation.

Proxies used within a service mesh are typically “application layer” aware (operating at Layer 7 in the OSI networking stack). This means that traffic routing decisions and the labeling of metrics can draw upon data in HTTP headers or other application layer protocol metadata.

Service-to-Service Communication Reliability

A service mesh supports the implementation and enforcement of cross-cutting reliability requirements, such as request retries, timeouts, rate limiting, and circuit-breaking. A service mesh is often used to compensate (or encapsulate) dealing with the eight fallacies of distributed computing. It should be noted that a service mesh can only offer wire-level reliability support (such as retrying an HTTP request), and ultimately the service should be responsible for any related business impact such as avoiding multiple (non-idempotent) HTTP POST requests.

Observability of Traffic

As a service mesh is on the critical path for every request being handled within the system, it can also provide additional “observability,” such as distributed tracing of a request, frequency of HTTP error codes, and global and service-to-service latency. Although a much overused phrase in the enterprise space, service meshes are often proposed as a method to capture all of the data necessary to implement a “single pane of glass” view of traffic flows within the entire system.

Communication Security

A service mesh also supports the implementation and enforcement of cross-cutting security requirements, such as providing service identity (via x509 certificates), enabling application-level service/network segmentation (e.g., “service A” can communicate with “service B,” but not “service C”) ensuring all communication is encrypted (via TLS), and ensuring the presence of valid user-level identity tokens or “passports.”

Antipatterns

It is often a sign of a maturing technology when antipatterns of usage emerge. Service meshes are no exception.

Too Many Traffic Management Layers (Turtles All the Way Down)

This antipattern occurs when developers do not coordinate with the platform or operations team, and duplicate existing communication handling logic in code that is now being implemented via a service mesh. For example, an application implementing a retry policy within the code in addition to a wire-level retry policy provided by the service mesh configuration. This antipattern can lead to issues such as duplicated transactions.

Service Mesh Silver Bullet

There is no such thing as a “silver bullet” within IT, but vendors are sometimes tempted to anoint new technologies with this label. A service mesh will not solve all communication problems with microservices, container orchestrators like Kubernetes, or cloud networking. A service mesh aims to facilitate service-to-service (east-west) communication only, and there is a clear operational cost to deploying and running a service mesh.

Enterprise Service Bus (ESB) 2.0

During the pre-microservice service-oriented architecture (SOA) era the Enterprise Service Buses (ESB) implemented a communication system between software components. Some fear that many of the mistakes from the ESB era will be repeated with the use of a service mesh.

The centralized control of communication offered via ESBs clearly had value. However, the development of the technologies was driven by vendors, which led to multiple problems, such as: a lack of interoperability between ESBs, bespoke extension of industry standards (e.g., adding vendor-specific configuration to WS-* compliant schema), and high cost. ESB vendors also did nothing to discourage the integration and tight-coupling of business logic into the communication bus.

Big Bang Deployment

There is a temptation within IT at large to believe that a big bang approach to deployment is the easiest approach to manage, but as research from Accelerate and the State of DevOps Report, this is not the case. As a complete rollout of a service mesh means that this technology is on the critical path for handling all end user requests, a big bang deployment is highly risky.

Death Star Architecture

When organizations adopt microservices architecture and development teams start creating new microservices or leverage existing services in their applications, the service-to-service communication becomes a critical part of the architecture. Without a good governance model, this can lead to a tight coupling between different services. It will also be difficult to pinpoint which service is having issues when the whole system is having problems in production.

Lacking a service communication strategy and governance model, the architecture becomes what’s called the “Death Star Architecture.”

For more information on this architecture anti-pattern, check out the articles Part1, Part2, and Part3 on cloud native architecture adoption.

Domain-Specific Service Meshes

Local implementation and over-optimization of service meshes can sometimes lead to too narrow of a scope of the service mesh deployment. Developers may prefer service mesh instances specific to their own business domains but this approach has more disadvantages than benefits. We don’t want to implement a too fine-grained scope of service mesh, like a dedicated service mesh for each business or functional domain in the organization (e.g., Finance, HR, Accounting, etc.). This defeats the purpose of having a common service orchestration solution like service mesh for capabilities such as enterprise level service discovery or cross-domain service routing.

Service Mesh Implementations and Products

The following is a non-exhaustive list of current service mesh implementations:

Linkerd (CNCF graduated project)
Istio
Consul
Kuma (CNCF sandbox project)
AWS App Mesh
NGINX Service Mesh
AspenMesh
Kong
Solo Gloo Mesh
Tetrate Service Bridge
Traefik Mesh (formerly Maesh)
Meshery
Open Service Mesh (CNCF sandbox project)

Also, other products like DataDog are starting to offer integrations with service mesh technologies like Linkerd, Istio, Consul Connect, and AWS App Mesh.

Service Mesh Comparisons: Which Service Mesh?

The service mesh space is extremely fast moving, and so any attempt to create a comparison is likely to quickly become out of date. However, several comparisons do exist. Care should be taken to understand the source’s bias (if any) and the date that the comparison was made.

https://layer5.io/landscape
https://kubedex.com/istio-vs-linkerd-vs-linkerd2-vs-consul/ (correct as of August 2021)
https://platform9.com/blog/kubernetes-service-mesh-a-comparison-of-istio-linkerd-and-consul/ (up to date as of October 2019)
https://servicemesh.es/ (last published August 2021)

InfoQ always recommends that service mesh adopters perform their own due diligence and experimentation on each offering.

Service Mesh Tutorials

For engineers or architects looking to experiment with multiple service meshes the following tutorials, playgrounds, and tools are available:

Layer 5 Meshery—a multi-service mesh management plane
Solo’s Gloo Mesh—a service mesh orchestration platform
KataCoda Istio tutorial
Consul service mesh tutorial
Linkerd tutorial
NGINX Service Mesh Tutorial

History of the Service Mesh

InfoQ has been tracking the topic that we now call service mesh since late 2013, when Airbnb released SmartStack, which offered an out-of-process service discovery mechanism (using HAProxy) for the emerging “microservices” style architecture. Many of the previously labeled “unicorn” organizations were working on similar technologies before this date. From the early 2000s, Google was developing its Stubby RPC framework that evolved into gRPC, and theGoogle Frontend (GFE) and Global Software Load Balancer (GSLB), traits of which can be seen in Istio. In the earlier 2010s, Twitter began work on the Scala-powered Finagle from which the Linkerd service mesh emerged.

In late 2014, Netflix released an entire suite of JVM-based utilities including Prana, a “sidecar” process that allowed application services written in any language to communicate via HTTP to standalone instances of the libraries. In 2016, the NGINX team began talking about “The Fabric Model,” which was very similar to a service mesh, but required the use of their commercial NGINX Plus product for implementation. Also, Linkerd v0.2 was announced in February 2016, though the team didn't start calling it a service mesh until later.

Other highlights from the history of the service mesh include the releases of Istio in May 2017, Linkerd 2.0 in July 2018, Consul Connect and Gloo Mesh in November 2018, service mesh interface (SMI) in May 2019, and Maesh (now called Traefik Mesh) and Kuma in September 2019.

Even service meshes that emerged outside of the unicorns, such as HashiCorp’s Consul, took inspiration from the aforementioned technology, often aiming to implement the CoreOS coined concept of “GIFEE”; Google infrastructure for everyone else.

For a deep-dive into the history of how the modern service mesh concept evolved, Phil Calçado has written a comprehensive article “Pattern: Service Mesh.”

Service Mesh Standards

Even though the service mesh technologies have seen a major transformation year after year for the last few years, the standards on service mesh haven’t caught up with the innovations.

The main standard for using service mesh solutions is the Service Mesh Interface (SMI). The Service Mesh Interface is a specification for service meshes that run on Kubernetes. It doesn’t implement a service mesh itself but defines a common standard that can be implemented by a variety of service mesh providers.

The goal of the SMI API is to provide a common, portable set of Service Mesh APIs which a Kubernetes user can use in a provider agnostic manner. In this way, people can define applications that use Service Mesh technology without tightly binding to any specific implementation.

SMI is basically a collection of Kubernetes Custom Resource Definitions (CRD) and Extension API Servers. These APIs can be installed onto any Kubernetes cluster and manipulated using standard tools. To activate these APIs, an SMI provider is run in the Kubernetes cluster.

SMI specification allows for both standardization for end-users and innovation by providers of Service Mesh Technology. SMI enables flexibility and interoperability, and covers the most common service mesh capabilities. Current specification components focus on the connectivity aspect of service mesh capabilities. The API specifications include the following:

Traffic Access Control
Traffic Metrics
Traffic Specs
Traffic Split

The current SMI ecosystem includes a wide range of service mesh including Istio, Linkerd, Consul Connect, Gloo Mesh and so on.

The SMI specification is licensed under the Apache License Version 2.0.

If you want to learn more about SMI specification and its API details, check out the following links.

Core Specification (current version: 0.6.0)
Specification Github project
How to Contribute

Service Mesh Benchmarks

Service Mesh Performance is a standard for capturing the details of infrastructure capacity, service mesh configuration, and workload metadata. SMP specification is used to capture the following details:

Environment and infrastructure details
Number and size of nodes, orchestrator
Service mesh and its configuration
Workload/application details
Statistical analysis to characterize performance

William Morgan from the Linkerd team wrote about benchmarking the performance of Linkerd and Istio. There is also an article from 2019 about Istio best practices on benchmarking service mesh performance.

It’s important to keep in mind, like any other performance benchmark, you should not put too much weight into any of these external publications, especially by the product vendors. You should design and execute your own performance testing in your server environment to validate which specific product fits the business and non-functional requirements of your application.

Exploring the (Possible) Future of Service Meshes

Kasun Indrasiri has explored “The Potential for Using a Service Mesh for Event-Driven Messaging,” in which he discussed two main emerging architectural patterns for implementing messaging support within a service mesh: the protocol proxy sidecar, and the HTTP bridge sidecar. This is an active area of development in the service mesh community, with the work toward supporting Apache Kafka within Envoy attracting a fair amount of attention.

Christian Posta has previously written about attempts to standardize the usage of service meshes in “Towards a Unified, Standard API for Consolidating Service Meshes.” This article also discusses the Service Mesh Interface (SMI)that was announced in 2019 by Microsoft and partners at KubeCon EU. The SMI defines a set of common and portable APIs that aims to provide developers with interoperability across different service mesh technologies including Istio, Linkerd, and Consul Connect.

The topic of integrating service meshes with the platform fabric can be further divided into two sub-topics.

First, there is work being conducted to reduce the networking overhead introduced by a service mesh data plane. This includes the data plane development kit (DPDK), which is a userspace application that “bypasses the heavy layers of the Linux kernel networking stack and talks directly to the network hardware.” There is also Linux based BPF solutionby the Cilium team that utilizes the extended Berkley Packet Filter (eBPF) functionality in the Linux kernel for “very efficient networking, policy enforcement, and load balancing functionality.” Another team is mapping the concept of a service mesh to L2/L3 payloads with Network Service Mesh, as an attempt to “re-imagine network function virtualization (NFV) in a cloud-native way.”

Second, there are multiple initiatives to integrate service meshes more tightly with public cloud platforms, as seen in the introduction of AWS App Mesh, GCP Traffic Director, and Azure Service Fabric Mesh.

The Buoyant team is leading the charge with developing effective human-centric control planes for service mesh technology. They have recently released Buoyant Cloud, a SaaS-based “team control plane” for platform teams operating Kubernetes. This product is discussed in more detail in the section below.

There have also been several innovations in the service mesh area since last year. Let’s look at some of these innovations.

Multi-cloud, multi-cluster, multi-tenant service meshes

In the recent years, the cloud adoption by different organizations has transformed from a single cloud solution (private or public) to a new infrastructure based on multi-cloud (private, public, and hybrid) supported by multiple different vendors (AWS, Google, Microsoft Azure, and so on). Also, the need for supporting diverse workloads (transactional, batch, and streaming) is critical to realize a unified cloud architecture.

These business and non-functional requirements in turn lead to the need for deploying service mesh solutions in heterogeneous infrastructures (bare metal, VMs, and Kubernetes). The service mesh architecture needs to transform accordingly to support these diverse workloads and infrastructures.

Technologies like Kuma support the multi-mesh control plane to make the business applications work in multi-cluster and multi-cloud service mesh environments. These solutions abstract away the synchronization of service mesh policies across multiple zones and the service connectivity (and service discovery) across those zones.

Another emerging trend in multi-cluster service mesh technologies is the need for application/service connectivity from edge computing layer (IoT devices) to the mesh layer.

Media Service Mesh

Media Streaming Mesh or Media Service Mesh, developed at Cisco Systems, is used for orchestrating real-time applications like multi-player gaming, multi-party video-conferencing, or CCTV streaming using service mesh technologies on Kubernetes cloud platform. These applications are moving more and more away from monolithic applications to microservices architectures. A service mesh can help the applications by providing capabilities like load balancing, encryption, and observability.

Chaos Mesh

Chaos Mesh, a CNCF hosted project, is an open-source, cloud-native chaos engineering platform for applications hosted on Kubernetes. Though not a direct service mesh implementation, Chaos Mesh enables Chaos Engineering experiments by orchestrating fault injection behavior into the applications. Fault injection is a key capability of service mesh technologies.

Chaos Mesh hides the underlying implementation details so the application developers can focus on the actual chaos experiments. Chaos Mesh can be used along with a service mesh. Checkout this use case on how the team used Linkerd and Chaos Mesh to conduct chaos experiments for their project.

Service Mesh as a Service

Some service mesh vendors, like Buoyant, are offering managed service mesh or “service mesh as a service” solutions. Earlier this year, Buoyant announced the public beta release of a SaaS application called Buoyant Cloudthat allows the customer organizations to take advantage of managed service mesh with the on-demand support features for the Linkerd service mesh.

Some of the features offered by the Buoyant Cloud solution include the following:

Automatic tracking of Linkerd data plane and control plane health
Managing service mesh lifecycles and versions across pods, proxies, and clusters on Kubernetes platform
SRE-focused tools including service level objectives (SLOs), workload golden metric tracking, and change tracking

Network Service Mesh (NSM)

Network Service Mesh (NSM), another Cloud Native Computing Foundation sandbox project, provides a hybrid, multi-cloud IP service mesh. NSM enables capabilities such as network service connectivity, security, and observability which are core features of a service mesh. NSM works with existing Container Network Interface (CNI) implementations.

Service Mesh Extensions

Service mesh extensions is another area that has been seeing a lot of innovations. Some of the developments in service mesh extensions include:

enhanced identity management for securing microservices connectivity including custom certificate authority plugins
adaptive routing features for higher availability and scalability of the services
enhancing sidecar proxies

Service Mesh Operations

Another important area of service mesh adoption is in the operations side of the service mesh lifecycle. The operational aspects—such as configuring multi-cluster capabilities and connecting Kubernetes workloads to servers hosted on VM infrastructure, and the developer portals to manage all the features and API in multi-cluster service mesh installations—are going to play a significant role in the overall deployment and support of service mesh solutions in production.

FAQ

What is a service mesh?

A service mesh is a technology that manages all service-to-service (east-west) traffic within a distributed (potentially microservice-based) software system. It provides both business-focused functional operations, such as routing, and nonfunctional support, for example, enforcing security policies, quality of service, and rate limiting. It is typically (although not exclusively) implemented using sidecar proxies through which all services communicate.

How does a service mesh differ from an API gateway?

For service mesh definition, see above.

On the other hand, an API gateway manages all ingress (north-south) traffic into a cluster, and provides additional support for cross-functional communication requirements. It acts as the single entry point into a system and enables multiple APIs or services to act cohesively and provide a uniform experience to the user.

If I am deploying microservices, do I need a service mesh?

Not necessarily. A service mesh adds operational complexity to the technology stack and therefore is typically only deployed if the organization is having trouble scaling service-to-service communication, or has a specific use case to resolve.

Do I need a service mesh to implement service discovery with microservices?

No. A service mesh provides one way of implementing service discovery. Other solutions include language-specific libraries (such as Ribbon and Eureka or Finagle)

Does a service mesh add overhead/latency to my service-to-service communication?

Yes, a service mesh adds at least two extra network hops when a service is communicating with another service (the first is from the proxy handling the source’s outbound connection, and the second is from the proxy handling the destination’s inbound connection). However, this additional network hop typically occurs over the localhost or loopback network interface and adds only a small amount of latency (on the order of milliseconds). Experimenting with and understanding whether this is an issue for the target use case should be part of the analysis and evaluation of a service mesh.

Shouldn’t a service mesh be part of Kubernetes or the “cloud native platform” that applications are being deployed onto?

Potentially. There is an argument for maintaining separation of concerns within cloud native platform components (e.g., Kubernetes is responsible for providing container orchestration and a service mesh is responsible for service-to-service communication). However, work is underway to push service mesh-like functionality into modern Platform-as-a-Service (PaaS) offerings.

How do I implement, deploy, or rollout a service mesh?

The best approach would be to analyze the various service mesh products (see above), and follow the implementation guidelines specific to the chosen mesh. In general, it is best to work with all stakeholders and incrementally deploy any new technology into production.

Can I build my own service mesh?

Yes, but the more pertinent question is, should you? Is building a service mesh a core competency of your organization? Could you be providing value to your customers in a more effective way? Are you also committed to maintaining your own mesh, patching it for security issues, and constantly updating it to take advantage of new technologies? With the range of open source and commercial service mesh offerings that are now available, it is most likely more effective to use an existing solution.

Which team owns the service mesh within a software delivery organization?

Typically the platform or operations team own the service mesh, along with Kubernetes and the continuous delivery pipeline infrastructure. However, developers will be configuring the service mesh properties, and so both teams should work closely together. Many organizations are following the lead from the cloud vanguard such as Netflix, Spotify, and Google, and are creating internal platform teams that provide tooling and services to full cycle product-focused development teams.

Is Envoy a service mesh?

No. Envoy is a cloud native proxy that was originally designed and built by the Lyft team. Envoy is often used as the data plane with a service mesh. However, in order to be considered a service mesh, Envoy must be used in conjunction with a control plane in order for this collection of technologies to become a service mesh. The control plane can be as simple as a centralized config file repository and metric collector, or a comprehensive/complex as Istio.

Can the words “Istio” and “service mesh” be used interchangeably?

No. Istio is a type of service mesh. Due to the popularity of Istio when the service mesh category was emerging, some sources were conflating Istio and service mesh. This issue of conflation is not unique to service mesh—the same challenge occurred with Docker and container technology.

Which service mesh should I use?

There is no single answer to this question. Engineers must understand their current requirements, and the skills, resources, and time available for their implementation team. The service mesh comparison links above will provide a good starting point for exploration, but we strongly recommend that organizations experiment with at least two meshes in order to understand which products, technologies, and workflows work best for them.

Can I use a service mesh outside of Kubernetes?

Yes. Many service meshes allow the installation and management of data plane proxies and the associated control plane on a variety of infrastructure. HashiCorp’s Consul is the most well known example of this, and Istio is also being used experimentally with Cloud Foundry.

Additional Resources

Glossary

API gateway: Manages all ingress (north-south) traffic into a cluster, and provides additional support for cross-functional communication requirements. It acts as the single entry point into a system and enables multiple APIs or services to act cohesively and provide a uniform experience to the user.

Consul: A Go-based service mesh from HashiCorp.

Containerization: A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Control plane: Takes all the individual instances of the data plane (proxies) and turns them into a distributed system that can be visualized and controlled by an operator.

Circuit breaker: Handles faults or timeouts when connecting to a remote service. Helps to improve the stability and resiliency of an application.

Data plane: A proxy that conditionally translates, forwards, and observes every network packet that flows to and from a service network endpoint.

Docker: A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.

East-West traffic: Network traffic within a data center, network, or Kubernetes cluster. Traditional network diagrams were drawn with the service-to-service (inter-data center) traffic flowing from left to right (east to west) in the diagrams.

Envoy Proxy: An open-source edge and service proxy, designed for cloud-native applications. Envoy is often used as the data plane within a service mesh implementation.

Ingress traffic: Network traffic that originates from outside the data center, network, or Kubernetes cluster.

Istio: C++ (data plane) and Go (control plane)-based service mesh that was originally created by Google and IBM in partnership with the Envoy team from Lyft.

Kubernetes: A CNCF-hosted container orchestration and scheduling framework that originated from Google.

Kuma: A Go-based service mesh from Kong.

Linkerd: A Rust (data plane) and Go (control plane) powered service mesh that was derived from an early JVM-based communication framework at Twitter.

Maesh: A Go-based service mesh from Containous, the maintainers of the Traefik API gateway.

MOSN: A Go-based proxy from the Ant Financial team that implements the (Envoy) xDS APIs.

North-South traffic: Network traffic entering (or ingressing) into a data center, network, or Kubernetes cluster. Traditional network diagrams were drawn with the ingress traffic entering the data center at the top of the page and flowing down (north to south) into the network.

Proxy: A software system that acts as an intermediary between endpoint components.

Segmentation: Dividing a network or cluster into multiple sub-networks.

Service mesh: Manages all service-to-service (east-west) traffic within a distributed (potentially microservice-based) software system. It provides both functional operations, such as routing, and nonfunctional support, for example, enforcing security policies, quality of service, and rate limiting.

Service Mesh Interface (SMI): A standard interface for service meshes deployed onto Kubernetes.

Service mesh policy: A specification of how a collection of services/endpoints are allowed to communicate with each other and other network endpoints.

Sidecar: A deployment pattern, in which an additional process, service, or container is deployed alongside an existing service (think motorcycle sidecar).

Single pane of glass: A UI or management console that presents data from multiple sources in a unified display.

Traffic shaping: Modifying the flow of traffic across a network, for example, rate limiting or load shedding.

Traffic shifting: Migrating traffic from one location to another.

Traffic Split: Allow users to incrementally direct percentages of traffic between various services. Used by clients such as ingress controllers or service mesh sidecars to split the outgoing traffic to different destinations.

The role of a modern software architect is continually shifting. Subscribe to the Software Architects’ Newsletter from InfoQ to stay ahead.

About the Author

Srini Penchikala is a senior IT architect based out of Austin, Texas. He has over 25 years of experience in software architecture, design, and development, and has a current focus on cloud-native architectures, microservices and service mesh, cloud data pipelines, and continuous delivery. Penchikala wrote Big-Data Processing with Apache Spark and co-wrote Spring Roo in Action, from Manning. He is a frequent conference speaker, is a big-data trainer, and has published several articles on various technical websites.

InfoQ Software Architects' Newsletter