BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Top Five Challenges of Running a Service Mesh in an Enterprise

The Top Five Challenges of Running a Service Mesh in an Enterprise

Bookmarks
40:02

Summary

Christian Posta takes a look at some of the common challenges organizations face when adopting service mesh and how to overcome them.

Bio

Christian Posta is Global Field CTO at Solo.io, former Chief Architect at Red Hat, and well known in the community for being an author (Istio in Action, Manning, Istio Service Mesh, O'Reilly 2018, Microservices for Java Developers, O’Reilly 2016), frequent blogger, speaker, open-source enthusiast and committer on various open-source projects including Istio, Kubernetes, and many others.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Posta: I'm going to be talking about the top five challenges of running a service mesh in an enterprise. My name is Christian Posta. I'm a VP and a Field CTO here at solo.io. Here at Solo, we focus on application networking, and connecting, securing, observing various services across clouds, across clusters, regions, zones, and so on. I've been here almost three years, been focused on integration and distributed systems, and building cloud architectures for quite some time now. Author of a few books, including a book that will be released in the next few months, called, "Istio in Action" from Manning.

What Solo Offers

At solo, we're working on application networking, and doing that in a way that better fits cloud architectures and modern architectures. We have two main products, Gloo Edge, which is our modern API gateway, and Gloo Mesh, which is our enterprise service mesh. We've seen in the analyst reports recently how our vision, our products, how we compare to our competitors, start to outperform. Service mesh is still relatively new, evolving, and maturing space. You can see in one of the first analyst reports that we perform very strongly. As of last week, we made public that we raised our Series C at a billion dollar valuation. This is definitely a testament to our customers, the work we're doing with the industry, and working with organizations deploying Envoy based technology at massive scale, including service mesh, Istio, and these types of things, and the tremendous success that we've had.

Service Connectivity

This is an area of very deep interest for us at Solo. We are service mesh experts. We are application networking experts, through and through. Let's take a look at some of the challenges that we've seen organizations face as they start to adopt this technology or the next phase of modernizing their application infrastructure. The talk is running a service mesh, and some of the challenges of adopting it. We might want to take a step back and understand, what is the challenge, what is the problem that we're trying to solve with a mesh? It's around service connectivity, partially. I think that's a big tactical part of why we're talking about service mesh. Then, operationally, how can we bring and apply policies about how services communicate with each other, independent of how they were written, how they're deployed. The service mesh brings a lot of value to being able to do that. It is two main areas. One is the technical side, one is more of the operational or efficiency and principal way of how we build and manage these types of infrastructures in the cloud.

Service connectivity, things that we typically worry about: service discovery, load balancing, timeouts, retries, circuit breaking, transport security, observability. These are not optional things to solve, we need to solve them somehow. In the past, we've seen the cloud generation even before enterprises started getting in, things like Amazon and Netflix, and so on, how they started to do it, was to build things into the applications themselves. Try to govern the teams so that they use these specific libraries, these particular pieces of infrastructure to implement the patterns correctly but across languages, across different frameworks. Especially, in an organization that has been around for 50 years and is very successful, and now they're trying to modernize, those types of things get very difficult to operate.

Centralized Approach

Another approach that we saw in large organizations is to stand up some centralized team and system through which all the traffic in the company should flow through, whether that's enterprise service bus, whether that's some centralized API Management Server. We've seen this approach, but this doesn't scale very well. The decentralized approach, this scaled fine, it came with other challenges. This does not scale well and has bottleneck challenges, whether that's the technology, obviously, or more importantly, the processes that built up around these teams and the team that supports service A, how do they go about making changes? They go to some centralized team and say, can you do this for me? Let's open tickets, and let's slow things down.

Decentralized Communication between Services

What we want is some balance in between, and that's where the service mesh and the surrounding and supporting components come into play. It's the decentralization as well as the self-service and the automation around the policies that would be enforced in the mesh that allow an organization to move faster. That's the whole point of these modern architectures, and adopting things like containers, Kubernetes, CI/CD, is to move faster. The last missing piece of that is how services communicate with each other. Decentralizing that communication is very important. The service mesh brings a pattern that is based on offloading some of the application networking concerns to something else, an agent, a proxy. In this case, a lot of the service meshes have implemented the proxy with a technology called Envoy. In this model, the application talks, when it talks to the network, talks through this proxy first. When a service communicates with another service, there's inbound traffic, it goes through this proxy first, and then to the app. These proxies are co-located with the app instance, these are not reverse proxies. They're actually one to one with the application instance, and become atomically associated with the application.

In these proxies is work we can implement consistently, regardless of what language and framework was used to write the app. We can consistently apply application networking implementations and policies around things like timeouts and retries, and service discovery, and circuit breaking, and load balancing, and not have to worry about what the applications were written in, what frameworks. We can control and manage this stuff remotely. That's where the control plane starts to come into the picture. Control plane configures the individual proxies that live with each of these instances.

Do You Need a Service Mesh?

This sounds good on one hand. The architecture fits what we're trying to do by modernizing. The benefits of doing this, decoupling some of these policies, automating them away, removing the burden from developers, actually has a very big win. A lot of value brings a lot of value. Do you need a service mesh? We work with organizations all the time that are going through this process, do they need a service mesh, because maybe they don't? That's going to be very context specific, you will have to decide that. Are you building a microservices style architecture? Do you have tens or hundreds of different applications or services that you are deploying? Are you allowing teams to "use the right tool for the job," or the right language for that particular team's skill set? Microservices in general are complicated. The more moving pieces you have, the more that you put out over the network, the less you have governed and centralized, it could create a big mess. This is where service mesh, at least on the application networking side can help establish some consistency and some knowns about the system, especially when things start to go wrong.

Are you using things like containers and Kubernetes, or cloud infrastructure that scales elastically? Are you communicating more over the network RPC style, maybe async or messaging based, but typically gRPC, and REST, and GraphQL, and SOAP, and these types of RPC style communications are where a service mesh brings its biggest value. Then, of course, deploying a large set of services, where you need to have some consistency over the policies about how services communicate with each other. If you have an environment that resembles this or will tend toward this, then a service mesh might be a good solution or a good thing to start looking at.

Challenges of Running a Service Mesh at Scale in an Enterprise - Which Service Mesh to Pick

Let's take a look at some of the challenges of running a service mesh at scale in an enterprise. The first one starts with what service mesh to pick. Over the last three or four years, a lot of different vendors have entered this ecosystem with different options for which service mesh to pick. There are various technologies that are being used, different architectures that are being used, and different strengths or initial focal points of what a particular mesh might be trying to solve. For the most part, a large percentage of the meshes are converging and using Envoy proxy. They're converging around this technology, which brings a lot of different features for being able to do service to service communication. Some meshes are opting to either reuse technology they already had, or to rebuild something completely new from scratch.

You also want to evaluate, what are your use cases? What is the maturity or battle tested implementations that could support your use cases? We've noticed at Solo after observing this market for the last four years, that Istio has become one of the more mature deployments of a service mesh, especially at scale. Especially at an enterprise with a lot of those unique edge cases, that some of the other meshes are still trying to catch up and figure out how to solve, or haven't even seen those problems yet. Where do you go for help? Service mesh and deploying this technology requires a deep level of expertise. Where can you go for help or partner, either with an open source community or with a vendor of your choice? We see that the community itself has converged around Istio, while some of the others are supported by single vendors. Some don't have any real commercial support at all.

How Does a Mesh Fit In With Existing API Gateway Technology?

The next question is, how do API gateways fit into this mix? Some meshes have a gateway implementation. Istio is a good example. Have an Istio ingress gateway. Then you start to bring that in, and say, I'll just use the gateway. Then you realize, I have a need for things like integrating with OIDC, or web application firewalling, or I need to do message transformations. Or I need to do more sophisticated rate limiting or quota policy enforcement. Then you start to build all this stuff yourself. Then the question is, what is the role of API gateway? How does this fit in with service mesh? Am I supposed to go build all this stuff myself? There are solutions. You can just stand up one of your existing API gateways in front of your services and try to route them into a service mesh. You could offload some of the API gateway capabilities to your applications themselves, although that seems to be going in a step backwards. Or you could use an API gateway that is native to Istio or a particular service mesh. By native, I mean it's actually built on the proxy that Istio is built on, and provides the various capabilities that you will likely need at the edge. Things like transformation. Things like data loss prevention, web application firewalling, maybe even things like SOAP translation. Things that gateways can do today, but now you can pull this and make it native into the service mesh instead of trying to hack together something that you're going to have to maintain around your existing gateways.

Global Service Routing, Failover, and Continuity across Infrastructure

Then there's the point around, deploy across multiple clusters, how services find each other across multiple clusters. Then building a highly available service-to-service communication fabric. Basically, it comes down to what you did in the past. Just set up a bunch of hardware load balancers, and the load balancer when you curl it, will spread the curl into a pool of services. That incurs additional expense. That incurs additional hops in the network. It might be more worthwhile to instead of forcing everything out of a cluster back through gateway, into the cluster, and then back out again when services need to talk with each other, the ability to talk directly to each other when it makes sense. To be smart enough to know how to route and how to fail over without having to rely on some external and expensive load balancers. Something more like this, where traffic from app A can go to app B, even across failure domains without having to cross back through some centralized load balancer.

Workload Certificate Management Integration with Existing PKI

Another really big topic when it comes to adopting and operating a service mesh in these organizations is, how do we solve for the certificate management problems when we're talking about enabling mutual TLS? That's one of the benefits of using a service mesh at layer 7, where applications are communicating with each other. We can assign identity to the applications. Encode that identity in the transport, using certificates. Then apply policies to those identities. Policy is about whether A is allowed to talk with B. To do this, the underpinnings of an implementation like this right now depend on certificates. You will likely want to tie that back into your own PKI, Public Key Infrastructure. You might have Vault. You might be using one of the cloud CAs or something like that, and safely doing this. Because you don't want to start handing out intermediate signing CAs, or root CAs, or any of this stuff into your infrastructure without keeping things extremely secure. This is definitely an area where, first of all, don't write things to disk. Don't put things in secrets. Keep things in memory. Offload root CA handling to some offline hardware management. These are all compatible practices when deploying a service mesh. Getting this right is extremely important.

Extending the Service Mesh

One of the last pieces that I'll cover is actually extending the service mesh. The proxies are on the request path in the mesh, and they can be configured with the control plane. The proxies have a set of capabilities that are coded into the proxy that you can use or not use. You might need a customization to fit maybe what your organizations are already doing. Typically, what we've seen here at Solo is that those customizations are around security, like you were trying to retrofit the mesh into a brownfield environment where services are already communicating with each other. They have some existing security protocol, maybe they're parsing some token or signature, and you have to verify that maybe it's not using JWT or some accepted practice, or something that was built 10 years ago, and you need to be backward compatible with that.

At Solo, one of the things we've been excited about for a while now, and we've seen adoption of, is WebAssembly to extend the capabilities of the proxy, and do that dynamically. You can write your security plugin in WebAssembly, and then inject that into the proxies, where it makes sense, the applications that care about that, and dynamically alter the behavior of the mesh. This is extremely important and a versatile way of getting that last 10% or that last mile fit for your organization's use cases by making the customizations yourself. Without having to dig into Envoy proxy, and C++, and managing a build of Envoy, and basically forking Envoy, maintain your own build. You can use WebAssembly to do that, which is pretty powerful.

Integrating VMs into the Mesh

There's actually one more really important concept that comes up when deploying a service mesh into the enterprise, and that is a lot of the meshes can run nicely on Kubernetes, which is containerized workloads. A lot of their workloads actually run on VMs, and we need a way to integrate those VMs into the mesh, so either deploying the sidecar or using some gateways to integrate the VMs into the service mesh. Different mesh providers have different level of support for this. Things like Consul from HashiCorp started off in that generation of technology on the VMs, and then are slowly trying to inch into Kubernetes. You have people like Linkerd who are not doing anything with VMs. Then you have Istio, which is in the middle that was Kubernetes first, but not only, and offers support for integrating VMs as first class citizens into the mesh. That's a very important piece of the puzzle when adopting a service mesh.

Demo

Let's go back to a big use case. I like to say that people don't want a service mesh, but they have needs around consistent policies, failover, high availability, compliance requirements, and so on. A service mesh can be used to solve that. Let's take a look at an example and a demo that I'll show around an architecture that illustrates, or mimics something that you might want to do in a multi-cluster setup. In this case, we have two different clusters: one in the West region, one in the East region. We also have a third cluster at the bottom that is our Gloo Mesh management controllers, which automates the configuration federation of multiple clusters.

Then on the top part of the diagram, we have another cluster with API gateways that are built on Istio. Basically, there's another multi-cluster Istio scenario, but the API gateways can do things like rate limiting, and web application firewalling, request transformation, invoking AWS Lambdas, these things directly. Traffic flows through them into the clusters. Then when traffic is in the clusters and in the mesh, then we can apply these failover policies. Everything looks transparent when failing over and maintaining continuity, whether you're calling from the outside of the mesh, or whether you're a client inside the mesh. Let's go and take a quick look at that.

Here, what we're going to see is we are outside the mesh, we're calling from my laptop, a curl command to a set of services that will go through the gateway. The name will get resolved using external DNS. Nothing's super special about this. The curl will go through the gateways into an app called Web API, which then curls recommendation, which then curls purchase history. There's a graph of curls, a sequence of curls here. If things fail over or stop working in this cluster, then we should be able to fail over without going back through the API gateways, directly between clusters. Let's see how that works.

In this particular curl, it looks like the DNS routed me to cluster 2. That's fine. Curl it a couple more times, looks like I got cluster 2 again. Now I got cluster 1. Curl it a few more times, we see externally, we land on one of the cluster that's actually closest to me, and it happens to be the West cluster, and that's cluster 1. One thing we'll notice is that we are calling the global names, whether that is externally here from my laptop, or internally within the service mesh, we're calling global names. We don't want to pin ourself to topology specific names, Kube DNS. What we want to use is global names to deploy the apps, so when they are running inside the mesh they behave a certain way, when they run outside of the mesh, they behave a certain way. The apps don't know and they don't care, and they shouldn't.

On the top pane, we see cluster 1, which has Web API, which curls recommendation, which curls purchase history. On the bottom pane, we see cluster 2, same thing. If I come back here to cluster 1, I go into a sleep pod or client here, and I do a curl on web API service, we will see the response, and it will come. We can see the full response. Again, we're calling the global name, and it's going into cluster 1. It will always go into cluster 1, and it will stay in cluster 1 unless there's some failover event. We won't see the load balancing like we did when I was calling from my laptop. Another thing to notice is if I do a little bit more verbose curl here, when I did this curl here, we are resolving to an IP that is not a public gateway. It's not a public address. It is an internal mesh-only address. We're not going out back through the API gateways for this curl. We're going from the client, which is a mesh, to the web API service, which is also in the mesh directly. Right now we let the service mesh apply its policies for that communication.

The last thing that I'll show here is if I take down the purchase history replica running in this cluster. We take that down, that becomes zero replicas. We come back to the client in the mesh, and we make a curl to the Web API service. We'll notice that the curl ended up in cluster 1 like we expected. What we have there actually is locality we're routing, and then when it hits the purchase history service, it will actually fail over to cluster 2, automatically and transparently. Again, we're calling the global name, purchase history, and the application network is responsible for implementing the various failover and priority, and sometimes regulatory policies about how traffic should flow through the system. In this case, we correctly fail over to cluster 2.

Summary

At Solo, we are the leaders in this space. We're working with probably the largest deployments of service mesh in the world. It is a great place to work, to learn, to contribute to open source, to contribute to the ecosystem and the industry in general around this space. We're all over the world. It doesn't matter where you're located.

Questions and Answers

Losio: First of all, I'll start with a question that has the chicken and egg problem, that is, you start with some microservices. You have something at the beginning. When should you really start to think about the service mesh? I'm a new startup. I have a new made project or whatever. I have something already running or I'm slowly growing. I have high hopes, but when is the tipping point?

Posta: It first starts off with your microservice journey in general. Building a system as a set of services that are communicating over the network, and you plan on adding more, that by itself is already a fairly complicated situation. You need to have the supporting infrastructure to be able to run that and to be able to operate that going forward. The first things that you want to consider, in that scenario, are probably not service mesh. They're probably things like, how do we deploy these? Where do we deploy them? How are the teams going to make updates to them? How do we build a process to enable self-service, because a big part of why you're likely building microservices is to be able to move faster. Those are some of the initial things that you want to consider.

You also have to factor in the rate of either learning or adoption. How comfortable an organization is with building those pieces out first? You don't want to try to do everything all at once, because that's not going to be very fruitful. You have to consider, who's driving this in the organization? Is it the developers? Is it bottom up? Is it the executives? Is it top down? Is it somewhere in the middle? All of these things contribute? Once you get that stuff sorted out, and you figure out, how am I going to bring 2, 3, 5 microservices into a working environment? Then you have to worry about and consider the network. That problem will never go away. When you're talking about microservices, you're putting things on the network. You're putting request, response. You have to deal with the realities of the network, front and center.

If you happen to be adopting something like Kubernetes, some of that stuff, Kubernetes does start helping with some of those things. If a service goes down, Kubernetes will try to bring it back up. If you are trying to look for some basic service discovery, Kubernetes has stuff like that. You might be able to get something out of your platform, even if it's just a core bare bones platform already that you should consider as a step in that direction. There are other things like if you have a small number of services, where it might make sense to look at a gateway, a more modern edge gateway, ingress gateway, API gateway. Actually, the way a lot of people start with something like Istio is to deploy the ingress gateway first, and start operationalizing those pieces and use that for some basic routing, and then start to tiptoe into, let's start exploring how sidecars work for my application. There's no hard and fast rules here. There are areas that you want to consider. Have some foundational infrastructure in place first to support microservices. Then think about the growth of the platform. How you're going to operate it. For some teams that might be, yes, you look at service mesh right away, because maybe they're already familiar with Kubernetes. Maybe they already have CI/CD. Maybe they have their security pipelines and scanning, and all that stuff all set up. Now they're just trying to spin up a new set of apps.

Losio: You see that as an easier transition at that point. It really depends as well. As you mentioned this already, with a single digit number of services, you may already have a very good use case to move to, or definitely start to work with.

Is something that is just a service mesh just something for microservices? How do you see that?

Posta: I think anything that communicates over the network could tend to benefit from what the service mesh is doing. It just happens to be that for a larger scale services environment, where communication over the network has become more proliferated, then a service mesh brings more value in those environments. If you just have one monolith, then maybe less so. Maybe a simple gateway will be useful there. The more things you have communicating over the network, which in today's IT age, that's becoming more, even to the extent that you scale out to IoT. The network is being highly leveraged here. Then the last piece is the type of infrastructure that you're deploying too as well. Let's say you have 10 physical boxes, machines, and you're deploying applications on 10 of those, it's not a very dynamic environment. You're starting to use a network so you still need to solve these network concerns. If the infrastructure is not all that dynamic, applications are not changing, and so on, then maybe a gateway would be sufficient for something like that. If you see that the infrastructure applications could be becoming healthy, unhealthy, going away completely, scaling, autoscaling, then that adds another element, another variable that exacerbates the problems around the network.

Losio: How do you implement observability on a service mesh, or how will you integrate what you have as well, in terms of observability?

Posta: Observability is a property of a system, first of all. Let's start off with that. What that means is observability has to be taken into account at all levels of the system. What you want to build an observable system for, why you want to do that, is so that when things go wrong, you can start asking questions, pull back the onion and figure out what's happening, and have enough data to be able to do that. That's an observable system. The service mesh is playing on the layers of the network, between the applications at the application level, so it's looking at HTTP requests. It's looking at application layer protocol and request. It plays a part in that observability story. Typically, how people are integrating the mesh with the rest of the story is by starting off, at least with what are the top-line application to application networking metrics that you might be interested in? Things like, how many requests are going from service A to B?

Losio: High level request and high level metrics, so start from there.

Posta: Throughput, latency, error rates, saturation, these four or five top level important metrics that you might want to capture between services. You can also do things like distributed tracing, capturing access logs, and these types of things. Then you want to pull that back into your larger system, and use that for observability.

 

See more presentations with transcripts

 

Recorded at:

Feb 24, 2022

BT