InfoQ Homepage Presentations Deploying Service Mesh in Production

Deploying Service Mesh in Production

View Presentation

Speed:

39:43

Summary

Christian Posta shares practical guidance on how to adopt a service mesh for an organization including separating out control plane and data plane, plugging in with observability tools, leveraging gateways appropriately, rolling out mTLS safely, and overall preparing for troubleshooting and debugging.

Bio

Christian Posta is Global Field CTO at Solo.io, former chief architect at Red Hat, and well-known in the community for being an author (Istio in Action, Manning, Istio Service Mesh, O'Reilly 2018, Microservices for Java Developers, O’Reilly 2016), frequent blogger, speaker, open-source enthusiast and committer on various open-source projects including Istio, Kubernetes, and many others.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Posta: My name is Christian Posta. I'm a Field CTO at a company called Solo.io, where we focus on service mesh deployments and tooling and running in production. We have a couple of products that we've built that large organizations have taken and deployed, to solve some of the service connectivity problems. This is an area of expertise for us. I specifically have been involved in the Istio community since the very beginning. I've written a couple books, including one that should be launched or published at the end of the fall, 2021.

Communication Mesh between Services

We're going to be talking about service mesh and running service mesh in production. I definitely want to start off with some context about what is a service mesh. You may have heard of it. You might have seen people deploy it and talk about it. What it is, is basically a way of solving service to service communication challenges with sidecar proxies that allow you to transparently instrument your network calls with observability. You can grab telemetry. You can enforce security. You can control routing between services, by using and programming these little proxies that live with the application. This is an alternative to writing a lot of this functionality into the application yourself, or using a centralized gateway, through which all traffic flows and at times creates potential bottlenecks. The service mesh pattern puts these little agents next to the application code. It doesn't matter what language you've written your applications in, it sits out of process and sits as a sidecar or a helper process to the main application instances. These little proxies are configured and managed by a control plane component that then operators and end users interact with to drive the behavior of the network. In many ways, this is an API on top of your network that understand application traffic.

Do You Need a Service Mesh?

Do you need a service mesh? That's a question you're going to have to answer if you're coming to look for tips for deploying a service mesh into production. You have to decide, do you need a service mesh? You have to look at your context in your environment. Are you dealing with a lot of services, a lot of different components that need to interact with each other over the network to solve a business problem: multiple languages, multiple frameworks? You don't want to be writing this functionality over and over in different languages and different frameworks, and then you're on the hook to maintain the correctness of those libraries, to upgrade them, and maintain their lifecycle. Those are tall orders for an operations team to try to make sure that all of these different languages implement everything exactly the same, when it comes to security and telemetry collection, and routing, and policy enforcement, rate limiting. All of these things that are not differentiating language or framework, or even business differentiating concerns, they're cross cutting concerns. Service mesh especially plays well or solves problems well in a cloud native environment where you have ephemeral workloads scaling up, scaling down. There's more decentralization and autonomy in the teams that are deploying these services, probably revving them quickly. You need to build some guardrails for how that actually happens. Service mesh can help with that. Typically useful in RPC type interactions or anything really that communicates on the network. It provides operations team a way to get consistency when it comes to dealing with how traffic and services communicate over the network. That's one of the things that when things start to fail, being able to reason about how things should behave in any production environment is very difficult when any team can go off and write this logic and this functionality however they want it. Service mesh brings some consistency to that, running in a distributed system.

Where Do I Start?

The next question is, where do you start? Service mesh is anything that deals with application level networking, those can be complicated to get right. This was no different in the past, and it's no different now. The best place to start is to start small. To start iteratively, and adopt and grow into some of the capabilities that a mesh offers. I have been advising people for the last four years to start adopting a service mesh at the edge, where traffic comes into a boundary, where you can start to get the benefits of a mesh without directly affecting the model and how you deploy your applications because sidecar deployments, Kubernetes and containers make them easier. VMs, the pattern is still applicable, but it's not as explicit. Start with something that doesn't force developers to change, or to even think about what and how they deploy their application. Start at the edge. It's a common ingress API gateway, start there. Start building the capabilities at the edge, and then take those learnings as an important step in adopting service mesh. The maturity of an organization to adopt a service mesh comes when you have to try to operate this thing in production. How does it behave? How do you debug it? Where do you get the logs? What does the telemetry tell you? How do you demystify actually using this technology, the proxy technology that will, if you adopt a service mesh, inevitably be living directly with the applications?

Start at the edge, and from there, start to iterate closer to the application. Start high and slowly push the sidecar proxies down. Maybe pick a group of applications to start with, slowly add others. Enable things like mutual TLS. Start collecting telemetry for the services that are communicating over the network. Implement safety valves, resilience mechanisms, these types of things. It's an iterative approach to adopting. Through Solo and Red Hat, before that I've been working with organizations adopting this type of technology for a while. This is the tried and true approach to doing that. If this is the framework, this is the foundation for whether or not you need a mesh. Once you start doing it, start small. Understand the various pieces that make up the mesh, and then iteratively bring them to your applications. Show wins, show value.

Tips for Deploying to Production - Initial Hello World Experience Is Not Suitable for Production

Let's jump right into some of the observations that I've made over the last few years about people adopting and deploying service mesh to production. The first one starts in the evaluation phase or initial hands-on experience when adopting a service mesh. That is the Hello World experience, that when you go to the docs, and you go to click the Getting Started link, what you actually do to get started, that's not the real world production use. You're not going to take that guide, run that guide, and then be in production. There's a lot of real world tuning and configuration that need to go into getting your mesh running in production. I'm going to try to show a quick little demo that illustrates this.

The point here is when you're starting to evaluate, go a level deeper than just Hello World. Try to actually bring your workloads into the mesh. Think about, how are you going to prepare this for the lifecycle upgrades? Integrating with other parts of the system like telemetry collection, time-series databases, tracing engines, these types of things. Go a level deeper. Don't just say, this looks so easy. It's not. Make sure that you invest a little bit of time in getting to that right level to determine whether or not the mesh that you're looking at, can adequately be supported in production.

Gateway Functionality Is Crucial for Self-Service and Multi-Cluster Service

Number two, when people start using the mesh, look at the edge, look at the gateways. Understand those pieces. Those are critical pieces of functionality. Going from the previous point to now, you also have to plan the architecture for how gateways will be used to fence off boundaries, enable cross-cluster communication. You have to think about, how do you keep those gateways safe? How and when do you enable developer teams to own their own gateways? There's a lot of architectural guidance around gateways, and we'll take a look at least a little bit of it in the demo. Don't neglect or don't overlook the fact that gateways are very important to building a successful service mesh architecture.

Treat the Data Plane as Part of Your Application

Three, as I was hinting at earlier, the data plane is extremely important. A data plane when it lives as a sidecar with your application, becomes part of the application. It shouldn't be treated as some black box or some other thing. It's part of your application. It's not written in the same code that your application is, and it lives outside the application, but it is functionally part of your application, so you should understand how to deploy it, how to safely roll it out to existing applications, how to debug it. Those are very important considerations just like you would for your application code.

Default Certificate Management Is Not Suitable for Production

One of the benefits of using a service mesh is to enable more secure communication between services. To enable authentication authorization at the connection level, at the request level between services, and stand up the underpinnings of a zero-trust network. With that in mind, the service mesh does enable some convenience, things like certificate management, and so on. You need to figure out how to plug your existing PKI infrastructure, or if you don't have any, to build a PKI infrastructure into what the mesh does and how it orchestrates minting workload certificates, rotating those certificates, enabling mutual TLS or TLS, and these types of things. Integrating with things like cert-manager in Vault, or in a public cloud like AWS PCA or ACM. Those are really important steps to deploying a service mesh into production.

Understand How to Debug the Mesh Configuration

Last, you got to understand how to debug the mesh, how to debug the network. This is a first principle anyway, when you build microservices, and you build distributed systems, you need to figure out how to debug that system, how to debug the network. It's so different when you put a service mesh in place. There's a lot of, I would say, a service mesh with these proxies that live with the application that shine a light on the network and give you a lot more understanding about what's happening. You as an operator, you as a user deploying a service mesh into production, need to understand what those telemetry signals are and how to interpret them, and how to very quickly debug when things aren't working exactly the way that you're expecting.

Resources

A lot of this material, it comes from the real experience working with organizations, deploying a service mesh into production. We've built workshops for this at Solo. If you can go to this link, https://www.solo.io/events-webinars/, to check out some of the upcoming webinars, and more importantly, hands-on workshops that we've built that incorporate these types of learnings, this type of material. When we build our service mesh product at Solo, it takes a lot of this stuff into account. Because our goal is to simplify using and operating this type of technology to solve the challenges people have around high availability, service failover, zero-trust networking in multi-cloud or more like hybrid cloud, on-prem and public cloud. Bringing those two worlds or gluing those two worlds together nicely at the network.

Deploy Istio for Production: Hands-on Workshop

What we're going to see here is we have a sample Kubernetes cluster running, and we have a few workloads. The web API app ends up calling recommendation, which ends up calling purchase history. It's a set of microservices where there's a network communication, we can use this sleep app as a client to call into the microservices. If we get our namespaces, we see we don't have a service mesh installed. What we're going to do is we're going to install a service mesh. The first thing we're going to do is we're going to notice that we're going to follow a different approach than what the official documentation shows. This is point number one. You got to dig deeper past the Hello World experience. In this case, what we're going to do is we're going to install a very minimal control plane that we can then layer things on and add more functionality. We're doing this specifically so that we can enable longer term lifecycle type things like upgrades and patching, layer in different observability tools and tracing. The best way to do that is to bring the pieces together yourself, or you obviously can build automation around doing this. The most important thing is understanding the service mesh a little bit deeper than just Hello World.

We got a couple of set of steps here. We're going to install our service mesh using this config here, which specifies some interesting production configurations. We're going to annotate it with a specific revision. This is an Istio specific thing, but what it means is, it tags it with a version, which means we can run canary versions of the service mesh itself. We don't have to worry about doing in-place upgrades and this kind of stuff, we can canary the service mesh itself. You have to dig through the docs a little bit deeper to find this. This is absolutely something you want to keep in mind for a production deployment. Now that we've deployed it, we can see, we have our control plane running, very simple. One component from here. We can layer in Prometheus. We can layer in Grafana. We can layer in gateways. We can do all kinds of things, but keeping the components deployed in such a way that we can manage their lifecycle without downtime is incredibly important.

The next thing is we want to start with the service mesh or start with the gateway, start at edge. We want to separate out the lifecycle of the gateway from the control plane, because we want to be able to upgrade and update each of them independently and without taking down traffic. In this case, we see we have our control plane up, and we have our configs to enable the gateway. This is going to be an ingress gateway that allows traffic into the service mesh. We're going to create a different namespace for hosting the gateway. We don't want to combine them with the namespace that's holding the control plane. Then we'll do another install, we'll give another revision. This aligns with the Istio version that we're using. We'll give it a second to install the various components that are necessary for the gateway. Now if we take a look, get pod for the ingress gateway, we see that indeed in a different namespace, we see the ingress gateway. Again, you have to dig through the docs, dig past the Hello World, and understand how to actually deploy a service mesh for production. Separate out failure domains and failure boundaries, so that you have better lifecycle management, better fault tolerance, and better end user experience.

Now we have our gateway deployed. What we want to do is configure our gateway to allow traffic into the mesh. These are Istio specific configs, don't worry about them. We apply them, and now we should be able to call our services through this gateway, through this service mesh ingress gateway. Indeed, we see the web API service calls recommendation, which calls purchase history. That works correctly.

One thing to notice here, is if we take a look at our workloads, we still don't have a service mesh running here. We have basically this slide right here, where we have some workloads, and we've deployed a gateway. This happens to be part of the Istio service mesh, and we started at the edge. Now we want to roll out the sidecar proxies to our workloads. We want to roll out the service mesh to our workloads. The first thing that we're going to do is we're going to label our namespace, telling it, this is the version of the control plane that we want for this service mesh, which will then inject a sidecar proxy into the workload. We're not going to inject it directly into the existing workloads, because I said here on point three, "Treat the data plane as part of your application, rollouts should be done as canaries." We're going to do that. We just created a new deployment that we're going to treat as a canary. This should have the service mesh data plane deployed next to it. If we take a look at it, we see, yes, here's the web API. We're going to make a change to it, we're going to do it in a canary rollout. We're not going to just start changing things in place. In this case, we're going to deploy the service mesh data plane next to it. Notice the rest of them do not have the service mesh yet.

Here we'll take a look at what version of the service mesh we're connected to. From here, we can make a bunch of calls and verify that, yes, the canary is working correctly. Everything is fine. We haven't disrupted, we haven't broken anything. If all looks good, then let's roll out the rest of the updates to the applications. Now we went through the canary phase fairly quickly. We see that the canary as well as the regular deployment has the data plane. All we have to do is we're going to fast forward, we're going to speed up the rest of it for the rest of the applications here.

There's a bit of obscure detail, but a very important detail to deploying a service mesh into production. Something that you will likely run into. It goes basically to point number five and number 1, "Go a little bit farther with your service mesh, don't just accept a Hello World as what it's really going to be like," because it just works in a POC or Hello World environment is not the same thing as a mature service mesh actually deployed into production. What we're going to show here is a couple things, which is why I said, start at the edge to get some quick wins. One of the problems of deploying the service mesh data plane next to your application, the sidecar next to your application is that application might have some assumptions. Maybe the application has an assumption that when it comes up, it reaches out to something to pick up some config or some security credentials. It makes a call before that application actually starts up. Service mesh, you could run into a problem where you put the data plane there next to the application, and when the applications come up, there's a race between which comes up first, the proxy for the service mesh or the workload. If the workload comes up first and tries to make a network communication call, but the proxy is not ready, that call will fail. That's the whole point of the service mesh is to intercept application networking traffic, and apply, enrich it, secure it, observe it, all that stuff. If the proxy is not ready, that call is going to fail. What happens? The application is going to fail. The whole thing will get restarted, and it'll race again. Then you get into these unpredictable scenarios.

What we're going to do is we're going to upgrade our service mesh, to try to influence the ordering of the proxy and the workload. Hold the application till the proxy starts, so that if there is any communication, it goes through the proxy. We're going to actually embark here on an Istio upgrade. We're going to upgrade from 1.9.5 to 1.10.0, in parallel, as a canary. This is about as safe of an upgrade as you can do, which is important because things like a service mesh are critical pieces of infrastructure for your system. Now let's take a look, we have both control planes running at the same time. Nothing's happened on the application side, everything is still running just fine.

What we're going to do is we're going to label our namespace with this new revision, and telling it that we're intending to upgrade to this new control plane. We're going to do the same thing that we did earlier, we're going to deploy a canary. This new canary, just like I said earlier, we're making changes to the application. In this case, it happens to be the service mesh data plane, but we are making change, that's part of the application. We're going to do this as a canary. When we do this as a canary, we want to check it. We deploy. We check it, and make sure that is functioning properly. If it's not, we want to back that canary out. Now we just created this new canary pointing to the new control plane, as we're in the middle of this upgrade. If that's the case, remember, we configured this service mesh to hold the application till the proxy starts. Now if we take a look at the pod, this canary, we should see that the ordering was influenced, that we had the initialization, the proxy started first, and then the workload started up. Look at the service mesh. Istio is probably the most mature, most deployed service mesh out there right now. If you look at the other ones, make sure you're digging a little bit deeper than just Hello World, and preparing yourself for success in a production environment.

Wrap-up

I'm just scratching the surface on all this stuff. Check out our various workshops and stuff that we do at Solo. We do a lot for the community to help educate and further the solutions around service mesh and others.

Questions and Answers

Betts: That was a great overview. I think as you said, you just scratched the surface. It was a whirlwind tour. It was how to get stuff deployed. Obviously, you had a lot of scripts that were like, let me get this done quickly. I like that you pointed out a few of the key ideas that people need to be aware of when they're starting with a service mesh.

I wanted to start with the idea that you'd discussed was the separation of concerns. You mentioned one of the benefits of a service mesh is that it handles a lot of the networking responsibility, the TLS and security and other aspects of networking that people were having to put into their applications, and so you don't have to write that. That leads me to the question of, who then has to handle that code. Where does that code live? Is it just Istio configuration files, if I need to say my service allows these other services to call me but no one else? Where does that configuration go? Who handles that?

Posta: The developers and how they in the past had to solve these type of networking problems. They would bring in their own custom libraries to do that. They would go Google around and find, for Node.js, I'm using Node.js so I'll go find whatever the circuit breaking implementation is for Node.js. The Java developers, they would go, fine, this looks right because Netflix used it. They would cobble together these pieces. In the service mesh, you delegate that to the sidecar proxy. That sidecar proxy has an implementation that is consistent, regardless of what language you end up using. The sidecar proxy is driven by the control plane. The control plane is what configures the sidecar proxies to behave a certain way. We're going to load balance this way for this app. We're going to implement timeouts and retries, and per retry timeouts and retry by this, all this for this particular app. The control plane actually delivers the configuration at runtime to doing that.

Who is in charge of specifying what those policies are, and where do those policies live? It's typically organizationally dependent. Some teams that we've worked with at Solo, the platform team owns all of that configuration. They put that in Git, and each app has its own folder, and the app teams can influence those configurations. You end up with this contention, because now to make a change to the application networking, the app team has to talk to the platform team, and they have to synchronize somehow right there. That could slow things down. Other approaches are, the platform team will enforce global configuration, baseline configuration, especially around security, but in some cases around the resilience or telemetry collection aspects. Then individual teams, when they deploy their applications, they deploy their applications with configuration that configures the service mesh for their app. There's variations in between that as well. It really depends on the way the teams operate, and the way they're comfortable operating. Some are not as comfortable giving each team their own autonomy, which has been shown in some organizations that the more that you build the platform to enable teams to self-service, the more you reduce those coupling points and those bottlenecks, and slowing down the process to make changes in the system.

Betts: Then stuff like the resilience, so that you have retries and timeouts and circuit breakers, I still have to write some code in my application. That I'm writing all the circuit breaker code in my app, but I'm aware that there is going to be a potential retry or a circuit breaker and I stopped, handle that, it's just I don't have to own all of that code. That code is going to be handled organizationally in our service mesh.

Posta: That's right. The services in a distributed system like this, still need to know that things will go wrong. When things go wrong, how does the app recover from that? When things go wrong, you can configure and say, if the network is having some issues then retry a couple times. Wait only this amount of time before retrying. Envoy has a really interesting feature that you have to dig deep in Istio to actually use, but it's called request racing. If one request is taking too long, Envoy can send out another request to the same service, and then whichever one returns first, take that response. This is obviously good if you're pulling referential data, not making writes and that kind of stuff. There are different resilience aspects that the developer can configure, but they don't have to write that implementation, because those implementations might differ between applications. When you start deviating from how you reason about the networking, then that's when you start to make things a little bit harder for yourself.

Betts: One of the questions is one that I was thinking about, the common issues that you see people running into. I like the little example at the end that this application starts up, but the proxy isn't in place. You don't have to rewrite the app, you have to reconfigure the order of how the app is deployed after the proxy starts up. That seems like a common scenario that people don't think about until they run into and then they just have to handle. What are the other common scenarios like that, that you see people, yes, that's this thing, you just have to follow this pattern to solve it?

Posta: There's a couple. There's some that are directly related to the service mesh, and there's some others that are related more to the way the applications were already written and the assumptions that those organizations have. For example, one issue that we see is how do you safely enable mutual TLS? Because mutual TLS and encrypting the traffic on the wire is one of the big selling points for a service mesh. People struggle with, do I use self-signed certs? Do I integrate with my own PKI, and my own certificate authorities? Do I use intermediate CAs? How do I actually get this thing up and running? The second part to that is, once you enable mutual TLS, that's going to break any other client that wasn't using mutual TLS to talk to that service. If there's already some service communication there, the idea is to slowly, iteratively bring services into the mesh. If you enable strict mutual TLS, then the services that are not using mutual TLS, those calls will break, and that will be a problem. That's the mechanics of bringing the mesh into the system.

On the other side, you see things like, my services, when they communicate with each other, the mutual TLS, that's all good. We can do some authorization policies built on that. The requests that we send back and forth, we have to verify the HMAC, and some other stuff that we've built into the way applications communicate today. Being backward compatible with existing protocols and handshakes and stuff that the services were already doing, that tends to trip people up. There's a few ways to solve that. We have people that we'll work with at Solo that are using WebAssembly to solve those problems. You can build external auth services to solve those problems. Maybe even inject Lua code into the proxy to do that. That is one of those cases that people are already running a set of services, they want to introduce a service mesh, but they can't break what's already there. They have to be as transparent as possible. That's not as easy as just saying, turn the service mesh on.

Betts: I think the one big takeaway I've got is you can't just throw in a service mesh. It's not as simple as the Hello World app. You can't just have it and get all the benefits. A lot of time and effort needs to go into thinking through the questions of how we're going to use it. How we're going to deploy it. What are we hoping to get out of it? What does it take to do that?

Posta: There are some complexities. Just like any technology, there's learning curves, and so on. The benefits of doing this versus the alternative, which is bringing your own libraries in, your own frameworks in, trying to maintain those for all the different languages and hope that every developer uses it consistently. The alternative of managing a system like this to service mesh is a lot more expensive, a lot more error prone. We do see a lot of value when people adopt this approach.

See more presentations with transcripts

Recorded at:

Jan 21, 2022

Christian Posta

InfoQ Software Architects' Newsletter