InfoQ Homepage Podcasts Service Meshes and Linkerd with William Morgan

Service Meshes and Linkerd with William Morgan

Feb 01, 2021

Today on the podcast, we talk about Linkerd and the larger Service Mesh space with William Morgan (CEO of Buoyant). We cover William’s thoughts around important concerns such as latency and cost (both in your cloud bill and in real human costs) of operating services, we talk a bit about the birth and evolution of Linkerd (including some of the design decisions such as Rust in the data plane and Go in the control plane building Linkerd), and, finally, we’ll talk about the importance of security with service meshes (and how it should be reasoned).

Key Takeaways

A service mesh let’s developers focus on their core competency (or the business logic on an application) and not the patterns and practices of making service to service distributed systems operate effectively.
A common complaint is often around the number of containers and added latency when adding a service mesh to your system. In answering these concerns, it’s important to think about where your true costs lie. In almost every environment, the human cost is the thing that dominates, not the machine costs.
The evolution of Linkerd is in many ways the evolution of the service mesh market. Linkerd evolved a library used at Twitter called Finagle, to JVM-based sidecar with much of what Finagle offered in it, to a low footprint microproxy, and finally to a zero-configuration microproxy with mutual TLS offering those features.

Subscribe on:

Transcript

00:21 Introduction

00:21 Wesley Reisz: Among other things service meshes moved code used for wiring and the telemetry of services out of application-level logic that have traditionally been included in things like libraries into a whole new layer -- most often implemented with proxies typically called sidecars. A lot has been written about service meshes. A recent CNCF Survey shows the uptick in it with 27% of respondents saying they are using service meshes today. That's up 50% from the previous year, but what is a service mesh? What are these things called side cars and why do they all matter?

00:53 Wesley Reisz: Today on the podcast, we're talking about service meshes and more specifically diving into Linkerd with William Morgan of Buoyant. Hi, my name is Wes Reisz, I'm one of the co-hosts of the podcasts and chair of the first QCon Plus event that was held in November of last year. Speaking of that, there's another QCon Plus planned for May of 2021. Held between May 17th and May 28th, this spring version of QCon Plus will be held over two weeks. In just a few hours a day, we will feature 16 tracks curated by domain experts to help you focus on the topics that matter right now in software development. These are things like leading full-cycle engineering teams, modern-day data pipelines, and building continuous delivery system workflows, and the platforms used to build all of these things. If you're a senior software engineer, architect, or team lead, take a look at qcon.plus for more information.

01:43 Wesley Reisz: As I mentioned today on the podcast we're talking with William Morgan. William is the CEO of Buoyant, who is one of the leading manufacturers of a service mesh called Linkerd. Linkerd according to that survey that I mentioned before, has a market share of about 41%. Today we're going to dive into some details of Linkerd. We're going to talk a bit about the service mesh Linkerd, and also the larger service mesh space. We'll talk about things like William's thoughts around important concerns that deal with things like latency and costs, both in cloud service bill and in the human cost. We'll talk a bit about the birth and evolution of Linkerd. We'll talk about some of the design principles and details. Linkerd uses Rust in the data plane and Go in the control plane. We'll talk a bit about why that is, and some of the decisions made along the way. We'll also talk about the importance of security and zero trust configuration as we leveraged service mesh.

02:38 Wesley Reisz: As always, thank you so much for joining us on your jogs, walks, and commutes.

02:42 Wesley Reisz: So William, welcome to the podcast.

02:44 William Morgan: Thanks Wes, great to be here.

02:45 How do you define service mesh?

02:45 Wesley Reisz: I kind of attribute my definition of service mesh to stuff I've heard you talk about or seen you write about. And described this way, it's a dedicated infrastructure layer often implmented with sidecar proxies, not always, but often for implementing east-to-west (service-to-service) traffic. It kind of separates business logic from networking plumbing or plumbing, the make things work correctly. I've paraphrased that quite a bit, but what is your definition today? How do you define service mesh?

03:12 William Morgan: Yeah, I think what you said was a pretty good definition. I think, if anything what it is missing is the value, like the "why?", you know?

03:22 Wesley Reisz: Yeah, totally.

03:23 William Morgan: What's a car. Well, it's four wheels, and some seats, and a steering wheel. Okay? Well, that's good, but why of the car?

03:30 Wesley Reisz: It's the "So what?" Right?

03:31 William Morgan: Yeah. So for me whenever I talk about what a service mesh is I kind of start with something like that. I immediately follow up with, and the reason why it's important is because it gives you a bunch of features that you otherwise would have to have at the application level, and it instead gives those to you at the platform level. That shift in where it's located in the stack is profound value of the service mesh. There's a reason why we go through all these crazy pains to implement it in this insane way with all these proxies and then stuff that from the outside, probably it looks a little weird.

04:02 Wesley Reisz: To be real clear, you're talking things like discovery? You're talking like application patterns, like circuit breakers, and resiliency features like that, right?

04:10 William Morgan: Yeah, that's right, that kind of stuff. The observability side of things. Things like success rates, and latencies, and request tracing. Then even on the security side of things like mutual TLS.

04:22 Wesley Reisz: Yeah, definitely. I want to dive into security. I think there's some interesting conversations that we can have on that front. So the CNCF Cloud Native Survey for 2020 came out, I think in November, it was taken some time around May and June. But in it, it listed 27% of respondents... again, respondents to the CNCF Survey. But 27% of respondents are using service mesh in production, 50% increase over last year. They expect the growth to continue with 23% evaluating a service mesh, and another 19% kind of planning to use it in the next 12 months. Why? Is that these things, these features that you were talking about before that we kind of kicked this off with? Is that what's driving us, kind of moving that stuff out of the application tier and putting it into kind of this infrastructure layer?

05:07 William Morgan: I think it's mostly, there's all these blog posts about it, so-

05:12 Wesley Reisz: That you're writing, right?

05:13 William Morgan: If you want to keep up with the Jones's, then you got to quickly add a service mesh to your stack and who cares what it does. You got to check that box. No more seriously, although we actually do see some of that approach sadly in some of the service mesh adoption. I think the reality is the service mesh is just a really convenient way to get a set of features that you really do need, right? There's not really a good alternative to that other than implementing all that stuff, as I said in the application. That can be done, certainly that's kind of the traditional approach, but it's not great. There's a pretty significant cost of implementing a service mesh. So I do expect that trend to continue because in the Cloud Native space whether you like it or not, you're going to end up using one because it's just a better trade off than any other trade off you can make.

05:59 Wesley Reisz: Yeah. Now that we have the network in the mix rather than doing just calls inside of a processor.

06:04 William Morgan: Right, right.

06:06 At what point should teams start to look at a service mesh?

06:06 Wesley Reisz: At what point do you think it becomes important to really start to look at a service mesh? At what point does the overhead of the things that you're doing in the application start to become enough that it makes sense to really implement a service mesh. What I mean, are we talking dozens or hundreds? Or do you think it's always good to start there?

06:24 William Morgan: Yeah. I think there's a couple of factors that go into it, and I touch a little bit... Not to toot my own horn, but I wrote a very lengthy Meshifesto on servicemesh.io. Which I think now redirects to a Buoyant page. But I try and outline some of the factors that go into it there. I think certainly, your application and kind of the structure of your application is a big one. If you have a monolith, service mesh is not going to do a lot for you because the way that it adds its features is by adding them to those connections between services, so you have to be running microservices. I think another big factor is if you are running in an environment like Kubernetes. Because what Kubernetes gives us is the ability to implement a service mesh in a way that actually is not that painful for you. Right? When we talk about deploying thousands of sidecar, micro-proxies, or whatever it is we do with Linkerd, we can do that basically transparently in Kubernetes.

07:19 William Morgan: Because there's a lot of features we can build on top of and set the network routing rules and all that stuff kind of automatically for you. If you are in a different environment where you don't have those primitives, then yes, you can do a service mesh. But the cost, the deploy time costs, and the operational cost is going to be significantly higher, and I think that changes the equation. So in my mind, I think it really comes down to, yes, what's the structure of your application? Then are you in an environment that makes it easy to add a service mesh? If you're not, the cost might outweigh the benefits.

07:48 There are common issues raised when adopting a service mesh, including the number of containers and latency. How do you respond to these comments?

07:48 Wesley Reisz: Let's talk about that for a second. So one of the first push backs you hear from a company who might be particularly a smaller, maybe startup, maybe cost-averse company. When you start talking about a service mesh, that means that we're going to double the number of containers that we're running on our environment. It's going to be a lot more expensive to be able to run this. What's the response when someone brings up that argument?

08:08 William Morgan: The easiest response is what is the actual cost for you? Is it the compute cost, or is it the human cost? I think in almost every environment, the human cost is a thing that dominates. So if that's what you want to solve. Then maybe the other response there is where do you want to spend your innovation tokens? Do you want to spend that on inventing basic infrastructure, and having your developers write retry logic, and have to get that right in a big distributed system? Or do you want to just get the best practices implemented for you, and spend your mental energies on the actual logic of your business?

08:42 Wesley Reisz: Wonderful answer. The next question is latency, right? With sidecar model what's... I guess as your response we'll be able to talk about, "Yeah, but there's a performance cost here."

08:51 William Morgan: That's certainly true. In fact, the surface mesh, it's not like you're just adding one proxy between every hop, you're adding two proxies, right. You've got both the client side and the server side. So when we think about not the human cost, which I still think is the most important thing to consider... But when we think about the kind of machine cost, there's three things, there's a latency, there's memory consumption, there's CP consumption. You're going to pay a price, as with any piece of technology. It's not like that's new to the service mesh, you're adding stuff. If you added the stuff in library form, then you'd pay those cost too. Maybe less so. So that's all part of the cost benefit analysis. I will say on the Linkerd side, we are extremely aware of the resource and the latency cost. We do a whole lot of work, especially in the proxy side to minimize those things, so that you end up with what we call a micro-proxy, that's extremely fast and extremely lightweight. It has a really, really fantastically good profile, but it's not zero by any means.

09:48 Wesley Reisz: Yeah, totally. I've got that down. Actually, I want to talk to you a bit about, you talked earlier about all the cool kids doing certain things, you're not using Envoy. So, I wanted to dive in and specifically talk about that, and talk a bit about the implementation with Rust, because that fascinates me. I've heard, at least you can verify, but Linkerd is one of the most, if not the most performance of any of the service meshes that at least I'm tracking.

10:10 William Morgan: I believe that's the case. Benchmarking any piece of software is an art form, as well as the science. We do our own internal benchmarks, which gives me confidence in that answer. But there's always situations where you can find a particular scenario where something else is better, but I do believe that overall yeah, Linkerd is the most performant, and has the least resource consumption. Which, that's all stuff that I'm very proud of. Although honestly, again, I would sacrifice that in the interest of human facing simplicity. Happily, we haven't really had to make that sacrifice but I would.

10:44 Wesley Reisz: As I was kind of doing a little bit of research prepping to this, I read a blog, I think... and I wish I had captured better notes, so I could quote the blog. But I think it talked about Linkerd as being a service mesh that was relentlessly focused on simplicity and user-friendliness. Is that a design principal of Linkerd?

11:01 William Morgan: I love that blog post already, did I write that?

11:04 Wesley Reisz: You may have.

11:06 William Morgan: Yeah, that's 100% right. That has been a design principle, I want to say since the beginning, but really since the kind of 2.X rewrite. Because we went through this transformation from the 1.X branch, which was quite different to 2.X. But in the modern Linkerd yeah, 100%. Our focus first and foremost is on simplicity and on reducing the human cost to operating a service mesh. Because we have this fundamental belief that the service mesh doesn't have to be complicated. Which makes Linkerd a weird project because that whole space, the whole term service mesh now is so synonymous with complexity and weirdness. Here we are trying our best to fight against that. It doesn't have to be that way. I think what we've demonstrated with Linkerd it doesn't actually have to be complex.

11:44 Can you retrace how the current version of Linkerd came about?

11:44 Wesley Reisz: Yeah, very nice. I want to talk a bit about security, but before we do that, you mentioned a little bit before about libraries and things being maybe in this network mesh as well. You also kind of just mentioned some transformation with Linkerd 1.0, the Conduit to 2.0. So I kind of wanted to spend a minute and go back and just talk about where Linkerd came from all the way back to Finagle days forward. Thought we might trace a little bit on that. Could you talk a little bit about where Linkerd came from kind of its origin story?

12:17 William Morgan: Oh, you want the ancient history. Well, again, not tooting my own horn, but I have a very nice article up on InfoQ about this.

12:23 Wesley Reisz: Yeah, I can... is that where that is?

12:26 William Morgan: Uh-huh (affirmative). Yeah, yeah.

12:26 Wesley Reisz: Well, the reason I want you to go through it though, is because I think in a lot of ways the stages that you're about to talk about show a lot about the journey from libraries to what is now current modern service mesh.

12:37 William Morgan: Yeah. The short story is Linkerd 1.X, which maybe I'll call ancient Linkerd, although it still powers some pretty huge deployments around the world. It came out of this project that we were very familiar with at Twitter. Which is where Oliver and myself, kind of the initial Linkerd creators worked. We came out of the project called Finagle, which you mentioned. Finagle was a Scala Library that was powering, and I think continues to power Twitter's infrastructure. The very first version of Linkerd was literally Finagle, just wrapped up in proxy form. So Finagle has all these beautiful programming, idioms about doing functional programming and RPC calls. We just threw that all away. We said, just give us the operational side of things. So, it's the retries, and the timeouts, and the load balancing and so on. So that first Linkerd, it was just that, was Finagle in a box.

13:25 Wesley Reisz: But Finagle was a library that was included with the projects services at Twitter?

13:30 William Morgan: Mm-hmm (affirmative). Yeah, that's right. That's right. There was no sidecar proxy. Or even that wasn't really an idea that was too much in our heads when we were at Twitter. Because Twitter had just mandated like, "Hey, guess what if you're writing a service, it's going to be on the JVM."

13:42 William Morgan: But that speaks to that move about moving things out of the application and into this other service layer?

13:48 William Morgan: Yeah, that's right. I mean, Scala Library, for doing like functional programming and RPC calls... It's like the audience who can make use of that directly, it's pretty limited. But once you wrap it up in a proxy... and especially, the thing that really happened at that point was the rise of Docker and containers. All of a sudden it didn't matter what this thing was written in. You just stick it in a container and you stick it next to your application. Then the rise of Kubernetes, and now we've got these nice networking controls where we can just transparently wire stuff up. All of a sudden you've got this way of doing... what's effectively like runtime binding of functionality. That's what ended up really driving momentum on Linkerd.

14:27 William Morgan: So then there was Linkerd the classic version, as you kind of mentioned... I don't think he used the word classic, but the classic version you mentioned. Then Conduit came about, and then the two merged into Linkerd. What happened there?

14:41 William Morgan: Yeah. I feel like Obi-Wan Kenobi, I'm like, "Oh, that's the name I have not heard in a long time." So yeah, around the time that Linkerd the 1.X Finagle based version was really taking off, we were already aware that the JVM was awesome at scaling up, but it was very poor at scaling down. So we'd put a lot of engineering resources in it, but we couldn't really get the proxy itself to get under 150 MB/s maybe under 120. So that was okay if, for people who are running these giant three gig JVM apps. But there was an audience who were writing their microservices in Go, and they were like 50 MB/s, you couldn't really ask them to run this 150 MB proxy, next to their 50 MB service instances and be like, "Hey, it's a transparent proxy. You don't have to worry about it." So we knew we had to rewrite it.

15:25 William Morgan: Then the other thing, maybe even more profound than that was that having brought that thing into the world, we saw how people were struggling to adopt it. I mean, it was in production use, but it was much harder to get there than it should be. In part, because I think what we had done is we've basically taken every possible Finagle feature that we had our hands on and just exposed it as like "Here's a giant yaml file." So when we set about to rewrite this thing... And by the way, you should never do this, right? It's like second system syndrome. You can literally look it up on Wikipedia. It's like, "Here's what you should never do. Take a functioning thing and then rewrite it from scratch. There's no way that will ever work."

15:59 William Morgan: So like idiots, we did anyways, and I guess we happen to make it work. But we knew that there were two things we had to fix. Number one, we had to get off the JVM, that was not a path forward for us. Then number two, we had to make it simple. We had to make it something that could actually be adopted in the order of minutes, rather than the order of months. So all of those design principles, went into what is the modern version of Linkerd, I think we're up to 2.9 as of last month. Which has a control plane written in Go kind of the lingual franca of Kubernetes, and then a data plane written in Rust.

16:34 Why does Linkerd implement it’s own proxy with Rust?

16:34 Wesley Reisz: So you talked a little bit before about the micro-proxy about this Rust proxy. So I'm curious about some of the design decisions that you all had when you were going about creating this. So questions that come to mind, why Rust, for example?

16:48 William Morgan: Oh, I don't know. We just picked-

16:49 Wesley Reisz: It seemed like the thing to go?

16:51 William Morgan: We read this Hacker News blog posts and it sounded pretty good. So actually, we made this call in 2018, and it was pretty scary at the time. Because fast forward two years Rust has got crazy momentum, has got this very well-developed ecosystem of networking libraries. In fact, I think that most modern kind of asynchronous network programming engineering is all happening in Rust right now. So that's a really exciting ecosystem to be a part of. But 2018, it was barely there. We were like, "Holy moly, here's this language. And if we go down this path we're going to have to do a bunch of investment into core networking libraries." It's not like we get to just wrap Finagle in a box anymore. We're going to be like figuring out how to deal with bits and bytes and move them around between thread pools or like whatever... I don't know how any of this stuff actually works. Whatever happens in that proxy.

17:40 William Morgan: But the thing that was so compelling for us about Rust was that it had two things that we... well, maybe three things that we really looked for, and the third was maybe less important. So the thing was, it allowed us to write native code proxies. So we could compile these things down just like with C or C++, we can compile these things down to basically about as fast as the computer can execute them, right? There's no managed environment. Even something like Go has a managed environment, you have the garbage collector. All that stuff gets complicated when you're trying to develop these really low latency network proxies because every once in a while, the garbage collector will rear its ugly head. Then it's like, "Ah, additional, 100 milliseconds of latency.", and then your tail latency looks crappy and that's a problem.

18:19 William Morgan: So we needed to have really fine grain control of memory usage especially, so we could do allocations on a per request basis and amortize all that stuff and have a really sharp distribution. Then the other thing that was really compelling for us and why we ended up on Rust instead of C or C++ was the memory safety guarantees. So, the data plane of a service mesh, that's the critical component here. That's where our users, their HIPAA data or their PCI data, or their Financial transactions are going through the data plane. So any kind of vulnerability there is like, it's a huge, huge problem. So what Rust allowed us to do was to sidestep this entire class of like kind of endemic problems, in C and C++ programs around memory access and buffer overflow exploits, and those sorts of CVEs.

19:07 William Morgan: So it was really the right combination of those two things. It was also kind of a nice language for us coming to it from Scala. I mean it had zero cost abstraction. So we could write in these higher order ways and still have them compiled down to zero cost stuff. But yeah, all of that at the time it seemed like a really risky decision, and it was a risky decision. It kind of paid off in retrospect, but it was a little scary.

19:26 Why even write your own proxy in the first place?

19:26 Wesley Reisz: Earlier, when we were talking, you talked about where you want to spend your innovation token, so to speak. Is your core competency building this plumbing between services, or is it in the business logic of your application? In some ways I could ask the same question with building a proxy. So the micro-proxy that you built is incredibly performing, it's very secure. All those are facts. However, what was the reason that you all chose to build that rather than doing pretty much like everybody else and using Envoy? What was the business reason, I guess that made you all want to do that?

19:59 William Morgan: Yeah. That's a great question, and Linkerd is a weird project. Well, weird sounds bad. It's unique. It's very unique in the service mesh space. A lot of these projects... I know you said we shouldn't compare them, but they kind of start to blend together. They all kind of feel pretty similar, and Linkerd is out here with a very different approach to a lot of this. Primarily that approach and our ability as a project to be so much simpler, and faster and lighter, I would argue to have the best security kind of foundations are due to that investment in our micro-proxy. The reason we call it a micro-proxy is because it is not like Envoy. Envoy is a general purpose kind of Swiss army knife. You can use it for all sorts of things. You can use it at the edge, you can use it in sidecars. You can use it in a central proxy... I don't know, you can do 100 different things with it. Because of that, it's complex. Right?

20:47 William Morgan: That's not a knock on Envoy, it does a lot of different stuff, and it has a lot of different features. Linkerd our little proxy, which we just call linkerd2-proxy, is very boring name. It doesn't do anything except act as a sidecar service mesh. So that allows us to strip out all the complexity around all those use cases that we don't need, and to be really, really focused on just serving the kind of minimum that we absolutely need to do. Because remember these proxies are inserted everywhere. You're getting two of them between every single hop. So every byte we can shave off there, every millisecond we can reduce from the latency, it's really, really important.

21:20 William Morgan: So I would argue that yeah, 100%, this is part of our core value prop is that we are adding these proxies everywhere. It's kind of this invasive thing. We want to make sure that they are not just fast and they're not just light, but that they are built in this very secure framework so that you can do this very... It's a scary thing to add a service mesh to your system. You were putting it in this very sensitive part. You were vulnerable to what this thing does. Every single request is happening and your application is going through these proxies. So we really want to give you something that you can have a lot of confidence in.

21:56 Wesley Reisz: Yeah, totally. It totally makes sense. That is your core competency. I do have one question on it, and that is what does the community look like around it specifically? Obviously, there's a huge community around Envoy and Nginx those communities. What about the micro-proxy here? What does that community look like? Am I at risk because there's maybe not as large a community for it?

22:16 William Morgan: Yeah. So the community around linkerd2-proxy is the Linkerd community. It's not a separate thing. It's not designed to be reusable. It's not designed to be like a separate part. It's in the same repo that goes through the same security audits that the CNCF funds, everything's done in the open. It's like a CNCF project. So we try and make it as transparent as possible, but it doesn't have its own community.

22:43 William Morgan: What it does have though is simplicity. I actually wrote a blog post about this recently too. Sorry to keep referencing my own materials here. But I wrote a blog post called simply Why Linkerd he doesn't use Envoy? I go through kind of the play-by-play of each of those decisions. It's a question that we get asked a lot because Linkerd is unique and that it doesn't... But one of the things I look at is just kind of the size of the code bases. Linkerd2-proxy is something like a fifth, the size of Envoy. Which is not a moral judgment again, but it's just a very different sort of project.

23:16 You use Rust at the data plane, what made you choose Go for the control plane?

23:16 Wesley Reisz: So I know I've asked this question to Oliver before, but I'm going to ask it again because it's in my mind all over again. But for all the reasons that you just mentioned at the data plane, why Go at the control plan?

23:26 William Morgan: Yeah. Well, because the requirements are actually pretty different there, right? The data plane, we want to be as fast as possible. We want to be as small as possible. We care about every millisecond and every byte used, the control plane is much more relaxed about those requirements. The control plane sits off to the side, it's not in the data path. What we care there primarily is, can we interact with the rest of Kubernetes in a really nice way? So being in Go meant that we could use the Kubernetes libraries directly, and that ecosystem was already there. It's also an open source project, we wanted to attract as many contributions as possible. We wanted to make it welcoming and friendly for everyone and go as a relatively easy language to pick up. So that part was attractive too. So different requirements and those two components ended up with different languages for us.

24:14 How can the attack vectors change when a system is using a service mesh?

24:14 Wesley Reisz: Yeah. Very nice. Okay. So I want to switch gears just a bit and talk a bit about security in the context of service meshes, in general and then Linkerd in specific. So I believe with the latest versions, 2.9.1, I think with 2.9 you introduced some new security constructs, some zero trust configurations and things like that. So I thought it might be a nice space to talk. So I guess to first start off with when you talk about security in a service mesh, how does the, I guess, attack vector change? Does it change or is it the same?

24:48 William Morgan: Yeah. So, I'll do my best to answer this. I'll put a big asterisk, which I always do ask when I'm asked this question, which is, I'm not a security expert. I know what they tell me. I try and reason about it as best I can, but I am ultimately a little bit decoupled from the ground truth here. But what I know is that there's two things that we really focus on in Linkerd. The first thing, which I think is hugely important, it's actually not making the system worse. So whenever you add a component to a system, if you're actually introducing a vulnerability somewhere in there, if you're making it harder for the overall environment to be secure, then you've made things worse. It doesn't matter what kind of fancy features you've added, if you have made things worse over here, then the whole chain is only as strong as its weakest link. So a lot of what we do in Linkerd is purely trying to not make things worse.

25:38 William Morgan: That ranges all the way from making sure that we have a data plan that's written in Rust, and has as many kinds of security guarantees we can around your critical sensitive data. All the way to keeping the system as a whole really simple, so that when you do add these features, the poor human being who actually has to configure this can build a mental model of how this thing works, can reason about it, and can be presented with a set of intelligent defaults. We wanted to avoid was adding a lot of complexity to the system. Especially around the security features where all of a sudden now it's hard to reason about, it's easy to make mistakes. Or maybe just so difficult that you never enable that stuff. So we wanted it to minimize that and to make it so that whatever we did... if we did nothing else, let's not make the system worse. Let's not make it more insecure, or less secure.

26:25 William Morgan: Then in terms of the actual features the big one from Linkerd I think, and most service meshes is around mutual TLS. Which means that as Service A talks to Service B over the network within a cluster, or even in 2.8, I believe across clusters. Linkerd will initiate and terminate TLS and verify the identity on both sides, which makes it mutual TLS, and will kind of handle all of that transparently. So the actual communication that's happening, it's not just encrypted, but we've kind of authenticated both sides. So that A knows it was talking to B, and B know and it was talking to A. Now what's the actual attack vector that, that protects against? There's some. There's some, but it's not like a panacea, not by any means.

27:08 William Morgan: If someone still has access to the host, someone's still pops the host and it gets root on a node somewhere in the cluster. It's not like this stuff really helps you, you can inspect the memory. There's a variety of other things you can do. So really what I think the value of that is, is around having a very straightforward mechanism for getting encryption in transit everywhere, as easily as you can. Then having service identity in place and enforced by the platform. I think that's the real value.

27:37 Wesley Reisz: What does it look like to enable mutual TLS using Linkerd for a operator?

27:43 William Morgan: Yeah. So you just install it, that's it.

27:47 Wesley Reisz: It's on by default?

27:48 William Morgan: Yeah, that's right. It's on by default, there's zero configuration. That was the big push for us. Was that okay, here's this thing, we believe it has value, but if you have to configure it, if you have to do a bunch of hard work then you're either not going to do it, or you're going to do it wrong. So it's on by default zero config. If you're running Kubernetes today, you're five minutes away from having mutual TLS between all your TCP connections. All you have to do is install Linkerd. That was a big, big push for us. But it speaks right to our principles around not just simplicity, but around having security as part of the default features and not as a later add on.

28:24 Wesley Reisz: Very nice. Well said.

28:25 William Morgan: The only other thing I'd really add is that security's a very broad topic and there's a whole lot more to running a secure application in Kubernetes than adding a service mesh. So that is one thing that I believe is helpful. especially when we start talking about zero trust and trying to move the security enforcement down to the most granular layer you can. But there's a lot more that you have to do that the service mesh can't help you with. So it needs to be part of a holistic strategy.

28:49 What’s next for Linkerd and the service mesh space?

28:49 Wesley Reisz: So what's next? What's next for Linkerd what's next for this whole service mesh industry? Where are we going now that we're all mutual TLS enabled?

28:57 William Morgan: Yeah. Gosh. For Linkerd itself, there's one kind of big feature set that we really have our eyes on in the short term. So, upcoming we've got 2.10 and 2.11. 2.10 is primarily going to be focused on minimizing, even more minimizing the control plane and making it modular. So that you can install it even smaller and smaller and like more and more stripped down versions of Linkerd. That's important to us because we are a big believer in doing the minimum amount necessary, right? It's like we don't want to have the global kitchen sink project that can solve all things for all people. We want to give you the bare minimum that you need to build a secure and reliable Kubernetes system. So after 2.10 is 2.11, and then we will get to policy. Which we've been wanting to get to for a long time. Actually, it's kind of very topical to security as well.

29:43 William Morgan: So policy means right now Linkerd basically allows every request to happen. If A wants to talk to B, Linkerd will do its best to make that request happen. Once we have policy in there, we'll give you the ability as the operator to restrict things, to say "A is not allowed to talk to B." Or, "It's only allowed to call these calls. Or it has to satisfy these conditions.", or whatever it is. Then after that, I think the remaining set of hurdles for us are really around expanding Linkerd. So can we get the micro-proxy to run outside of Kubernetes? Can we give you the ability to incorporate kind of more and more things into the same operational paradigm? That's kind of the Linkerd roadmap at a very rough sketch.

30:22 Wesley Reisz: Understood. All right, and we're about to wrap up just to give you a little bit of a softball pitch. I have heard of Buoyant's Cloud's commercial product I believe, but what's that all about?

30:30 William Morgan: Yeah. So this is tied to what we believe kind of the future of the service mesh is. Which is, to become really boring so that what we can start focusing our time and energy on, is what sorts of things can we build on top of the service mesh? So Buoyant cloud builds on top of Linkerd to provide a dashboard for platform owners, the same audience who are adopting Linkerd. It allows them to solve kind of the rest of the story, right? So the service mesh here is playing a role. It's providing metrics, it's providing kind of mutual TLS. Buoyant Cloud is then tying that to everything else that's happening in your Kubernetes cluster. So that's an example of the sorts of thing that I think has got to be the future of the service mesh. We've got to make the infrastructure itself very, very boring so that we can get back to the work that we actually want to do, which is building these platforms that are reliable, and safe, and resilient, and flexible that we can then launch our business logic on top of that.

31:24 Wesley Reisz: All right, William. Thanks for joining me on the InfoQ Podcast.

31:27 William Morgan: Thanks, Wes. It's great to be here.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.