BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Michelle Noorali on the Service Mesh Interface Spec and Open Service Mesh Project

Michelle Noorali on the Service Mesh Interface Spec and Open Service Mesh Project

In this podcast, Michelle Noorali, senior software engineer at Microsoft, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: the service mesh interface (SMI) spec, the open service mesh (OSM) project, and the future of application development on Kubernetes.

Key Takeaways

  • The service mesh interface (SMI) specification provides a standard interface for service meshes on Kubernetes. SMI includes definitions for traffic management, traffic policy, and traffic telemetry.
  • The SMI spec is aimed at both platform builders/operators and application developers. Platform operators can provide sensible service-to-service communication defaults, and developers can specify service/application release configuration, such as canary releases.
  • Open Service Mesh (OSM) is a lightweight and extensible cloud native service mesh. Using the CNCF Envoy project. OSM implements SMI with the primary goals of securing and managing microservice applications.
  • A common use case for adopting service mesh technology is to enable transport security and verification of service-based identities (e.g. via mTLS and SPIFFE).
  • Although functionality provided by a service mesh will most likely not be added to the core Kubernetes project, it is likely that service mesh implementations will be provided within Kubernetes distributions (distros).

Transcript

00:05 Daniel Bryant: Hello, and welcome to the InfoQ podcast. I'm Daniel Bryant, news manager here at InfoQ and product architect at Ambassador Labs. I recently had the pleasure of sitting down with Michelle Noorali, senior software engineer at Microsoft. I've been following Michelle's work for a number of years from Engine Yard, where she worked on platform as a service technologies, to her work on the Helm Kubernetes package manager, and now onto her work in the service mesh community.

00:26 Daniel Bryant: Michelle has been influential within the creation of the service mesh interface specification commonly referred to as the SMI spec and she is a strong community advocate for this and many other CNCF projects. I was keen to explore the current developments in the SMI space and learn more about the recently launched Microsoft-backed open service mesh, OSM project. I also wanted to get Michelle's opinion on topics such as multi-cluster networking and the future of service mesh. Hello, Michelle, and welcome to the InfoQ podcast.

00:53 Michelle Noorali: Thank you so much for having me.

00:55 Introductions

00:55 Daniel Bryant: Yeah, thank you for joining us today. Could you briefly introduce yourself to the listeners, please?

00:59 Michelle Noorali: Hi, I'm Michelle. I'm a software engineer at Microsoft. I've been working on Kubernetes and container-related tooling for five-ish years now. A lot of the work that I do has been in open source and part of the cloud native computing foundation, CNCF. So I've been a core maintainer on projects like Helm, Draft, CNAB, cloud-native application bundles, and now I mostly work on service mesh stuff. So in the service mesh space, I work on service mesh interface, SMI, I know we'll talk about that a little bit today, and open service mesh, OSM as well. In my spare time, I also serve on the technical oversight committee of the CNCF and I'm on the governing board. I serve as a developer representative. So very involved in the foundation.

01:42 Daniel Bryant: You sound very busy Michelle, too, right?

01:44 Michelle Noorali: Sometimes it does get that way.

01:46 Could you explain the motivations for the service mesh interface (SMI) spec, please?

01:46 Daniel Bryant: So I followed your work for quite a while now, as we were saying off mic. And I followed the Helm work, definitely most interested to chat around the service mesh space, I think that's really cool. So I know you've been working in the service mesh interface (SMI) spec quite a bit recently. Could you kind of explain the motivations for the project and how you came to be working on this as well, please?

02:04 Michelle Noorali: So a few years back, as you probably remember, there service mesh was introduced, and when it was first introduced the term and the technology really resonated and caught on with a lot of people. And so there were lots of folks who were excited about it and it seemed like everybody wanted one, but they were confused if they needed one, and there wasn't a lot of consensus around what features actually make up a service mesh. So William Morgan of Buoyant did a great job of coining the term. He basically defined it as a dedicated layer of infrastructure that deals with or handles service-to-service communication. So another way to say the same thing is essentially if you have a large environment with a lot of microservices, it's highly dynamic, things are changing all the time. You might want some features around observability and control around your networking related to your applications.

02:56 Michelle Noorali: So security features, things like that, management of deployments, things like that. So that's when you would use a service mesh. And in general, for the most part, service mesh implementations use the sidecar proxy method approach. So you just take all of that application network-related logic and you plop it into a sidecar next to your application, and you basically have this control plane running that manages all your proxies next to your applications. And then that takes care of all the features that you want, the observability and the security features. Going back to SMI, there isn't a lot of consensus around what features the service mesh provided at the time. So a bunch of folks in the community, Microsoft, Buoyant, and LinkerD folks, HashiCorp and Consul folks, folks from Meshery and Solo, we all got together and basically created this list of top, most requested features that people actually want from a service mesh.

03:51 Michelle Noorali: So when you say, if you want a service mesh, these are the things you probably want, and they are traffic management control, they are observability, they are access control, so managing which services can communicate with each other. And then this group of people defined the APIs that folks can use to implement those features in a service mesh or build against a service mesh using those features. So that's what SMI ended up being as the spec list of APIs that you can use to do observability and Canary deployments and access control for services running as part of your service mesh. And it becomes this standard that tooling can build against and people can consistently rely on regardless of what service mesh you end up using. Long definition and intro, but I hope that kind of gives you an idea of what SMI is.

04:42 What has been happening with SMI in the previous 18 months since it was created?

04:42 Daniel Bryant: Super useful Michelle, because I think level setting in this space is really important. In the service mesh, I kind of keep up to date with it quite a lot, but it's still quite challenging at times to figure out all the stuff that's going on, right? There's a lot of moving parts in there. So I think that was brilliant. And on that note, actually the SMI spec has been moving on quite a bit this year. I see it was accepted into the CNCF Sandbox project space. So what's been happening? Because it's been, I think about 18 months since the SMI spec was created?

05:08 Michelle Noorali: So SMI was created, and it was created with a bunch of different folks from different organizations, but going into the CNCF, I think we got in maybe in April as a Sandbox project. And that was just really nice because then it became a project that is now in a vendor-neutral IP space. So legally the barrier to entry is reduced for people who can't contribute to projects that are owned by a particular company or in a particular company's IP space, necessarily. So that was really nice. I think that's one of the benefits and we've been seeing more and more folks join the community from different companies. We've been seeing folks from Red Hat and a bunch of design other service mesh implementations kind of companies. So I think it definitely has opened up that community, and then what's also really nice is, being in the CNCF is kind of helping us figure out, okay, what other projects can we coordinate with? What pieces make sense to fit with other projects? So those are still early days and early talks, but that's what's going on. We're releasing the project more frequently or doing more frequent releases and just getting more people involved and more eyes on our projects. So lots of contribution and lots of changes, stuff is happening and all exciting things.

06:20 What integrations will SMI have with the wider CNCF landscape?

06:20 Daniel Bryant: So I was keen to dive a little deeper into your mention of integration points there, because when I look at the CNCF landscape, I do sometimes get overwhelmed, as I'm sure many folks do. Are there specific integrations that you're going to be targeting or is it all work in progress or secret, for example?

06:33 Michelle Noorali: No, nothing is secret. Everything's recorded and posted, but that's the beautiful thing about being in the CNCF and also even creating specs, so we're not the only spec in this space. The idea is there are so many components, just take a look at the landscape, and people need a way to mix and match components and make things fit. And so these interface projects are really... They're not glamorous, I would say, but I think they're very important and they're good building blocks. So hopefully there can be some collaboration and again, very early talks, but in the metric space, we're trying to figure out, okay, is there any way to collaborate with open telemetry? I don't actually know the answer to that question, but maybe, and there are new things happening with the ingress resource in Kubernetes. So how does service mesh really deals with east-west traffic? So what would we want to say in terms of north-west traffic?

07:24 Michelle Noorali: How do we want to deal with that? Is there anything the spec wants it to define? And in terms of traffic policy, like access control-related things, how can we integrate with different types of workload identity? So right now, kind of getting into the weeds when you define which service can talk to another service and access control policy, part of the SMI spec, you have to tie your workload to some workload identity, which defines what the workload can and can't do. And essentially that spec defines what sources can communicate with what destinations, and they do that by communicating that the pods with these service accounts can communicate with the pods that are attached to other service accounts. So that's one form of identity for our workload, but there are lots of other ways to do identity in different cloud providers and things like that. And then SPIFFE is another way to identify a workload. So these are very early conversations and we're still trying to figure out what makes sense, but those are kind of the areas that we could tease a little bit more and kind of figure out how we integrate with other projects.

08:25 Who is the SMI spec primarily aimed at? Is it platform operators or application developers?

08:25 Daniel Bryant: Who is the SMI spec primarily aimed at? Is it platform folks or is it more app developers?

08:32 Michelle Noorali: It's so funny you ask that, because I think that's something that's been brought up in the community and we want to define basically the personas and the perspectives that use the project. So that's still in progress formally, but just from my opinion, I think it kind of is both and it'll change. So right now, generally, for the most part, if you want to do a Canary deployment, we have a traffic split resource that you can go apply. SMI has some CRDs and you can define what custom resources you want to use, and then you can apply those in your cluster, and as long as your service mesh implements those, then you can get something like traffics. So right now a service owner might actually create that resource, right? Depending on who has access to the Kubernetes cluster, but I do feel that as more tooling comes along and is built on top of service mesh, then these things will be more like underlying components and things that just the cluster admin sets up and then tooling on top of that, might go and do all the coordination needed to build the actual resource underneath. But at the moment it's geared towards both the end app developer as well as the cluster admin and the platform developer and all parties involved.

09:46 How do engineers configure SMI? Is there lots of YAML to write?

09:46 Daniel Bryant: Yeah, I think that's the general vibe, right? With Kubernetes at the moment, because I followed it like you from a very early stage of the project, and we've got used to writing YAML. Yeah, we're cool. We may not like it, but we're cool with it, right? Whereas I think a lot of folks in the enterprise are like, " ..., I come from a Java background." People are kind of still a bit unsure of all the XML they've had to write in the past, now it's YAML instead. Yeah. So as folks are adapting and we mentioned custom resources that defines things like traffic policy, telemetry, you mentioned about traffic management. How would app developers use these things? Like what kinds of things are they thinking about when they are defining traffic policies say or traffic management?

10:26 Michelle Noorali: By the way, I have hope on the YAML thing. I've been seeing the rise of these operational UIs in the CNCF and around the CNCF. Spotify has a thing called backstage.io and Lyft has a thing called Clutch, and I really have hope that we won't have to do that much YAML soon, but it's happening. But essentially what folks really need to think about is, what features and functionality they need. Some people look to a service mesh and they're like, "I just need mTLS because there's some regulatory requirement that I need mTLS for and is a non-starter if I don't have that." Or, "I need to have it by some year, X year." And so you might look to service mesh for that, if you want observability in a specific way. So like SMI says, not only can you get golden metrics, so like latency, error count, success count, things like that. Not only can you get those for your individual resources and groups of pods or resources like hold namespaces or deployments, but you may also want edge metrics.

11:30 Michelle Noorali: So like you want to get golden metrics from between all those services running in one namespace and another namespace or between one deployment and another deployment. So if you want things that granular and in that format, you might look at SMI metrics, or people who implement SMI metrics. If you want to do a Canary deployment, you might look at using traffic split as long as your service mesh implementation implements traffic split. So you can just define what service your requests are making requests to, and then you can say there are different backends you want to make sure those requests go to when you have certain percentages, you want to route to a particular backend, so you might define a traffic.

12:12 Michelle Noorali: And then access control is... I think I've mentioned a few times, which services can talk to which services. And so I think the idea is you got to figure out what thing you want, and sometimes it's not that everybody wants everything, sometimes just one thing is enough. And I think this probably resonates with you because you've been in the community for so long. Since day one, folks have been asking, how do I do a Canary deployment in Kubernetes? Or how do I do AB testing, and things like that? And so this is a good abstraction to use if you're trying to do those kinds of things.

12:41 Daniel Bryant: Perfect, yeah. I remember Kelsey Hightower in particularly doing a number of AB traffics demos, even like long time ago. It seems like so long. We've got solutions, but none of them are quite a perfect fit, as you've mentioned, at sort of the UI control level, because I think a lot of developers are more comfortable with sort of setting up just the UI, saying, "Canary this over like 12 hours." At the moment, I think reality is, in my experience, we're mainly doing these things at the CLI. Maybe using something like Flux, there's some cool tools out there to automate it, but still a bit more low level compared to UI-driven stuff, right?

13:15 Michelle Noorali: Yeah, it's just been so long. It feels like so long, but I do feel like we're building these really solid, modular interoperable components. So we're building the pieces to get up to that UI level, and we can see it and I can see the sun through the clouds or whatever the expression is. So it's almost there, I have hope. I'm hanging on.

13:36 Will projects/vendors implement all of the SMI spec? And do you think individual service mesh projects will provide functionality above and beyond the base spec?

13:36 Daniel Bryant: How do you think the standards will be adopted? Because one thing I've seen in my past with standards is it becomes a bit of a sort of race where each vendor is trying to out-compete others. So they may implement the kind of core, sometimes a slightly derogatory term is “lowest common denominator”, but they implement the basic standards, but they're always looking to add extra things on. What do you think is going to be the case? Are folks going to implement all of SMI, and do you think the service mesh vendors will also try and sneak in extra value adds as well?

14:04 Michelle Noorali: That's a great thing to hit on, and there's a few parts to that question, actually. I do feel like it's a hard thing building the set of APIs for multiple implementations, because it's not like everybody's just building on one type of proxy. There are different types of proxies under the hood. What I am hopeful about is, I love seeing that there are implementations out there that were never part of the community, but they just built against SMI, and we just found out about them pretty organically. I was like, "That's a really good sign." That means that we built something easy enough for folks to understand, and it resonates with even implementation builders that they can go and implement that, and that must be of some value to end users. So that's really nice, or to tooling adopters.

14:46 Michelle Noorali: So that's nice. I don't know, to be honest, if everyone will adopt all of the APIs. I think we're still figuring out what needs to be done there, and it's still evolving pretty quickly, but if an implementation implements a certain version, that version is frozen so they can build on top of it and not break things, which is nice. I don't see people adopting all components. I think they'll just adopt the thing that is really useful to them, and there's actually a conformance test that's being built right now. The folks at Meshery are working really hard on that, and also folks in the CNCF networking SIG. So they're building this conformance tool that allows you to run tests against an implementation to see if it actually conforms to the APIs. And it's like, yeah, they have different suites for the different APIs, so you can pass traffic split and not pass access control.

15:41 Michelle Noorali: And that's okay, because you may just implement traffic split, and that's fine. And I think that that's a conversation we have ongoing in the community as well. I haven't seen any discomfort with the fact that an implementation might not implement all of the components. The other piece though, that you touched on, was around extensibility. And we are still trying to figure out what that story looks like, there are implementations that do a lot more than SMI does. In fact, SMI actually falls behind most of the implementations. It really looks to see what things the implementations have in common and then build an API against that, rather than spec-driven development, which we don't necessarily do. So that's, I think, an interesting way to go about it as well. I think extensibility is something to look forward to.

16:28 Michelle Noorali: People are going to need a lot more functions, so we're working on OSM, open service mesh, and we implement SMI. Microsoft started that project, we donate to the CNCF as well, and we implement SMI, but we also wanted circuit breaking. So we have a CRD that does circuit breaking, but we also need to figure out on our end, what does the extensibility story look like? How do you do more complex tasks in a way that's complimentary to SMI rather than it's one or the other? So we'll figure it out, and as we figure it out and test new things and experiment, we'll bring those learnings back to SMI, because at the end of the day, we want SMI to kind of hold all of those patterns.

17:07 Could you introduce the Open Service Mesh (OSM) project, please?

17:07 Daniel Bryant: Awesome stuff, Michelle. That sounds all super interesting, and you touched on my next question actually, which is perfect. I was curious about open service mesh, because it's an Envoy based service mesh, if I'm understanding correctly. I saw it popping up a few months back, I think it was, and I said, "Ooh, this is super interesting. SMI compliant, interesting people involved." Could you give us sort of the background as to where that popped up from and what the motivations for it there?

17:28 Michelle Noorali: Yeah, the one-liner, what we put on the marketing stuff on OSM is, it's a lightweight, extensible cloud-native service mesh, and it helps you manage, secure and observe microservices in highly dynamic environments. It's SMI-native, so it doesn't completely implement SMI just yet. It doesn't implement SMI metrics, we were working on some features around there, but it does have metrics in it. The goal is to fully implement to SMI, it runs on Kubernetes, and like you mentioned, it does take the sidecar-based approach and we do build on Envoy proxy. So the basic features that we have built out so far is traffic encryption, so you get mTLS between services, traffic shifting, so you can do Canary deployments and we're working on making sure you can do AB testing as well, access control and getting those golden metrics, traffic metrics. So those are the basic features and these days mutual TLS is all the craze, so I think that's something that we really wanted to make sure that we had in there from the get-go. The motivation was just simply, we wanted to have our own implementation that we can help our customers with as well as contribute back to SMI with.

18:40 Michelle Noorali: When I first got involved, it was actually after the project had initially gotten started, so it was a couple of months after. I was just really excited about service mesh, and we wanted to basically contribute back to the spec, but it's really hard to do that if you don't have your own implementation. So it was very key for us, because we think that SMI is really valuable and not just to the community, but also to our customers, to have this kind of consolidation in our community. So that was really important to us, so that's kind of where it came from. We wanted something simple and easy to understand. That's really the essence of the project and everything we do, we kind of keep these core principles in mind. So it should be easy to understand. It should be easy to contribute to. It should be easy to debug. And easy is such a relative term, but it really helps us keep this perspective. If it's getting too complicated for someone to really digest in one sitting, then we're doing something that is not core and key to what we wanted to do and wanted to build. So a lot of times we've iterated already on what we're building and we're continuing to iterate to make sure that those things are super easy to use.

19:48 When should engineers look at adopting a service mesh like OSM?

19:48 Daniel Bryant: I think that's something I hear from folks in general around cloud-native tech. It can be a lot to take onboard sometimes, it can be a lot to learn. And that leads, I think, nicely onto my next question: what would be the typical use cases for adoption of OSM and roughly, where do you think in sort of the journey people would start looking? They're probably going to spin up a Kubernetes cluster, deploy a few microservices. When they get to a certain scale or maybe getting to closer to production, is that the time you look at OSM?

20:15 Michelle Noorali: Yeah, that's a great way to look at it. I think what we've seen though, it's really driven by the key features that you need. So the same features, it's like I'm a broken record, right? Just whatever features you might want from SMI, those are the things that you're going to look to OSM for. So if you want management of your traffic and being able to do different deployment patterns like Canary and AB testing, those are things that you might look to OSM for. Traffic encryption, that's a really big one. mTLS is huge, I didn't know that it was such a big thing until I really started talking to people and they were like, "Yeah, actually this is a requirement. We need to make sure it's done in a few years. And this is important. And that's why we're looking at service mesh technology, because I know it's going to take care of all the hard bits for us."

20:59 Michelle Noorali: So that's kind of what you would look to it for. It's really one or two features that you need, and then you might need the rest of the features at some other point. So that's why you were going to look to it, but yeah, just generally not for folks who have small clusters and just a few microservices. Generally for larger environments. Maybe you have a few clusters, maybe you have a few clusters across different regions, and that's why you're going to start looking at service mesh. We're still figuring out what our multi-cluster story is and things like that, but that's something that we want to make sure we tackle relatively soon, because it is something that folks want. And also we're trying to tackle the VM and functions case too, so it's likely that your environment is not just Kubernetes and that you do have your legacy services running in VMs, and if you're using functions, that's really cool too. So you want to be able to extend your mesh in those environments to those applications as well. And so we are looking at figuring that out and making that scenario very friendly as well,

21:58 What is the multi-cluster story with OSM? Will OSM support the connection of multiple Kubernetes clusters?

21:58 Daniel Bryant: Funny you should mention it, I'm doing a KubeCon talk with the Buoyant folks around multi-cluster. So I did a demo with Thomas Rampelberg, which is cool. A lot of fun. I think multi-cluster could be quite big. Is there any insight into where the OSM is going to go in relation to multi-cluster or is that still quite early days?

22:14 Michelle Noorali: It's still pretty early days. We're just figuring out what the design might look like and we're just figuring out what a prototype might look like, so still super early. I think there is a lot of ways to do it and I feel like a lot of the hard problems have been tackled, to be honest. So I'm not entirely sure what that will look like, I don't have anything concrete right this second, but it's something to look forward to in the next few months. Just the early stuff, we are still very early in the project and not production-ready at the moment. So we want to make sure that we have a very reliable thing that folks can run before they start playing with multi-cluster. That's a big thing, but there's active development on it. And it's great that you're working with the Buoyant folks in a talk, those are all fantastic people. Thomas is one of my favorite people in the SMI community.

22:59 Do you think service mesh-like functionality will eventually be integrated into Kubernetes itself?

22:59 Daniel Bryant: Do you see something like OSM or a service mesh being bundled with Kubernetes at some point? Because some folks sort of say, really it's such a fundamental piece of functionality, this kind of east-west traffic management. It could almost be part of the Kubernetes platform. I've heard arguments to say no, as well. I'm kind of curious what your opinion is on bundling these communication primitives with Kubernetes itself.

23:20 Michelle Noorali: Yeah, so like in core Kubernetes, or do you think in different distributions of Kubernetes?

23:25 Daniel Bryant: Both good questions, actually. I'd love to hear your opinion on both, actually.

23:27 Michelle Noorali: Okay, cool. So in core Kubernetes, I'm all about keeping it super lightweight, super simple, as slender as possible, because I really love that the pattern that Kubernetes implements, the design of the system, is so modular. It's so un-opinionated, you have these core primitives and they work generally for whoever's deploying applications into whatever environment, whether it's in the cloud or bare metal or whatever. And that's, I think, what the most beautiful thing to me about Kubernetes is. When you learn about the abstractions, yeah, it's a lot to take in, but when the abstractions kind of sit in your mind, you're like, "Oh yeah, I get it. This makes sense." It's not unintuitive, necessarily. I don't know if that's a statement everybody would make, but that's how I feel about it.

24:13 Daniel Bryant: I get it.

24:15 Michelle Noorali: And I wouldn't necessarily see it as part of core Kubernetes, like those features, but as CRDs, as things that sit on top of core Kubernetes and then get bundled into distributions, all about that. I would love to see folks head in that direction, and I think people already have, like Rancher and folks have already done that. So definitely love seeing that direction, because although Kubernetes is such a nice, beautiful, people would argue complex, but I think it's just what you need, set of primitives. I think it's not enough for the folks that are actually trying to do really cool things, like deploy applications all across the world., And not have downtime and do updates frequently and things like that. I think they need more, and so you definitely need to package up Kubernetes with more stuff.

25:01 What do you think the future holds in the app dev, deployment, Kubernetes, and service mesh space?

25:01 Daniel Bryant: What do you think the future holds in this space? The sort of app dev, deployment, Kubernetes, service mesh space. I know you're super excited about it and I'm always keen, folks who really get it and really enjoy the space, I'm always keen to hear what they are thinking the most important to most exciting things are in the future.

25:17 Michelle Noorali: I'm not a security expert, but I think I'm most excited about the security-related aspects of the project. I think it's great that service meshes allow you to do mTLS, so you can do TLS and that encrypts your traffic, but mTLS really validates whether the server is the server and the client is the client, not just that the server is who the server says it is. So that's really awesome, and it's exciting for me to see the enablement of that type of a security-related functionality, but there are even more exciting use cases and scenarios and things like that that I'm really keen on getting into.

25:55 Michelle Noorali: I think there was an SEO blog post actually, or article that talks about one of their scenarios, is that not only do you want to make sure that the service identity is able to access the other service, or services are able to talk to each other, but it's also they want to lock down even more around what service is getting talked to, what services in particular regions, running on particular hardware, just to add all the levels of security to that type of communication. So I'm interested in seeing kind of the development in that space and learning and being a part of that and contributing, hopefully some of that back to the project and back to the spec.

26:35 Daniel Bryant: Brilliant stuff, Michelle. Oh, this has been an awesome conversation. I've definitely taken a bunch of notes here. Thanks for your time. Thanks for joining us today.

26:41 Michelle Noorali: Thank you so much for having me. I really enjoyed talking to you.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT