InfoQ Homepage Presentations Introduction to SMI (the Service Mesh Interface)

Introduction to SMI (the Service Mesh Interface)

View Presentation

Speed:

Download

48:25

Summary

Brendan Burns talks about the generic interface for service mesh technology. The goal of this abstraction layer is to provide an easy-to-consume API that can be implemented by many different service mesh implementations (e.g. Istio, Linkerd, etc). Users are free to adopt service mesh concepts without being bound to any particular implementation. He covers the SMI specification and implementations.

Bio

Brendan Burns is a Distinguished Engineer in Microsoft Azure and co-founder of the Kubernetes open source project. In Azure he leads teams that focus on containers, open source and DevOps, including the Azure Kubernetes Service, Azure Resource Manager, Service Fabric and Azure Linux teams.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Burns: Thank you for joining, thank you for coming in. My name is Brendan Burns, I'm here from Microsoft Azure. I'm going to be talking about the service mesh interface. This is my first time through this talk, so I'm going to apologize in advance if I make any mistakes or if I go short or go long. It's always a little exciting the first time, so hopefully, it'll be entertaining for everybody and learn something as well. There'll be some time for questions at the end. Although, if you feel inclined, feel free to throw up your hand in the middle too if something is unknown or unappealing.

The Service Mesh Landscape

I think that when we look out into the service mesh landscape, one of the things that is striking to me - because one of the main things that I do is go and talk to customers, and service mesh is something that comes up repeatedly with people - one of the things that is striking to me when we talk with customers about the service mesh landscape is that, in many ways, it reminds me of some of the early days with containers and everything else. There's a lot of excitement, a lot of sense of the potential, a great deal of fragmentation, and a great deal of confusion amongst people.

They see the potential, they're interested in some of the ideas, they see a lot of complexity, they see a lot of different implementations. It's very challenging, I think it's a very challenging landscape in which to make a decision, especially when you understand that it's not necessarily a decision that you can roll back easily. I think that when people look at things like service mesh, they're going to be building large, very important parts of their application infrastructures, on top of the technology, things like authorization, things like request routing and rely experimentation. There are a lot of stuff that you'd build very deep into your application, that if nine months down the road you decided it was a bad idea, it would be very challenging to rip it out.

I think that when we look out, when I am trying to help enable our customers to be successful, and I look into the service mesh ecosystem, I see a challenging landscape and a challenging environment in which to help those people be successful. When I talk to people who are in the tools vendor space, or who are thinking about an ecosystem that could live alongside or above a service mesh, it becomes even more confused and confusing. This has all led its way towards the service mesh interface that I'm going to talk to you about.

The Problem for Users

From a user perspective, the main problem is that there's a combination of confusion, a sense that they don't really necessarily know which way to turn, as well as a real, a very real sense of fear that they might get locked into something that they are not necessarily going to be happy with later on. They know that they can see the promise. If you go and you adopt the technology, and if you take a look here, you're excited about the idea. If this timeline is the timeline of the adoption of technology, you get excited about the idea, you see the potential for the idea, you go ahead and adopt the technology but usually by the time you hit production you're a long way down into the choice of adapting your toolchains, adapting your development teams, learning about particular things.

When you get to production, if it turns out that the thing that you went to production with is not the thing that you're going to be happy in production with, you are pretty committed at this point. It becomes something that is extremely challenging for you to unwind because you've built deployment processes around this idea that this is here. You've built experiments around this, you've built developer workflows around this, you've spent the time and gone to weeks of training or whatever it is to figure out how to use a very specific implementation. If you end up in a place that's not comfortable, you have some very awkward choices.

I think we've all been through this with one technology or another where we went in for a certain value prop, but when we came out the other side, when you actually productionized it, when you actually scaled up, when you actually started using it in anger, the reality of the challenges became very clear to people and yet it became something that was a very difficult thing to unwind. That's bad, it's bad because it prevents a dive. We saw this with container orchestration, in all honesty, we saw this in the first few years of container orchestration. It's really bad because it hinders adoption, people hold back because they don't want to get stuck.

I think pre Kubernetes actually, cloud had some of the same challenges because people didn't necessarily know, "How do I get to a multi-cloud storage strategy?" It holds back adoption. It also means that people are stuck making choices that they're not necessarily happy with. For both the general cloud-native landscape, because we think this is an important area for people to move forward into as well, as for our customers who really want to find a successful way to organize their applications, this fragmentation is not something that's useful for people.

Likewise, in many cases, the systems themselves are more complex than the use cases that the users want to put them to. Whenever you're designing an API, you have two different choices. You can go with the MVP, the main viable product, build something that hits the 80% use case for where an end-user is at or you can throw the kitchen sink in and try and put on every single possible knob that every single person might possibly want, and results in a very flexible API, but one that's very complicated to use.

I think when we look out into the service mesh environment right now, this is also one of the challenges that users are facing, which is that the API's that are there are all API's that were designed around technology. They weren't necessarily designed around users. They were designed to highlight all of the possible ways a particular piece of technology could be used, not necessarily to take a look at, "Well, concretely, if I am a user who's thinking about doing this stuff, what is it that I actually really want out of it?"

The Problem for the Ecosystem

If we take a look at the ecosystem, there's a similar problem, it becomes a spot the differences extra set, one of those like they put in the back of games magazine. The problem for the ecosystem is that if I'm a tools vendor and I'm thinking about building tooling on top of a particular that would use a service mesh, or would use service mesh technology, maybe it's to do testing of a flag, staged rollout of a particular flight configuration, maybe it's visualization for monitoring, there's a lot of different tools that you might say, "Hey, you know what, it would be really great to build on top of the topology that's represented inside of a service mesh."

If I'm in that tools ecosystem, I have a really horrible world because I have to actually then bind myself. I have to either figure out a way to abstract myself away from the specific implementations of service mesh technology, or I have to figure out a way to bind myself to each and every one of those service mesh technologies, or I make a single choice and I've bet my tool on the success of a particular implementation. If I'm in the tools space or the tools ecosystem, then none of those are good choices.

Some of them involve a lot of extra work for me. Others involve forcing myself to narrow the set of people who can use my tooling because if I go to 10 different customers, for a variety of different reasons, they're going to have chosen different kinds of service meshes. Maybe it's because they came from a company that used a particular implementation before. Maybe it's because they had a pre-existing thing that they used for interacting with their VMs. There's a lot of different reasons why individual companies will make individual decisions around a choice like service mesh and if I’m either a vendor or an open-source project, trying to develop something in the tools ecosystem, I'm stuck.

This hinders the development of the tools ecosystem, this means that I may not go and build that useful tool that I wanted to build yet because I'm waiting for consolidation to occur, I'm waiting for it to be a single thing that I can target with my tools. The net effect of that is unhappy users because it means that I don't have as many tools, I don't have as many visualizations, I don't have as many ways of managing my application, because people aren't going and building them, because they're dismayed by the landscape of the fragmentation in the landscape that they see.

We saw this also in the days of orchestration where a monitoring startup would come and it would have to build something for one, two, three, different orchestration technologies, different container runtimes like Rocket or Docker. There were a bunch of places where people who really just wanted to build useful tools had to make choices and pay attention to implementations that really, honestly, they didn't care about. This is a problem not just for end-users who want to build their application, but for people who want to go and build out a rich ecosystem of tools, that might live on top of something like a service mesh.

The Solution? Moar Abstraction

Well, what's the answer to this? Anyone who has been paying attention to the history of computer science knows that the answer to this is to add more abstraction. There is no problem that we can't solve by sweeping it under the rug and trying to put a pretty interface on top of it, and pretending like we don't see the thing that's below us. I say it jokingly, but it is actually literally the history of computer science. You don't really like machine code? Well, we'll introduce assembly code. You don't really like assembly code? We'll introduce languages. You don't really like programming languages that target specific machines? We'll introduce managed languages that give you a virtual language runtime on top. You don't like machines at all? We'll do serverless and we won't even think about the fact that there are machines there.

We keep layering these things up because the truth is, if we had to step down and write every single application in assembly language, I would dare say that none of the technology that I am using today would exist. It would simply be too hard and too complicated to go and build all of that without building these higher levels of abstraction above us. In particular, when you think about the tools ecosystem, they will either build the adapter interface themselves, because they have to, or we can come along and build the adapter interface for them.

That is precisely what the service mesh interface is intending to be. It's intending to be an abstraction layer on top of all of these service mesh implementations that focuses on the end-user, that develops the minimum viable set of APIs that we think a user wants when they're thinking about service mesh, that a tools ecosystem can develop against, then saying, “You know what, from an implementation perspective, we actually don't care”. Implement it using Istio, implement it using Linkerd, implement it using console, implement it yourself using just Envoy and the go control plane for Envoy. It doesn't matter, but know that all the tools that are built against this interface will work and know that if you do decide to use a service mesh and you do adapt this layer, six months down the line, when you get to production, you will be able to actually swap implementations without your developers knowing, without your processes changing.

That's a huge value add and it is an insurance policy, if you will, that makes it easier for someone to step in and adopt service mesh into the core of their application. That makes it easier for a tools project or a tools vendor to jump in and say, "Yes, it's time. Let's go build something really great on top of the ideas inside of a service mesh."

We announced this specification, we worked on it with a bunch of people. I think one of the most fascinating things for me was that when we set out to do this, and I made a bunch of calls, and Gabe made a bunch of calls to people in the ecosystem, nearly universally, everyone said, I was expecting a big song and dance to convince people, nearly everyone said, "Yes, we were already thinking about building something like this." It was a huge wake-up call for me to say, "This isn't just a good idea that I thought was a good idea. This is an industry-wide understanding that this is a necessary layer that we have to build for cloud-native computing."

The net result of that, I think, is that we've really introduced a pretty impressive community of people who are interested in and actively contributing to work on the SMI spec. It includes Microsoft, obviously, but Linkerd who was here before, HashiCorp, Solo, Red Hat, Rancher, Docker, Pivotal, lots of lots and lots of people involved in putting together this interface. The net result of that is something that we really believe can be generally applicable to 80% of the customer needs that are out there, we know it can be targeted to multiple implementations because there are at least three that we know of right now with Istio, Console from HashiCorp and Linkerd, and really provides a nucleus of a growing community that we can build forward with.

Obviously, if what we're talking about is giving an abstraction layer to people, it's important that people who take the bet on that abstraction layer have a sense that it's a living thing with a community behind it, and that it's going to continue to move forward.

Service Mesh Interface Goals

When we think about building this service mesh in the community, we're building around service mesh, it's important to talk about what our goals are. I think I mentioned this but, abstractly, at the top level, what we're trying to do is isolate the concepts from the implementation. When you are building anything, really, you don't necessarily care about the implementation of sorting. You want it to be performant, but you care about its characteristics, that it's performant, that it puts things in the right order, that it puts them in the same order every time. You care about the characteristics of a concept that you're using, but you actually don't care about the implementation.

Whether somebody decided to implement it in assembly language, in a compiled language, it just doesn't matter. You want to use the concept, and I think in general, most users are there. They don't necessarily care about the hypervisor that's under their VM. They don't necessarily care about the JVM that they happen to be using, if it's open JDK, or IcedTea, or whatever else, but they do care a lot about the fact that the concepts that they're using work. I think that's an important thing that the SMI is trying to introduce, separate out concept from implementation. In service mesh so far, they've been way too tightly bound. The product has been the implementation in a way that's not productive.

I think also, there was a desire to simplify things. To take a step back and say, "Hey, you know what, we're a year into service mesh, or a year and a half, two years into service mesh as, longer for some people, but as a mainstream thing that people are talking about. What are we actually seeing out in the field? What are our users asking for? What are the people who want to use service mesh, what are they excited about?" Let's not try and throw in all the technology that we could possibly cram in, and load up our Kubernetes cluster with 80 or 100, custom resources. Let's really try and focus it down on the main viable set of things that people want to accomplish with the service mesh.

I think there's also a real sense that - and we said this also when we develop Kubernetes - that we're going to get it wrong. We're going to get it wrong on the first try. It's important to get out there and iterate and move with the customers, move with the users so that we can continue to improve it and make it into the thing that is right, knowing that the thing we have right now, we try but it's not going to be the right solution necessarily. Then finally, really build out a real community. I think we've done that pretty successfully so far, but building a community is a journey, not a destination, so continuing to build a journey around the service mesh as a whole.

This Isn’t a New Pattern

I wanted to highlight that this is really not a new pattern. This is exactly what has happened throughout the history of Kubernetes and the open-source community around it so far. Going back to the early, early days of Docker and Rocket, that led to the open container image. The notion that we had to have a bunch of different networking providers, a bunch of different networking implementations to provide Kubernetes networking led to the container network interface. A similar understanding led to the container storage interface. Even before then we had storage volumes that were pluggable.

I didn't mention it here, we have cloud providers that are pluggable all over the place. We've implemented Ingress in this way where we have an abstract API and a bunch of implementations of that abstract API. We've implemented network policy in Kubernetes in the exact same way where we have a representation of network policy and a bunch of different implementations of that networking policy. This has happened over and over and over again and there's a good reason for that. The good reasons for this is that this is what users need, they need the concept. They don't need the implementation; they need the concept that they can use. Tool vendors need the abstraction, not specialization, and implementers actually need isolation from their users.

I think this is probably the least well-appreciated one, which is that just like the user wants to be isolated from the implementation, if I'm implementing, I actually want to be isolated from my users too because it makes it easier for me to build a V2. It makes it easier. This is classic don't build a leaky abstraction stuff where you having a degree of isolation from how you're being used actually makes it more flexible and easier for you to iterate forward as well. I think all across the board there's a reason why, and then we're seeing this with cloud providers being moved out of Tree and Kubernetes, there's good reasons for everybody involved in the community to do this, and so I'm optimistic that it's going to be successful.

Service Mesh Interface - API Overview

I'm going to take a deep dive now then into what the API actually looks like since these are the concepts that we think resonate with the end-users. There are actually just four API objects in the service mesh interface. One of them is so simple that it doesn't even necessarily count. There is a traffic spec, there is a traffic target, traffic split, and traffic metrics. This starts with the idea that you want to be able to specify kinds of traffic. In particular, this is specified through these route definitions.

We have an HTTP route group, all it is saying is it's representing an HTTP request so that you can apply this group to an HTTP request and say whether or not that request is a member of that group or not. In particular, in this case, this matches all gets for something with /API. That's it, it is a regex, but I don't have any regex characters in there and so obviously just mentioned /API, you could add a dot star, you can do whatever you want. Also, it's a little bit hard to tell in the YAML but, of course, the matches is actually a list. The formatting means making that list is hard on a slide, but you can add other match requirements, you can add other methods in order to match the specific requests that you're interested in.

This is just a way of, basically, identifying traffic, that's all it's being used for. It's a little bit abstract but it's really just to say, "This is a class of traffic that I want to represent." One of the things that's interesting with service mesh as well is that it has been very deeply tied to HTTP. Yet, a lot of the concepts that you want in a service mesh apply to TCP as well, and so actually SMI takes care of that because there's a recognition that if you're going to build out a mesh, you do probably want to do more than just HTTP, you could also represent a TCP route. A TCP route is way simpler because there's really nothing to it. It's just a representation of the fact that there's a TCP flow happening and I need a reference point to point to. Obviously, by themselves, these objects aren't very useful, because they don't define anything really beyond our way of recognizing traffic.

Then, we take a step forward and we say, "Actually, we can put them to us with a traffic target." The way that you do it with a traffic target - a traffic target is intended to represent a declaration that a particular destination can talk to a particular source can talk to a particular destination. This is doing access control. What we do here is when the traffic target object, the destination - and I'm going to get into the details, obviously fitting API objects on the slides is challenging - but the three high-level pieces of the traffic target are the destination spec, the route spec that we just covered, and the sources, which are the place where the traffic comes from.

If we dive in a little bit onto the destinations, you say, "Hey, we're going to say our destination is a set of pods, those pods are going to be identified by Kubernetes service account and on a particular port. We use Kubernetes service accounts for identity in the system. We use them to identify containers that are a particular class, and so any pod that matches this service account is going to be included in this destination access control rule. Obviously, we give it a name and then we have a port associated with the traffic as well. That slots into where the destination is.

If we go next to the spec, this is the thing that's going to define the traffic. We've defined where the traffic is going, now we're going to define the traffic itself. We're going to reference that route group that we defined before, so I'm going to say, "I want to match all traffic to this destination that matches that API route group that I defined before." Obviously, this is an array so that I can put multiple route groups, I could have TCP, I can do whatever I want.

Then finally, going forward, we have to talk about the sources, who's allowed to talk to this destination. In the source, we can say use a service account as well. It's a different service account, or it doesn't have to be but presumably, you want it to be a different service account, that represents who's allowed to call my API. We're using built-in Kubernetes concepts around identity that are already supported and already managed by the API server. We're not trying to add a new side shadow identity system or anything else like that. We're going to just define access control in terms of the identity that's already present in the cluster.

Putting It All Together

If we put this all together it looks like this. The traffic target exists to say, "Hey, service account with my API callers talking on this particular route definition/API, can talk to people who implement the API and are part of this service account, my API implementation. This is how we put it together to do access control between pods identified by a service account of people who can call the API to pods identified by service account of people who implement the API.

At this point, we've defined in a declarative way, the concept that the user cares about, which is the access control inside of my cluster. It's up to the implementation of the service mesh, whatever it happens to be, to go ahead and implement it. That is the first important concept that I think most people are interested in, in service mesh, which is access control.

Service Mesh Interface: TrafficSplit

The second topic that I think a lot of people are interested in, is a notion of doing traffic splitting. In the Service Mesh Interface, we also have this object that represents traffic split. I should have mentioned, it may be obvious, but I'll mention it, these are all custom resources that are defined in the context of Kubernetes installed into your cluster, so you do end up with a few extra custom resource definitions, four is a fairly reasonable number.

When we're talking about traffic split, what we did was we were defining traffic split in terms of Kubernetes native services. We don't actually want to build a whole separate shadow set of APIs that exist alongside Kubernetes. We want to reuse Kubernetes concepts as much as we can.

When you're talking about doing traffic splitting, you're going to talk about referencing Kubernetes services. That's exactly what this does here. It says, "Hey, I want to create a new traffic split. In the back ends that I'm going to do traffic splitting on, I'm going to just reference existing Kubernetes services. I'm going to reference a service named experiment, I'm going to reference a service named canary, and I'm going to reference a service name production.

The weights that I'm going to give it are going to be proportional weights that indicate where I should load balanced traffic to. There's an interesting discussion, I'm not going to go into it, but about doing weights versus percentages. That was a point of discussion and inspecting things out. We ultimately came down on weights, but if you're interested in the esoteric details of how one creates a spec, you can go read it up on GitHub. This indicates that the weight of traffic will be 1, to do this experiment service, 10, to the canary service, 100, to the production service. It's roughly a 1%, 10%, 90% split, although the math majors out there will know that it's not quite 1%, 10%, 100%, split, but it is in many ways easier to reason about weights than it is to reason about percentages.

If we take this, then we look at the picture of what that looks like. What's going to happen when I declare this traffic split, is a new Kubernetes service is going to get created. That new service is going to have the name of the traffic split, and when you send traffic to that new service, you'll see the traffic splitting behavior to the pre-existing services. Not trying to add our own implementation of ingress, not trying to add our own implementation of anything, really trying to reuse core Kubernetes concepts, but implement this traffic splitting. In particular, my experiment service comes into existence, and roughly 1% of the traffic gets split off to the experiment service, 10% of the traffic to the canary service, and 90% of the traffic to the production service. That's traffic splitting.

Service Mesh Interface - TrafficMetrics

The third piece that we think, when we talked to most customers that they were interested in with service mesh technology was metrics and monitoring. That leads to the traffic metrics object. The traffics metrics object refers to a resource, in this particular case, it's referring to a pod. It has a bunch of edges that we're going to go into in a second. Then it also has the metrics data, we'll cover that in a little bit, it has the timestamp of the observation, it has the time window of the observation, and all the metrics data.

What's interesting about the resource and the edges is that it's a very flexible way to represent all of the different kinds of traffic you might want to measure and monitor. The resource is the central point of the monitoring, so there's only one resource, it could be a pod, it could be a service, it could be a deployment. If it's a service or a deployment, it's going to aggregate across a bunch of pods, but it is the center of the graph, if you will.

Then from that point, you have a choice, you can either measure all of the edges coming in, or you can measure all of the edges going out. Then to show you that, we'll look at a bunch of different edge definitions. For example, you might say that the direction is two, and the resource is empty, which means I'd like to measure all inbound traffic. I want all inbound traffic to this particular pod to be captured in this metric. I might also want to measure all outbound traffic to a specific pod. In this particular case, the previous example is very generic, all the inbound traffic to this container, this pod. This one is very specific; all outbound traffic from this container, or this pod to this other pod. That's a very specific kind of traffic that I'm interested in measuring.

You might say a little bit more broadly; I want all inbound traffic from a particular service. Then this case, I have a particular pod, that's the resource that is the center point of my metrics. Now I want all traffic that originated from pods that match this service to be recorded in this metric. It's a little bit specific, because it's specific to a pod and a service, but it's a little bit more generic, because there's probably multiple pods that actually implement that service to send traffic down to this particular container. When we look at the traffic metrics, what we see here is that it's exactly the sort of metrics that you'd expect. It's got a name, it's got a unit, it's got a value. Obviously, this can be easily scraped into whatever your favorite metric system is.

It's an interesting discussion, and one that we had early on about saying, why don't we just export Prometheus metrics, for example? Why don't we just make a scrape interface available? Why are we defining an API? Ultimately, it comes down again, to flexibility and abstraction. We don't want to presuppose for any implementation that you're using any particular monitoring implementation. There's a lot of different monitoring implementations out there. A lot of people use more than one, some doing some aggregation inside the cluster before pushing to cloud monitoring, or more centralized monitoring. By making it instead of metrics API, which is the pattern that's pre-existing in Kubernetes, it makes it much more flexible for adapting to whatever particular monitoring solution that you might want to have.

In particular, what we do is we make an API interface available, an extension API interface available on the metrics endpoint of Kubernetes, so traffic.metrics.k8s.io, becomes a new API endpoint that you can go and talk to. Then Prometheus, or whatever you want, can go and scrape those metrics. It's an aggregated API server, so unlike custom resource definitions, it actually does a pass-through to the metrics server. The metrics server then implements all of the aggregation of metrics that you need from your containers and your services into the traffic metric server, and then it can be scraped out into whatever monitoring system that you want. This is the same way that other metrics that are used by auto-scaling and other things are working in the system today.

Hopefully, that gave you a rough idea of the scope and shape of the current specification of the Service Mesh Interface. As always, these things are works in progress, both in terms of the specification itself, as well as the implementation of the specification. I encourage you to go check out the specifics of the spec and specifics of the implementation, if you're interested.

Concerns: Lowest Common Denominators

I want to talk a little bit about one of the biggest concerns that came up, and this came up in the thread, it's actually come up in a couple of different forums, but it came up in a thread a while ago with Tim Hawkins, I've known for a long time, this concern around a lowest common denominator interface. I think that this, the naming, and whatever else are the core concerns of computer science.

I spent a long time earlier talking about how we've always done abstraction. For as long as we've done abstraction, people have said, "But isn't that just the least common denominator?" What about the special characteristics of my hardware processor? Why are you programming in Java, because you don't have access to this particular instruction, or even while you're programming in C, that doesn't have a compiler that can target this very specific construction?" There's always been this constant tension between an abstraction that is empowering and simplifying, and a fear that that abstraction is too simplifying and lacking in some very specific characteristic. As we always think about these things, we have to keep that in the forefront of our minds, and have an understanding that an API is not a flag placed in a single place, but it's an evolution over time.

We add capabilities as we see new uses, every single thing that we do iterates as we go forward so that we can adapt. The model that we think about when I think about Service Mesh Interface, the inspiration that I put forward to people - there are lots of different examples, but this is the one that I highlighted for people, which is OpenGL. OpenGL, if you look at the specification, was built basically to solve exactly the same problem that Service Mesh Interface is trying to solve.

I got a whole bunch of graphics cards, I got a bunch of games, I got a bunch of operating systems, I want to be able to write code once and have it work across all of these different systems. I would say that OpenGL has actually been very, very successful at this. It has been quite successful at enabling game developers to build a lot of really interesting applications, without necessarily dealing with the details of how the graphics card implements it. Except anybody who's in any real graphics programming knows that there's a giant list of vendor-specific extensions, be it shader languages, be it everything else, that are actually where most of the beat of the graphics programming goes.

What happens over time, is that from OpenGL, 1.0 to 1.1, and so on, as things become more and more commonplace, as they become used by more and more of the applications, they became folded into the spec. They moved from being an extension to being a part of the API. The spec wasn't treated as if it were the tablets handed down and could never change. It was treated as a document that needed to be iterated on at a regular cadence to add new capabilities to ensure that they continued to hit 80% of the needs of the people who are programming into this world.

Service Mesh Interface: Iteration Plan

That's exactly where we intend to go with the Service Mesh Interface. Let's start with the basics, let's the value prop really crystal clear to people, make it simple to think about, simple to adopt. I hope that I explained it reasonably well in 15 minutes. Then let's move forward, we're going to have lots of custom extensions. Let's embrace that, I think that there's a lot of people who are like, "Oh, my God. Ingress, it's this horrible stew, and there's extensions all over the place, and it's not portable." I have very little patience for that argument, because it has served successfully for a large variety of people. Maybe yes, we need a v2 where we're going to fold those extensions back in. I think what we didn't do with Ingress is we didn't do a good job of acknowledging that we had to continue to iterate it forward, and we needed to continue to fold the extensions back into a V plus one.

With the Service Mesh Interface, I think we're really adopting an approach that says, "Yes, we're going to embrace extensions, we're going to embrace incompatibility, because that's the way that we find out what users want and what users need." Over time, the things that everybody's using, all those extensions, we're going to find the commonality, and we're going to fold it back into the spec so that we continue to iterate and move forward and have something that is relevant to people over time. I think this is the way that you design abstraction API's everywhere.

What's the state of the art of the system? Let's say the art of the system, I mentioned, we had a whole bunch of implementations, we have a bunch of tooling that's starting to work on it, a bunch of partners who are working together. When I start thinking about the future of this, I would highly ask you to come join us. Check out smispec.io. We're up on GitHub, there's an SDK, a go SDK. I'm this far away from pushing up a TypeScript SDK too, because I'm having a lot of fun with TypeScript these days.

There's a bunch of different Getting Started docs that are out there from Hashi, from Weaveworks, from other people. This thing is only going to be successful if there's a community of people who adopt it and who help us push it forward, and also tell us if it doesn't work for them or what they think is terrible about it. Please come and be part of that community.

Questions and Answers

Participant 1: Great presentation, Brendan [Burns], lots to think about there. One of the questions I had around, say the Ingress spectacle, long time in Kubernetes to iterate. I totally get your point about folding things and it makes complete sense. Do you sense there's any need for an increased cadence for taking the learnings we have with the SMI? Folding them back in say, every six months, every year? Maybe that sounds quite a long time?

Burns: I think the tricky thing that happens there is that in some ways, the Kubernetes APIs and the philosophy of Kubernetes APIs was trying to design for the long haul, there's not sort of a notion, I don't think there isn't a single V2 that I know of, in the API. I just saw some highlights, Horizontal pod auto-scaling V2, I think, is that there's a little bit of effort there. I think in the way that we have approached the APIs - I'm not sure that we've approached the API - we iterate a bunch to get to the V1, but I don't think we acknowledged the fact that in some of these places, we're also going to probably iterate after the V1, in setting up that cadence.

I think Lib would be viewed as a failure at some level, to iterate every three months, have a V2, a V3, every three months. Just at a philosophic level, I feel like the project feels that way. That's something we should change, for some of these APIs. Not for all of them, backwards compatibility is still really important, OpenGL agglomerated new stuff. One was compatible with one, you could still do the one 1.0 stuff in most cases. We have done some of that stuff, but I do think that there isn't that sense that this is a place where it's good to get something out, but we should also set up some regular iteration. It's not something that we've done a good job of so far.

Participant 2: Similar question about the iterations. Do you anticipate that something might be removed from the spec? You have to somehow either carry some legacy stuff with a...

Burns: I really hope not, I don't think so. I think one of the reasons to start small and to start with the MVP, is to be really certain, only start with the things that you're really certain about. I guess I take a step back, I don't think the spec is fully formed. We're not at a 1.0, we're not there, maybe over time, and certainly, the spec looks a lot different than the thing I scribbled down on a napkin a little while ago. There may continue to be evolution but I hope we don't do a lot of removal, if we do, it'll be a failure.

Participant 3: Thanks, Brendan [Burns], great talk. Do you see in the future service orchestration being a capability of a service mesh, or do we still want the applications to manage - they have a use case of a process service? They need to call multiple microservices, so should they be calling each one on the side, or should the service mesh be responsible for those use cases?

Burns: I think it's going to be far more likely that it's in the code if it needs to, in some cases. Service meshes are great, when you want the fact that you're shipping it somewhere else to be transparent to the application, experiments and canaries and things like that, where you actually kind of want the application to not care or failover, where you need to failover. If you think that it matters to the application, the application probably should be making the decision and understand that it's making the decision.

I view service mesh as transparent technology, like technology that the app developer is not supposed to see.

Participant 4: My question would be, in a way, this is really oriented toward people who want to use service meshes, people who want to implement them and use them. I didn't necessarily see users being involved in this. What is the best way for the user community to weigh in and go, "Yes, we really would like a simpler way for this all to come together?”

Burns: Partially it's probably vote with your feet, and use tools that adopt the interface. Make it clear to people when you do that, you're using the interface. One of the things we're not necessarily trying to do is replace the interface to console, for example. Console has its own CLI, it has its own API, we're not trying to necessarily replace that API. You want to be clear with people that that you're using the interface API, as opposed to the native API. Obviously, it's got a GitHub, so you're welcome to make whatever comments you want there as well.

Then, my suspicion is every user is also writing some degree of tooling, and so think about how you would use it in your tooling and if it works for you. A lot of this is influenced also simply by the conversations that I've had. Come and talk to us. If you're an Azure customer, I certainly am talking to you. Likewise, I don't actually really care what cloud or not cloud you're a customer of. I'd be eager to talk about your use cases and hear whether you think it would work or not.

Participant 5: As a follow-up question, do you guys have the earlier Kubernetes to the adoptions term, then we form all of those community channels and add to the Slack, and so the user can send a question on all those things, we can collect the feedback? Do you guys have those things?

Burns: Yes, you can check that. That's all up on the SMI spec, spec.io. Yes, for sure. If there are any forums like that that we're missing, Stack Overflow or whatever, let us know. I'm on Twitter, too, if you want just hit me on there. Actually, SMI is on Twitter, too, that's also possible. It's not me actually, behind the SMI, I don't know who it is.

Participant 6: Could this become part of the Kubernetes spec at some point?

Burns: I think it could be, but I don't know that we want it to. In the sense that I'd rather not see Kubernetes come to encompass the kitchen sink.

Participant 6: How hard is the dependency on Kubernetes?

Burns: It's totally dependent on Kubernetes but actually, that's ok. As long as the arrows flow from the ecosystem down, that's great. You look at Helm, there are a lot of tools that only exist for Kubernetes, but that doesn't mean they should be part of Kubernetes. Or maybe I would say, it's very much intended to be part of the Kubernetes ecosystem, but I don't necessarily see it being part of Kubernetes. I do think that getting into some neutral governance in the form of the CNCF or something like that is actually very critical, and is part of the long-term roadmap, because that stuff is critical. The last thing Kubernetes needs is more API's.

Participant 7: Do you think that AKS may ever ship the CRDs by default in its installation?

Burns: I can't comment on future things. Generally speaking, what we try and do is if lots of customers want it, we ship it.

Participant 8: Maybe less about AKS, do you see this as something that should be promoted into the providers themselves?

Burns: It's an interesting question. I suspect over time, we will get there. Although I have a lot of people who are interested in service mesh, I don't have enough adopted customers who I would turn it on by default. Make it a checkbox, sure. Turn it on by default, I need to get to 80% of customers who are using it and we're nowhere near there. Either in terms of SMI or service mesh in general.

See more presentations with transcripts

Recorded at:

Jul 22, 2019

Brendan Burns

InfoQ Software Architects' Newsletter