Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Service Mesh: Past, Present and Future

Service Mesh: Past, Present and Future



Idit Levine discusses the unique opportunities presented in service mesh for multi-cluster and multi-mesh operations.


Idit Levine is the founder and CEO of, a company that develops tools to help enterprises adopt and extend innovative cloud technologies alongside modernizing their existing IT investments. She has spent her career building all areas of cloud infrastructure and open source software in both startup and large enterprise companies including Dynamic Ops, VMware, CloudSwitch and Verizon.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Connect, see, and speak with like-minded people. Join us to accelerate your learning, be better informed, and drive innovation.


Levine: My name is Idit Levine. I'm the founder and the CEO of Today we're going to talk a little bit about service mesh. The past of the service mesh, the present, and the future.

The Way Applications Are Built Has Been Revolutionized

Why do we even need a service mesh? As you guys probably know, in the last few years in the market, there was quite a big shift in the architecture. Even in the past, we were having one monolithic application that had all the business logic in one binary, now, we basically split it around and we have a lot of small little microservices. That means that we moved from one binary to a distributed application, which means that everything should go through the network. Every communication between services, or any component on your application have to go through the wire. That makes the network an extremely important piece in the infrastructure.


Challenge number one is routing. You need to make sure that service 1 capable of talking to service B. Number two, you need to make sure that where they communicate to each other, it's extremely secure, because you definitely don't want a third party trying to sabotage that. The last one, because service A can talk potentially to a lot of replicas of service B, you don't know where the request actually will hit, potentially, to go to service B1, to go to service B2, and it can go to service B3. You have no knowledge of where it will hit because they're identical. That means that it's very hard to reason about the observabilities. Basically, the request can be everywhere. The log can be everywhere. It's very hard to piece the work that happened on the request. It's becoming very challenging.

We need to fix it. The way to actually go about fixing it is by making sure that you will be able to route and control the traffic. Make sure that your traffic is extremely resilient. You have to do something like retries, or circuit breaking. You need to make sure that it's secure with a root certificate, that it's communicated between the services is mTLS, Mutual TLS. That there is policy enforced. The last one, you need to make sure somehow to make reason about the metrics and the logs. Again, today, if there's two services trying to communicate to each other, basically, in order to achieve that, you're going to put all those operation code, actually, to achieve routing and security inside your microservices. Basically, you couple the business logic with the operation code. Ideally, that's what service mesh is trying to solve.

Network Abstraction

Before that, we couple them together, the operation code and the business logic. Every time that you want to change some security setting that means that you need to redeploy your application. If we will be able to separate that and basically abstract the network, then the business logic will be something that will change only if you change the business logic. The operation code is something that you will be able to change more often and easily.


How can we achieve something like that? The technology is pretty simple. We put in a proxy next to each of those services. We're tricking the IP table that every traffic in and out to those services is going through this proxy. Now, basically, we can give configuration to the proxy. It's important to understand that the proxy is a very powerful tool. It can do so much stuff. It can manipulate the header. It can manipulate the body. It can append the headers. It can make sure if to let some request go or not. It can do retries. It can do circuit breaking. There's so much stuff that it can do. It's also pretty dumb, and you need someone to tell him what to do. You definitely don't want him to get this instruction when it's under request, but where latency is very important. What you need is someone to feed in this information before that. That's what the control plane is doing. You have the data plane, which is actually when a request is coming. It's basically the proxy itself, when the data is actually flowing. You have the control plane that before time is basically feeding the configuration that when the request will come the proxy will know it. Now we achieve that. We abstract the network. We separate that. We can easily configure the proxy without actually needing to change anything in the business logic. That's pretty cool.

Service Mesh: Present

As every cool technology there is a lot of people who are interested in taking part of it. What you will see in the market is that there's quite a lot of solutions that are trying to solve the same problem. The question is, which one are you going to choose? There's quite a lot of reason why it's so hard to choose, first of all, because you have to try them. That's a lot of options to try. The second thing, it's not that simple. It's definitely not simple. The implementation today of the service mesh is extremely complex. That means that you need to understand a lot of implementation details. It's very hard to onboard and setting the first setting. You need also to learn the API. Again, different API for each mesh. If you're trying, you need to learn a lot of API. The last one, we really generally do not know who is going to be the winner here. It's very dangerous for you to choose the wrong one and then needing to switch everything that you build on this mesh, basically locked to that provider, go and basically change that. We saw that happening before with the orchestration of Kubernetes, and Docker Swarm, and Mesos, and Cloud Foundry. Everyone who didn't choose Kubernetes needed to throw everything and start from the beginning. We definitely do not want that to happen.


I predicted that two years ago. Two years ago or so, I basically created a project called SuperGloo. The idea with the project was let's abstract that. Let's make sure that there is one API. Let's make it a dead-simple API. Then let's create an adapter to all those different meshes, because I predicted that there will be a swarm of them. Why is it important to make them dead simple? Because if you're looking at APIs like Istio, you will discover that it's extremely complex. The APIs themselves, if you're looking at them as the notion of sidecar. Sidecar is implementation detail, it's not what the mesh is doing. The API is just extremely complex, and basically, expose a lot of the implementation detail.

KISS: Keep It Simple Sxxxxx

I tried to figure out, what is the simplest way to basically describe a mesh? What is the functionality that we're giving to our customers? It's pretty simple. We want a group source. We want a group destination, so something like client and server. In the meshes, the only thing that we want to do is to define a policy rule on the pipe. This is my source. This is my destination. I don't want them to talk. This is my source. This is my destination. I want to retry 10 times.

Service Mesh Interface (SMI)

We open sourced this project, and 9 months later, Microsoft reached out to us and said, "This is a brilliant idea. We really wanted to basically help you push it as a community." Microsoft in the power of a big company basically announced SMI together with us, which brought the idea, the vision, and the implementation, and convinced other companies like Buoyant of Linkerd, and HashiCorp, and Reddit, and Pivotal to join this initiative. That's called SMI, Service Mesh Interface. Right now it's part of the CNCF. That's called service meshes.

Service Mesh Rarely Comes Alone

There is more challenges. The challenges are that you never have one. In a nutshell, you probably have more than one cluster. Therefore, you have more than one instance of service mesh. It could be the same one. It could be that you have two instances of Istio. Potentially, it could be that in your organization, there is different groups and they are using different meshes. Or, it could be that you're going to use on-prem Istio, but really want to use App Mesh, which is now open source on AWS. How's that going to work?

Delivering on the Vision of Multi-Mesh

Basically, what I wanted to achieve, it's the same rules of source, destination, and policy rule, but I want to do it even if the source is in a different cluster than the destination. After doing SuperGloo and putting it out there, as well as writing and doing the SMI, and learned a lot. We came back and said, "From all this learning let's see what is the right thing to put out there." What we realized is that there is a few problems with the solution that we put so far. Number one, when we did SMI with Microsoft and all those service mesh providers, they insisted on making it the lower common denominator. That's always a problem because it's very hard for me to explain to my customers that are running Istio in production, why he cannot use circuit breaking because Linkerd doesn't support it. That doesn't make any sense. That makes it, actually, a not usable spec.

That's something that from the get-go in SuperGloo, we attract different. The way we did is we basically said we're going to support every functionality that the service mesh had. If this is something that the current service mesh that you're using is not supporting, it will try and not implement. The second thing that we discovered is that the community constantly wanted the newest and greatest. When we announced SuperGloo, it was I think Istio 1.0. Immediately when 1.1 got out, we got a request, "What about 1.1?" What we understand is that it's extremely important to constantly stay in pace. Right now Istio is just announcing 1.7, we are already supporting it. The last one, I understood that it's extremely important to do multi-cluster support. That's what we discovered when we talked to everybody. Everyone has more than one cluster, and they're interested in the multi-cluster support.

Service Mesh Hub

That's what we did. We spin up a solution called Service Mesh Hub, which is basically the evolution of SuperGloo. We basically put it out there for the community to use. What Service Mesh Hub is doing that is extremely useful, is it's giving you the ability to manage your cluster, so you can register a cluster. It's discovering every service mesh on there. Then, if you want, you can install new from Service Mesh Hub, with help with Service Mesh Hub. Then what you can do is basically just apply this. We discover all the workload and the services, so it's giving you tons of visibility. You can just use a very simple API: source, destination, and policy rule.

Another thing that you can do, and again, that's coming to solve the problem of the multi-cluster, if you can take as much as service mesh that you want, and it doesn't matter which type they are, it could be two instances of Istio. It could be two instances of Istio from different versions, and App Mesh, and basically group them together to what we're calling virtual mesh. What that means, is that when you're grouping it together, we're going and doing a lot of work behind the scenes to actually make sure that it's safely orchestrated. We treat them as one big mesh. Virtual mesh is basically everything that you can do with one mesh, you will be able to do with virtual mesh. That's the beauty of it right now is that you can apply all those API simplification. You can just use, again, source, destination, and policy rule.

Also, we created a function called Inspect. Why is that? Because it's very simple to say you have a source, you have a destination, you have a policy rule. Actually, in a mesh, you have way more than one source and way more than one destination. You're losing the proportion of, what is happening to this specific service? What is the rule that I apply for it? For that, we created the Inspect function. Basically, if you go into a service, you can see all the rules that apply in and out. As a source and destination, I think this is extremely powerful. Basically, this is what service mesh is. This is the present. This is what our customers are running today.

Service Mesh: Future

Then the question is, where are we going with this? What is the future? By running in production, in a lot of big organizations, I can tell you that no one just taking the service mesh or the API gateway the way it is, and just use it. People constantly customize that. Everybody has different rules, different customization that they want to do, different environment and third-party tools that they're using. This is something that Envoy actually follow. [inaudible 00:13:14] of Envoy, they actually came with this architecture called filter chain, which means that when the request is coming, it can go through a chain of filter, that some of them you can write by yourself, and then put your own logic inside. The only problem with that, or challenge with this is the fact that in order to write those filters, you need to write them in C++ Async, and you need to recompile Envoy. Trust me, this is not an easy assignment.

WebAssembly and Envoy

We sat together with Google, and we looked at this very cool technology that was emerging in a different market, and that's the WebAssembly. WebAssembly actually was trying to target the problem that exists in the web, which is, we want to run fast applications in the browser. We want to make sure that we will be able to do it fast, and maybe JavaScript is not a great language for that. We want to make sure that it's portable. We want to make sure that we can write it in one place, but run it in every platform of browsers. We want to make sure that I can run it in any language that I want. It has to be secure because I don't want it to take my browser down. Ideally, it would be great if I can actually go and run it in different platforms like ARM or even run it outside the web. That's exactly what WebAssembly is doing.

We looked at this together with Google. We said this is pretty brilliant, and maybe it can solve those problems with Envoy. What if we will create a filter that's capable of actually interacting with the WASM model? Then basically, you will be able to write those WASM filters in any language that you want, extremely secure and very fast. You don't need to recompile Envoy. That's solving a lot of those problems. We did this exercise and start working on this. Then I look at this tweet. Lin Clark is one of the leaders of the WebAssembly. She basically announced the WASI which is giving it the ability to run it not only on the web browser. Solomon Hykes the founder of Solo basically tweeted this tweet that if he had WASM and WASI in 2008, he will never have created Docker, that doesn't make any sense. I saw a lot of similarity between Docker, the WebAssembly and the Linux container architecture and use cases.

What I understood is this. Google is not the one who created the Linux container. That actually was created with a lot of help from the community and Google. The technology itself was created by Google. Exactly like WebAssembly. Definitely, bringing the WebAssembly to Envoy, majority of it was done by Google. What actually makes Linux container usable and adoptable is by what Docker did, which is make it extremely simple for people to use. The user experience was extremely important. It was very important to us to do the same thing for WebAssembly and Envoy. Therefore, we created WebAssembly Hub.

WebAssembly Hub

The idea with WebAssembly Hub is extremely simple. We have to have a common tool that you can write WASM in it. You choose the language that you want to write with and which platform, either Gloo, which is our API gateway, Istio, who we are extremely supportive, as well as Envoy. It will create your library, and you will be able to seed it to it, open it, and basically write only the business logic of the WASM. Then you will be able to do WASM build. You will be able to push it as WASM. You push to a repository that we built, WebAssembly Hub. You will be able to pull it if you want for a community sharing. Eventually, you'll be able also to do WASM deploy and choose the platform that you want to run it. That will just happen automatically. It will take the WASM model. It will bring it to Envoy, to Istio. It will compile. It will configure everything to work with it and to do that.

WebAssembly Hub is out there. It's seriously, Go try that. There are community models there. It's very exciting. That's going to take service mesh to the next level. Again, just keep up to date. This ecosystem is extremely fast moving. I'm really excited to be one of the leaders there.


See more presentations with transcripts


Recorded at:

Jan 15, 2021