Transcript
Talwar: My name is Varun. I'm a co-founder of a company called Tetrate. We are an enterprise service mesh company. I'm going to talk about resiliency, and to be more precise, runtime resiliency, and something that is built into your network. I like to start any given talk about a technology topic from history. Cloud 1.0 was the first era of cloud. It was when we saw the wave of virtualization, and people basically getting more out of their hardware. That went on for quite a few years before we hit the era of current cloud, which is cloud 2.0, which is basically getting compute resources from someone else. You don't have to run machines in data centers, and someone else is running them for you more efficiently. You swipe your credit card and you get resources that they're managing. That has helped tremendously in provisioning agility and bringing up compute anywhere we want. Really, where we are going towards is cloud 3.0, which is a more dynamic and distributed compute. Dynamic in the sense of containers and autoscaling, and scheduled through orchestrators like Kubernetes. Distributed in the sense of different regions: private, public cloud, hybrid, and so on. As well as distributed in the sense of application components being distributed. In a world where compute is so dynamic, our networking and security stacks are the ones that are lagging behind. Those are the ones that need to catch up.
Cloud 3.0 Transformation - Innovations in Networking
Prior to starting Tetrate, I had the opportunity to work at Google for about 11 years. A lot of people talk about, how come Google infrastructure is so reliable and secure? How come it's so resilient? Despite launching more services, despite having thousands of new developers join every year, the infrastructure is always up and available. One of the core to that is the investment in the core network. Networking innovations at Google have been quite few, not all of them have been talked about. I was on the front row seat of the last two, namely gRPC and Istio, both of which I was a co-creator for. Those are where the networking stack was taken to application level. Really making the networking, application-aware. gRPC is this modern RPC fabric, and was launched in 2016. Istio, which is this proxy-based approach, injecting proxies into the network and making them L7 proxies which are aware of what is going through them. This was launched in 2017. Both of these are thriving open source projects today.
Context
Coming back to the context of this talk, resiliency is super important. As more companies are moving to public cloud, anytime there is an outage in any of the cloud providers, the list of brands that are affected just keeps going longer. That significantly hampers their uptime, not just uptime, but their business and their brand image.
Resiliency Is Not Just About Software
How can we do better? Before we go into that, let's just scope the problem of resiliency. It's a multi-layer problem, which starts at the infrastructure layer, but then extends to the network layer. The more distributed they are, the more critical the network layer for reliability. Obviously, extends to data layer as well, and to your people, practices, and operations as well. Failures could be of different types. You could go from a host to a node, to a given service, to a given data center, to a given region. Obviously, at a physical level in terms of cabling, switches, and routers as well. All of these can cause failure modes and availability issues for your applications. The question is, how do you design your applications to be resilient against them? Can we do better than just two deployments, active-active, or active-passive?
In a world where compute is becoming available everywhere, my stake is that you should be deploying applications in multiple availability zones. They're easier now to provision, run, and manage anyways. Deployment pipelines are more automated. All we really need is a smart, connected network, which can route traffic to the right healthy deployments all the time, and we will have resilient applications. Easier said than done, how can we actually do this in practice? Let's look at some scenarios.
Scenario 1: Service Instance Failure
Imagine a simple scenario of a 3-tier application. You have your frontend web server database, and you have traffic coming into some edge. It could be a data center or a cloud region, into some application proxy or an ingress proxy, and then into your application. First thing is the application should be deployed in multiple availability zones. That's the first premise of making it more resilient. The second is simulate faults, and harden your service code base against the ability to handle faults. Things like service mesh and Istio have certain capabilities, where you can inject and simulate faults, and make it ready for being more fault tolerant. Once you have applications deployed in availability zones, the thing that you need for it to have failover is connectivity between the zones, so you can actually route traffic over. Those are some good practices to do to improve availability.
Service Proxies: Route to Healthier Instance
Let's say you have a specific service instance down on a given node. It could be database. It could be web server. It could be frontend. Just take one. The approach in this resilient network is to have a service proxy next to each service, or an application proxy in front of the entire application, which can detect that a given instance is misbehaving. Maybe that can be detected through higher latency or higher error rate, or some other signal, and it's coming from that instance. It's usually coming from the sidecar proxy running next to them. That can signal to say, ok, I should be load balancing to a different instance which is healthier, which has healthier compute pools, which has healthier pods, if you follow Kubernetes. That is an easy way to keep availability and resiliency. The other is about, failures will happen. How do you actually make sure that the proxies are smart enough to have timeouts built in, and quicker retries built in, so they can recover from those modes? Those are good tips and practices as well.
Scenario 2: Service Failure
Let's say the entire service is down, and none of the instances is actually available in that given zone or in that given data center. What do you do then? Then what you need to do is route to a different availability zone. It's easier said than done. For that to happen, you need to know the state and health of each of these services and all of the zones in real time into a controller that can then decide, ok, where should I route traffic? You need connectivity between them for it to actually be able to route traffic. Data and data consistency is a whole other layer of problem that needs to be solved for you to have a consistent outcome. The other is, it is always desirable to have these running on autoscaling infrastructure, so resource capacity doesn't become an issue for our availability.
Availability Math
One thing that we all know but sometimes forget, and it's good to put it in numbers is, what is availability? Availability is defined by how much is my average downtime in any given year. We often talk about two nines, four nines, five nines availability, but really just having it available even from one to two availability zones, significantly decreases our downtime, and improves our resiliency. Having it even from one to two, is of very meaningful impact.
Scenario 3: Application Failure
Continuing this trend, let's say instead of a service or a service instance, the entire application is down. How do you then route traffic to a completely different instance of that application? How you need to set up two-tier load balancing such that the layer above it, in which case, the edge proxy can actually know, which application proxy to send traffic to. Here, what's important is, all of your other security controls, compliance controls that you have built in that you need to operate the application, are actually available in all of these availability zones. That's something which is done through configuration in service mesh, and these L7 mesh like architectures, which can then make sure that same configuration is shipped to all of them, and therefore you can guarantee same behavior. It's easy on this diagram, but not easy for everyone to achieve the setup such that health signals are propagating to edge proxies, proxy is making the right decisions, and you are load balancing in the correct way.
Scenario 4: Region Failure
You can elevate this problem to not just one application, but the entire region is down, the entire data center is down or is out of capacity. In those scenarios, you want to route to a completely different data center. In cloud setup, it could even mean to a completely different cloud as well. As long as you have the applications deployed in these zones, the problem solution is similar, which is, you have a layer above, which has the signals for its health and performance at any given time, and can make routing decisions and route traffic to optimal zones. Then from there, to optimal application instances, and from there to actual services.
Resiliency via a Dynamic Autoscaling L7 Network
In summary, my main point is that we're moving towards a world where we can have this dynamic autoscaling, application-aware network. The reason to call it autoscaling is all of these load balancers can also be run in a compute, which can autoscale, and they can be elastic as well, just like your compute nodes. This setup if deployed properly, architected properly, can do two things, one, significantly improve your application resiliency. Two, not overload your developers to have to build all of that into each of their services and applications, and make it more a part of the network fabric itself. We at Tetrate do this for a living. We have a platform that enables this. Having done that in a bunch of places, we have quite a few of best practices and blueprint architectures that fit well in actual, real environments.
Questions and Answers
Rettori: While you were going over the story of network evolution at Google, what came to me is like, what led to creation of gRPC and Istio? What did you have before that wasn't optimal, that then led to creation of then gRPC initially? What problem wasn't being solved? You maybe want to touch a little bit on that.
Talwar: gRPC is the next rendition of something called Stubby inside Google. Stubby was there since the beginning of Google, which is like 1999. Really, it was any two services can talk to each other via this Stubby mechanism. It existed for a long time. Went through a bunch of iterations over maybe 12 years. Then the reason it's needed was twofold. At that scale, if you do JSON over HTTP, which was the classic way of doing it for client traffic, it wasn't optimal enough for our scale. Just to give you some examples, just by doing protobuf, which is binary over wire, than text over wire, with JSON over HTTP, you get 10x improvement in many scenarios. That meant millions and billions of dollars of savings at our scale.
Then, gradually, what happened is a lot of things got built into Stubby itself, like load balancing, retry, and sending out some of the spans for tracing. A bunch of stuff happened in that. gRPC was nothing but the next version of it, which got open sourced. The reason is in one organization, you can be very opinionated and like, ok, I'll just support three languages or four languages, in some cases, only one, and then these are my libraries. gRPC is library based. It was ok. When we're putting it out there in the open, you cannot have that one organization opinionation. Google practically runs in three languages: C++, Java, and Python. Then everything is in that. When we had to go into polyglot world, and also support a lot of existing services, that's why the need for something which is not library based, and proxy based, and that led to Istio.
Rettori: There is the debate that for circuit breakers, it's preferred to avoid backoffs or retries, and typical thing is that they need to be implemented in the application and not at the network layer. What are your thoughts?
Talwar: There is this interesting times that we are going through in terms of what is in application, and what is in network, and in many cases, there needs to be cooperation. Unlike for tracing, for passing on the header values, that's a good example, where you have to do it in the application. The circuit breakers, the core proxies, be it Envoy, be it other proxies, they have these concepts inbuilt in terms of the ability to check health of upstream, or wherever you're sending it to. Defining the rules of when I break it, and doing all that via configuration, so that those paradigms exist in these proxies and control planes. Obviously, this is all based on proxies only, not based on all the traffic that's going through them, in terms of the latency and error rate of the requests going through them. They don't know other aspects of your underlying compute. Let's say your CPU is overloading, like that application is consuming, that won't be known. What is happening more now is some of these things are getting added on, which is, passing on the signals from your node of like CPU memory signals, and those being passed on to make some of these decisions, or being able to take external signals from application to let the proxy make decisions.
Obviously, application itself knows, has the most context, but how much are people actually developing that in terms of understanding, all the way from node, all the way through different things that could go wrong. I think that's hard. The two things we are at least seeing more is interaction between the proxy and the underlying node and application, and also the reverse. Which basically means that proxy signaling to the underlying autoscaling infrastructure to scale, so that that actually is also happening more. I know that the health is degrading because of latency going up by signal down to the autoscaling infra of somebody like Kubernetes, or just the cloud providers' instance group. That's a signal that is not being used and should be used.
Rettori: There is a perception that Istio is not being fully adopted at enterprises. What do you think? What are the things that enterprises need to realize, to then take advantage of, as you call, this smart application-aware network?
Talwar: Istio took on marketing momentum before technology momentum. That was one of the reasons of its feedback loop. It's gotten better now. It's gotten a lot better now. The other thing is, it has too many knobs and too many configurations, and so on, and it just makes it complicated for people to grok and adopt. The other thing is you need very clean controls on who can do what. I often tell people like unlike Kubernetes, and other things like that, Istio and service mesh in general is a multi-persona problem. It's not a single persona problem. Inside an enterprise, how does a platform team manage the gateway, manage the sidecar? Sidecar is usually coupled with application, so now, how do you do application upgrades? That is for the app team. Gateways are usually managed by a different team. If you're going all the way to edge, there's usually edge and edge proxy teams. Then security wants to always be there as somebody else who's at least having visibility. In many scenarios, even wanting to enforce what policies have to happen, and what is optional to happen. They even want to be in the workflow of exposing a service to the outside.
The net of that being basically you have to solve for, what does each team get in terms of their views and controls? How do you make the knobs just simpler to use? There's too many knobs and too much YAML, if you ask me. One thing is like, just make it simple to say, this is my API, this is the behavior I want. That should just happen. Things like Istio are just there in the platform to make it happen, and in the infrastructure. That's the approach we have taken at Tetrate. I think that's a long term way, if this is to be adopted for real at scale, and for a longer time. That's how it will come to be. Like most technologies, it will become boring and not visible, and there'll be a way to just use them, and not have to tickle with the details of them.
Rettori: When we talk about service mesh and then Istio and, of course, other technologies, there's always that concern on how does that relate to traditional API gateways. The lines get blurred. It gets a little bit fuzzy, like, what is an edge proxy? What is an API gateway? Are they different? Should they be different? What are your thoughts on this?
Talwar: I'm obviously biased here. I think they should not be different. The platform we are building and built and deployed is like, use Envoy throughout, as you can deploy it as a tier-1 load balancer, as an edge proxy. Each application can have a proxy and an application proxy up front. Then you can have sidecar which is also based on the same data plane. One data plane throughout, and then each application, an application being a first class concept for us doing what it needs to do. In some cases, you're like, just do authentication at the ingress layer, and that's all I want. That's all I need to do. I'm not going to go into sidecar business for a while, which is ok. Whereas someone else would want to, no, I'm ready. It's all HTTP, and I'm comfortable. It's not like super performance sensitive, and the latency stuff is not concerning to me. You can go down that path as well.
The way I think about it is, people build services. People deploy services. You can expose them to your internal teammates, and/or your partners via internal APIs. You can expose it to public via public APIs. The controls you need are similar. That line is blurring between traditionally what used to be north-south and east-west. People do more microservices and API contracts, you need internal API based interactions. The only difference is internal APIs, you'll do token based authentication there. In external APIs you will demand for like, I need to have OAuth, and you need to go through this flow. External APIs you will do like, no, I want WAF style policy of bulk protect against, type these set of IPs. Whereas in internal APIs you will just say, test traffic from this team should not DoS me, so just rate limit. Scenarios are a little bit different, but controls at the technical level are similar. I think it makes a ton of sense to just have it in one platform, and that's the only difference.
In fact, the team that Istio came from, was actually called One Platform, which was Google's way of saying, this is internal APIs, external APIs. You just tell us what you want in your APIs. These are the behaviors. All we used to do as a team in Google was that every team would just submit their API spec and what they want, and things would just happen. Today it's an internal, tomorrow it becomes an external API. You could just append a few things on that API spec, and that's it. Nothing else changed in terms of rolling it out.
Rettori: The mesh of meshes, is that a thing or not?
Talwar: That term I don't really like, but the concept is really true. What we do at Tetrate, and I think more generally, in industry, it is becoming true, which is, there are three layers in this, and this has not been explained that well yet. There is data plane, which has to be right where the traffic is. There's a control plane which needs to be in the vicinity of it, which is in the same cluster or same VPC, but not too far. Then there is a third layer, which is what we call management plane, which is where you look at the top and say, ok, what do I need to do for each of the applications across, and do routing decisions, and resiliency decisions, and so on. We believe in that world, and we're building the management plane. Istio is still used as-is, and as it's growing in features, and all that, as the control plane in vicinity.
Absolutely, having it be done in a way that it's agnostic of compute and cloud. If I have N clusters in Microsoft Cloud, N clusters in Amazon Cloud, each of them can have Istio as the control plane. Can you really make those resilient decisions of instead of routing to this Microsoft zone, route us to this Amazon zone? People have come and asked me, can we do that for cost, performance, security, or whatever other service that they like in their cloud reason? That absolutely is something that is possible, not super easy for people to achieve. We want to make that easy to achieve. I think we are going towards that world. Mesh of meshes sounds like a bad name, but architecturally we are going there.
Yes, it can sit anywhere. That's the beauty of the management plane. It could be in any of those, wherever you decide it to be. The thing is, the required edge or ingress is to be placed closer to where its application is.
Rettori: I like the smart application-aware network term that you have, do you think it can always be automatically equated to service mesh in general? If I'm not to use an edge technology for this, what's my alternative then, if any?
Talwar: Service mesh became this everything term. The concept is that your network and your platform layer is smarter. For example gRPC which was my other baby that I had, supports xDS, so you can just build stuff in gRPC and not have any proxies, and ask for the same behavior from control plane, and this all works. I think more language stacks, and more frameworks will start supporting these xDS-like capabilities. Once they do that, I think we can go into how things are built themselves. It hasn't happened yet. That I think is what should happen. Today is Envoy proxy. Tomorrow could be something even better. The concept is the same both in language framework, and in these proxies. Make them more intelligent so that you're not doing it in application, and you can do that consistently without loading up the application developer across your fleet. I think that concept is very useful, and here to stay.
See more presentations with transcripts