Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Life of a Packet through Istio

Life of a Packet through Istio



Matt Turner talks about Istio - a service mesh for Kubernetes that offers advanced networking features. He gives insight into Istio’s full power, and its architecture.


Matt Turner is a software engineer at Tetrate, working on Istio-related products. He's been doing Dev, sometimes with added Ops, for over a decade. His idea of "full-stack" is Linux, Kubernetes, and now Istio too. He's given several talks and workshops on Kubernetes and Istio, and is co-organiser of the Istio London meetup.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Turner: Hopefully I've got the three hours done enough because I know I stand between you and beer. You've had literal Google tell you about how they literally invented microservices and then you were meant to have the actual CTO of CERN tell you about how they actually - what's the Sheldon Cooper quote? - rip the mask off nature and stare into the face of God. But you've got me. I think Wes [Reisz] is giving refunds by the door actually.

So who am I? I'm Matt Turner. I have done a bunch of things. I've done some dev and some ops. I helped Sky Scanner move to Kubernetes. If anybody like cheap flights, I was a Kubernetes consultant and I worked at a startup doing service mesh stuff, focused on Istio. I'm now starting my own thing. It's meant to be in stealth mode but the T-shirts turned up on Sunday and they're a really nice color. I guess it's out of stealth right now. I'm now doing a cloud-native consultancy. I guess helping people use this kind of tech.

If anybody's here to learn actually about Istio because you saw it on the name - I know it's a new, cool thing that a lot of people are interested in. I'll spend five minutes on the pitch, on what it does and why but what I'll then do is dive in fairly deep on how it works. We'll try to look at the architecture of why it's built the way it is, because I think that's really interesting, which is, I think, why I was asked to give this talk. We'll do a bit on what containers and Kubernetes pods actually are and how they work, and then we'll look at that architecture from the kernel point of view and then we'll look at the wider architecture of how you build a big distributed system out of all of that kind of stuff. If anybody's completely unfamiliar with containers and kernels and networks and Kubernetes, this might not be the right thing for you. But hopefully it'll make sense as we go along.

The original objectives of this talk - and as I say, I'll try to look at it in a slightly different way today is to see how a packet goes from left to right, traverses an Istio system which is probably running on Kubernetes made of the Envoy proxy that you probably heard of. And then we look at the control plane calls that are made during that process to this Istio control plane, which is this sort of management system. And the original purpose of this is to build this useful mental model for debugging Istio, should you ever hit any of its very few, very, very few edge case bugs, and reasoning about it, which is maybe what I'll focus on today. Yes, so you should probably know a bit about networking and containers, but I'll just go straight into that. Oh, this is three hours. We're not going to do all of that.

Context and Introduction

The Istio pitch. Why are we here? Your business wants value, right? Your business wants business value which is basically new features because that's what customers pay for. They want to get them out fast. They want quick experimentation time, quick cycle time and low risk. So we want this fast feedback loop, we want this scrum or the lean startup approach. So what did we do? We broke the monolith. There's our monolith, our single rock and we cracked it in half.

Did we get microservices in that? No, we get a distributed monolith, which has got all the previous problems and now a whole lot more, because what was a function call that could never fail is now going over a network that's probably on fire. You are left with a distributed monolith. This leads you to do a bunch of things to mitigate that. So you might have two services running in these two anonymous gray boxes representing a compute environment. Previously they would be namespaces in the same Java process, the same JVM. Now they're different processes in different containers that could be on different sides of the planet.

We started off by using, by deferring this to like a library, putting something like Hystrix or Finagle in there to get the back off and retry and deadlines and rate limiting, and all these services that we needed that we heard about in a lot of the earlier talks on this track. The problem with these is that they're libraries, they're in process. So if the library changes you need to spin a new release of your service as well. So you better hope it builds like at that point in time and it's passing tests. These two are specifically JVM only, so Java, Kotlin, Scala, only it's no good if you want to start writing Golang or Rust. You need another implementation of the library that does all the same things. And for anything that requires coordination, like global coordination or an on the wire consensus between two things, they have to speak the same wire protocol so you can't develop these things in isolation. It recently became very hard.

This is where Istio comes in. You take those same two services and you basically admit that all communication these days is HTTP, be that GRPC which is HTTP2 streaming, or be it good old fashioned JSON REST over HTTP1. So you put an HTTP reverse proxy by every service. This is the logo for the Envoy proxy. Then this thing can do all of those network resiliency functions for you. And what Istio can do is put that in front of these services completely transparently.

Istio then has a control plane which is what we're going to look at. It has these three components that sort of program these proxies up and tell them what to do, so you're not there writing manual config and injecting it into your containers. And importantly, this thing has an API on the front so you can write configuration and the control plane will take care of it and get it to the right place and roll it out. And this API is declarative. It's a lot like the Kubernetes API. It takes YAML documents that describe the state of the world as you would like to see it.

Istio bills itself as an open platform. It's an open source project to connect some of these services together to secure them, to control them as in manipulate that traffic, and to observe them. This is pretty much what it does. This is the emergent behavior of all the low-level features we're going to look at. I think Sarah said in her keynote that we should be using service meshes to get these distributed systems, the network functions that they need for free. Retries and back off and deadlines. Then I think Ben was saying that Google does actually have this stuff in a library because they have much more control over their source code, but it's the same principle. You need that stuff. And the dream is that your service is left being only business logic.

Networking and Containers

So that was the super-fast pitch. Super-fast operating systems 201. Oh, no, we're starting already. This is our left to right packet. Little Nyan cat packet here coming across an Istio system. The first thing it does is it hits an ingress point. So this is a request from a user out on the internet with an app or a website, and they hit an ingress point. This isn't actually too interesting, just to say there's no magic here. This ingress is a bank of the Envoy proxy. This does the same as your ingress controller on Kubernetes. There's no magic. It does the dance that everybody has to learn when they first get hands-on with Kubernetes as you point your wild card DNS record and your wild card TLS cert at a sort of a load balance from your cloud. It terminates things, it reissues to a node port. In through a cluster IP, it gets to a proxy. The whole dance. We know how this works. For the sake of 15 minutes and your beer, let's take ingress as a sort of fait accompli.

Our service has come through ingress, it's being routed to the right service. So this is all layer seven. You get this in any Kubernetes system. Your ingress controller does layer seven. These requests would come in with an HTTP host and a path and various headers and those could all be used to route the requests to the right service that you want in your microservices system. In this case, service A. So the packet moves across and it gets there. What does it find? Well, it's no big secret that we've got this Envoy proxy stuck in front of the actual service, the actual business logic. But what is going on here, what is the architecture of this?

Let's remember what a container is and let's remember that actually there's no such thing. In the Linux kernel you probably won't find this word in the source code. There's no first class consent of a container in Linux. Previously you had jails, and Solaris had zones. Plan 9 is the best operating system. It had full namespaces but we're not on the best timeline. We've got some student's reimplementation of 4.4 BSD. So what we've got in our Linux kernel is containers and namespace made of namespaces. So we've got these six namespaces that are software isolation mechanism.

Very briefly, imagine we have a container. Morally Nginx, called Nginx. It's running Nginx and then maybe it's running something else, some of the Unix processes like supervisord to keep it up. It's in these six isolation points. Briefly, this mount namespace isolates the mount table. It's like chroot on steroids. These processes see a different version of the mount table, a different virtual file system built up from the forward slash, the root path to things outside of this namespace. This, if you think about it, is necessary, right. Containers run from an image. So actually the very first thing you need to do is make sure that forward slash, the root of the file system is that image, that tarball, rather than the host's disc. And then you can mount volumes into these containers. That's obviously just inserting mounts into the mount table like you would mount a USB drive on your host operating system. That's isolated.

The UTS namespace means that this container can have a separate hostname and DNS domain to the stuff outside of it. The PID namespace means that process IDs can't be seen on the outside from inside. You shell into a container. Run a shell. You do PS and you see like two things. You see bash and PS. You don't see any of the stuff from the host system. The user namespace isolates the user IDs as well. So the user 1000 in here isn't the same as user 1000 on the outside. Can't start mapping pages. User namespaces - we won't go there. The IPC namespace stops you sending some signals. SystemV IPC requests, systemV shared memory across this boundary. Then, as will become interesting for us, the network namespace isolates “networking” and we'll see what that means, but it basically means that there's a different set of interfaces, a different set of IPs inside this namespace to outside.

What is a Kubernetes pod? Because what this packet hit was a Kubernetes pod, containing two containers. Because that's what Kubernetes pods are there. They're several containers kind of stuck together. So a Kubernetes pod is quite interesting. What we build to give us this dev experience from the primitives that we find in the Linux kernel is kind of shown by this diagram here. It's two containers coupled together. This is why you have to deploy a pod as an atomic unit, as the atomic unit of scheduling. It has to go onto one machine because they have to share a kernel because these two containers actually share some of these namespaces. They both have a separate mount namespace, as they have to because they run from separate images. They actually both have to have separate UTS namespaces so they can have different hostnames.

But they share a PID namespace. So the processes in one can see the processes in the other. It can signal them, it can talk to them. They share a user namespace so that file. Unix file permissions work because they agree on the set of users and groups. They share an IPC namespace if you want to use any of the systemV mechanisms. And importantly for us, they share this network namespace, which means they have one view of the network. What does that mean? It means they've got one interface. It's actually one end of a Veth pair. I think Wes likes the long explanation of sort of veth pairs and virtual networking I gave in the three-hour version. It doesn't fit here, but this is an interface that we've renamed to eht0. So it looks like a sort of standard actual PCI card I/O. This thing kind of looks like a small virtual machine. It's one end of a veth pair.

But anyway, we have our one interface with one IP address shared by all of these processes in these two containers. We have loopback. We also have one set of routes, one route table. We have one set of sockets, one domain for Unix domain sockets. And importantly, we have one set of IP tables rules. I could write an IP tables rule, a process in here, get set up an IP tables rule that, say, dropped all traffic, and that would drop all traffic coming in and out of this one as well because they're in the shared network namespace, although they're allegedly two separate containers. So Nginx can for example bin to 8080 TCP and then FluentD couldn't because one socket space; address is in use.

For our purposes for building this service mesh, we can do more interesting things. Imagine we replace that Fluent D that was a sidecar log exporter with a proxy like Envoy, and then we set up some IP tables rules to say, "I want to intercept all traffic coming in and out of this port 8080 and I first want it to come into Envoy and then Envoy is going to punt it back out to say “loopback." And then there's another rule that says, "Okay. From loopback you can finally go from Nginx." This is how we leverage this shared IP table system to do this transparent interception of network traffic.

This is called the sidecar model. We originally have Fluent D. I didn't point it out, but this idea of morally a process should do one thing and it should do it well. That's the Unix philosophy. Morally a container should do one thing right. It should have one primary process like an Nginx. If you're a putting a database in here you're doing it wrong, but it's okay to have sort of ancillary services. The same with the pod. They should have one primary purpose. This thing presumably serves HTML, but it also has this sidecar giving extra functionality that we might want.

The sidecar injection is a big topic. Liz Rice who works for Aqua does this amazing talk where she basically builds containers from scratch. She live codes a couple of hundred of lines of Golang and makes all the system calls to build these namespaces and to make these containers. Watching this go through is really interesting. But basically as the kubelet tries, it starts to build these pods up, it creates these namespaces. Then it goes through this list of init containers, which are more container images that just contain a one-shot process that does something and quits.

But what's interesting is that they affect this namespace. They affect if something writes to IP tables rules or changes the route table, just like running route on your command line, when that route userspace command quits, the route table is still changed. The kernel remembers that. It's a persistent thing and it then affects every other process subsequently in that network namespace. So the first init container just turns on core dumps. You can draw your own conclusions about perceptions of stability of the system from that. I will not comment.

The next one is more interesting and it runs a very long shell script that I won't go through, that basically sets up all those interception rules that I talked about. This is what makes it transparent, is the fact that this comes along, sets up these rules, and then your primary container has to have no idea that it's in this intercepting environment, but it doesn't need to contain any of the sort of networking logic we've talked about, for the retries and the back-offs that we want.

This is an init container basically because it's a privileged operation to manipulate IP tables rules. This thing runs with CAP_NET_ADMIN capability and then obviously when it's gone that's dropped. Then we can start the, the actual two containers. There's a bunch of details. Envoy listens on 15,001. The IP tables rules have persisted from the manipulation by that init container and they reroute the packets through Envoy.

That's how the very early interception works and that's how we build this up from those namespace perimeters in the kernel and C groups as well. I didn't mention them, but you've probably heard of C groups which is the kind of hardware side of that isolation mechanism. A container or a pod also existed in a bunch of C groups which limits its visibility of hardware devices and limits its rate of access. So you can limit network bandwidth and CPU and memory usage with the use of C groups. Together they provide hardware and software isolation.

Pilot and Routing

So, how am I doing? What happens next? The next thing is that maybe the packet bounces in through Envoy into service A back out again. If service A has just crashed and never responds to Envoy, Envoy will maybe wait a second and then just return a 503 on your behalf, or whatever you've got it configured to do. But service A is going to issue another call in the back end to service B. and it wants to talk to service B. The first thing it's going to do is service discovery. How does it know where that service B is? So you're probably in Kubernetes. As I say, I'm assuming you’ve got a bit of Kubernetes knowledge here. You've probably got a Kubernetes Service fronting all of the pods that comprise service B. You will have a cluster IP as it's called, a virtual IP, a VIP for that service. So you could just fling packets at that.

The problem with that is that it's then the Kubernetes proxy that does the back end selection of the load balancing, and it really has no idea what's going on. It can't do a great job other than just as round robin. But we can do better than that, right? That's part of what Istio is for because, like I said, the ingress is layer seven and does host and path based routing and could look at headers and make much more intelligent decisions. With an Istio system, that's not the only thing that's HTTP aware. All of these sidecars are. What this Envoy wants to do is it wants the full list of potential back ends, and it wants to be able to choose one itself rather than just throwing it at a round robin thing. It wants to be able to talk to the other Envoys, and do a least weighted or a geographically closest or something like that. So it needs to be able to find out where all of these things are in order to be able to do that.

So what do I do? I'm probably on a Kubernetes system. I can ask Kubernetes. So we can get this service for service B and we see that sure enough, it's got one cluster IP. This is a virtual IP and this just gets round robin between the real back ends. That's not really good enough. The way these things work though, the way kube knows where to find the back ends is this label selector. So in this case we've got app equals to service B, because that's how the pods are selected. So can I do anything with that? Well, let's have a look in DNS first. Service B, that's the kind of name I've given this thing. Again, 1 A record, service IP range. Not a pod IP.

Indeed though, I can go get all the pods and I know we're looking for service B. So maybe I grep for that and I find all these things. The labels aren't shown but I now find that I do have three pods and they all start with service B. But this just isn't sufficient. Why? Because services are based on these app selectors that can be arbitrarily complicated. You can't just go fumbling around like this. This isn't sufficient either. We need a way to get always the right set of pods. We need to take that label selector and we need to run it ourselves basically.

Kubernetes luckily gives you an API endpoint to do that. Unsurprisingly it's called endpoints. So I can ask to get the endpoints for service B, and now I get three IP addresses in this example, and they are pod IP ranges, not service IP ranges. This is an example. The execution environment I found myself in has a service discovery mechanism. It's actually got two in this case. It's got a naïve one which is, "Oh, you want to get packets to something that quacks like a service B? Great. I can do that for you. I offer you a layer of abstractions.’ And it's also got an, “Okay. If you want to kind of lift the lid off, if you know what you're doing, if you want to be taking the intelligent networking decisions, I can give you the actual IPs of the workloads if that would help.” In Kubernetes we can hit that endpoint and in other systems there are similar things. You might imagine raw DNS on VMs, you just look for SRV records or something.

I can get a longer form of this. And for each entry I get the IP address. I also get interesting things like what node it's on, so I could go and look up what region and what zone that's in, look at how close it is to me. I find out what kind of ports is expected for this. And of that useful information is in this service discovery mechanism.

I need to take that service discovery information and I need my Envoys to have it. This is one of the first talking points maybe about the Istio architecture, is I want this Envoy to have this preprogrammed. I don't want it to have to reach out every time. It can't look in DNS. That's not good enough. I don't really want it calling Kubernetes to hit the endpoints API every time because that's going to really load the system, it's going to slow Envoy down. So what I really want is to have that service discovery information ready and available in Envoy so it can start doing things. We introduce our first Kubernetes control plane component, which is this thing called Pilot that does exactly that. Pilot's the thing that configures these Envoys and pushes configuration to them. So as I say, you don't have to. Because out of the box it configures them to do default retries and timeouts and all these kind of things. So that's the pilot component.

How does it get its configuration? Well, it's actually got a bunch of adapters. Pilot is the interface to your environment. It knows how to go talk to Kubernetes service discovery to go and find out where your pods are. If you're not running in Kube, it will also talk to consul, it'll also talk to zookeeper. It can actually talk to all of them at once so it can build a shared database of, "Here are all the endpoints in Kubernetes and then you also told me about a consul system managing your legacy VMs. I've ingested those services as well and I've synthesized that information together." It then kind of churns that data and it pushes it to each of the Envoy proxies.

Another interesting point is that that API is what Envoy, Matt Klein at Lyft is calling the data plane API. They've gone to lengths to standardize and open this API. The idea being that in a system like this, this is open source, this is free software. You are free to swap Envoy out for anything else that implements the data plane API. And I think maybe HA proxy does now. So that's also another sort of interesting part. So this is, as I say, a push model. Actually, a Pilot will do its best to establish watches on these back ends so it gets kind of long-polls when they get changed so it doesn't have to spam them. It'll spam them if it has to. Then Pilot churns that config information. If and when it changed, it pushes it to Envoy asynchronously so Envoy has new configuration ready to go. Envoy doesn't have to poll. So that's how we do this remote asynchronous ingestion of service discovery and push it into Envoys.

What can Pilot do? Well, it's the thing that configures your proxy in a static way. It can affect the routing of the ingress because that ingress controller is provided by Istio subjected of the same configuration. It's the thing that'll do your traffic mirroring for testing, prod, and staging. It'll do your AB testing, traffic shifting, canary deployments. It'll set up circuit breakers, it'll set up fault injections. Anything that that proxy can be told to do. “One percent of the time I want you to return a 503 just because. And if the back end takes more than seconds then throw a circuit breaker and return this default.”

Mixer and Policy

We found service B. Here it is. There are three instances of it, three pods, in an amorphous gray compute blob. Remember, they may not be on top of each other. Can the packet now traverse? Well, not necessarily. There are a few more checks we need to make, a few more things that Pilot can't configure the proxy to do ahead of time. We need to check that there's no security policy in place that says that A isn't allowed to talk to B. We need to check there are no rate limits that have been exceeded. So this isn't a kind of stuff you could preprogram the proxy with. It needs to kind of know.

So unicast rate limiting is easy. We could tell this Envoy that it's got a 1,000 QPS over here. Well, what that means is this one instance of service A gets a 1,000 QPS to what? Each instance of service B or all of them? But then what if other instances of Service A are calling this? What if there's a service C that's calling this? So to do global rate limiting to basically say, "I've load tested my new service. SRE are happy to take it over. We know it hockey sticks at 5,000 QPS per pod. I've got three. So I want a global rate limit of 15,000 QPS." from wherever. And by the way, service A is a higher priority than service B. That is a more difficult thing to do, and that requires a few extra components.

Introducing Mixer, the next control plane component. This thing does those policy checks about security and rate limits and it's also the thing that gathers telemetry. Not only have we moved retries and rate limits and whatever out of our service by pushing them to Envoy, also because this thing's on the wire because every packet goes through it and because it understands layer seven, HTTP, it can generate logs, it can add trace headers and generate trace spans. It can generate metrics for us. Again, something that can be taken out of the service, Envoy implements that for us, and then Mixer is just an aggregation point for that. Again to plug it into the environment.

We now take a digression. This is where we get into some architectures you probably haven't wondered about. I'm talking about layer seven, routing these things, treating these things like a layer seven network, routing this stuff based on HTTP information. When you're doing IP layer three, layer four networks, you have this thing called the IP 5-tuple and this is the set of five data points that are sufficient to identify an IP flow, an IP connection. They are the source address and the address to the source port and the destination address and a destination port and the protocol. That being UDP, TCP, that kind of IP protocol. So with these five you can identify any TCP stream, any UDP connection between endpoints.

The way you build these big IP routers, the big systems that do internet backbone kind of stuff, is they have this segregated architecture. They have a control plane and they have a data plane. The control plane ingests all the information it needs to make routing decisions from BTP and open shortest path first, and then also local protocols like spanning tree and ARP. So all of these different pieces of information that would come together to tell a big iron router where to send a packet all come into this control plane which is a general purpose computer and it builds this thing called the router information base, which is like a SQL database. There's a different data schema for each one of these protocols. They all get put into tables with that schema and there's these big JOIN statements that merge them all together, know priorities and work out what decisions to make.

You do that on this general purpose asynchronous computer with no real hard deadlines. You do that and every time one of these protocols gives you a new piece of information about your topology or your peers, you put this into your RIB and you churn it and compile it. And what you emit is entries for your forwarding information base. This FIB is much more like a NoSQL database. It's a bunch of denormalized tables that are heavily indexed. They're all meant to be constant time lookup. So as soon as I get a packet, I can look at that IP 5-tuple and say, "Okay. Which TCP connection is this? Oh, that's your current YouTube stream because it's your IP, your browser's IP and port hitting YouTube 80 protocol TCP."

I could look that up in constant time because there's probably a table keyed on that and I can then make a really fast decision about where to send it. I don't have to do all that crunching involving this business logic, know that it understands all of these protocols, and I don't have to call up into this database every time. So this is kind of pushed into the FIB which is part of what's called the data plane. Actually that data plane itself has- and this is getting into implementation details- but it has what's called slow paths and fast paths and slow paths.

So if this packet has very recently been seen the actual interrupt handler for the network card can probably deal with it because it's got a small fixed piece of memory, it's got a small cache. It can cache parts of the FIB. It knows it's done an access control list check in the last 100 milliseconds, so it still considers that information good and the top half of the interrupt handler can probably just punt that packet without doing anything. If that information isn't in that small cache, or if you need a little bit more decisions taking for maybe look checking, ACL information in a different table than this FIB or something, you might actually have to call a kernel. You might have to come out of the interrupt into the kernel proper, so somewhere you can allocate memory. For example, you might actually have to call into a kernel module. And if that can't handle it, the architecture of these systems is you punt over a socket. You actually get into the user space where you really can do anything you want.

These things have slower paths and faster paths based on the locality and the recent validity of cache information. But they all access smaller or bigger parts of this forwarding information base, which is this denormalized indexed store. Why the aside into an IP router? Why do we care about sort of big iron boxes? I think this actually looks very similar to the Istio architecture. I would say the Pilot is your control plane. It's your RIB that ingests all of these service discovery protocols and all of the user configuration that tells it who's allowed to talk to what. And then it compiles that configuration and it punts it off to Envoy, which is the thing that actually has that if I see this kind of set of headers, I need to send it over there. If it's literally this path that goes to service B kind of thing. So Envoy to me is the data plane, but it's the fast path of the data plane. So there are some decisions that Envoy can't make on its own. For example, applying a global rate limit. Can't be done. In locality it can't be an Envoy's little interrupt handler, because we don't have all the information we need. We need to go coordinate with some other people.

To me, Mixer is not actually control plane even though it's drawn there in the diagram. To me, it's the slow path of this data, plane because Mixer is on-line and it's part of every packet flow as far as I'm concerned. So where would we draw Mixer? I would take it out of its box and kind of put it down here? As I say, there are two things it can do. It can do the checking, this packet allowed to traverse based on security rules modeled as RBAF, based on sort of global rate limits, and Mixer's the thing that holds that counter. And then as I say, it also sees all of this telemetry information. It's an aggregation point and it's also an adapter. If you want your metrics to go into Prometheus and then also into cloud watch metrics and if you want your logs to go into elastic search, you just tell Mixer where those things are and it gets everything from all the pilots and Mixer will talk to them.

What's interesting about this is, I said it's on the data plane, I've said it's on the hot path. That's not entirely true. It's an architecture diagram, right? You're a senior engineer. You're in a design review. That looks like a single point of failure, maybe. It certainly did to me to start with. But there's a whole bunch of implementation details that mitigate that. So Envoy obviously calls to Mixer, but it uses what they call a fat client. There's quite a lot of code in this Envoy plugin that calls to Mixer. Basically what that means is that firstly the reporting stuff, the telemetry information that's sent up is batched, sometimes aggregated. It's asynchronous. It's not on the main thread. It gets off the main Envoy thread straight away, and then if you can't reach a Mixer, if it's being slow, it blocks a different thread and that thing times it out and that thing's asynchronous. So simple batching and asynchronicity takes it off that hot path, off the main worker threads in Envoy.

The checking is even more interesting. So a request will come in. Maybe I want service B and I'm getting this path and I'm this user agent. Say we're blocking some buggy version of IE that's just sending us malformed requests and a query of doom for our system. Your service A is allowed to access service B on this path, as long as it's not IE. So first request hits Envoy and the fat Mixer client talks to Mixer and it says, "Well, I've got all these headers. Am I allowed? Yes or no?" And Mixer says, "No, you're not. Drop that packet. Don't sent it across."

And by the way, you can cache that and you can cache it for 500 milliseconds or 100 requests. Whichever comes first. And by the way, the key for that cache is just the user agent. You send me the user agent and all the other headers and the path and the host, but I'm telling you that IE is just blanket banned. So actually, when I did all of my machinations I decided to ban it just based on that user agent header. So you can put Envoy fat client. You can put that no in your own cache just under the heading of user agent. If any user agent IE ever comes up, it doesn't matter what host it's going to, what path it's going to. Just drop that packet. So Mixer gives an optimal cache key back to Envoy, and it says then, “This is valid for a 100 milliseconds, 50 requests. And if you can't reach me to get another answer after that, if that cache expires and you can't reach me, this thing fails open or fails closed,” depending on how you've got it set up. Maybe whether it's a security mechanism or a sort of soft rate limit.

All of those implementation details go into hopefully making this thing that looks like a single point of failure actually a more resilient system, because Envoy is preprogrammed by Pilot to do what it can. Then in a way, it's almost preprogrammed by Mixer. If you can get even one answer out of Mixer you send it to Envoy and then it's got this preprogramming, which is yes for now, but after that you've got to fail close if you can't reach me. So it's basically a no because this is a security thing, so we're going to err on the side of caution. If you can reach me and if I can validate that, all the rules are in place then I might say yes. I might open the gates for a 100 milliseconds. In that sense it makes it almost preprogrammed and actually makes the system more resilient, even though it might look like a single point of failure. So, yes, Mixer can do this checking of ACLs and authorization. It can do rate limiting and it can do reporting of logs and metrics and tracing.

Can we finally traverse? Maybe not. This lady here is called Eve. She's interested in dropping in on our packets and hearing what they have to say. How do we mitigate this? Well, we stick it in an mTLS tunnel.? Well, we encrypt it. When your browser talks to an origin web server, you use simple TLS, right? The server presents a certificate. You trust that because it's signed by a root authority whose certs you've got installed. That gives you encryption but it gives your browser verification of the identity of the origin server. It doesn't give the origin server verification of the identity of your browser. It doesn't let know who you are. You could be anybody, because you're not presenting a certificate. That's why you have to log in to Amazon. You have to use a different form of credentials, a username and a password.

Because we've got control of all of this and because it's between two services that we control, we can actually do mutual TLS. You not only get that encryption on the wire, but you get strong verification of the identity of both ends. In order to do that, they need certificates, mutually trusted certificates. This is the third control plane component, a thing called Citadel, which issues those certificates to the Envoys. It pushes the certs out and they're quite short lived and it renews them quite often.

There's a whole bunch of stuff again that I don't have time to go into about how Envoy calls Citadel and says, "Hey, I'm service B. Can I have a certificate to assert that?" Citadel has to trust that Envoy, right? Your security chain's only as strong as its weakest link. So actually there's a side channel through to an agent that runs on the node where Citadel can verify that. It doesn't do much today, but there's a whole bunch of work going on to have that side channel agent check the binary signature and the binary hash of the service binary, the Docker daemon, the kernel, talk to the TPM, the BIOS, all of these security vectors will be verified. Those attestations go to Citadel. Citadel then says, "Oh, yes. I know you’re service B. Have a certificate to prove it." Service A then accepts the cert.

So that's kind of the third part of the architecture. If you see Pilot as a reactive config compiler and pusher, Mixer is a sort of data plane fast path. Citadel is like a batch job, I guess. Citadel is something that runs in the background. This is almost like let's encrypt agent, whatever that's called. It just keeps rotating your certs. So that's the third part of the control plane, and again, it's got a slightly different model.

I'd say we're there. I'd say the packet can reach. It's gone left to right. We've seen how and we've seen all of the control plane components that it hits along the way, and what they all do. Just a few more things to say on the subject. There's also an egress controller which is ingress in reverse. Another bank of proxies. Controls your access to the internet. This isn't normally done, but actually if you think about it, your average back end microservice probably doesn't want to talk to the internet. It should only be talking to other microservices. It might need to talk to databases and queues and stuff from your cloud provider from your PaaS. Almost certainly shouldn't be accessing Russian IP ranges. So you might want to block that by default. You also might just have used an Ubuntu based image because you were lazy and the damn thing's trying to update itself in the background. Just stop it from doing that.

So egress control is provided, again, under the same Envoy proxy, under the control of the same control plane, same set of documents. Config documents applied to these Envoys is applied to these. We actually need to get configuration into this system and so Pilot takes the config. It takes the information from all of these service discovery mechanisms, but it actually needs to mix that in with what the user wants. So the user's got to say, "Well, I want this particular rate limit and I want this fault injection, but only between staging service A and staging service B." So the user has to get configuration into this as well.

Istio is normally run in Kubernetes. It hijacks the Kubernetes API server currently to do that. You use kubectl. You write these YAMLs that look like kube configs. You use kubectl to pump them to the kube API server. Through various hooks and hacks, Istio just goes and reads those and then Kubernetes uses its own etcd instance which is this key-value store database to persist that data.

What Istio is doing soon- this is in development at the moment, and there was a small change to this slide. It was to rename this Galley. Istio is writing its own component to take user configuration and to store it and to validate it, to persist it, to store it and to send it into pilot, and that'll just be another stream of information like the service discovery is. This, I think, for me completes the picture. This gives us the full three-tier architecture. So I say three-tier architecture as if it's the 90s. I guess everybody's thinking Oracle DB and PHP and all that horribleness. And sure, that was the thing. Actually it didn't serve us that badly. In a lot of cases, we got ourselves ORMs, we got ourselves schema migration tools. We applied science to it computer science, at least.

But now with Galley, I think you've got that same sort of three-tier model and it does fit almost. You've got a management plane now, and then you've got a control plane, and then you've got a data plane. Much like this would be the UI of your web app, and this would be the execution tier and this would be the database. This thing, this management plane, is optimized for user-friendliness. It doesn't need to be fast. It can operate on human time. It doesn't need to be particularly highly available. We just optimize of that user-friendliness and all it does is it takes user input and it presents it in a nice way, and it validates here and it stores it in a very resilient way.

Then you have a control plane, which if you like, it actually does most of the work, as in it implements most of your business logic, most of your actual value. It's the complicated business logic in this control plane as it would be in the execution tier, if you're doing your three-tier web app. And this thing, as a group, all the replicas of them are optimized for concurrency and for availability because that's what a control plane is doing in your system. And then they push configuration to the data plane. This is maybe your most even your database. As we know, Envoy does all the heavy lifting but it does it in a very dumb preprogrammed way. Like as dumb a possible way, because we just want it to work. These things are optimized for latency and they're optimized for throughput. And the way to do that is to make them dumb and to give them these pre-indexed, like pre-chewed configurations.

To me that's kind of the equivalent of putting just views and indexes and stored procedures into your database. I'm sure we've all seen applications that are implemented entirely in store procedures, and that's a nightmare. That's an anti-pattern, but a little bit of a stored procedure to give you a wrapper and some transactions around updates over several tables, just views so you don't have to fetch a bunch of tables, and then do the joins yourself in Java code up here. That is a legitimate use of pushing things to the data plane. And I think you can see analogies for that now with our cloudflare workers, and with eBPF, which is basically Lambda for the Linux kernel. All of these systems, these little hook points where you can just add little bits of code, they need to get run all the time and need to be highly performant and need to scale with the data plane and be optimized for latency and for throughput.

That's me trying to fit into the architecture track maybe. That may be a bit of a squint, but I actually think that model works quite well. I think the analogy to the router with the control plane and the data plane is there. To me Galley or currently what kube does, is a good approximation to a management plane. So that's the architecture of Istio. That's how it works. Its reason to exist, it is a service mesh, right? It's those network functions taken out of your service or it's an HTTP-addressed overlay network, whatever you want to call it.

We heard the pitch for what it does. This is the way it's built. I hope I've explained why with some parts of the control plane being batched jobs, some being online things, some being compilers. So, yes, I don't think I've got any more slides. Hopefully, that was interesting. I took you through the introduction, we did a bit of how do I use the Linux kernel primitives to build something a lot more emergent, like a Kubernetes pod, which is containers, and then what are containers? And then given that, given that ability to transparently intercept traffic and do intelligent HTTP-aware things with it, how do we then build a distributed system across an entire region or maybe the world, and give that a consistent configuration and a consistent set of things that you want it to do for you, and how do we keep it secure? That's really all I wanted to say.


See more presentations with transcripts


Recorded at:

May 06, 2019