BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Istio - Weaving the Service Mesh

Istio - Weaving the Service Mesh

Bookmarks
48:35

Summary

Louis Ryan talks about Istio, a tool which provides a common networking, security, telemetry and policy substrate for services called ‘Service-Mesh’. He also talks about how the service-mesh helps to enable the transition to microservices, to empower operations teams, to adopt security best-practices and more.

Bio

Louis Ryan is Core Contributor for Istio, gRPC, & Principal Engineer at Google.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

My Google Career

So I have a quick slide here. C'mon, why is that not doing what it is supposed to do. Okay, one second. Sorry. There we go. So I'm going to tell you about my career at Google in one slide. This slide spans 10 years. So, yep, it is -- it is kind of sad sometimes. There's actually a little bit before this, but it is probably best not to talk about what that was. Otherwise, I will just get sad.

And so, I worked on API infrastructure at Google. I have worked on API infrastructure for 8 years now. When I started working on it, this was the general architecture. You know, Google,- lots of HTTP traffic comes in, there's a big reverse proxy that sits in front of everything, it routes traffic to a bunch of services, some of those services implement public restful APIs using a library, G data. And have any of you used atom pub before JSON was cool? Probably best that you didn't. Okay.

And so, you know, and Google is happily building APIs, and when I say happily, not really happily at all. And a couple of things happened, but the biggest thing that happened was smartphones. And all of a sudden, a whole bunch of teams at Google needed to build a lot more APIs that served a lot of traffic, and there were APIs not serving the problem. They needed more sophistication models, quotas, denial of service protections, they needed terms of service; if you are using the API, you should say what you are using it for. And there were concerns that were not being dealt with, so we built up a team to build a lot of that stuff out and we put the functionality into the middle-tier proxy, called the API proxy, inventively enough. And that works great, and it worked well for several years; it scaled up and we were serving lots of traffic.

But there were a couple of problems. There were services that were not APIs, and you paid for the resources it used, and we needed something to measure and track that, and there were properties that we wanted to track that were not APIs. So we needed a centralized way to track those behaviors and that allowed people to funnel the usage data into that for billing. So we created a thing called the control plane, and this worked fine and then something horrible happened, and that's the cloud.

At some point, that was way too slow. Having this middle proxy in the way, in the middle of the data paths was too slow, too expensive; we built that thing in JavaScript, and it was a single failure domain. There were a million APIs coming through the proxy, and that caused spots. I have taken out significant portions of all of Google's API's traffic by pushing config, and config is API, don't push it everywhere. And we needed a new version of the API proxy, and we wrote it as a side car proxy in C++, and it runs on every job that actually implements the service. And you will notice that the protocols started to shift, they look like our internal protocols. The truth is that cloud and Google are not really all that different, right?

Everything is on the same physical network, everything -- your workloads run on our machines, just like our workloads run on our machines and they talk in the same ways. This was a trend going on; we had to support orders of magnitude in scale, reduction in latency, and costs. If you are going to serve an API like cloud BigTable, that can serve millions of writes a per second, you cannot run it as the same as a centrally managed JavaScript job. So that's my career for the last 10 years doing this stuff, well, really 9; I spent the last year doing something slightly different.

Cloud - > Internal & External Convergence

So, you know, I mentioned a little bit about this cloud and internal convergence; we had the network and physical convergence diversion, we talked about isolation and reliability, and we also had security concerns. So when you send us your data, you really want the same kind of protections that we applied to that data when we serve our consumer-facing products. And you need the same kind of end-to-end security and threat protection models we have been providing and serving inside of Google for a long time. And you will see cloud vendors building solutions in this space, but it is important to get this stuff right.

So there's a big convergence going on; we are using this side car for things, for a variety of patterns that were designed to aid integration, to help us insert behaviors into the network or data path or services that were massively cross-cutting. And that's a similar problem that most of you have today when you build services, whether they are micro or not.

And so, had we taken some of the things that we had done and start to apply them to that problem. And so as I mentioned before, the API proxy is a side car, and side cars have been pretty trendy recently, so we will talk about how that pattern enables, you know, in open source and also vendors to provide solutions that help you kind of start to abstract away some of the concerns at the networking level and also some of the concerns of the security model that, you know, you really don't want living in your code.

Decoupling - >Velocity

I am going to talk about the earlier talks today- and one of the themes in the talks, they talked about decoupling and velocity. If you can decouple your operators from your developers, they can do their jobs faster. Just like microservices helps you decouple one application concern from another, and then the two teams building and developing the microservices can iterate faster, you have similar coupling problems at different parts of the stack or process.

And so, you know, we have -- this track and a number of other tracks at other conferences, you will see a fair amount of conversation about decoupling operators from developers; it is an important trend in the industry. If you have a big portfolio of services, you have those two roles. But that is not the only coupling I want to talk about today.

So today, when you write services or microservices or whatever types of services you want to call them, you are often writing networking code, whether it is HTTP requests, or you are using a library and trying to make it work with TLS, or trying to config a load balance or anything like that. You are writing code or to config the behavior of the network, and make the network do what you want to do. And we will think about how do we step back and think what applications we need from the network.

And so we also want to talk about decoupling the network topology for security. And so, you know, traditionally, when you talk to, you know, big IT shops about security, a lot of the conversations will revolve around network segmentation. How do I package workloads into networks with very specific boundaries so I can reason about what is talking to one and what is not allowed to talk to what? And I think there are some problems with that model because it is not fine-grained enough, portable or flexible enough to meet the use cases we see today in microservice development. And I want to talk about modernization and architecture. You go to a lot of conferences, one of the words you hear is lift and shift; how do I take my workload and move it to the cloud? You will see that conflated with modernization. Do you have to lift and shift to modernize your application stack? I mean, certainly, you can use lift and shift to manage your opex and cap ex-distribution when you are buying compute and storage from a cloud vendor, but is that coupled to modernization, or are there other ways of going about that? And I talk to people that need to maintain certain IT assets while working with the cloud, and how do I make those things work seamlessly; and all of that feeds into modernization. Okay.

What is a "Service Mesh"?

And so, what is a service mesh? So, I like to think of a service mesh as a network for services, not for bytes. So when you deal with networking, you reason about sockets, ports, packets, you know, maybe you are using an L7 protocol HTTP, I suspect a fair number of you are. And it tends to tail off pretty quickly when you get to the protocol level of L7. When you are building services, you need a lot more out of your network. You want the network to route away from failures, you want the network to avoid high latency routes to a specific service, you would like the network to avoid hitting a service that has a cold cache. You would like the network to tell you when there are unexpected behaves, latency spikes, packet loss. And you would like the network to participate actively in identifying the root causes of failure. And you would like to make sure that the data flowing over the network is secure against trivial network attacks.

You would also like some observability out of the network, you would like to know, sorry, let me go back to this in a little bit. There's a variety of features we would like to talk about, observability and security, but you would like a lot of this given to you for free. You would like to know that, you know, as developers today, right, I don't have to go and buy an awful lot of software to get a variety of properties that I want. The history of software development has been that the underlying infrastructure rises up to meet the needs of the application over time. But we have all been sitting on the networking stack, layer 3 or 4 for a while, and not getting a lot of value out of the network than that. How do we raise the abstraction up? And when I say free, I don't mean that you didn't pay anything for the software. That's not probably the cost that you care about; the cost is that you did not have to re-write your applications to do it. That would be a much, much more expensive proposition for you. And I care about free, as it is an open source project, and that is the real cost in the system. If you have to modernize by re-writing the applications, it is hugely expensive and completely untenable. It is possible that whoever wrote the application left the company five years ago, you have no idea where the source code is, nobody knows how to build it, but it is still running in production. So that's not that uncommon.

So how do you help, like, in that situation? So I talked a little bit about, you know, a network for services. And the goal here is to reimagine the network. Like, you as application developers, what do you want the network to do for you, or think that is going to work, anyway? Do you care actually which IP import a packet went to? I mean, on a fundamental level, do you actually care physically where it went? Do you just care that it went to the right service that did the right thing on your behalf, and knowing that?

Right, so, you know, if we start to think in those terms then, you know, there's a variety of features we all want. We want the network to handle service discovery for us. So when I say that I want my sales app to talk to my HR app, I shouldn't have to put into my application code the fact that the HR app lives on these IPs in these regions, these zones, and all of these concerns. Don't make me think about that. Please stop. And also, when the HR app talks to the sales app or some other thing, does it really need to know how to load balance, or does the receiving app tell it, or tell the network, how it wants the traffic load balanced? This is stuff that should not live in the application code, it starts to pile up.

And, if you think about that for a second, right, we all are running companies or work in a lot of different languages and networking code is hard. In my history at Google, I worked in a lot of networking stacks and high-level customization stacks, and they were all client libraries running their own stacks with unique and special nodes of failure. We had many examples of applications or clients that caused weird behaviors because they had a quirk in the networking code.

So if you are using some framework to build a set of services, that might work fine when both the clients and the services are written in the same language, because the framework was designed for that. And maybe the framework is specialized to a couple of platforms and you stay razor-focused on that. And as your portfolio of services starts to grow, more languages, run times, and things start to creep in and things start to fall apart a little bit, and it becomes extraordinarily expensive to keep that consistent. I have a fair amount of experience on that. I worked on gRPC for a number of years, the goal was to be a consistent framework that spans a range of languages. That was a huge engineering effort to make sure that they all behave the same, and the truth is they don't. They have little quirks that occasionally make things go funny. So how do we solve the problem, and particularly solve the problem when you are dealing with application code that can't be updated? You can't go put the shiny new framework into it, you cannot really control it.

Weaving the Mesh - Sidecars

How do we go about doing this? I'm sorry to do this after lunch, but it is going to be a heavy meal. But we use side cars, and what Istio does is it injects side cars into both sides of the network, and doing that allows us to do a number of things. If you look at the outbound features- so service A wants to talk to service B, when A makes a call, Istio captures the network traffic, routes it through the side car, and the side car layers the behavior that you want. So the side car on the client side injects authentication credentials, and failures if you want to do chaos stuff. If you need fine-grain routing control, it can split the traffic, it can do tracing, we can start tracing. And there's a number of other features, but it gives you a lot of power in the network. Now we have a smart end point that belongs in the same trust domain as your application, it is part of the application, and we will talk about the security properties of that, that is able to take on a bunch of application-level concerns for networking. So you can think of it as an out of process library, which every mental model works for you. It is part of the application, but not in the process space of the application.

And on the service side, the server side car is in the trust domain of the receiving application, so fine for it to do authentication on that side; it checked that it received a protocol that is appropriate. It provides check and if you have a CMS, it can enforce policies on that and rate limits. A lot of production deployments don't often do rate limiting, but as you grow in scale, one of the more common forms of DOS attacks is the accidental form. Somebody runs a test, it has an unintended consequence, and it takes down the system. And also, load shutting is a big part of that; if I want to prioritize the requests I want to drop on the floor if I'm one of these modes, the DOS mode. And you can participate in request tracing, it can provide telemetry, that we will talk a little bit about, and it can inject to see if your system is behaving how you would like it to. That is Istio in a nut shell, that's the fundamental architecture.

Istio – Putting it all Together

So how do we put it together? So we have, as I showed you here, you know, the side cars, and there's a control plane that sits on top of that that configures the behavior of the system. So there's an API that sits above that, and that basically operators, or developers, or whomever wants to affect these types of things pushes configuration in that says this is the behavior that I want the network to have. And then we have a component on the left, Pilot, and it distributes the configuration, the intent, to the side cars so they implement the behaviors. So mostly, that is pushing configuration down into Envoy Live. And Envoys are a really interesting proxy, because it has to receive an update of the configuration, it does not need to restart. This is actually a novel thing in networking and proxy land; it is an API-driven config model for the proxy, and it stays up the whole time. You want to check the networking route, it has no impact on, like, cupping, SLO, and availability, it just changes its behavior.

And we have another component on the right-hand side, Istio auth, and it injects certificates into the side car and to rotate them on a pretty regular basis. And the reason we send certificates down into the side car is we use that to identify the workloads. And we secure the traffic between both sides. I will get into some more details about security later.

And then the last part, the piece in the middle which we call the mixer, is a kind of -- it is the extension model of Istio, effectively. It receives telemetry from the side car proxies, which federates down stream to whatever telemetry collection system you want to work with, and it implements a policy check. So the server side car when it receives a call from the client, it will say, hey, should this call go through? And the information that is available to make that policy decision is dependent on the protocol that is being used.

If the protocol was HTTP or gRPC, we can send a lot of information to mixer, layer 7 information about the call to enable mixer to make that policy decision. And on the bottom corner, we try to do this as transparently as possible. In the case of Kubernetes, we do the magic in Kubernetes land; it re-writes your pods on the fly to inject this networking behavior, you don't have to know that it happened. And segregating the operator roles, and operators can enforce this deploys in a Kubernetes environment. And this is not tied to Kubernetes, we released in Istio how to do this in VMs, we did with Mesos in that environment, we talk to Docker folks a lot. This pattern just generally applies, we do not care what orchestrater you used, it is the same basic model.

Our Sidecar of Choice - Envoy

So I mentioned about Envoy, I saw the Data Wire guy talk about Envoy, and he has good taste. And so it is a C++ proxy, you know, I worked on proxies in a variety of different languages, I probably -- the first two iterations of the stuff that I worked on in Google probably should be done in C++, it is just, there is really solid tooling behind it. You can get a lot of performance, you can get a lot of interesting and quirky things that are hard to do in other languages, and it has been used at Lyft for a while. They have beaten it to death, and it scales up pretty well with their service. They have been very happy with it, and they have a great community to work with. If you have a chance to see a talk by Matt Clien, I recommend it, he is one of my favorite angry men in tech. And his Twitter feed is pretty good, too.

But I mentioned the API-driven updates- it has features like zone balancing, it does HTTP two on the inbound and outbound side, which is unique in the industry. And more important than that, it was designed for observability. Envoy, they have this philosophy of making behavior of the network a first class thing, you cannot know that if you don't know what it is doing, so it makes a lot of information about its behavior that you can down stream into telemetry systems. This helps in production roll outs and understanding how systems are behaving, when they are having failure modes and being able to diagnose it.

Modeling the Service Mesh

So how do we model this? So I talked a little bit about Kubernetes, but we are effectively environment-agnostic. But we need to understand the topology of the network. We need to understand service discovery tools or orchestration tools, because they dictate the topology of the network. So we have this system, Pilot, which programs Envoys, and it receives topology information from Kubernetes, Console, Eureka or whatever you want to write and plug into Pilot. So we suck in all of this topology information, and then we merge it with the config that you apply, to scrape over the network topology that we emit down to the stack.

So we have done a number of integrations- I mentioned console, Eureka, there's a number going on in the community right now. If there is one that is not tracked there that you would like to see covered, let me know after, or file a GitHub issue. That would be great. And we push the config to the network, and we don't re-start Envoy to do it, or try very hard not to. If it is re-starting, it is a bug, not a feature. And that's how we program the network.

So we will get back to the other high-level features. I talked about observability and the properties of Envoy; what do we mean by Observability? We have this, you know, this mesh of applications that are all talking to each other, or these services, they probably all have names, they probably all run under a certain authority, they are probably all performing different sets of named operations, or roughly-named operations on resources within your application domain. They may live in different zones, they may have different physical characteristics, they may have -- when you make a call, it has latency, you know, error distributions. All of that is a set of information that you want to easily extract from the system and be able to put into nice, shiny dashboards into analytics pipelines and things that feed back into the deploy management and CI/CD systems so you can do incremental roll outs. That's what we mean by observability; extract as much information as you can and package in a way that you can consume.

That is what we do in Istio; we see what is going on, and we help fill in the gaps when the application is really not helping them fill in the gaps. A good example here would be, you know, we want to show a graph showing all the different operations invoked by a service, and the latency is in the 90th percentile latency for each of the operations, but the .htm is -- the API is restful. And the paths, if they are parameterized, they make a terrible classification system. So maybe you want to use open API to classify paths and operation IDs and generate your graph that way. But you don't want to re-write the application code; how do you do that? We will provide means for mixer to qualify the traffic to put into the telemetry system without updating the application.

Visibility

And so today, Istio ships with an out of the box monitoring experience; we use Prometheus and Grafana to do that. You get these metrics without instrumenting your apps, and they are keyed by the source and destination of the traffic. I have a graph that says, how much traffic is going between service A and B, not just how much traffic is going into service B. And, right, that's a very powerful tool.

I can also trace requests. And now, obviously, tracing requires some participation on behalf of the application there to propagate contexts through the application code itself. But Istio can help you with initiation, sampling, etc. It can make the trace -- it can make the behavior of the network, as it flows around the application parts, visible to you. So you can see the network latencies as well as internal application latencies if you properly instrumented your code. And we want to do this in a kind of vendor-neutral way, so actually, there's an extensible pipeline that goes out of the back of mixer that plugs into a variety of instrumentation systems. We use Zipkin today, but we will probably ship different out of the box experiences, or you will see other variations from other vendors. There is plenty of choice on the monitoring side; maybe you like, I don't know, commercial vendor, like new relic, or Datadog, or Splunk. And on the tracing side, you want to use open tracing or whatever you want to use; we actually do not care. We want to make sure it is easy to integrate that and extract it from the network and push it into these tools in coherent ways.

And so basically, you know, this is an example of the topology. It is showing you Prometheus that we ship, we have examples with stack D and driver, I do work for Google, there will be occasional plugs. And we have GUIs, and I showed you the previous screen; Rework- they have an observability tool sitting on top of the Istio metrics and showing you a whole bunch of interesting stuff. If you want a product that does those types of things, they are good people to talk to. And it is doing that thing again.

Resiliency

And so, we will talk about resiliency for a second. I think that there are some talks today about chaos engineering; this is something that Netflix pioneers in the ecosystem for many years. And so what are the features that you want the network to help make sure that your application code is well-behaved in the face of failure? You know, people talk about time-outs; time-outs are a very useful tool, and they are a double-edged sword if you are not careful, they are an important tool in the toolbox to make sure that the system stays stable. Maybe you want to, you know, try to read from a remote service, if the read fails in a certain unit of time, I can read out of a cache or data structure and still serve content to the user.

So time-outs do things that help you like that. And re-tries are on the sharp end of the timeout world, you are cascading outages, that is another example of accidental attacks. So circuit breakers, health checking, and maybe your application is big enough to span many regions and you want to make sure that it routes traffic appropriately when bad things happen.

And you want to do fault injection. At Google, we run an annual exercise, Dirt, where we make systems start to fail in very specific ways to illicit awful behaviors in their dependencies; this is something that I think a community should -- it is a thing that people should definitely be doing as part of their operational management approach to rolling out big, complex applications into production.

And on the left hand side, there's an example of a configuration object, which you send the Istio to a control plane API, and it says, here are a bunch of properties about how I want the network to behave, and the situation is any traffic going into the destination service. And we have a fairly sophisticated language that it allows you to apply the controls to intersections of source and destination.

Traffic control, okay. So that is the resiliency side of things, but you know, there's lots of use cases, they don't have to do a lot with resiliency, maybe you are doing blue/green deployments, or solving the hosting problems, you are just making sure that the right traffic goes to the right place.

Traffic Splitting

Here is an example of traffic splitting. Traffic splitting it important because you want it to work at the right layer in the networking stack. If you look at Kubernetes today, it has a load balancing system for pods, which is layer 4-based, so if a pod wants to talk to another pod, then the traffic is at layer 4 rather than 7. If you want one percent of HTTP requests to go to the other pod, you cannot do it. You need something that gives you the flexibility.

And, you know, this is a common practice in the industry, traditionally, mostly done using middle proxies. And then we are just pushing the same functionality down into the side car on the client, so it happens from the client to the origination. Here, we are showing you basically a weighted percentage traffic split between two destinations.

Traffic Steering

And similarly, you might want to do what we call traffic steering, which is not -- it is just dividing traffic up arbitrarily based on percentages, but dividing it up based on some property of the traffic. The example we show is a user agent, a filter; I want to send iPhone traffic to one service and Android traffic to another service. This is a simplistic example, the routing rule can be quite complicated. I can write a rule that matches based on the source here as well, I can do complex combinations of these things to test out variations of traffic and behavior that allow me to influence roll-outs, or a whole bunch of properties in my operations side of things. Okay.

Securing Services

So I'm going to give you a heavy meal about security for a second, because it is an important point. When it comes to securing services, what are the things that you actually want to do? We talked about segmentation, so you want to make sure that only the services that are supposed to talk to each other are talking to each other. And only the operations that they are allowed to use when they talk to each other are talking to each other. There's a couple of things you need to do before you do that.

One, you need to make sure that you know who the service is, right? So who is calling you. There is no point in having a policy if you can't, in a strong and verifiable way, know who is calling you, otherwise, you might as well be guessing. And so we had this notion of verifiable identity. And you want to make sure that encryption is on by default.

One of the things that we do at Google is we assume that our internal production network is insecure, that there could have been a penetration attack done to any workload on any node running on the network, and how do we protect all the other things on the network from that event? I'm not saying that's what is happening, but that's the mental model. We want defense in depth against a variety of attacks. One attack is sniffing the network; we want to ensure that payloads as they go between services are encrypted, and that encryption is tied to the identity. We have another reverse problem here, the securing name problem; when service A wants to call service B and it gets the IP address, how does it know that the IP address is part of service B? That's what we call the secure naming or addressing problem.

And we want to respond to threats rapidly. What Istio does is it issues certificates to the workloads so they can identify each other. We use mutual TLS to give verifiable identity and encryption by default. That's how we know who is talking to who. And if service B is compromised, I want A to stop talking to it as fast as possible. I will revoke B's credential; we have tools for rapid revocation so it will terminate in a limited time window, we can limit the last radius of a particular security event to time. That's an important property of your security posture, and we can often do better than that. We can shrink it down into very, very short amounts of time.

Problem: Strong Service Security at Scale

Why is this important? You can think about strong service security at scale, one of the concerns that you might have. Well, you should be concerned about insiders, accidental or deliberate. One of the biggest security risks to any company is the people who work at the company, or the people who accidentally walk in the door and stick a USB thumb drive into a machine. You have to worry about hijack services, Equifax is probably the gold standard on hijack services these days, unfortunately.

And what hijacking means, whatever was hijacked, now it is a malicious agent, gets to use all of the properties and power that the service a had before it was hijacked. It is part of the trust domain, and if everything else trusts it because it is network reachable or for some other reason, that is not good. Trust needs to be managed in a fine-grain way. But fine-grain models are hard to manage, or fine grain networking models are hard to manage. How do you reason about fine grain security, what is the mental model you want to have? You know, when I talked, as I have been talking, I say that service A wants to talk to service b. That's the natural model in the microservices world.

And so, how do I say this is service A, and this is service B, and how do I do that when service A is mobile? If I'm using container orchestration, service A is not going to sit there on one port for the rest of its life. It is going to move around a lot, possibly quite quickly, in many geographic regions, and it might even move to the development laptop. We talked about, you know, test in prod. Well, that is also, you know, there are variations of that which are -- they develop on your laptop and have your laptop, the thing you run on your laptop, which is one of the services, being able to call the other production services which are running on the cloud.

And right, now your network parameter model has to be way more flexible to accommodate these things. You also need to reason about securing resources. I mentioned securing APIs and operations within APIs, but resources are the next level down of granularity. I have a database, and that database is a resource. I have a file in a file system. I may have one API to read and write files, but one of the files is way more valuable than the other. I have to reason about those types of things.

At some point, you have to deal with audit and compliance, either for statutory reasons, or because it is good practice. I would like to be able to know what my security posture is and are there things I can do to make it better. And there's a bunch of wants, I mentioned workload availability; it is a need and want. You want to move between a premise and cloud based on whether it is cheaper, or easier, or more performant, or whatever is your criteria. I want to administer this remotely, and do local development; bringing your own devices are a common thing in the developer community and the enterprise world in general. And also being able to reason about people and machines in a somewhat similar way.

When I talk about authority, I talk a lot about service to service authority. A lot of what happens in the operations land is user to service authority. And how do I reason about those two things in a consistent way? If they are entirely distinct, it makes it very hard to think about what the security properties of my system are. And I want to keep costs down.

And a lot of this, you know, if you want all of these properties, and you want to deal with all of these concerns, you know, you could be looking at a hefty bill. And so, you know, Google's approach is that we don't believe that traditional parameter security models are sufficient. We think they are useful, we do a fair amount, but we have an additional layer on top, which is identity-based, and strong and verifiable identities.

Istio – Security at Scale

And so, here is a quick overview, I showed you a little bit of this before. We have a CA, it provides certificates that are pushed to the side cars, those certificates are rotated, and they have an expiration that is quite short, and envoy initiates mutual TLS between the two side cars; that is how we have verifiable identity, and those identities are passed into the policy layer to make decisions.

What’s Mixer for?

And so I might just glide over the policy stuff quickly. We showed you this earlier. But the policy stuff is, you know, it enables integrating the extensions into the system. And so, in this Istio community, we did an open policy agent, which is a spec for people to declare policy constraints on network, or service behavior, in a standardized expression language. We have an integration with that. We provide white lists and black lists, as features out of the box. We will probably do integrations with LDAP systems, or even with active directory. But the goal is to funnel all of those things through a common API and an API that enables us to cache policy decisions as close to the edge as possible so, if the downstream policy systems start to fail, the network stays up. That's the goal.

That is not an easy thing to do engineering-wise. We are working through the bumps with that, but that's the goal. That is as opposed to saying, every one of the policy systems is integrated into the edge; the problem with doing that is you see inconsistent failure modes between the policy systems and it is hard to aggregate the SLI. It is important to have a common point with a common API, it is more important for caching behavior with systems stability.

And this stuff is focused at operators, so we want to say, hey, operator, what are the properties of the network, integrate them, and this is a config model so you can manage and roll them out without requiring application developers to go and create a new build so you can do that. Okay?

Attributes- The Behavioral Vocabulary

So here are some examples of, you know, some attributes that we produce. And we call this the behavioral vocabulary, and there's a long list of things, if you go to the site, you can see more of this stuff.

Roadmap

And here is a short road map, and I think this is probably a good point to switch over to questions.

I thought I saw in the roadmap, I was going to verify, is it across cluster?

Oh, yes. So when I talk about workload mobility, right, and obviously Istio has to deal with workloads that span many physical locations, different orchestraters, on-prem cloud, you can name it.

How far are you from that?

Ask me after Christmas.

Okay.

Hi, a two-part question. So you move the library into a process by itself in the side car, why not take it one level further and move the side car out into the cluster, which is a traditional cluster? And the rational behind that- we do it that way, the only problem we have is the deployment life cycle of the proxy. For example, we have a new build for a side car to push, we are dependent on the app to update itself.

That's a good question. And so, one of the reasons why we run side car is because we want it to be part of the application trust domain. All right? When we issue a certificate, that certificate identifies the application. If we do the same thing with middle proxies, they are super powers, they act on behalf of N other things in the network. When you service the proxy, you have a blast radius problem. You want to scope down the privilege of the credential and the workload that that credential is associated with. And we do plan to do a variety of things to help people manage the updates of side cars, and while in Envoy we try not to re-start it, it has hot restart capability to reroute traffic in place, so we will do roll outs without impacting traffic at the side car level, but it will require some qualification to make that work. But that is the goal. And I'm not saying that middle proxies have no role here; we actually in Istio have many use cases for middle proxies, but we try not to turn them into super powers. It is very important.

Related to the proxy discussion and on the road map, in terms of API management, can you talk about what you expect on the API management side?

I gave a subtle example of one thing that API management does today, and that is classify API features. If you can classify features of your API, and extract IDs out of restful end points, like go open API, then you can tie quotas, ACLs, those types of things to those features. And I talked about this to the internal to internal use case, but those are applicable to the external to internal use case, which is classic API management if you look at what the vendors sell. There are other features in the long-tail API management, so content transformation, end user identity integration, things like that, which we plan to provide support for at the platform level. So for instance, you will see signed support come out in Istio in the not-too-distant future, and there's a lot of things in there.

I had a question about transformations on the incoming payloads. Just to show the example of how you can key on a header value or something like that for routing. Is there extensibility to that? If I wanted to integrate with a max mine geo location database, can I make those extensions?

Yes, the mixer is designed to be an extensible platform, we have a guide telling you how to write extensions for it. That's the primary extensibility model, and there will be an API-based extensibility model, and all of the signals they are extracting from the network, there's a standard API that you can implement that receives the feedback and it can augment traffic and control routing behavior.

Great, we will thank Louis one more time.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.

See more presentations with transcripts

Recorded at:

Jan 30, 2018

BT