Transcript
Louis Ryan: We're going to talk about cloud native hybrid networking or whatever that actually means. I don't think anybody really knows exactly what it means. Some of the patterns and things that I've seen talking to people trying to do this kind of stuff. Where they're coming from. What they're trying to get to. The struggles that they have. Maybe a little bit of a brief interlude with a bit of a philosophical rant just to keep things entertaining.
This is my background. I ran Google's API management platform for about a decade. I helped create something called gRPC in open source. I work on an open-source project called Istio, which is involved in the cloud native hybrid networking service meshy space. I work now for a company called Solo. Probably the most important point of this whole thing is I am not a networking person.
If you feel like you've been sold a bill of goods, my background is in applications, in APIs, and services, and the communication patterns between those things. My background is not in configuring routers. It is not in configuring firewalls, although I have done such horrible things in the past. I do not consider myself a networking infrastructure person. That said, I do have some opinions about what networks can and should be doing, and I've been working on that for a long time at this point, and so that's really what I'm going to talk about today.
Outline
If I'm not a networking guy, why are you even here? What do you care about? There are three things I think about in networking, and this is about how networks should evolve. I think networks need to elevate their functionality. Networks are too primitive. If you consider how networks have evolved over the last 50-odd years, we have gotten amazingly good at shoving bits around the network at incredible speeds. We've not gotten too much better at a lot of other things. Networks and the abstractions and features that they provide are a necessity to the application. There's not a lot going into the network to make the network more useful to the application.
Definitely things, but not all of them are part of the network, and we will get into what that means. They should compose. They should work together to build higher-level systems, to build higher-level capability. Those patterns should be repeatable. If I do something here, I should be able to do the same thing over there, because if I have to do something different in two different places, that drives up my cost, both in terms of the people and the skills that need to know how to do those things, in the systems that have to compensate for those differences, and the money that I spend because maybe I have to use different systems to achieve the same effect different ways. That's not great. We want to be able to do roughly the same thing, particularly in the hybrid universe, where I've got two clouds and an on-premise system, and I've spent a lot of money, or I've inherited these things from a legacy setup, and I'm trying to evolve the system.
Elevate
The first one, we want a network to be more useful to the application, not the other way around. Too much time is spent in networking, or in application development, trying to compensate for the limitations of the network. The network is infrastructure. Its job is to be more useful to the application layer, not to be more useful to network admins. I don't think enough time and effort is going into this. As people here who run operations for platforms, or who make buying decisions, you should be thinking about, when I acquire something in the broad sense of it, is it providing utility up to the application layer? Because the application layer is what is delivering value to your business.
First, these big problems. This is not a joke about F5, or maybe it is a joke about F5, for those of you who run those things in production. It's more a joke about the concept. The IP address dominates how people think about networking. How many do I have? How can I get one? Who do I have to ask to get one? What does it mean when I don't have one, or something's in another network? How do they communicate? Applications do not care about any of this. Why should they care? I really don't care what your phone number is. I care that I'm calling a human being. Applications call services and APIs. They are called by services and APIs.
This is an implementation detail, and yet we are hyper-focused on it. We are bound by it. Yes, IPv6 will reduce some of the constraints. We'll have to think about these things even less. Everybody deals with these things. Why? Why hasn't this gone away, as far as we're concerned? It should have. I don't talk to somebody's social security number, or their national insurance identity, or whatever the local thing is. We talk about Big Pharma, jokingly. We have big IP. It dominates how people think about networks.
Here's all the things that we do to make IPs work. Again, the application does not care, and yet we spend so much time in operations dealing with these things. Why? Why hasn't the networking infrastructure that you pay an ungodly amount of money for fix these things for you, so you don't have to deal with them? Yet you're still dealing with a lot of them. I'm not here to tell you that there's a solution to these problems. I'm here to tell you that this is a problem. When you buy things, or acquire things, or build things, you should be working towards making these things not a problem. Systems that do a better job shielding you from this stuff are better systems. The philosophical statements about how we should be thinking about networking, and how we should be thinking about control. These are very common things. Like, I want to build applications and services. I want them to talk to each other. Because we're talking about hybrid, I want it to work efficiently and securely no matter where they are. It should be the same everywhere.
The applications mostly don't care where they run. There are reasons for them to care. For the most part, they shouldn't care. The counterpoint for all of this is because we're all admins, and we all have to meet compliance goals, is I want controls. I want to be able to inject control and policy into the network to make sure that the applications are doing the things they're supposed to be doing, or more to the point, not doing the things they're not supposed to be doing. I want that to work regardless of how the infrastructure changes.
If I move from one networking infrastructure to another, why do I have to rewrite every single policy I wrote? It's insanely expensive. The single biggest barrier organizations have to moving is the investment they have in policy. It represents a huge amount of business value. We'll talk a little bit more about what that means. These are things that you want, or ideally you should want. Certainly, I think, and having talked to a lot of people, hear people say this is what they want.
We're going to go through a little theatrical diversion. Who has not heard of this play? We're going to have a little fun with Romeo and Juliet. In retrospect, I think it's a security story. It's a play in five acts. Act one, a misconfigured firewall. Romeo should not be able to talk to Juliet, and Juliet should not be able to talk to Romeo, makes sense. A bunch of audit and reporting gaps. Mercutio and Tybalt, middle managers in your organization, too busy fighting about stuff, then reporting up to the chain that Romeo and Juliet are talking to each other, and there are no controls to report that they are talking to each other. Nobody knows. Romeo and Juliet are services. Now we have an accidental dependency. Happens all the time, and can cause problems. We have a bunch of confused deputies. I think it's the nurse and Father Lawrence who are supposed to be sorting things out. They think their prime objective is to make this love thing happen, but in reality, they're just not aligned to the corporate goals, and they're not meeting the OKRs. Again, middle management problems.
Then we have a cascading outage. Juliet initially has a brownout. That brownout causes Romeo to have a total outage. Then, because Romeo goes out, Juliet finally goes completely out, and the whole system goes to hell. Then we have a really ineffective post-mortem where everybody comes in at the end and goes, let's fire Mercutio and Tybalt. They got fired earlier on anyway. We should rely on our corporate objectives, and they don't actually go back and look at what the real problem was. Not an uncommon situation in production systems. I have had this happen to me in production systems, with inadvertent dependencies causing outages. A really big one about 15 years ago at Google which took out every public-facing API. The post-mortem was a little better than the one at the end of Romeo and Juliet, but this is roughly the play.
I said it was a tragedy about DNS. We all remember this lovely line, and this is what happens in tied networks. Wherefore are thou Romeo? Romeo is at this IP address. We're talking to a name. Names are good. Names are an abstraction of the infrastructure. It's what we want. The problem is DNS is asymmetric. When Juliet calls Romeo, how does Romeo know that Juliet is calling him? He's listening on a socket, or in a balcony in this case, or below a balcony. He's going to use an API to try and figure out who's calling him. The problem is this API returns an IP address. It does not say Juliet. Anybody who's talking to Romeo as far as he's concerned. When we think about security, we want to know who's talking to us. Who art thou?
If the answer is an IP address, we don't know. Not really. Not in modern networking. I see people who go, I'm just going to reverse look up the IP on my corporate network to figure out who this person is. They may be across a map, or you may have split horizon DNS, or all manner of kinds of things that prevent you from doing this. I'm sure it will come as a shock to nobody, but there is no RFC standard, or thing in networking today that when I call an API on a socket and say, tell me who my peer is, that it will tell you reliably a name. Fifty years doing the same thing, and such an API does not exist in POSIX. Why? We solved so many other problems in networking, but we don't solve that one. This is the fabric, the very basis of modern application networking, the socket API, and yet we can't do that one thing. Hard problem.
Going back and looking at the firewall rule, this is what they had in production. For everything in the Capulet network, you can't talk to the Montagues, Montagues can't talk to Capulets. Looks good. Except those four stars doing a lot of heavy lifting here, it's turning into, in runtime, in a firewall system, a list of IP addresses. There are lots of ways things can be talking that don't use the network the way you would expect. You have a ton of IP churn, which makes those rules unstable. You could be using a serverless system like Lambda where they don't have IP addresses. They could be going through messaging tunnels, all kinds of messaging tunnels. Squid proxies here, there, everywhere. You could have the right code but run it in the wrong place. Maybe I accidentally run Romeo in Juliet's network. Plain old misconfiguration. We're all familiar with that one. Just a ton of complexity. It's really hard to maintain these networks and keep the rules up to date. This could be thousands of rules, tens of thousands of rules.
My personal favorite, and most likely culprit in most cases, is the policy was right once upon a time, but it is not right now. It's really hard to know that, because the policy involves translation. That translation is dependent on understanding of the infrastructure that has existed at the time the policy was written. If the infrastructure changes and those assumptions are invalid, the policy has rotten. What's the solution? Maybe we should give things identities. Imagine what networking would be like, and network security would be like, if everything that ran on a network had an identity that was verifiable and provable by everything it talked to, and vice versa. Think about it for a second. The entire firewall industry would be different. How applications and authorization policies written would be different. Yet we haven't done this in networking. It's shocking.
It's not like the universe is sat still. People have built solutions to these things, they just haven't built them into the network in a way that makes it easy for the applications to consume it. The networking layer doesn't think it's a responsibility to do this. I think it is. We have PKI, X509, we know how to give identities to things. Fifteen years ago, we used to think it's too expensive to give a certificate to everything on the network. I don't think that anymore. Give certificates to everything. Let's encrypt happened a decade ago. Nobody cares. You can give a certificate to every IoT device on the planet. It's not a problem.
We have things like mTLS, a technology I know a painful amount about, so we can mutually exchange credentials and identify peers when they communicate with each other. It's done in the application layer typically, but Wireguard has got a very similar system to mTLS. It uses PKI. Everything is authenticated. You can use an identity, IPSec. Just no good API abstraction up into the application layer, and you can do it wrong. You can just share the same key with everything, so don't use IPSec that way. These are reasonable technologies that should just be baked into the lower layer. Then, really what the application layer has done, which is compensate. The application developer has got tired of the waiting for the network to do what it was supposed to do, and so we invented JWT, so we could exchange information about who we are. In reality, servers do TLS, and clients do JWTs, so it's an asymmetric and ugly system, but this is what we're all doing. This is how the server knows who the client is, and vice versa.
Compose
We have a rant about identity over, we want to be able to inject controls and tools into the conversation between applications, services, and people, so I can meet my organization's policy goals. I have a network. Maybe even now I've got a network that has identity at the heart of it. That would be lovely. We'll talk a little bit about patterns. What are people doing today? How are they trying to solve for problems in hybrid networking? What's good? How are they evolving? What are they trying to get to? You all heard the famous paper, The Cathedral and the Bazaar. It's mostly about open source versus commercial software development. It describes how networking works in organizations today. You have two parties. You have an ivory tower, which is a highly curated thing. It's the thing the organization invests in. In the modern enterprise today, it's the VPC, usually. Everything new and shiny is put on the VPC.
Most of the capital investment, training, because the VPC is the target. Enterprises were told to move to cloud, so I'm going to pour my resources, my intellectual capital, thought, effort into making the VPC as good as it can be. Then there's everything else. Everything your organization has acquired over the last 20 years, on-premise, private data centers, offices. They just look radically different. You have an ivory tower and a lot of other random stuff. Super common. Probably the types of enterprise I talk to, that means on-premise or private data centers are this kind of thing. They have different infrastructure. They have different cost models. They have different deployment strategies, different maintenance cycles. They just end up being super different. This is painful. Because I have all these other objectives coming in from my business, we now need the network to do this type of policy control, so I can audit, so I can meet ISO 27001. These kinds of requirements are constantly coming at businesses, and so the network is being forced to evolve.
Companies starting in this place are like, ok, we have a big problem to solve. How do we go about getting the controls that we need into the network? There's just so much diversity. Like maybe on-premise, you've got a bunch of J2EE versus in the cloud, you've got Kubernetes. You've got a few different clouds. It's quite common, actually. In the previous talk I was in, we were talking about building SaaS, and the recommendation was, if you're going to build a SaaS solution, start with one cloud. I couldn't agree with that more. Not everybody gets that choice. Some people are forced to use multiple clouds. Sometimes they've acquired companies that used other clouds, and so they end up in this state regardless. They have serverless and VMs and all these other technologies. They have a bunch of VPCs, and they're trying to rationalize again how they want to organize VPEs, because they're still beholden to the IP gods, and they have a bunch of different firewall technologies, which makes this super painful.
There's a pattern that almost everybody follows. There are some exceptions, but by and large, this is the thing I see the most often, the hairpin pattern. What is the hairpin pattern? It is a big proxy sitting somewhere in the network that I'm going to send everything through. I mean everything. All my ingress traffic, all my internal traffic. The one exception to this is a lot of organizations that do all of this for their internal and ingress traffic, it's still the Wild West for egress traffic. They do this really well, but everything is just allowed to egress out through a whole bunch of random places. Some more mature organizations also make all their egress go through this as well, or maybe they just have a big egress proxy and a big ingress and internal proxy, but the same basic pattern applies. There's a huge proxy, and I'm going to funnel everything through it. To go along with the big proxy, there is a big policy store.
All those rules, all those controls that you want, authentication, authorization, quotas, rate limits, audit requirements, monitoring, observability. There's nothing wrong with this pattern per se, or there are some things wrong with it. You get to see all the traffic. All the traffic goes through the proxy. There's no mystery anymore, which is good. It gives you all the control you want. You wanted control. You were told to have control. You have control. It's actually a pretty rational evolution. I don't have a big problem with this. There are some issues. Big proxies, when they fail, cause big outages. They tend to be SPOFs.
Some of that's because of the technological choices that we made. I'll come a bit back to that when we talk more about repeatability. They tend to live in the VPC. That's the place where we're doing all the modernization. I'm going to run the big proxy in the VPC, but I still want all the on-premise traffic to go through it, so I'm going to have to send all the on-premise traffic up into the cloud through the big proxy and back down. That can get a little pricey. People traditionally use the most powerful tool they have to implement the big proxy, which usually means they're using an API management product and maybe you have an expensive load balancer. Often, that's functional overkill for the set of controls they're trying to apply. You're paying one of the big API management vendors per request going through a proxy, even though there aren't that many policies being applied to it, so the cost is disproportionate, and that creates a lot of pressure.
Then there's the whole Conway's Law thing. Conway's Law is software will match the organizational structure. It'll just evolve into that state, but because we're forced into this architectural state, what will happen is your organization will start to mirror this organizational state. Big proxy and big policy store will become big proxy team and big policy store team. The way you interact with the team is to file a ticket saying, I want you to do something. We're all familiar with Ticket Ops and the glorious efficiency that is Ticket Ops today. It's just as bad in this world as it is in the networking, VPC, firewall world, possibly more complicated. That's where the composability part starts to become a problem. Organizations that put a little more time and effort into it do put effort into making sure that their policy stores compose, and what we mean there, I want policy composability. Imagine you have some engineer and I just want to serve 5 terabytes. Everybody remembers Broccoli Man the meme.
If somebody in your organization wants to do something pretty simple, and to do something simple, they have to go and integrate with big proxy, and big proxy feels like this, this massive engineering effort. The impedance mismatch between what they think they have to do, which is serve 5 terabytes, or launch a simple API or a service internally in my network, is misaligned with the experience of doing so. This creates organizational friction. It creates concern. It creates an incentive for people to work around it. That's a risky thing. Romeo and Juliet again. They worked around by accident, or intentionally an internal control. Then that dependency took root, became a fixed part of the organization.
Then, when it becomes discovered, you have to decide what to do about it. Sometimes those types of decisions are, leave it alone. It's already become an entrenched part of my business and I can't touch it. It's Conway's Law, fragmenting and driving your organization to do things, constant backpressure to controls. This is a real struggle in most organizations.
Talking about organizational structure and what organizations do, when I talk to bigger enterprises, there's a networking team that builds the networking infrastructure. There's one of those for each of the major infrastructural compute platforms that they run on. There's a team responsible for API management for ingress. There's a separate team responsible for platforms. There's a security team that buys the firewall. The egress stuff is still left as the Wild West, but the security team normally owns it, but really the platform team should, and people are constantly trying to figure it out. One thing that I often say to people is, external services are APIs that you didn't build, but you want all the same controls for. Just because A is calling B or C, doesn't matter if B or C, one of them was built internally by you, the other one was built by somebody else. Why do you have different controls for those things?
At some point, the auditing universe is going to come for everybody and say, no, you have to treat them the same way. You need the same controls on them. It's just a piece of software that you didn't write. You still consume it. Don't treat it any differently. All these organizational pressures start to pile up. I'd like to think of Conway's airplane. Hopefully everybody knows what this thing is. It's the largest plane by volume on the planet, I think. It's called the Beluga. Carries around the wings at the airport of the Airbus, between England and France, I think. Maybe the other way around. A really big piece of infrastructure dedicated to solving a problem that for solving other types of problems can be hard and unwieldy to use.
We have this composability problem. I said, we want to inject controls and tools. Probably this is the biggest challenge. This is the anathema to Ticket Ops. Policy must compose. What does that mean? I'm an admin and I write a policy. Things in, Montagues cannot talk to Capulets and Capulets cannot talk to Montagues. Great, that's a very broad policy. I also want individual teams, and there are hundreds or thousands of them within my organization, to be able to write more specific policies than that. Or to leverage the tooling that's available in the network to achieve the effect they want to affect. Maybe they don't want to control access, but they might want to control routing.
Perfectly reasonable to let an application team control routing within their own domains, their own path structure, whatever portion of the URI space they've been granted ownership of. I want those policies to compose. That should be a feature of the infrastructure, not of the organizational process. We want tooling that does this type of stuff. What has this got to do with cloud native?
One of the biggest features of cloud native is, Kubernetes likes to talk about eventual consistency in microservices, we're also thinking about the eventual consistency with policy and composability in these systems designed to compose. When I think about networking, and we talk about this in Kubernetes land a little bit, we say cattle, not pets. Usually, people are talking about pods and application deployments. I think we should think about networking in the same way. Networks are cattle, they're not pets. If a network goes away, it shouldn't affect the organizational behavior of the system or the applications. It's just some IPs vanished into the Ethernet. Who cares? I don't care. What would allow us to treat them as cattle instead of pets is getting some repeatability. That my policy will work on any network.
My applications can be deployed on any network. Something on one network can talk to something on another network, and it doesn't care if it moves around all the time. Repeatability, flexibility, composability. We want these kinds of properties in our networks.
Repeat
I want the same capabilities and controls everywhere that I run my applications. If you had that property today, how much easier would your life be as an operational person? How much more time would you have to do the things that you want to do? To provide more value to the application layer, because that's the thing that's actually moving your business forward. To worry less about an infrastructure. How do we get repeatability? Probably the first thing is commodity infrastructure. The one thing we can say about commodity infrastructure, the reason why it's a commodity is it's designed to run anywhere. It doesn't care, really. That's why open source is such a strong and powerful force in our industry. nginx, a piece of commodity infrastructure, revolutionized how people run ingress. It is run on everything. It does a pretty good job.
The proxy I work with, Envoy proxy, the same thing. Those are powerful policy systems. They provide a lot of capability. If I can run the same thing everywhere, get the same controls everywhere, that makes my life a lot easier. At least it should. Or maybe not easier, but at least consistent. Consistency in the long run is key for enterprise. I want policy controls. There's such a long list of things that people want to control. AIs are a real thing that organizations are figuring out how to let their developers consume and interoperate with. They need particular kinds of controls.
The cost model for an AI, you pay per token. You want to keep control of your bill. You don't want to make too many API calls to an expensive service, or write too many things to S3 buckets. You also don't want to consume too many tokens, but if you tell a developer, you should turn this thing off. You want to quota and rate limit on tokens instead of RPCs or REST requests. There's just a vast universe of policy out there that people are using. PCI controls. API management features. We just keep adding more and more on. That's ok. We need that, and we need that everywhere.
I had a long diversion about identity, and we want identity-oriented policy. Why do I say, what is identity-oriented policy? That means policy that's rooted in the structure of your organization, not in some infrastructural concept. I'm Louis. I work at Solo. I have a boss, she's the CEO. That's the organizational structure. The policies in my organization revolve around that structure, not the fact that I log in on a laptop and get allocated this IP address on the VPN. I was at one of the unconference things, and they were talking about policy and policy management, and how to turn what was a policy goal into policy enforcement in the various different infrastructural mechanisms that they had in the cloud providers that they'd acquired, and on-premise. They had a mixture of on-premise and cloud. What they ended up doing was defining their own policy infrastructure, their own way of expressing their policies, and then a compiler to compile it into the infrastructure.
That's actually the general pattern that you want. You want to express your policy in terms of your organization, so it makes sense. It's grounded in your organization, and it can evolve as your organization evolves, because most of your policies are organizational in nature, and then compile it into the infrastructure. Today, those compilers are human beings running in infrastructure teams. We're making people do that, not code, and that tends to cause problems. Identity, and having identity be a fundamental feature of the network, means that the policy enforcement side can look a lot more like the policy definition side, so it's easier to reason about, it's easier to audit, and it becomes more consistent. Also, no SPOFs.
One big problem, that big proxy, and the design patterns that people use with it, they tend to have big outages. One of our customers, they had a big centralized thing. Somebody pushed configuration to a system that was adjacent to it that caused high packet loss between it and the other thing, and instant global SPOF, because everything was coming in. That was a very unpleasant event for that customer, because it occurred during the middle of a hurricane in the United States, and they're a major insurer. We have to be really careful about where we put SPOFs. It's tempting to use big centralized infrastructure to solve policy problems, but we really need to be able to distribute that into the network, so that the failure domains are more correlated with the individual application failure domains, because everything has failure modes, and we don't want the property of, if this one thing goes down, everything goes down. That's not good.
Conclusion
I'm not here to talk about technology or other things. I'm not going to tell you to use a solution. I put this slide in, I even felt bad about it. I work in this space. This is what I do. This is what I've been trying to do for years. I believe in what I've been saying. This is a technology that I've been involved in building for the last four or five years to try and address some of these problems. There are other people building solutions to try and address some of these problems, but this is one that I built, or have worked on, and work in the community on. I put the slide up earlier, and I said it was the service mesh guy, so I figured I'd better put this up.
Questions and Answers
Participant: I have a question referring to your example of the big IP. Let's assume we have the application which talks to other applications, so we have the application to application interface. Let's assume that it's using, for instance, mutual TLS, or certificate, or JWT. On top of that, we, for instance, introduce some extra firewall rule with the IP whitelist for this. Is it the extension of the application logic, or this is a part of the control, maybe, in your understanding?
Louis Ryan: If I have mutual TLS, or let's say, single-sided TLS and JWT for the application to application communication, and then I also have a firewall rule, are they part of this composed system? How should you think about those things? Overwhelming in most enterprises, those are completely independently written. There's no conversation about those things. A lot of firewall rules tend to be very broad in nature. Zero Trust isn't a thing in most organizations, at least in the networking world, and so the firewall rules are boundary controls.
Like Montagues can't talk to Capulets, this VPC can't talk to that VPC, and so they're not related to each other. They're organizationally distant. Those policies are written by separate people at separate times, so they're not part of a common posture. They're not reasoned about the same way, or by the same people, or even necessarily visible to one person in the organization. Policies that are written independently that have this implied dependency, but there's no way of validating the actual dependency become problematic.
The argument here is that you should have a policy language that says those things, and then it can compile down to a firewall rule, or it can compile down to an authorization policy that's implemented maybe in your API management system, or some other system. I've definitely worked with more forward-thinking organizations that actually literally have their own policy language, expresses the relationships they want between the systems, and then compiles down to those types of systems.
There's a bank in Australia that we work with, and they have a policy system, and it compiles, I think they produce Palo Alto firewall rules and new versions of it every 5 minutes. It's pretty impressive. There are some scary parts about that, but that's what they do. The policy language is theirs. They deal with the complexity building the compiler, but they get good value for that. Their policies then become sustainable, they don't rot. That's really the main thing, because rotting policies are usually the prime cause of a security breach or some internal lateral movement attack.
See more presentations with transcripts