InfoQ Homepage Presentations InfoQ Roundtable: Multi-Cloud Microservices: Separating Fact from Fiction

InfoQ Roundtable: Multi-Cloud Microservices: Separating Fact from Fiction

Bookmarks

View Presentation

Speed:

45:09

Summary

The panelists discuss if it is possible to implement an architecture across multi-cloud promises removing vendor lock-in and the ability to shift load during cloud provider specific outages.

Bio

Bastian Spanneberg, Idit Levine, Armon Dadgar, Richard Seroter, Wes Reisz (moderator).

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Reisz: Security is too weak, or it's not mature enough. It's too expensive. There's too many legal, regulatory risks to doing it. Enterprise apps can't be conveniently migrated. Each of these are statements or arguments I found in several articles dated between 2009, 2012, arguing against not multi-cloud but against cloud computing itself. Facts & Factors, a marketing research company recently published a report citing that cloud computing market was estimated at $321 billion in 2019, U.S. money, and is expected to be over a trillion by 2026. According to Computer World UK, 80% of organizations are predicted to migrate between the cloud hosting or colocation services, by 2025, 80%. It's safe to say those concerns from 2009, 2012 have been addressed.

Today, the discussion isn't about whether we should be considering cloud. The discussion is about hybrid, multi-cloud operating models and how we should be leveraging them. A Cloud Guru, had a recent State of the Cloud Learning Report, found that 75% of the shops are using AWS as their primary cloud provider, but the same 75% also had workloads running in other clouds. What is meant by multi-cloud? How are enterprises leveraging multi-cloud? That's the topic for today's discussion.

Background, and what Multi-Cloud Entails

My name is Wes Reisz. We will be diving into, "Multi-cloud Microservices: Separating Fact and Fiction." In my day job, I work for VMware as a platform architect and solution engineer on Tanzu, which also very much targets this space.

I'm going to have each of our panelists introduce themselves. What I'd like to do is have you tell a little bit about your background. Then, when I say multi-cloud, what comes to mind? What types of things do you start to immediately think about? Then, what we'll do is we'll take those things and we'll try to shape our conversation around these particular ideas.

Richard, would you mind introducing yourself and tell me what comes to mind when I say multi-cloud?

Seroter: My name is Richard Seroter. I'm a Director of Product Management at Google Cloud. Came here from VMware, and Pivotal, and companies like that beforehand. I'm also an InfoQ editor. I do training for Pluralsight. Used to be a Microsoft MVP for Azure till I took the pill and joined Google. I spend too much time on Twitter, so you can find me in lots of places.

What do I think of when you say multi-cloud? The word that comes to mind to me mostly is misunderstood. I think because the one hand, there's an extreme that says, multi-cloud is awful and all of those people are terrible people who hate cats and everything. There's an extreme end of multi-cloud is terrible, because who would build apps that span clouds. That's nuts. You have latency, performance, security. There's a strawman argument that multi-cloud means apps that get spread across clouds, and it seems silly. There's the other side, though, that maybe is a little extreme that says, let's just build it right once, run anywhere. That we can just build clouds and treat the whole cloud as a utility service. Therefore, let's just make it all commodity. All the apps can run everywhere, no code changes. This is amazing. Obviously, there's something in the middle there. It just seems like we all mess up the narrative by hitting both of those extremes sometimes. Hopefully, today, we find something that's a little more practical and pragmatic.

Spanneberg: My name is Bastian Spanneberg. I'm currently working as the head of SRE for Instana. My initial background is software engineering. I was a software developer, software engineer. Soon in my career, I started to look more at the whole picture, building and delivering products, operating products, automating infrastructure setup. Somehow, over time, when I was still working in consulting, I was going towards this continuous delivery, DevOps, SRE topics more. Now I'm working in this field.

When I think of multi-cloud, the first word that comes to my mind was complexity. I like the misunderstood word by Richard actually, because as he said, a lot of especially marketing people have tried to paint the picture that, yes, multi-cloud will solve all our problems. We can move workloads seamlessly between clouds. Coming from a more technical background, I tend to see the complexities and the problems that come with it.

Levine: My name is Idit. I'm the founder and CEO of Solo.io. Before that, I worked in a lot of startup companies that got acquired by VMware and Verizon. Afterwards, I worked in EMC as the CTO office, as well as their cloud management CTO. Generally, in the last seven years, I've been working in open source. Right now we are focusing on service mesh like Istio and other service meshes, Envoy, contributing a lot there, and so on. That's the stuff that we're focusing on.

As part of doing this, we see a lot of customers on the field. Service mesh is a buzzword right now, and therefore everybody is interested in talking about that. What I see is that multi-cloud in general, is something that every company has to offer in the point that they're all offering their division to go to a multi-cloud environment, and they were basically allowing them, and having agreement with each of them. You rarely see a group leveraging them both. For instance, if they have the ability to go to the big three, I think that we have a lot of types of customers from the small to the big ones. I will argue that, besides one of the huge company, only one of the group has over 40 data centers. Most of them are basically focusing on one of the clouds and building everything to it. They like the fact that they maybe can use the other one, but nothing in their software is aligned with that. There's no mobility between them. That's what we see at least.

Dadgar: My name is Armon Dadgar. I'm one of the co-founders of HashiCorp. Our focus is generally cloud infrastructure automation, but very much with a bend in terms of, how do we span these multiple environments? I love, Wes, that you started with almost quoting the contrarians of 2009 of, why not cloud? In some sense I think you're right, that we see the same critique now of multi-cloud. I think, from our vantage point, certainly from the HashiCorp customer base, our view is every major, call it Fortune 2000, will be a multi-cloud company. We'll get into it later as to why, what are the reasons and motivations? Our thesis is very much that you will be, whether by choice or by accident.

I think in many ways that HashiCorp's focus as a result of that is, great, if you're going to be operating it that way. I think the word that comes to mind for us is fragmentation. Everything is API driven. Everything is software driven, but there's no standards, there's really no interoperability between any of these environments. If you're going to span on-prem and multiple cloud vendors, it's a highly fragmented landscape. How do you do that in a way that you have some common operating model? That's very much the HashiCorp focus of how do we bring that common operating model and address the fragmentation that we see in the market? I tend to agree that it feels obvious that where we're going is every big business will be a multi-cloud business.

The Middle Ground that's Not Being Addressed with Multi-Cloud

Reisz: I got words like misunderstood, complexity, fragmentation, inevitable, that this is the direction whether it's through purchases, or whether it's from actual need in multi-cloud. It's almost inevitable that we're going to go there. Is this just a tooling problem? Do we just need better tools to be able to multi-cloud? What's the reason, Richard, that there's all this middle ground that isn't being addressed by what we're doing today?

Seroter: I don't know. I talked to a game company a couple weeks ago, and if you look at most game creators now, like if you're going to go on Xbox, you get Azure credits, if you're going to go on Stadia, you get GCP credits, if you're going to go with Luna, you get AWS credits. A game company is never going to be single cloud at this point, because there's financial incentives, or other companies that purposely spread them out for risk purposes or geo purposes or different services. I'm a believer that the multi-cloud that makes sense today is right cloud, right app. If you make great choices, you'll be all in on GCP. Normal people are going to use a lot of different types of services, because GCP is amazing at some stuff. Azure is great at certain things. AWS, as well. Let alone the Digital Ocean's and Linode and other things. We're going to have lots of stuff.

Gartner said it in a nice way, "It's not about data centers anymore. It's about centers of data." When you think about when I'm putting compute close to data, I'm thinking about distributed compute. I think the hard part is most haven't played that out totally in their head yet. Are we missing a magic tool? I don't think so. I think a lot of this is architectural. Some of this is continually thinking about, how are you doubling down the right programming skills? Are you building distributed systems that are fault tolerant, that are tightly coupled that can actually be running in different places? I don't think there's some cloud management platform that's going to solve all this. Personally, I work on Anthos, so clearly, we're trying. There's something that there's more to it than that. To me, it's architectural. It's a skills based thing. It's exciting. I think if we're going to cross this chasm and make multi-cloud truly successful for most people, we're going to have to think about the skills we need to do it first.

Does Multi-Cloud Mean One of the Major Cloud Providers?

Reisz: Armon, when we talk about multi-cloud, do we just mean one of the major cloud providers? When we talk about cloud operating models, common operating models, I think is the phrase that you use, are we just talking about the major providers when we talk about multi-cloud? Let's try to define this.

Dadgar: In fact, I think we need to bring private cloud into the discussion too. I think when we talk about it, I think it's easy especially to write it off, and be like, obviously, everything's going to public cloud. The reality is when you go and talk to some of these Fortune 50 companies, they have 200,000 machines spanning 80 data centers. Their investments are in the billions of dollars. I think it's difficult to just write all of that off as well. I think the way we talk about cloud operating model is, it's really about cloud as a process, cloud as a way of thinking and architecting, less of as a location. In that sense, you can think about on-premise as if you want to call it a cloud endpoint. It's a different way of working there than a traditional data center. I think you can treat it like a cloud.

Yes, I think that when you're talking about public cloud, then the challenge there is really a question of the capital expenses required to be a major player today, is not for the faint of heart. If you're looking at the earnings of all the major players, you're talking about billions in CapEx. I think the reality is, there's only a handful of players that are able to operate at that league, when we talk about where public cloud is going.

Reisz: Idit, any thoughts?

Levine: First of all, I totally agree that we still see a lot of on-prem, kind of like cloud. I think this is important to bring it to the table. I also think that the cloud that we see are the big three ones: Google, Amazon, and Azure's of the world. I think that, ideally, everybody will want, exactly like Richard said, to leverage the right solutions, or the right services that exist on a different cloud. There is always a tradeoff. If you build your infrastructure to actually run everywhere, that would be extremely more expensive than actually just say, I'm going to focus on one. I know what it is. It's perfect. That's the only thing that I would test. It's just way simpler. These tradeoffs have to become, and you ask yourself the question, why should I support more than one? That's the real question. Is that a finance question? Is that because you don't have services in other places? Is that because you want to fail over between them? I don't think we see. This is the real question, in my opinion. Why would you want to go with more of them? What was the motivation? Because it is expensive.

Why People Are Driving Towards Multi-Cloud

Reisz: Let's actually talk about that a bit. Armon, I remember I read a post, or I think it was a video discussion when I was prepping for this, with you and Mitchell talking about why people are talking about multi-cloud. What is that? Why are people driving towards multi-cloud?

Dadgar: I think to Idit's point, in some sense you need a compelling why? Because there is a real cost, and there's a real complexity curve there. I think from what we've seen that there's a few really practical real world things. You decide to go all in on GCP, let's say. You buy a company that went all in on AWS, welcome to multi-cloud. I think there's a certain level of large business that do M&A all the time, and great, if you only have three or four major cloud providers to pick from, and you're doing a lot of M&A, you're going to end up in all of the above. It's just a matter of time. I think that's a really practical reason that you see.

Another one goes to the credit conversation earlier. You spend $20 million a year on Oracle Database licensing, you're going to get a sweetheart deal on Oracle OCI. If you're like, I can save 50% of my Oracle licenses, and what do I care where they're running? You're going to be on Oracle Cloud. I think you see that ELA conversation play out whether you're a Microsoft customer, or IBM customer, Oracle customer. There's the transition, if you have existing spend, we want to move you to our cloud. There's an economic motivation to do that.

Then, I think you get into the regulatory stuff. If you're a UK bank, your regulators told you, unless you operate in two different cloud providers, you're not going to cloud. You're going to enable multi-cloud. I think there's a lot of these different very practical things that you're either driven through a regulatory regime, you're doing M&A. You had these other practical concerns, you need to operate in whatever, Germany. You need a specific region, and maybe only Azure has what you need there, or whatever. There's a bunch of these reasons. You want to sell to government, you need to be in GovCloud. There's a bunch of these, I think, really practical reasons that just force you there, whether or not you necessarily want to.

I think with all of those things, we step back and go, there are things that affect big companies, by and large. You're not talking about startups that have complex regulatory regimes generally. I think that's why. It tends to be, I think, you see the split of the very biggest companies are multi-cloud, almost by definition, versus, yes, if you're a startup and a smaller business, why would you take that complexity on in an optional way?

Dealing with Developers and Teams in a Multi-Cloud Environment

Reisz: Bastian, I'm curious, you manage SREs, you're an SRE, when people start talking about multi-cloud, like Armon just mentioned, you just purchased another company, welcome to multi-cloud, what starts to go through your mind when you need to help work with developers and teams that are now all of a sudden in a multi-cloud environment?

Spanneberg: Currently, it's the other way around for us. We have been acquired by IBM. A lot of IBM people are talking to me about multi-cloud strategy at the moment. I want to get back one step, I want to add one thing to the implications of multi-cloud, because another thing I see is it depends a lot on your business model. If you're an enterprise company, like we are, the point that Armon mentioned is really a very valid point. A lot of people have big tiers with cloud providers, and they might have credits, they have committed spend. You want to offer your service to the customer where he is. That's another point for us, for example, why we run multi-cloud. We want to be, for one, close to the customer. It's often also way easier to do a deal with a customer, if he's already on the cloud. If he has a committed spend, he has credits, and you're in the marketplace, you can have quicker deal cycles. For enterprise companies, I think that's a very big point, too.

What Workload, Data, Traffic, and Workflow Portability Means

Reisz: Let's actually come back to the observability SRE question, because I want to keep going with this just a little bit. I'll go back to you again Armon. There was another article that I read where you or HashiCorp talked a bit about workload portability, data portability, traffic portability, workflow portability. I really like that, because it starts to segment a little bit on some of the things that are really good targets for multi-cloud, and things that are at least a little bit longer pole in the tent. Could you talk a little bit about what's meant by workload, data, traffic, and workflow portability?

Dadgar: I think Richard did a great job at the beginning talking about like, multi-cloud isn't a binary, where you have these extreme points. I think that it is a spectrum. I think those are where we talk about these four checkpoints maybe along the spectrum of multi-cloud. When we talk about data portability, the way I see it is there's two ways to think about data portability. One is where I'll call it an insurance contract, which is, I'm going to put all my data and MySQL in Google. If I decide tomorrow, whatever reason, I want to move that to Amazon. Great, I can spin up a MySQL in both regions. I can do a SQL dump, I can do a SQL import, but it's this expensive one-time data. You can think about it as almost this option. By picking MySQL, I have the sort of option contract. It's cheap for me to pick the option. Picking MySQL is not an expensive thing, but if I exercise that I'm sure all of a sudden that's expensive, where I have to move 100 gigs or a terabyte, or petabyte, or whatever it is. That's what I'll call optional data portability, or one-time data portability. I think that's pretty easy.

Most people, if you pick a set of relatively standard service, is great. If you're running Kubernetes, and you're storing data in MySQL memcache, you have that. It's pretty easy to migrate that. I think that then if you go one click further and say, I want real-time data portability. I want to be able to write to my Google Cloud instance, and have it real-time replicate to Amazon, now you're getting into a much higher cost. Now you're saying, I need a truly cloud native Cassandra maybe type architecture. I'm running active-active. I'm constantly double paying for bandwidth and things to do this real-time replication. That's the next click over.

I think past that, then it's really around what we call workflow portability, which is, I want a consistent CI/CD pipeline for how I take my app out to cloud, whether that's Google, whether that's Amazon. That tends to be where HashiCorp and friends play. It was like, great, I might pick Terraform and Kubernetes. I can build a common pipeline where I go through the CI/CD thing, it patches a container. It lands on a Kube, whether it's GKE or EKS, I can have a consistent way of app delivery.

I think that next layer up becomes truly workload portable, which is this idea of, today I'm deploying that up to Google. Now I want to switch gear and deploy it into AWS, the same app. This is getting closer to the extreme end of what Richard's spectrum was. I think the challenge to actually really do that is like, but what if you're using any sticky service? You're using Stackdriver. How do you move from using Stackdriver in Google? There isn't a good equivalent for that in Amazon, all of a sudden. Now you really have to architect around truly everything portable, no sticky dependencies, cloud as a commodity, to Spencer's point. It's a difficult view to architect against that.

Then the most extreme, I think, is the panacea. Which is, I push a button. I migrate it. I just change my DNS traffic routes, and everything just works. I think to achieve that, you need all of the other prerequisites. Your data is active-active replicated across site. You have workflow portability. You have workload portability, and no sticky dependencies. You can just turn this knob. I think the cost of architecting for that, and the cost of the infrastructure to be able to have that optionality is immense. You might need it. If you're a bank, and you have a regulatory requirement, too bad, you have to do it. You have no choice. I think most people don't need to live in that world of crazy cost and complexity.

Starting on the Multi-Cloud Journey

Reisz: Richard, where do you start on this journey? Some great points here. What are your thoughts on if you want to strangle the unicloud? How do you go about this?

Seroter: Such a vivid image, Wes. Even to Armon's point, you alluded to it, but multi-region is hard, like pick just GCP, AWS, Azure, how many people are even running in more than one region? That's hard. Replication, latency, data duplication, deployment consistency, drift. Start with that. Don't start multi-cloud, probably. That's probably a terrible idea. Armon, you mentioned the point of startups. Some startups are going to use BigQuery for analytics, others like using Lambda at the same place. They're using Twilio. They're using Confluent Cloud. Arguably, everybody is multi-cloud because no one is finding all the services with a single public cloud provider. When I think of starting, you're already going to be doing it even when you just start with a single cloud, you're dipping into other managed services in the cloud. I wouldn't start multi-cloud, not from an IaaS perspective, or not like a cloud application service. Start multi-region. Learn the difficulty of provisioning consistently across regions. I have Terraform up on my Visual Studio code right now building out Anthos clusters, because it makes my life easier. Learn some of the foundational tooling to even get multi-region working and deployment consistency and operational consistency. Then get a little bold and brave and say, let me add a GCP region here. Let me add this here. Then all of a sudden, it's not so scary. Trying to pick up a cloud when you haven't even done distributed systems in a cloud is probably a recipe for failure.

Reisz: Bastian, anything you want to add? Any thoughts?

Spanneberg: No, I totally agree. That pretty much is the picture that we went through. When I joined Instana, we just had one simple region, then we had a migration to a different operational stack at that point. Then we expanded to two regions in one cloud. Now we are running over several clouds. I can only totally agree with what you said, Richard. Every step in this way comes with its own problems and its own difficulties you have to solve. I think it would be a bad idea to try to start multi-cloud from the get-go. That said, however, I think it's maybe not bad to have it in the back of your head in terms of how you build your architecture, because some errors you might make are hard to build back once you're big enough.

Dadgar: I think maybe just adding on to that thought. There is some amount of upfront thinking, I think, that can save you a lot of pain, sort of the ounce of prevention, worth a pound of cure type of thing. If you know that at some point you're going to go multi-cloud, multi-region, I think some of those upfront decisions that actually do make sense are things like, if I know I'm going to pick AWS and Google, what are the set of common services, at least if I can constrain myself to those, then I know that changes my optionality in a way, that if I had said, I'm going all in on DynamoDB and Spanner. You've built yourself into a cul-de-sac that's going to be very difficult to then architect back out of. There are some of those architectural decisions that you can make early on, that you're like, I've preserved optionality at least. I might not use it. I might decide I don't need to go multi-cloud, but it's a lot easier if you're not all in on a technology that's not portable.

Where to Draw the Line

Reisz: Where do you draw the line, though? Premature optimization is the root of all evil. Where do you say this is a decision that I should make multi-cloud, this is not? That it's something I should just punt for later. How do you really know that right now in your journey?

Dadgar: I think the policy we take internally is take a hard look at what we think cost of rewrite would be. What I mean by that is like if we have a service that's using Aurora, by and large, if you squint, it's MySQL compatible. Aurora has certain features and limitations specific to Amazon. There is a path off of Aurora if we need to, that's not horrendous. Versus if we said, Dynamo, you're like, it's such a specific object model, API model, execution model that a migration off of Dynamo is almost an app rewrite to deal with that. That's where we look at and say, can we draw these lines? Maybe DocumentDB, for example. That's ok, because it's Mongo equivalent. You could migrate to Mongo, fine. I think that's how we think about it internally.

What Works with Multi-Cloud

Reisz: Idit, you do a lot with customers that are having to deal with these problems. You see, most people are really invested in multi-cloud or a single cloud, unicloud. What are strategies that are working? Is this, we need to go service mesh? What's actually working when people have multi-clouds?

Levine: I wanted to reiterate what Armon said, which he said it very right. If you're thinking about what is cloud today, we should go to computer basic. It's basically compute, it's networking, and it's storage, and some services and monitoring and tool to help. Let's look at the three that we have. The first one is compute. I think that we did a very good job with abstracting. I think Kubernetes is doing a very good job of doing it. I deal with that and other tools like Terraform, and others. We get that. Then there is the traffic, then the storage. The storage, it's not going to go away. As Armon said, we can, we tried. I was in EMC before, I can tell you, there are so many people trying to solve this problem, that's just a very big problem to solve. Eventually, the only good way to do this is take all this bunch and move them physically by track to another. If you're looking at the Netflix of the world, they're totally locked down. There is no other way for them to move all this giant data, and there's nothing we can do about it. First of all, let's be aware of it. It doesn't matter what. This is something that will lock you to a specific cloud. That's number one.

Then there is the traffic. The question is, did we actually abstract the traffic? I'm selling service mesh. I can tell you that that's where we're going with. That's what we're trying to do, the connectivity between the clouds. It's also expensive. If I'm running right now in one cluster at the same time, and one service wants to talk in the same cluster and have exactly the same service in a different cloud, physically, I will not want to do this because it will take me more time to get there. Latency is extremely important when you're talking about application. What we see in a lot of places is what we're calling routing based on localization. Where are you? What is the region? Then I will fail over. Until you're actually going to get this failover to a multi-cloud, to be honest, you will have a lot of fallbacks before because you want it to be fast. Latency is extremely important.

On the three things that we are talking about, compute, awesome, we totally solved. I'm talking about solving. When we're talking about services, we didn't solve Lambda and serverless, which is locking you to the provider again. Then there is the storage, we didn't solve it. I don't think we will. It's a very physical problem that I don't know how we can go about doing it. Then the last one is connectivity that we can allow that, as worst case, you can't go to the close one, go to the very far one, but it's worth it. You will not solve that as well. Our solution is multi-cloud. You can group mesh it between Istio on-prem and App Mesh in AWS, and bring them as one entity of virtual mesh. You still eventually will prefer to stay on the same boundary because of the latency. Forget about security, latency, I think that's something that is really critical.

Reisz: Armon, you know a thing or two about service meshes. What are your thoughts?

Dadgar: I totally agree with Idit, in the sense that latency is the great killer of a lot of these things. We haven't found a way to defeat physics yet. I think the other piece maybe I'll add in terms of thinking about compute storage network. Even as we think about the networking layer as, do we have a common abstraction? To me, I think there's a bigger challenge, which is, the clouds and on-prems all have different mental models for this stuff. I think what you see on-prem is great. Maybe you went all in on Cisco gear and use Palo Alto firewalls, and you had network fabrics and VLANs, and whatever. Great. You try and translate that to cloud, now you're like, it's not equivalent. They have a different notion of security groups and Vnets, and VPCs, and things like that. All of a sudden, you're like, how do I fit this back into my on-prem network model? Then as you go cross cloud provider, they're all different, slightly. It's not like fundamentally different. You squint, and it's same abstractions. They're just different enough. Then you're like, there's no way for me to easily say this Amazon VPC can talk to this Google VPC, or this security group can talk to this Azure Vnet. Once you start crossing those network boundaries, it just becomes a complete and utter nightmare mess, basically. Because effectively, what you have is each cloud is running an SDN, it's all software defined networking, but they're all completely incompatible with one another. They have no notion of each other's identity and primitives that build up those networks.

I think this is, in RMI, one of the biggest motivators for why service mesh. I think when we talk about it again, to me, it goes back to that fragmentation question of how do I rationalize the fact I have four different network models between Amazon, Google, Azure, and on-prem. It's really almost saying, we're going to abandon the network model from the physical layer. You're like, in some sense, I don't care that I have VLANs over here and VPC is over there. I'm going to move it to a service mesh layer. I'm going to click it one level up, and just treat my network as a dumb bit pipe. I don't care about what my security groups are, my VPCs, it's a dumb bit pipe. As long as the bits flow, I'm going to solve it at a logical layer with the service mesh, where I'll bring application identity and manage the traffic routing and rules at that layer. That gives me a networking abstraction, where now I'm not battling the fragmentation and inconsistencies of these SDN models.

I think to me that's probably one of the most compelling reasons to think about service meshes. How do you rationalize these SDNs that otherwise don't speak to one another, don't interoperate with one another? You ignore them, to a large extent. Treat it like a dumb network, push it out the level, and then rationalize. To Idit's point, let's be really careful about what our motivation is. If our motivation is I'm going to have an app call from Amazon to Google to on-prem and back to the user, your latency is going to be outrageous. Hopefully, that's not your use case. If it's more about, there's these selective cases where I'm crossing these boundaries and I want a rational security model and a rational networking model, yes, that's actually a really good use case for these things where you're paying the latency cost anyways. How do you do it in a way that you're not losing your mind managing firewall policy?

Rationalizing Certain Aspects of the Cloud

Reisz: Richard, I want to get your thoughts here on service mesh, but before I do, you come from a major cloud provider. I'm curious, you've got to get this question that Armon talked about, that if we're AWS EKS Anywhere, Anthos, that we're getting these things from some of the providers, are we just shipping the constructs of that particular cloud provider and now bringing it on-prem, or other locations when we're doing this? How do you answer the question that Armon just brought up about how every cloud provider has a slightly different way of rationalizing certain aspects of the cloud?

Seroter: That's the management challenge. Tenancy is wildly different in each one. How do you segment tenants? How do you get access, identity models, API surfaces, SDKs? That's the underrated part of multi-cloud. You can grab a Postgres database and a K8s cluster in every cloud. That's not the hard part. The hard part is figuring out AWS IAM, or how do GCP projects work, and things like that? Part of what I think, whether it's an EKS Anywhere, what we're doing with Anthos is taking some of our model. We're trying to do it all with open source tech, Istio, Knative, Kubernetes, things like that, so that surface is at least similar. Let's be honest, everything has a lock-in component. At some point, you just have to embrace the fact that everything is lock-in. Java is lock-in. Kubernetes is lock-in. My marriage is lock-in. It's all switching costs and value. If I'm getting value from the thing, go all in on Lambda. I don't care. If you're getting a ton of value from it, yes, maybe it's a rewrite, you're not planning to rewrite. Same with Cloud Run, or same with BigQuery. At some point, architecting for extreme portability is a waste of your time, that if your whole goal is better business outcomes through software, I got to make some bets.

I think the service mesh is going to be an interesting one. I'm not super smart on service mesh. I've been late to finally get the religion. I think I have it now. That seems like maybe that's one of those good layers you do bet on, that says, this is a normalization layer, but let's not try to normalize everything else. Because we're going to do that and we're not going to ship value for two years while we build some magical multi-cloud platform that's now out of date the second it's available. At some point, that's not the value. What is the simplest thing you can do that's maintainable? Do that. If it involves lock-in and you want to use Dynamo, awesome, Dynamo is terrific. Use Bigtable. Use whatever. Then just pick maybe a few strategic things that should be cross-cutting, that should be portable. Yes, bet on Terraform. Bet on Istio. Bet on a handful of things. Then just ship good software, so you actually stay in business long enough to do the next version.

Dadgar: I totally agree. I think this goes back to when we talked about that framework, it's like, not architecting for workload portability, unless you absolutely have to. It's the same recommendation. If you have to, ok. Otherwise, yes, I totally agree with Richard. The business value case is really hard to justify there. You're way better off being like, just go use whatever, Cloud Run and Spanner, and call it a day, and don't worry about it.

Thoughts on Multi-Cloud and Service Mesh

Reisz: Bastian, from observability, from SRE, from thinking about multi-cloud and having to troubleshoot things and to interact and keep things reliable as services are operating, what thoughts do you have with multi-cloud and service mesh?

Spanneberg: I'm also not too smart on service mesh. I can't add a lot there. A lot of it has to go into your application, into how you engineer your application, what mechanisms you build in, what failover mechanisms, or load sharing mechanisms your software has? What metrics you expose. Maybe you do distributed tracing, you add some open tracing techs, whatever. At least a part of observability has to happen in the engineering process. I think after that, it really depends which route you want to go. In our case, it's very easy because we are observability to a company, so we use ourselves to observe us in all the different clouds we use.

The Role of Observability with Multi-Cloud and Service Mesh

Reisz: Idit, what's the role of observability with multi-cloud and service meshes.

Levine: Let's talk once again about service mesh, because I think that in this panel we're moving to a multi-cloud. This is the side effect, but that's not the reason service meshes exist. Service mesh is actually started because people started adopting microservices environment. They saw that they can go faster. There was a solution for organization problems. I had a lot of teams contributing to the same source code, I need to separate that somehow to get them autonomic. Then the service mesh came to bring it all together. When we're talking about observability, the thing that I'm looking at service mesh is that, this bit of thing that service mesh is doing is basically taking out the operation code, like observability, security, and routing, basically outside of the application code. It's abstracting. I think this is the power. Just understand, we basically abstract all the network in that point. I think that, again, we can leverage it on multi-cloud, of course, and I can tell you because that's what we're doing.

We have a lot of customers that started with the process of walk, run, fly. Starting with a simple cluster, they just want to see what's going on inside the cluster, at least people think that they want. Then they wanted to go and say, ok, now we need maybe more security on the edge, maybe develop a portal to share it to other people. Then we're moving to the run one, which basically is saying, now I want to put it in my organization. I need a workflow, an approval process. I need to see, how's that fit? Only the run is basically, now I need to run a multi-cluster, multi-cloud. How do we do this? Now we'll bring all the tools that we did for this all the way there. That's what we're doing. If you're looking at this, I can name a lot of customers that we have. What we see is they're running over 100 clusters in production, different clouds, and so on. Again, that's what service mesh is solving. Multi-cloud is just a nice [inaudible 00:40:43].

Reisz: We shouldn't try to insinuate that it's there to solve a multi-cloud problem. It's there to let developers focus on the business logic and not the plumbing between services for those communications.

Parting Thoughts on How to Address Multi-Cloud

Richard, what are your parting thoughts? If someone has purchased another company and they're multi-cloud, what are your thoughts for them to address this reality they're now in?

Seroter: We assume it's inevitable. Your primary cloud choice probably lasts as long as your CIO does. Keep switching a lot. That's what's happening right now. Let alone maybe you're multi-cloud because you made your primary cloud so hard to use. If you're an IT team, loosen up a bit, because your teams are using other clouds because the primary has too much process and procedure behind it. If you're an IT team, loosen up a little bit. If you're a dev team, think about the right architecture that's going to prepare you for these distributed systems. Pick a few foundational things that are going to matter, and don't over-invest upfront in portability. Always keep your focus on speed and maintainability.

Spanneberg: I can only agree. Pick the right abstractions in the beginning, but start small, and then take it step by step, as we had with the example before, like a region, two regions, two clouds. One thing after the other.

Levine: I think that Richard said it nicely. Competition is driving innovation. I think that the fact that we have a lot of cloud that is competing right now with the on-prem solution, it's actually fantastic, because they force them to keep up. I think that this is actually very interesting. To me, as Yoda with just force, is like, always know where you want to go, what is the vision? Where do you want to go? Ideally, you want to go do abstraction with everything. I'm not locked to anything and everything is great. Build only what you need right now. Abstract, create an interface for that, but build only what you need right now, and then know where you're going.

Dadgar: I would just add to maybe what Idit was saying, which is, I think what we often see is a lot of customers end up sleep walking into multi-cloud. It's not really done with any intentionality. I think probably my biggest advice is be really intentional. What I mean by that is, have a solid understanding about what's the motivation, what's the goal, what's the outcome you're trying to achieve? In some cases, if you say, our goal is the regulator says we have to be able to do this full failover. Great. From that goal, maybe you figure out that you actually do need to architect for true workload portability. That's going to constrain your design and your thinking. If you start and you say, my goal is actually just, I have some credits that I got from Oracle Cloud and so I should also be using those, then don't kill yourself in terms of workload portability. Just figure out what's a viable set of workloads that you can consume those credits at. You can have a much more reasoned approach to what you're doing from your multi-cloud. I think just starting with understanding what the goal and constraints are, let you be a lot more intentional about it rather than, to Richard's earlier point of immediately gravitating to the extreme views, where most likely you don't need to fall into one of those. Figuring what are you actually solving for, and being very intentional about it.

Reisz: I did a podcast specifically talking about constraints being enabling, because it gives you a box to actually work in rather than Greenfield and being just completely wide open.

See more presentations with transcripts

Recorded at:

Jul 16, 2021

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?