Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Scaling Cloud-Native Applications

Scaling Cloud-Native Applications



vJim Walker, Yan Cui, Colin Breck, Liz Fong-Jones, and Wes Reisz look at lessons from scaling applications and things that may go wrong.


Jim Walker is VP product marketing at CockroachDB. Yan Cui is developer advocate at Lumigo. Colin Breck is Sr. staff software engineer at Tesla. Liz Fong-Jones is site reliability engineer at Wes Reisz is QConSF chair, co-host of the InfoQ Podcast, & senior engineer at VMware. (moderator)

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.


Reisz: What do you think of when I say cloud native and scaling? Do you think of tech, maybe things like Kubernetes, serverless? Or maybe you think about it in terms of architecture, microservices, and everything that entails with a CI/CD pipeline. Or maybe it's Twelve-Factor Apps, or maybe more generally, just an event-driven architecture. Or maybe it's not even the tech itself. Maybe when you hear scaling and cloud native, it's more about the cultural shifts that you need to embrace, things like DevOps, truly embracing DevOps, so you can get to things like continuous deployment and testing in production. Regardless of what comes to mind when you hear scaling cloud native apps, Here Be Dragons, or simply put, there are challenges and complexities ahead. We're going to dive in and explore this space in detail. We're going to talk about lessons, patterns. We're going to hit on war stories. We're going to talk about security, observability, and implications around data with cloud native apps.

Background, and Scaling With Cloud Native

My name is Wes Reisz. I'm a Platform Architect working on VMware Tanzu. I'm one of the co-hosts of the InfoQ podcast. In addition to the podcast, I'm a chair of the upcoming QCon Plus software conference, which is the purely online version of QCon, comes up this November. On the roundtable, we're joined by Jim Walker of Cockroach Labs, Yan Cui of Lumigo, and The Burning Monk blog, Colin Breck of Tesla, and Liz Fong-Jones of Honeycomb. Our topic is scaling cloud native applications.

I want to ask you each to first introduce yourself. Tell us a little bit about the lens that you're looking at that you're bringing to this discussion. Then answer the question, what do you think of when I talk about scaling and cloud native all in one sentence?

Walker: I'm originally a software engineer. I was one of the early adopters of BEA Tuxedo. Way back in '97, we were doing these kinds of distributed systems. My journey has really been on the marketing side. I might have started as a software engineer, and always data and distributed. I moved into big data. I was Talend. I was Hortonworks. I was early days at CoreOS. Today I'm here at Cockroach Labs, and so really the amalgamation of a lot of things.

When I think of cloud native, honestly, it was funny when you asked me this earlier, I was like, I think of the CNCF. I think about this community of great people that do some really cool things, and a lot of friends that I've made. Then I had to really think about, what does it mean for the practitioner? It's seemingly simple, but practically extremely complex and difficult to do. When I think cloud native, I think there's lots of vectors that we have to think about. When I think about scale, in particular, in cloud native, as this is about, is it scale of compute? Is it scale of data? Is it scale your operations and what it means to observability? Is it eliminating the complexities of scale? There's just so many different directions we can go in, and it just all leads back to this like, practically and pragmatically, it's extremely complex. I think we're trying to simplify things and I think we are getting a lot better and we are actually seeing huge advances in simplification, but it's a complex world. That's the most generic.

Breck: I'm Colin. I spend my career developing software systems that interact with the physical world, so operational technology and industrial IoT. I work at Tesla at the moment, leading the cloud software organization for Tesla Energy. Building platforms focused on power generation, battery storage, vehicle charging, as well as grid services. This includes things like the software experience around supercharging, the virtual power plant program, Autobidder, and the Tesla mobile app, as well as other services.

When I think about scaling cloud native, I don't think of technologies, actually, I think about architectural patterns. I think about abstracting away the underlying compute, embracing failure, and the fact that that compute or a message or these kinds of things can disappear at any time. I think about the really fundamental difference between scaling stateful, and so-called stateless services. That's a real big division there in terms of decision making and options in your architecture. Then in IoT, specifically, I think a lot about architectures model physical reality, and eventual consistency, failure, uncertainty, and the ease of modeling something and being able to scale it to millions, is a real advantage in IoT.

Fong-Jones: I'm Liz Fong-Jones. I'm one of two Principal Developer Advocates now at Honeycomb. Before Honeycomb, I spent 11 years working at Google as a site reliability engineer. What I think about with regard to what cloud native is, I think it relates to two achievable practices, specifically around elasticity and around workload portability. That if you can move your application seamlessly between underlying hardware, if you can scale up on demand, and within seconds to minutes, not tens of minutes, I think that that is cloud native to me. Basically, there are a number of socio-technical things that you need to do in order to achieve that, but those best practices could shift over time. It's not tied to a specific implementation for me.

Cui: My name is Yan. I've been working as a soft engineer for 15 years now. Most of that working as an AWS customer, building stuff for mobile games, social games, and sports streaming, and other things on AWS. As for my experience, when I think about cloud native, I think about using the managed services from the cloud, so that offloading as much responsibility to the cloud provider as you possibly can so that you work on a higher level of abstraction where you can provide value to your own customers as engineer for your own business.

In terms of scaling that cloud native application, I'm thinking about a lot of the challenges that comes into it. How they necessitate a lot of the architectural patterns that I think Colin touched on in terms of needing high availability, needing to have things like multi-region, and thinking about resilience and redundancy. Applying things like active-active patterns, so that if one region goes down, your application continues to run. Some of the implications that comes into it, when you want to do something like that in terms of your organization, in terms of the culture stuff, I think, Wes mentioned in a sense. You need to have CI/CD. You need to have well defined boundaries, so that different teams knows what they're doing. Then you have ways to isolate those failures, so if one team messes up, it's not going to take down the whole system. Those other things around that, which touches on infrastructure. It touches on tooling, like observability. I think it all comes into a big ball of almost a complexity that lots of things need to tackle when it comes to scaling those applications.

Reisz: When I was putting this together, I came up with so many questions, and the best way to describe it, I just came up with that Here Be Dragons. It resonated with me when I put this together.

Is Cloud Native Synonymous with Containers?

When we talk about cloud native, is it synonymous with containers?

Cui: From my perspective, I do see a lot of, at least the dialogue around cloud native is actually focused on containers, which to me is actually weird. If you think about any animals or plants or anything that you think is native to the U.S., would you think the first thing that comes to mind is, that thing can grow anywhere, or it can live anywhere. It's probably not. There's something specific about U.S. that these things are particularly well suited to, and so they can blossom there. When I think about containers, one of the first thing that comes to mind is just portability. That you can take your workload, you can run it in your own data center. You can run it in different clouds, but that doesn't make it native to any cloud. When I think about cloud native, I'm just thinking about the native services that allows you to extract maximum value from the cloud provider that you're using, as opposed to containers. I think containers is a great tool, but I don't think that should be cloud native, at least in my opinion.

Fong-Jones: That's really interesting, because to me, I think about cloud native as a contrast to on-prem workloads. That the contrast is on-prem workloads that have been lifted and shifted to the cloud are not necessarily cloud native to me, because they don't have the benefits of scalability. They don't have the benefits of portability. I think that the contrast is not to portability between different cloud providers. I think it's the portability to run that same workload to stamp out a bunch of copies of it, for instance, between your dev and prod environments. To have that standardization, so you can take that same workload and run it with a slight tweak.

Do You Have To Be On A Cloud Provider, To Be Cloud Native?

Reisz: I hear Liz and Yan talking about cloud providers, in particular. To be cloud native, do you have to be on a cloud provider?

Breck: No, I think that goes back to those architectural principles. Yes, actually, like Erlang/OTP, that's the most cloud native you can get in some ways. That's old news. That's like abstracting away the underlying compute, embracing multicore, embracing distributed systems, embracing failure, those things. That's the most cloud native you can get. Especially in IoT, in my world, the edge becomes a really important part of the cloud native experience. If from the edge, like a cloud is just another API to throw your data at, you're not going to develop great products. If the edge becomes an extension of this cloud native experience, you can develop really good platforms. If you look at the IoT platforms from the major cloud providers, that's the direction they've gone in this. There's an edge platform that marries with what they have in the cloud. I think that cloud native thinking can extend beyond a cloud provider into your own data center, or out to the edge, or whatever you're doing.

Reisz: I loved what Yan said about portability, because when I think cloud native, I often tend to think about containers. Yan's right, it's more about the portability. It's about being able to move things. It's not necessarily a specific technology. That may be hot today, may be popular today, but it's really the portability.

Walker: Don't just lift and shift. I talk to customers all the time about, you need to move and improve, because simply taking something and running it on new infrastructure is not cloud native. You're running it somewhere else on somebody else's server. That's all that is. Containers allow us to do that. There's a fundamental different approach to the way that you build software when you're cloud native. It's not just all the tools around it. You have to take into consideration, at least in my opinion, the physical location of what you're doing. I think that's this distributed mindset. Moving to this new world has to take that into consideration. We deal with this all the time. The speed of light is no joke. When you start talking about some of the things you were talking about, Colin, like eventual consistency, and are things active-active? The problem that we're dealing with here is software engineering has advanced to the point where we've caught up with the speed of light. To be truly scalable across the globe, in real time, whatever that means for you, I think we're running into those things. It's not just simply moving a container into the cloud. It's rethinking how it works inside that piece of software. I think that's the stuff that gets really challenging for people.

Thinking Security with Cloud Native

Reisz: We've talked about some different ways of thinking when we talk cloud native. I want to talk a little bit about security. How do you have to think about security when we're talking about cloud native?

Fong-Jones: I think that security can be a challenge when you introduce new control planes, when you introduce new ways to interact with your workloads. For instance, Ian Coldwater gives these amazing talks about how to break out of that container into another container to take over the Kubernetes control plane. These are things that are new threat models that are introduced when you have all these things that are meant to be flying information. I think that there are things you need to worry about around authorization. You can no longer have this idea that if it's in my cluster, I trust it. If it can talk to me, I trust it. You have to start adopting these things like gRPC, security certificates. I think that when you want to level up, you have to level up all the practices including your security practices. That being said, you don't have to use Kubernetes in order to be cloud native. I think that it's a thing where if you don't adopt Kubernetes, you can still use some of the more tried and true practices, and you don't have to keep quite as much on the cutting edge.

Walker: When I think about security and cloud native, all of the core principles that we've been following for security for years still apply. It's still AAA. It's all this stuff. I think this layer of zero trust, and just this whole movement towards zero trust between everything, thinking about it in that way, is what is allowing people to identify these types of threats, and being able to actually figure out how to contend against them. The threat vectors have gone up exponentially. I think zero trust and that whole movement is wildly interesting.

Reisz: It's with the supply chain, too.

Breck: I think part of your cloud native scaling strategy is adopting managed services or fundamental platforms from the cloud providers. I think the security story gets a lot better. I trust Microsoft and Amazon to be better at threat modeling, and patching, and all those kinds of things than most organizations. The more you can turn over to them, I think the security story in some ways gets better.

Walker: Do you remember back in the day when it was like, you didn't want to go to the public cloud provider because you were worried about their security. That's a serious change.

Breck: I agree. I think that's completely inverted now.

Cui: I was talking to quite a few customers, because my focus is on serverless. That's one of the big selling points of serverless to large enterprise companies, is around security. Because when you offload infrastructure security to AWS, or to Google, or to Microsoft, you essentially eliminate a whole massively complex class of security problems around infrastructure, around your operating system, around patching and making sure everything is nice and secured. That is now no longer your problem. You don't have to worry about the VM security. You don't have to worry about the security of the virtualization layer, or the operating system, just focusing on the application level security, which is still a lot of things to do. It's a much smaller attack surface that you have to worry about.

Think of when you're using things like managed services, you're also much more shifted towards this mindset of zero trust, because every service you want to talk to, you have to authenticate yourself somehow. I can't talk to DynamoDB by being asynchronous to DynamoDB. That just doesn't happen. Everything is based on the same consistent authentication model. You can easily use the same model in your own application as well, which makes security much more interesting and consistent for me. I think that has been one of the big reasons a lot of enterprise customers, especially like banking and the financial institutions, they are moving to serverless because of security.

Cloud Native and Observability

Reisz: How does the observability story change, or does it, when we start talking cloud native?

Fong-Jones: I think that observability becomes more of a requirement as you adopt cloud native technologies, because of that decoupling from hardware, because of the fact that you're potentially running many more microservices. I think all those complex interactions make it so that you can no longer achieve observability with the previous set of tools that we all used to use, and that you have to think about your ability to examine unknown interactions. Things that you didn't anticipate were going to happen and things that were complex interactions between services or between your users and your services.

Reisz: How do we deal with that, particularly when we start thinking IoT, with thousands in digital twins? We start talking about just thousands of potential things. How do we really get our mind around doing some of the stuff that Liz just said, with cloud native and observability?

Breck: I don't think we quite have as an industry. Actually, I wrote an article along these lines of like, at huge scale, we need to get away from that feeling of I want to trace every event through the system, or I want to know what's happening with every customer. Discrete becomes more continuous at huge scale. How do we take continuous signals and use that to tell what's going on in our systems. To me that looks like systems engineering or process engineering. You don't control a distillation column by looking at every molecule, you control it with these global signals around temperature and flow rates, and those kinds of things. We use a lot of those techniques for streaming systems in IoT at scale to tell what's going on, not looking at discrete events.

Walker: It's signal and noise here. There's reasons why Honeycomb and some of the other people that are out there doing this have a business because there's a lot of information that we can collect. In fact, I love what's going on with tracing, and I find it extremely interesting. I'm really excited about OpenTelemetry. The OpenTelemetry project in the CNCF is right now the number two most traffic. It's awesome, because it's important. It's not just the tracing side of it. It's the telemetry coming off of each one of these machines as well. I think, to me, right now, in the cloud native space, one of the most interesting movements that's going on. Like in the database, we use this all the time. We use tracing, because what is a query when you're going across multiple different nodes, and you're doing some join across huge, massive regions? How do we help our customers troubleshoot a query, a statement that was really simple in a single instance of Postgres? Now it's spread out all over the whole planet. What's worse is, there is a network lag between this node and that node, why, between this region and this region it's gone, and maybe a piece of hardware. The telemetry that's coming off of these things is awesome. It's hard to find the signal.

Challenges with Monitoring Everything

Fong-Jones: I want to back up a step here, and let's just talk through. Some people might have the naïve thought, why don't we just monitor everything? Why don't you monitor the network lag between all of these things? Why don't you alert on everything? I think that that's an important subject that we should also talk about when we talk about this, is like, why can't we just monitor everything?

Reisz: Why Can't We? Why can't we just monitor all the things?

Walker: We could collect the data, it's just too difficult to identify the issues. We're right back in the same problem. This has been our problem for a long time, but there's ways to do it.

Fong-Jones: Yes. I think that that goes to this premise that if you start from the top of your system, where your user traffic is coming in, and you're able to systematically investigate, ok, it got slow here, then it got slow here, until you find where the problem is. You're going to have a lot more luck doing that, than if you start from a million things screaming at you, because this one node is unhealthy, and the network lag has temporarily spiked for five seconds between these two data centers. That's just too noisy. You're just never going to be able to correlate it. Having the data isn't the only thing that matters, it also matters to alert on the right signals of user pain, and also to have that ability to trace things all the way through the system. I think that's what you're getting at, Jim, when you were like, we can figure out that the issue with this whole query was because of this network issue. You wouldn't otherwise have alerted on that network issue problem.

Walker: We're actually looking into the query itself. We actually have telemetry coming out of the execution of the statements. It's not just the physical and everything we think about network, it's actually inside the software. This is what I meant earlier when I was talking about the distributed mindset does not extend just to operations and hardware and everything that's around these cloud native systems. It's the way you think about the software you're building itself. I've seen that firsthand with the engineers at Cockroach Labs. I always say that, I think the CockroachDB code base is almost a PhD in a distributed system, because some of the problems that they're challenged with, in terms of the speed of light, in terms of these sorts of things, and observability, it's just tremendous at this scale. We're just scratching the surface and trying to figure these things out, I think as an overall community, in my opinion. I think we're just getting there.

Cui: I think certainly from this thing about that we're building resilient, self-healing systems as well, that if the network is slow or some node is down, your system should just handle it and recover. There shouldn't be any manual intervention. Why would you be monitoring those things when all you care about is user experience? If it doesn't impact you, you probably shouldn't be waking up in the morning to have a look at it.


Reisz: Jim, you mentioned OpenTelemetry and literally the room perked up. There's people on here that are not familiar with OpenTelemetry. Give us the two-minute elevator pitch, why should everybody care? What is this? I've never heard of it.

Fong-Jones: OpenTelemetry is a vendor neutral standard for instrumenting and sending telemetry out of your code so that you can then analyze it in some tool in order to achieve observability. Some of the formats of data that OpenTelemetry supports include traces and metrics. There are SDKs in almost every popular language. As Jim mentioned, we are the number two project in the CNCF right now, in terms of number of contributors. We're a collaboration of pretty much a bunch of end users like Shopify, like DoorDash, as well as a number of vendors such as Honeycomb, and Lightstep. Pretty much almost the entire observability community has come together to build this bar, to say, we are building a set of standards that you only need to insert your code once. That way, even if you decide that you want to change vendors, it provides vendor portability, which is really exciting, or lets people grow from Jaeger on to a more sophisticated solution.

Walker: I just wanted to highlight one thing that it is practitioners as well, it is not just the vendors, which is such a key part of that project. DoorDash was the team that pushed us towards it, because they get this problem at scale. It's a great project, because it is a lot of practitioners.

Challenges with Observability in Serverless

Reisz: Talk about challenges with observability in the serverless space. What are the problems and what are the realities of it today?

Cui: In your mind, there's a very different execution paradigm compared to when you're running containers or VMs, in that, you don't have access to a server, so we used to just install an agent on a box and be done. You can't do that anymore. In terms of those monitoring agents, people just talk about how much overhead it adds to your machine, because any CPU cycle you spend on collecting telemetry is the CPU cycle you're not using to process user events. When it comes to things like Lambda functions, you don't have to worry about it anymore, because the invocation handles one user request at a time. The fact that you also just have more event-driven systems, or architectures all over the place, event files into S3, triggers a Lambda function, picks it up, process the event. Write something to DynamoDB. That triggers a stream event, triggers another Lambda function. There's a lot more of this distributed system type of problems that when it comes to debugging, you have to figure out. You have to be able to put together a picture, end-to-end, of what actually happens. Why does something bomb at the fourth function in a call chain, it all traces back to the user request that came in the first place to the API, that something has an invalid payload or something. In terms of debugging problems, it gets more interesting. The fact that the execution model is quite different also puts a spin on how we collect telemetry in the first place. A lot of that has been improving and getting better. We now can run like a sidecar similar to a sidecar to containers, in the Lambda function, but it still requires a bit more complexity. It's an interesting challenge because there's a different set of constraints you have to work with.

Fong-Jones: The OTel Lambda extension is awesome.

Cui: Yes, it makes life a lot easier. Do you see the fact that now you can actually run after the user function returns?

Fong-Jones: Yes.

Cui: That was so good.

Observability in the Device and IoT Space

Reisz: Any tips, tricks, and thoughts on the device or IoT space, when it comes to observability?

Breck: It's hard. Even if you have really the best intentions in IoT, you end up with really long tails of things that you maybe have improved, but there's still four devices out there that are on some old firmware and the tails are just really long. IoT is so diverse in terms of moving things forward. I think it's actually the hardest problem in IoT is managing assets. It plays into scaling. It plays into observability. It plays into security, and ultimately building a great product. I don't think there's a lot of great patterns or tooling for managing millions of assets. I'm sure that Apple and Google have great ways of managing millions of phones, but those types of platforms are not really available to others. I think there's a huge opportunity there.

The Challenges with Cloud Native and Data

Reisz: When we talk about cloud native and data, what are the challenges that come to mind for you? What are some of the things that you think about?

Walker: Cloud native and data can be really simple if you're in a single region, or a single AZ. Like, it's prop up an instance of a database, preferably Cockroach, you're going to get active-active, all these things. It's when we start thinking about scale across regions. What is a region but another cloud provider, in my opinion? I think right now people are trying to say, you can do Kubernetes, and you can have it in multiple regions, but what happens with data at that point. For me, it's a very different lens because I'm in a database company. We believe in federating data above the Kubernetes clusters. I don't care if it's three cloud providers, multi-cloud, on-prem, we don't care. We federate at a different layer, and it's at the data layer.

The biggest challenge I think we have with scale is thinking about regions as clouds, thinking about regions as wholly owned different environments. When we get to the point where it is truly a serverless world, and you're running in lots of different environments, what is that going to mean for the database? Colin you touched on this, what's a stateful versus a stateless application? We start talking about apps, and most developers I know, there's a database behind a large amount of our applications. It's a big challenge. Multi-region is a problem, A, because of the speed of light, again. There's the network hops between areas. Once you have data moving between New York and Sydney, what does that mean for you? What happens when there's two transactions happening at the same time, one in New York, and one in Sydney, who wins? Understanding Raft and understanding MVCC, these core principles behind these distributed systems, to basically guarantee transactional consistency and these things, they become really important and really difficult. I think speed of light and consistency of data are two huge challenges, when we think about these things.

Stateful and Stateless

Reisz: Colin, in your introduction, you focused on stateful and stateless a bit. Why was that one of the things that made you draw attention to?

Breck: Services that have state are just fundamentally harder to scale. Often, you're joining data there, and you need to pay attention to how to join, especially in eventually consistent systems, how those joins are working out. It's in a distributed system, if you want to have multi-region failover, and you're trying to replicate state across regions, that's just really difficult. In IoT or OT, you're ultimately controlling things in the physical world, too, so the decisions you're making, from that consistency of data, also play into the model. It's just really hard. I think often in platforms that have a good separation between compute and storage, you end up with some better models. Another problem that comes up in OT and IT all the time is, you can't take downtime. You need to migrate systems or upgrade systems or add new functionality while you're still flying the airplane. If you can have a good separation between compute and state, and that offers you some migration patterns or upgrade patterns, that's also a real advantage.

War Stories

Reisz: We all have some great war stories about scale in cloud native, anybody have anything in particular that comes to mind that you might want to share?

Fong-Jones: The one hilarious one that I wanted to share is the $15,000 Amazon Lambda bill. It turns out that sometimes elastic scaling gets a little bit too elastic and gets out of control.

Cui: I've seen a few. The surprise is things like, I've got an infinite recursion. I think that happens a lot is that you trigger a Lambda function with file dropping to S3, and then you process the data. You write into the same bucket in a different folder, but then your root is pointing to the same function, so it just keeps running. Then, that can fan out pretty quickly as well. Those problems that I think have hit a few people in the past.

In terms of cloud native war stories, I don't have one myself. This didn't come from me. I heard it from Corey Quinn, who was talking about a company that was spending something in the order of hundreds of thousands of dollars a month or something like that on just network, or network bandwidth, because I think they were using a network gateway that you get out of the box from AWS to basically manage the net. That is charged based on number of gigabytes processed, so when you are sending through terabytes of data over the network every day, then that can get really expensive really quickly. I think he managed to cut it down to almost nothing by just using a bunch of net servers that the client runs themselves instead. Those problems pop up regularly, way more than any other problems as CPU run into serverless, just because cost management in the cloud is really complex. Especially, once you factor into all the different layers that factors into it, like data transfer, networking, picking the right instance sizes, and things like that, it just gets really messy really quickly.

Resiliency in Event-Driven Systems

Reisz: There's a question here specifically about Pub/Sub. When we talk about resiliency in event-driven systems, how do you think about recovery? How do you think about making them resilient? How do you think about just recovery modes, when there's issues that happen? What comes to mind? What do you think about? What are some quick hits for someone who may be struggling with an event-driven system? How do you start to think about this?

Breck: If I constrain it to event driven, I'll not think too much about IoT, you do have to have a sense of like, when devices talk to you, or when customers or whatever it is talk to you. Because if it's purely Pub/Sub or purely event driven, and you haven't heard from something in five minutes, is it offline? Is it in a reboot loop? Is it broken, or does it only never talk to you every 10 minutes or something like that? You need to have a sense of the data model built in when you're in a pure Pub/Sub or event-based system.

Walker: I would extend a little bit of that, and just say it's not active-passive. It's not replication, one-to-one. It's collaboration and coordination. If you're going to actually get anything that's getting any resilience in a distributed system, it's about coordination. Don't think about one-to-one, think about many coordinating and getting consistency. That's the trick, and that's in your software itself.

Breck: Especially when it's interacting with the physical world, just say what you know, don't try and make things up. Don't try and infer from data, or a Pub/Sub system, or a join that didn't quite work, something like that. Don't infer, just present what you know.

Reisz: There's a lot of times devices are offline, you have to infer to be able to get a complete picture. Particularly when we're talking not at a macro level, at a micro level, you're going to get an incomplete picture. You're going to get, not the full story. How do you deal with that?

Breck: The example there would be, maybe you haven't heard from a device in a while. You can say this is the last state that was reported, but it's also stale. I would have expected to hear from this device a minute ago, so here's the last state that I know. Don't draw a line on a graph extrapolating from that. Draw that in some shade of gray that says, here's what I know, but take it with a grain of salt because it's not up to date.

Walker: You're right, but it comes back to your workload and what you want to accomplish. If you can infer in some correct way. It depends on what you're trying to accomplish. If you have to have consistency, like it's going to cause a nuclear missile to go up or something, you can't do that. There's this range of what you want to accomplish from the value of the data that you actually have to interrogate before you implement the system. Sometimes it can do with a five minute delay, that's ok. It depends on how mission critical. If it's a bank account, no, you need quorum. It doesn't work otherwise. It really comes down to that question.

Key Takeaways on Scaling an Event-Driven System

Reisz: I want to give everybody an opportunity to, your just one more thing, your big takeaway. I want to give everyone an opportunity to give some food for thought for the developers and the leadership that's in the room on what it means to scale an event-driven system.

Cui: You need to have good practices before you even try to attempt this system, because they are quite complex. There's no magic bullet that you can just, use this tool and you'll be good. No, you need to scale up your team. You need to scale up the organization on how you do things. Having a good CI/CD, security models, and everything else around it.

Fong-Jones: Provocative statement, everyone should have the privilege of being able to deploy to production within an hour. It doesn't matter if you have all the elasticity in the world and all the workload portability, if you don't have the ability to actually make changes to your code, and to deploy it and get it out there.

Walker: I'll go old school and give it the framework of people, process, tech. A distributed mindset will go a long way. The best practices around these things are critical to get right. You can go adopt all these cloud native technologies, if you don't understand what an SRE does, or what that function actually was there for, you're going to gyrate for a while, and not really get far. I think it's really important to choose the right technology. Some tech claims to be cloud native. It's a lift and shift. It's a move and improve. I think the technologies out there that are interesting right now are those that have been re-architected for this world. This is a fundamentally different world we live in. I'm a startup-aholic, so there's so many really interesting startups right now, because they are re-architecting for the cloud. I think that's the stuff that gets really interesting. People, process, tech is always my lens for these things.

Breck: Don't get too hung up on technologies, and more focused on the fundamental architectural principles. I actually like what Reactive Foundation is doing in that regard, of, it's technology agnostic but focuses on the principles. It's really a fundamental shift. Some of the new service platforms or what the cloud providers are doing actually encourage you in a much better direction.


See more presentations with transcripts


Recorded at:

Jan 27, 2022