Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Building Resilient Serverless Systems

Building Resilient Serverless Systems



John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures, even while taking advantage of fully managed vendor services and platforms.


John Chapin is a co-founder of Symphonia, an expert Serverless and Cloud Technology Consultancy based in NYC. He has over 15 years of experience as a technical executive and senior engineer. He was previously VP Engineering, Core Services & Data Science at Intent Media, where he helped teams transform how they delivered business value through Serverless technology and Agile practices.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Chapin: My name is John Chapin, I'm a partner at a very small consultancy called Symphonia. We do serverless and cloud architecture consulting. We're based here in New York. If you want to talk more about that, come find me after. I'm happy to chat more about that, but what we're here for today is to talk about building resilient serverless systems. Here's what we're going to cover. We're going to review a little bit about what we, as Symphonia, think serverless is. We're going to talk about resiliency, both in terms of applications and in terms of the vendor systems that we rely on, we're going to give a brief live demo, so I'll be tempting the demo gods today, and then we'll hopefully have some time for some discussion and Q&A at the end. If for whatever reason we don't have time for that, I will be out in the hallway right after the session happy to answer all of your questions.

What is Serverless?

What is serverless? The short answer is go read the little report that we put together. We did this with our Ariely. There's a link down at the bottom here. This link is in the slide, you can download the free PDF. We think serverless is a combination of functions as a service, so things that you all have heard of, like AWS Lambda, Auth0 Webtask, Azure Functions, Google Cloud functions, etc, etc, and also, backends as a service. These are things like Auth0 for authentication, DynamoDB, Firebase, Parse - it no longer exists, but that would have been one of these - S3 even, SQS, SNS Kinesis, stuff like this. Serverless is a combination of functions as a service, these little snippets of business logic that you upload to a platform, and backends as a service. These are more complete capabilities that you make use of.

What are some attributes of serverless that are common across both of these things? We cover this in detail in this report, so I encourage you to check that out. There's no managing of hosts or processes, these services are self-auto-scaling and provisioning. Their costs are based on precise usage, and the really important thing that we're going to cover today related to that is that when you're not using them, the cost is zero, or very close to zero. What we'll also covered today is they also have this idea of implicit high availability, and I'll talk more about what that means a little later on.

What are some of the benefits of serverless? You've heard this from a bunch of different places in the last couple of years. These are the benefits of the cloud, but just taken one more step. Further reduced total cost of ownership, much more flexibility in scaling, and much shorter lead time for rolling out new things, much lower cost of experimentation. If I want to put up a new serverless application with some Lambdas and a DynamoDB table, I can get that up and running in a couple of hours. If I don't like it, I can tear it down in two minutes, and I haven't invested all of this money in this persistent infrastructure.

With all of these benefits though, we give up some things as well. One thing we call out in this report, and this is the thing we're going to talk about a lot today, is this idea of loss of control. The more we use these vendor-managed services, the more we're giving control over to vendors like Amazon, like Microsoft, like Google, in return for them running these services on our behalf. We're sort of saying, "Ok, Amazon, I think you're going to do a much better job of running a functions as a service platform than John Chapin will."

With that loss of control though, we're also giving up our ability to configure these things at a really granular level so however that Lambda service behaves, we have pretty limited number of configuration options for how it works for us. We have far fewer opportunities for optimization, especially optimization that might be specifically related to our business case or our use case. Again, with Lambda, we have one knob to turn. We get to adjust the memory and affect the performance, and we get to optimize how we structure our code and things like that, but we don't get to go in there and tune Linux kernel parameters or things like that.

The last part of this loss of control is this idea of hands-off issue resolution. Who here remembers when S3 went down a few years ago? Who here had a direct part in fixing that? One person here...lying. When S3 goes down, this guy right here can fix it. S3 went down, we all built systems that relied on that or relied on systems that relied on S3, and when it was down, we couldn't do anything to bring it back up other than ring the pagers for our account managers and whatnot. We couldn't proactively take any action to bring that back. We had to trust the vendor to get that back up and running. We lose some control when we're using these vendor-managed services, these serverless services, or as some people like to call them now, servicefull services.


With all that in mind - we've established what is serverless, what are some of the benefits and tradeoffs - let's talk about resiliency. This quote, "Failures are a given and everything will eventually fail over time." This is a quote from Werner Vogels, who's the CTO of Amazon, and he has this great blog post, "10 Lessons from 10 Years of AWS." This link is on the slides as well. I highly encourage you to go read this. He's basically talking about 10 years of running AWS at ever-increasing scale, and what's happening as they've been doing that.

What does he say on embracing failure? He says systems will fail, and at scale, systems will fail a lot. What they do at AWS, is embrace failure as a natural occurrence. We know things are going to fail, so we're just going to accept that fact. They try to architect their systems to limit the blast radius of failures, and we'll see what some of those blast radiuses look like. They try as much as possible to keep operating, and they try to recover quickly through automation.

I throw this up here. A lot of people use this as, "We're doing something really bad, and this is not where we want to be." This is actually the current state of the cloud right now. This should say, "This is normal." Something is always failing, and at scale, some things are always failing a lot. I used to joke that this was a webcam view of us-east-1 but nobody laughed. Thank you.

We talked about what is serverless, we talked about resiliency, what Werner Vogels had to say about resiliency, so what do failures in serverless land look like? Serverless or servicefull architectures are all about using these vendor-managed services. In doing that, we have two broad classes of failures, we have the application-level failures. Things like, "Ok, we shipped some bad code," or, "We misconfigured our cloud infrastructure," or, "We did something to cause our application to fail in some way." These problems were caused by us, and they can also be resolved by us. We can fix our code, we can redo our Cloud formation, or Terraform template, or whatever the case may be.

There's that class of failures, and then there are the service failures. These are the things like S3 going down, for example. From our customer's perspective, those failures are still our problem. If our customers can't get to our application, or our website, or whatever, they still blame us. The resolution, like we talked about in the first section, is not within our grasp. We have to rely on our vendor to resolve that, so, again, application failures and service failures.

What can we do? Is there anything we actually can do when those vendor-managed services fail? This presentation is really about the answer to that question, being yes. What we're going to try to do is mitigate these large-scale vendor failures through architecture. We don't have any control of resolving the acute vendor failures. Ok, S3 goes down, none of us except for that guy in the third row can go in and fix S3 specifically, so we have to plan for failure. We have to architect and build our applications to be resilient.

The way we're going to do that at a large-scale is we're actually going to take advantage of the vendor-designed isolation mechanisms. Werner Vogels was saying, "You've got to limit the blast radius for failures." They document those blast radiuses, and they have isolation mechanisms in place - Amazon, Microsoft, Google, all of them. We're going to take advantage of those isolation mechanisms - in this case, it'll be AWS regions - and we're going to take advantage of the vendor services that are architected and built specifically to work across those regions to help us keep our application up, even when one of those regions goes down.

This last bullet point, if I only had one thing to say, and this was an AWS-specific presentation - it's sort of an AWS-specific presentation - it would be, go read the well-architected framework reliability pillar. Amazon, and also Microsoft, and Google, they document how to architecture applications to take advantage of their isolation mechanisms. They give you the answers right here, so I highly encourage you to go check that out. If you're on one of the other clouds, seek out the same information there.

Let's talk really quickly about AWS isolation mechanisms. I'm going over this because we're going to see it later in the architecture diagrams and in the demo. These big circles are AWS regions, and within each geographic region is a number of, you can call them logical data centers. We have these availability zones, is what AWS calls them. Each availability zone is its own isolated little unit of maybe power, and network, and other things that a data center needs, other things that servers need, and the idea is you architect your application. If you're running something like on EC2, for example, you might have several EC2 virtual instances in us-east-1a, you might have several in us-east-1b, and the idea is you would have a load balancer in front of those. If one availability zone goes down, you're still up and running in another availability zone.

This is the classic regional high availability that AWS gives you. This is services running across multiple availability zones, and one quick way you can tell that you need to do this explicitly is if the service you're using addresses resources, and it's addressing them down to an availability zone level. When you spin up an EC2 server, it says, "What availability zone do you want to put this in?" You know that's your clue right off the bat, "Ok, if I want to be resilient to an availability zone failure, I need to have more than one of these in different zones," versus services like Lambda, services like Dynamo that you address on a regional level. We'll talk about how we use those, but I just wanted to go over this real quick.

Serverless resiliency on AWS - we talked about regional high-availability, these are services running across multiple AZs in one region. EC2, that's our problem, we have to architect our applications to do that. With serverless, again, Lambda, Dynamo, S3, SNS, SQS, AWS handles that for us. We say, "I just want Lambda in us-east-1, in one region," or," I want Lambda in eu-west." AWS is handling what happens if an availability zone within that region goes down. We don't have to worry about that.

To take another step up, global high-availability, these are services running across multiple regions, and we can actually architect our application at this level. We can architect our application to take advantage of multiple regions and have global high-availability. If an entire region of us-east-1 goes down, we can stay up and running. I mentioned it before, but that serverless cost model is one of the huge advantages. There are some other advantages to that serverless model when we're trying to build these global applications. This idea of event-driven serverless systems, things like Lambda, things like API Gateway, these are event-driven by nature.

When you combine that with this idea of externalized state in something like DynamoDB - and I think several of the other serverless presentations at QCon this year are talking about this idea of event-driven serverless systems or serverless microservices. What that means is we have little or no data in flight when a failure does occur, and that data's persisted to highly reliable stores. If we need to switch from one region to another, it's really seamless. We don't have anything in flight that we need to figure out what to do with. We can just switch from one to the other.

Another property of these serverless systems, that’s surprising that it comes out of this, is several systems tend to be continuously deployed because they're so damn hard to manage otherwise. At least that's something that we see. When you have that continuous deployment, you're specifying all of your infrastructure in code or configuration. You don't have any of this persistent infrastructure to rehydrate. If you need to have that running in many regions, and we'll see this in the demo, it's very straightforward to do, and oftentimes, that just comes naturally.

We talked about resiliency, and what we actually also get out of this style of architecture is not just resiliency, we actually get, again, this distributed global application that's better for our users and may not actually cost a lot more than running in a single region. We'll see this in the demo, and we'll see this in the architecture diagram, but that regional infrastructure is closer to your regional user. If we have our application deployed in us-east-1 and in eu-west-2, we can actually route users in those regions to those two instances of the application.

Because serverless is pay per request, those total costs are similar. If half of our users are going to us-east-1 and half are going to eu-west-2, our total bill is the same as if all of them were coming to us-east-1 in that first region. That pay per request total costs are about the same. Infrastructure-as-code minimizes the incremental work in deploying to that new region, so if we decide, "We want to spin up in Asia-Pacific," we use that same configuration template, send it to Asia-Pacific, and we have the capability to do automated multi-region deployments. I've got a link at the end of this presentation for some work that Symphonia did providing an example of how to do that, but that makes it easy to keep this multi-region infrastructure up-to-date. The premise of this talk really is that the nature of serverless systems makes it possible to architect for resiliency to vendor failures. Not easy, but possible.


Let's talk about the demo. We're going to build a global highly-available API, and I say we're going to build it - I'm going to show you, and make available the source code for you so you can take this home and play with it yourself. We're going to build a global highly-available API. The source code is there, and the slides will be distributed after. It's a SAM- Serverless Application Model- template, some Lambda code, we'll call it a basic front-end written in Elm. What does the architecture look like, though?

Here's what this application is going to look like from the architecture side. Imagine I'm actually standing all the way to the left with a web browser, or a phone, or something like that, and I'm using the front-end app, and it's making a network request. The first thing it's going to do is say, "I need to take this DNS name and translate that into an IP address. I want to hit" It sends that request, that request is handled by Route 53, this is AWS's global DNS service. Route 53 looks at where I'm coming from on the network and returns an IP address that's closest to me based on what's available for this application. If I'm sitting here in New York, it's going to give me an IP address in us-east-1 down in Virginia. If I'm sitting over in London, or in Europe somewhere, it's going to give me an IP address that's over in Europe, actually, that's in the London region, eu-west-2.

Based on that, at that top level, traffic is going to get routed to one of two regions. The Lambda functions deployed in that region are going to handle that request. They're going to talk to a couple of DynamoDB tables, and the other interesting thing is on the back-end of this is what AWS calls a global DynamoDB table. We've essentially set up two mirror DynamoDB tables that both accept rights and then propagate those rights to their peers behind the scenes. A lot of people stop me and say, "John, you're just describing some DNS tricks, and some AWS configuration magic." Yes, that's totally the point here. We have to architect in this way to survive these regional failures. The point is, we can do it. I described the request flow here. What we're actually going to do, going back to this architecture diagram, we're going to see that users sitting in one of these two locations get routed appropriately, so that's better for our users, that's that good user experience we talked about. We're also going to see, if we simulate a failure that traffic can get rerouted as well. We might get lucky because we're in us-east-1, but if we don't get lucky, basically what I'm going to do is fail the health check that Route 53 uses to determine what regions it has available, and that'll simulate a regional failure. Or, maybe this guy in the third row can help us out too, but you could bring down S3 on purpose this time.

We've talked about simulating failure. Let's jump right into the demo. I'm just going to run this little Elm front-end. Again, the instructions for doing this are in the source repo, so I encourage you to take that and experiment with it. The only thing you'll have to change is the DNS setup to match whatever domain name you have available because you can't use mine, it's taken.

I'm going to go over to Chrome here, I'm going to bring up our little front-end. This is a chat application, what we're seeing here is nothing yet, so let me put something in here, "Hello QCon." Obviously, on the left side, we have our message, the source and read column. That source column is telling us that's where the message came into the system. What we're seeing here is this message was accepted and processed by a Lambda in us-east-1, and we're actually using WebSockets. What this is also telling us is that that message was read from the system in us-east-1. That makes sense, pretty straightforward, but I just want to explain what you're seeing here.

Next thing we want to do, let's go ahead and switch our network location so that the back-end, at least, thinks that we're coming from Europe. I'm going to just travel over to Denmark via VPN. I've refreshed the application. Our message is there, our message was persisted, but we're now talking to an instance of this application running in eu-west-2. That information is being read from eu-west-2. That all makes sense. We can put another message into the system, "Hello, QCon from Denmark." We can see that this message was accepted by our application in eu-west-2 and then read back out of there as well. This is pretty easy, pretty straightforward.

Let's jump back to the United States. I'm just going to refresh here, and now you can see that we're reading all of these messages out of us-east-1 again. I'm back in the U.S., this application is behaving just the way we'd expect, reading data from us-east-1, and I'll show you where that data is living on the back-end a little later on in the demo.

Now, what we're going to do is we're going to simulate us-east-1 going down. On the network, I'm here in New York, I should be routed to us-east-1. If we take us-east-1 down, if anybody needs to turn off their pager duty or something, now's the time. The way we're going to do that is, we're going to go to Route 53, I'm going to my Health checks, and you can see I've got two health checks here - the first one is us-east-1, the second one is eu-west-2. I found, for the purposes of this demo, the quickest way to fail a health check is actually not to change the code behind it, but is to do what's called inverting it.

We basically are telling Route 53 to treat bad as good, and good as bad, and cats as dogs, and up and down, and all of that. I've now inverted that health check, and we're going to wait for that to go from healthy to unhealthy. While we do that, I'm just going to point out the DynamoDB configuration. DynamoDB is AWS's highly available, very performant key-value store. We have a messages table, and so I'm sitting here in us-east-1. I'm looking at the messages there. This is set up to receive messages in us-east-1 for traffic routed here. These messages get copied over to the same table in eu-west-2 in London. I can jump over there and see I have the same table in London with the same messages.

This is DynamoDB global tables, I encourage you to check it out. There are some limitations that we'll talk about later, but it's a really interesting way to have these global applications that share data across different instances in the deployed application. I'm going to go back and check Route 53, see if this thing is showing up as unhealthy yet. Route 53 is telling us that that health check is unhealthy. What we would expect here, if we refresh this page, is that we'll be reading data from eu-west-2, from London, even though on the network, we're still here in New York.

I'm going to try to refresh this. I've actually hit this before, this is a Chrome DNS issue, so I'm going to do an incognito window here, and you can see that this is now reading from eu-west-2. It doesn't seem like much, but we failed a region there, and our application stayed up. I can still interact with this application, "Where'd us-east-1 go?" Our users are still able to use this application. People sitting in New York might get routed to Europe, but it's still up. We've architected successfully around a regional AWS failure. It doesn't seem like much on the screen here, but this is huge.

If you were architecting in this way when S3 went down however many years ago, your application would've stayed up and your users never would have noticed. They might have hit a little bit of extra latency, but that's it. AWS and the cloud vendors give us the capability to do this, we just have to take advantage of it and do it. I'm going to reset our health check here to bring us back to the U.S., although we may not wait on that to actually happen. That is our demo of a globally available application on AWS.

Rough Edges

Let's talk about what some of the rough edges of this are. Many of these rough edges are specific to AWS, so global tables, and WebSockets, and custom domains, and cloud formation don't work very well. What that means is that my infrastructure-as-code approach that I really want to use for deploying this really easily to many different regions, there are some rough edges and some pieces of that are broken. I have to perform some manual actions to actually get this completely set up. These things will be fixed over time.

That third one, if you're using global tables, they can't have any data in them when you link them together, so that's a big caveat. This Stack Sets thing at the bottom - Stack Sets are just a way on AWS to deploy to multiple regions at once easily. We can actually mitigate that just with our deployment pipeline. Some other rough edges - and these are just more architectural challenges, I don't see a slide for this, unfortunately - but your application may have special considerations around whether you can accept data written in multiple regions or not. That's an architectural challenge for you to overcome. Your users may need to have affinity to one region or another. You may not be able to actually move them around or accept their data in multiple places for compliance reasons or for other reasons. Some rough edges there and some architectural challenges, but what we showed is that it is definitely possible.

I've got some links here for multi-region deployment. There are some other ways to architect different pieces of application, so there's this great documentation on a new feature of Amazon CDN called Origin Failover. That basically means I can set up a CDN. I can have the backing store for that CDN, maybe it's S3, or database, or something. If that backing store becomes unavailable, then cloud from the CDN will failover to another one.

Other CDNs, Fastly, and CloudFlare, and folks like that have had those features for a while, but it's now also available on AWS. There's Global Accelerator, which lets you do all of this but at a much more fundamental network level. I also have some AWS resources that I just want to call your attention to. I mentioned earlier that well-architected framework that you should all read if you're running applications, building applications on Amazon.

In James Hamilton's talk from re:Invent 2016 called "The Amazon Global Network Overview," he goes into deep technical detail about how they build out their global infrastructure, how they structure regions, how they structure availability zones. He talks about how generator cutover switches are poorly designed. It's super deep, and it's super interesting, and he's incredibly enthusiastic when he's talking about it. If you're using Dynamo, I strongly recommend Rick Houlihan's talk from last year at re:Invent, "Advanced Design Patterns for DynamoDB," and he goes into some of the technical detail around global tables as well.

Then there's a bunch of other prior art around building these global applications. We brought some of it together in this demo today. Then Symphonia, we have some resources out there as well, which I encourage you to check out. Feel free to email or hit us up on Twitter if you have any questions or just want to talk more. I would love for you to stay in touch again. I welcome questions by email, Twitter.

Questions & Answers

Participant 1: This seems like a nice, happy path story. Can you talk about the problems, the pitfalls, the edge cases which we're not thinking about? This is a nice, obvious happy pathway to do things, and you could potentially have routing within your region as well for local-regional failures, and it goes all the way down. What are some of the problems? What are the things we are not seeing here?

Chapin: The question, to rephrase it and condense it a little bit is, "Boy, John [Chapin], you showed a nice happy path here, surely it's not always like this. What can go wrong? Can you do this within a region?" What I would go back to on that second part of the question, "Can you do this within a region," is if you're doing that, you're probably not using serverless components anyway, and so I'm going to just not cover that because we're focused on serverless event-driven systems in this case.

I pointed out some of the minor rough edges somewhere in here. Those are the architectural challenges. If your application is not like this - when I say, "not like this," I mean, if your application can't operate like this - then obviously you need to make different choices. We're relying heavily though on globally available, or what should be globally available services like Route 53. There have certainly been cases where Route 53 has had problems. That's a big rough edge.

That being said, I would rather that Amazon, or Microsoft, or Google, or whoever runs a global DNS system than myself. Also, you could take Route 53 out of the equation here and use a third-party DNS provider too. You're still susceptible to some of those vendor failures and some of those service failures, but just at a much, much higher level.

Participant 2: This kind of routing is great when your service is absolutely not available, but you hinted there, what if there's a service failure, or you have a chain in tiers, and in one of the tiers there's a service failure. What do you have to propagate that, show that failure all the way to the routing so that it can be routed to the other region, potentially?

Chapin: The question is, what if you have multiple tiers, and within those tiers maybe, can you redirect, or can you account for failures?

Participant 2: Yes. The failure is not an availability one, perhaps a functional one. You might want to route to the other region completely.

Chapin: The way we're doing this is with a health check that is programmatically produced. If your failure is a functional one, reflect that in your health check. Your health check could encompass not only, "Do I exist at all," so it's either there or it's not. It could also encompass, "Am I getting the right kind of data back that I expect?" Or, "Am I producing the results that are needed?" Or whatever the case may be.

You can roll all of that up into a health check. There's a lot of danger there too, and you run the risk of with testing sometimes, it's re-implementing a lot of your system in the health check, but it's certainly possible. A lot of this can be sliced up into tiers. Route 53, for example. This doesn't all necessarily have to be just at the front-end of your application. This could be different tiers within it.

Participant 3: You mentioned that the database has to be set up with no data. How would you go about implementing or onboarding, say, another region? Is that something you've done, or is that something that you're just waiting a solution for?

Chapin: DynamoDB global tables have to be empty when you want to link them together, make them globally available, basically. The question to me was, have we onboarded applications that already have data into an architecture like this? The answer is no. It's so new that we're not using it heavily yet, and I'm waiting for a solution. The solution to that would be a manual cutover; dump all the data, create a backup, and then very carefully move things over. It'd be a dance though. It'd be a pain.

Participant 4: On the edges, you mentioned that it's not available in cloud formation?

Chapin: Yes.

Participant 4: Just elaborate on that. Is that just set it up manually and stuff like that?

Chapin: Yes. The comment was, some of these services cannot be configured in cloud formation. Just to review, cloud formation is Amazon's infrastructure-as-code service. I have a big YAML file or a JSON file, and it has all of my resources in there, a bunch of Lambda functions, some API gateways, maybe some DNS records. Some of these pieces that we showed in this architecture cannot be configured right now using that infrastructure-as-code service.

For some of these things, in particular, the custom domain name for the WebSockets, you have to either go into the AWS console and manually configure that, or you can do it through API calls, but again, you have to write the script to do it. Terraform may support that, I haven't tried, actually. The other piece of that is that linkage between DynamoDB tables across regions is the other thing that you can't do in cloud formation, in that infrastructure-as-code tool.

Participant 5: With the database having both rights, there could be data discrepancies and maybe two sites having a split-brain concept that could lead to data ambiguity and how consistent it is. If the rights are more, there may be a replication lag. How do we handle all these?

Chapin: I agree with all of those things. I didn't hear a question in there, but the comment was basically, "You have a dual master situation, or rights going to two tables in different regions at the same time. What do you do about the data?" The answer is, architect to build your application to handle that possibility. You could use convergent data structures. For example, in this case, this being a chat application, it's pretty straightforward just to have the messages be singular, and if we have duplication, we actually handle on the client's side, which if you dig into that app is actually what it does. You design your application with that in mind if you want this kind of system. No silver bullet, no magic there. You have to put in the work.


See more presentations with transcripts


Recorded at:

Sep 10, 2019

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p