Designing and Building a Resilient Serverless System: John Chapin at QCon London

In a presentation at QCon London 2019, John Chapin explained the basics of serverless technologies and how to architect and build a resilient serverless system. He also ran a demo of a how a globally distributed, highly available application can be built and run in multiple regions on AWS.

Chapin, co-founder of Symphonia, uses AWS as an example but notes that what he describes is also available from other cloud vendors. He starts by describing serverless as a combination of Function as a Service (FaaS) — services that run your business logic — and Backend as a Service (Baas) — for example databases or messaging systems that are used by the application — and refers to his free book for a more detailed description. The more important attributes of serverless include:

No managing of hosts or processes
Automatic auto-scaling and provisioning
Cost based on exact usage, even down to zero
Implicit high availability

This give some benefits, in addition to the benefits we already get from the cloud:

Reduced total cost of ownership (TCO)
Scaling flexibility
Shorter lead time

But at the same time we take steps down the cloud path, we also give up a little bit of control compared to when we were using our own hardware:

Limited configuration options
Fewer opportunities for optimization
Hands-off issue resolution. If an outage occurs, there is nothing we can do. We must wait for the provider to fix the problem

Looking at resilience, Chapin refers to Werner Vogels, CTO at AWS, who in a blog post points out:

Failures are a given and everything will eventually fail over time…

For Chapin this means that the systems we are building today are so big and so complex that failure is a statistical inevitability. Some part of the system will be in failure mode almost all the time and we must handle that. In his blog post, Vogel also talks about embracing failure and although he is doing that from the perspective of a vendor running cloud services, Chapin thinks we as serverless application developers should consider the same concepts, which include:

Systems will fail. At scale, they will fail a lot
Embrace failure as a natural occurrence
Limit the blast radius of failures by using isolation mechanisms
Keep operating
Recover quickly, and automatically

In the scope of serverless it’s all about using vendor-managed service, and Chapin notes that there are two classes of failures. The first is application failures, which is your problem and your resolution. The second is essentially all other failures, which are your problem but not your resolution. To mitigate all vendor-related failures Chapin recommends that we plan for failure by architecting and building systems that are resilient. We should also take advantage of the tools provided by our vendor:

Isolation mechanisms, like AWS regions
Services designed to work across regions, like Route 53
Recommended architectural practices, like Reliability Pillar, a document Chapin highly recommends

AWS uses regions and availability zones as an isolation mechanism. A region contains at least two availability zones with proximity to each other and is guaranteed to be in one country. An availability zone is an isolated data centre within a region, a separate building with separate power and network connectivity. If it goes offline for whatever reason, it is designed to not affect any other availability zone. To get more information about how this works, Chapin recommends a presentation by James Hamilton, architect at AWS, where he gives an overview of Amazon global network.

Serverless resiliency in AWS is based on either regional or global high-availability, which means that services run across multiple availability zones within one region respectively across multiple regions. Regional high-availability is handled automatically, but to achieve global high-availability we must architect our applications. Other attributes of serverless applications include:

They are typically event-driven with state stored in external and reliable storage. This means there are little or no data in-flight when failure occurs
Continuous deployment, which means that there is no persistent infrastructure to re-hydrate and that it’s likely to be portable

In an end-to-end demo, Chapin builds a globally distributed, highly available API that basically is a chat application. It can be deployed in several regions using replicated data storage which means that if a region goes down, the application will continue to work with all data available. In his demo he also shows this in action by simulating a failing region (by having his application return error codes).

Chapin has published both example code and slides for the presentation. Most presentations at the conference were recorded and will be available on InfoQ over the coming months. The next QCon conference, QCon.ai, will focus on AI and machine learning and is scheduled for April 15 – 17, 2019, in San Francisco. QCon London 2020 is scheduled for March 2 - 6, 2020.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter