BT

Your opinion matters! Please fill in the InfoQ Survey!

Monzo Outage Post Mortem

| by Alex Giamas Follow 3 Followers on Nov 06, 2017. Estimated reading time: 1 minute |

Monzo, the digital, mobile-only bank based in the UK, recently suffered outages in their current account payments and prepaid debit cards systems. Oliver Beattie, Monzo’s head of engineering, took on Monzo’s community forum to provide a post mortem of the outage.

Monzo has designed their infrastructure from the very beginning with global scaling as one of the core hypotheses. This has led to hundreds of microservices being developed over time.

These microservices are packaged into Docker containers which are deployed using Kubernetes in AWS. Orchestration of services is performed by etcd which kubernetes uses to identify which services are deployed where along with each service’s state. Routing and load balancing between services is done using linkerd.

The outage that affected both the prepaid cards and the current account holders was caused by a combination of several factors.

First of all, there was a bug in Kubernetes that caused requests to timeout after cluster reconfiguration. A cluster reconfiguration that happened a week before the actual outage started causing these timeouts, preventing linkerd from receiving updates from Kubernetes.

Then, when the outage happened, one of the immediate reactions was to restart all linkerd instances, which exposed an incompatibility between the versions of Kubernetes and linkerd Monzo was using, worsening the situation from a services specific outage to a full platform outage. The thread in Monzo’s community forum also provides a full timeline of events.

There are interesting lessons to learn from an outage like this one. Other than fixing bugs and keeping versions of different libraries on check for incompatibilities and other issues, Monzo has identified the need to improve procedures regarding communication of outages both internally and externally.

Also, another lesson that can be learned is the importance of alerts, dashboard and health checks in every layer of an application to get early detection of human and other errors. All in all, it’s important to do everything we can to prevent outages and also both resolving them and communicating clearly what has happened afterwards so that we can build better safeguards for the future.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT