BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Modern Banking in 1500 Microservices

Modern Banking in 1500 Microservices

Bookmarks
50:59

Summary

Matt Heath and Suhail Patel explain how Monzo team builds, operates, observes and maintains the banking infrastructure. They talk about how they compose microservices to add new functionality, Monzo’s culture, deployment and incident tooling, monitoring practices and how they share knowledge effectively.

Bio

Matt Heath is an engineer at Monzo, where he works on Monzo's microservice platform and payment services. Suhail Patel is a back-end engineer at Monzo, focused on working on the core platform. His role involves building and maintaining Monzo's infrastructure which spans hundreds of microservices and leverages key infrastructure components like Kubernetes, Cassandra, and more.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Patel: When we got asked to give this presentation we had around about 1500 microservices. Since time has gone past, we've now got nearly 1600. I checked this morning. Our microservice ecosystem continues to grow. Here's a connected call graph of every one of our services. Each edge represents a service actually calling another service via the network. The number of connections is really large. It's constantly changing as new features are being developed.

Background

My name is Suhail. I am joined by Matt. We're both engineers at Monzo. We spend our time working on the underlying platform powering the bank. We think that all of these complexities about scaling our infrastructure and making sure that servers are provisioned and databases are available, should be dealt by a specific team so that engineers who are working on the product can focus on building a great bank and not have to worry about their infrastructure. They can essentially focus.

What Is Monzo?

At Monzo. Our goal is to make money work for everyone. We deal with the complexity to make money management easy for all of you. Monzo is a fully licensed and regulated bank in the UK. We have no physical branches. We've had this API ever since we began. I think it is a legal requirement to have an API which lists all of your branches. It's been our least changed service over time. You can manage all of your money and finances within the Monzo app. Yesterday, we hit a big milestone of 4 million customers in the UK. That number actually makes us bigger than some banks in the UK that have been here for over 100 years. Banks that I used to walk past when I was a kid that had physical branches. We also have these really nice, really striking, hot coral debit cards. They do actually glow under UV light, which is really nice and really good when you are in the dark.

Where to Start Building a Bank

Heath: Five years ago, a group of people decided to build a bank from the ground up. There are a lot of reasons for that. We wanted to build something that's a bit different, how people manage their money easily, simply. That means you're competing in a space of, honestly, quite a large number of quite big, very well established banks. When we were trying to work out how we would approach that problem, we wanted to build something that was flexible, and could let our organization flex and scale as we grew. The real question is, with that in mind, where do you start? I occasionally suggest this would be the approach. You just open your text editor. Start with, new bank file. Then go from there. Five years ago, I didn't know anything about banking. It's an interesting problem to start from something where you need to work out and understand the domain. Also, work out how you're going to provide the technology underneath that.

Time Pressure

In that world, we have an interesting pressure, which is time. Getting a banking license takes quite a long time, as you'd expect. It's important that companies that become banks are well regulated, and have the customer's best interest in mind. There are lots of things that you need to do to make sure that you're looking after people's money. The responsibility you have there is huge. At this point, you have an interesting dilemma as a product company. You want to build a product that helps people manage their money as quickly as possible. To do that, you need feedback as quickly as possible from customers. You can't get feedback because you don't have a product that you can give to people, because you can't take and hold deposits until you're a regulated bank.

The Product Development Process

In this world, right at the beginning, we weren't even planning on building a prepaid card. We did that a few months later. That allowed us to partner with another company, and go through the product development process. At the beginning of that, we needed to work out a way to build a technology platform that would be extensible, so we could quickly and easily adapt to the additional systems that we need to plug in to. We wanted it to be scalable. Ideally, we wouldn't have to do a large re-platforming effort, four or five years in, which is a common approach. Also, only a problem that happens if you're successful. You only need to re-architect an entire system if you actually have scale that requires you to re-architect that. We needed to be resilient, because we're a bank. We needed to be secure, because we're holding people's money.

The Technology Choices

With those four main things in mind, we wanted to work out, what are the technology choices that we will make to drive those things. We made a few quite early on. We use Go as our primary programming language. There's lots of reasons for that. Ultimately, as a language goes, it's quite simple. It's statically typed. It makes it quite easy for us to get people on board. If you're using a language that not many people know, you have to get people up to speed on how to do that. Honestly, if you're working in a company where you have quite a large framework, you already have that problem. You have to get people to understand how your toolset works, and how your framework works, and how they can be effective within your organization. Go also has some interesting things such as a backwards compatibility guarantee. We've been using Go from the very early versions of Go 1. Every time a new version of Go comes out, it has a guarantee that we can recompile our code, and we basically get all of the improvements. What that means is the garbage collector, for example, has improved several orders of magnitude over the time that we've had our infrastructure running. Every time we recompile it, test that it still works. Then we just get those benefits for free.

The other things that we chose early on, were emphasizing distributed technologies. We didn't want to be in a world where you have one really resilient system, and then a second backup system, and a big lever that you pull but you don't pull very often. Because if you don't exercise those failover modes, how can you know that they work reliably? We wanted to pick distributed technologies from very early on. We use Cassandra as our database. Back in 2015 Kubernetes wasn't really an option, so we actually used Mesos. Then a bit later in 2016, we revised that, looked around, and it was clear that Kubernetes was the emerging market leader. Before we expanded into our current account, we switched over to Kubernetes. The thing that we were taking from that is providing an abstraction. An abstraction from underlying infrastructure for our engineers who were building banking systems on top of that.

I think the first version of Kubernetes we've run in production was version 1.2. For anyone who has used those versions of Kubernetes, that was an interesting time. There were many benefits to moving to Kubernetes. We actually saved loads of money quite quickly. We had lots of machines that were running Jenkins worker pools, and loads of other things that we couldn't easily run on our Mesos cluster. By moving to Kubernetes, we could use the spare capacity on the cluster to do all of our build jobs and various other things. We could more tightly pack our applications. That saved us a load of money. We shut down loads of other infrastructure.

Outage at Monzo

It wasn't all plain sailing. One of our values at Monzo is that we're very transparent. We believe that that is the right thing to do. Unfortunately, sometimes that's quite painful. As an example, in I think, 2017, we had quite a large outage because of a combination of bugs between Kubernetes, how it interacted with etcdb backing, consistent storage layer, and Linkerd which we were using at the time in the key version one as our service mesh. Due to a combination of different bugs, which honestly, are quite hard to test, that resulted in a complete platform outage for us. Those are things that we have to think about as we are developing and introducing new technologies and evolving how our platform works. We have to think about how can we test them and how can we be confident that we're providing the extensible, scalable, resilient, and secure platform that we want, and providing a really good product to our customers so that they can trust us. Hopefully, all of you can trust us.

Iteration of the Monzo App

We started off in the early days with a really basic product. We didn't even have debit cards to start with. Then slowly from that point, we've iterated and added more and more features. We've added Pots so that you can organize your money. You can pay directly out of them. You can save money through those. You can pick how to do that in the app. Or, you can have your salary paid a day early. You'll get a prompt in the app if you're eligible. Then you can sort that into the Pots so you can segregate all of your money for bills just straight away. You just never see it. Your bill money goes over here. You pay your bills straightaway. All of these are provided by an API. This part is relatively straightforward. We have many product features. We have many aspects of our API that we need to build.

Diverse Product Features

Then I think the point that things may be less clear or where things get a bit more complicated. We have many product features. Those aren't the only things that Monzo has to build. For example, we have to connect to lots of different payment systems. These are some of the ones in the UK. They range from MasterCard. We both are a card processor. We process all of our own transactions. We're also a card issuer. We issue our own cards. There's lots of different systems that go into that. That's relatively complicated. Adding those things as separate systems allows us to keep those things simpler. Some of those things we've in-housed. We used to have an external card processor, and now we've brought that inside. We have a MasterCard processor written in Go. We had to add a load of EBCDIC code pages to the Go programming language. Anyone knows what EBCDIC is? It's a whole interesting, different type of thing. By doing that in one section of our infrastructure, we can isolate that complexity, and everything else doesn't have to be aware of it. Many of these things we've in-housed. MasterCard, one of them. Faster Payments, more recently. We now have our own direct connection to Faster Payments that is built on our own infrastructure. We brought our own gateway. That's now in-house. All of those things we're doing because we want to build more resilient systems. We want to have more control over the product experience, but we also want to be able to own our own availability.

Some of the other examples are chat systems. When you chat to someone through Monzo, 24 hours a day, every day of the year that goes through our own systems now. We've in-housed that as a critical function of the business. Behind that, there's a whole team of people who will chat to you and help you with your problems. We both have APIs that do this in the app. We also have bespoke systems behind that, which are what our internal teams use to talk to people, manage a variety of tasks, that kind of thing. Over 50% of our staff use this on a daily basis. One of our colleagues, Sophie, gave a really interesting talk, "Support at the Speed of Thought." It's a really good insight into how we build those internal facing products. Again, the examples we're giving here, the problem space is quite large. It gets even larger because we integrate with loads of other companies. We integrate with IFTTT so you can get real-time notifications of your payments. Flux, so you get receipts in the app. All of these things are increasing the scope of the domain and the number of systems that we're having to build. Then behind that, there's all the stuff that you don't see. Things like detecting and preventing financial crime, security. All of these things are systems that we build in a consistent way on our platform. They're not things that are visible, really, to the outside world.

Increase in Services and Complexity

Over time, clearly, the number of things that we're working on as a company has increased. That means the number of systems that we build has increased. Over time, that's meant that the number of services we have has increased quite dramatically. Currently, we have 1600 services in production. All of them are very small. We use a bounded context. They're very tightly scoped on the thing that they do. That allows us to be flexible because different groups can operate a small section of this codebase. As our organization grows, and we have more teams, they can specialize in progressively smaller areas. These systems are responsible for everything that powers our bank. From payment networks, moving money, maintaining a ledger, fighting fraud, financial crime, and providing world-class customer support. All of these things are systems that we built as services within our infrastructure.

That's how we get from, essentially, what started as a relatively simple product through to a system that now at first glance looks really complicated. In this particular diagram, we have color coded areas based on teams that own and maintain these systems. You can see there are clusters owned by different teams within the company. This is clearly not a great way to look at data. This isn't an internal tool we use on a daily basis, but it shows that there are many of these services and they are interlinked. The way that we interlink those things is the interesting part.

Adding a Microservice

Patel: You want to add a microservice? Where do you get started? You start with a blank canvas. This is the surface area that engineers are typically exposed to. They put their business logic in a well-defined box. The surrounding portion makes sure that it works and is production ready, and provides all the necessary interfaces and integrations with the rest of our infrastructure. One of our biggest decisions as an organization, was our approach to writing microservices for all of our business functions. Each of these units, or each of these microservices of business logic are built on a shared core. Our goal is to reduce variance as much as we can of each additional microservice we add. If a microservice gets really popular, we can scale it independently. Engineers are not rewriting core abstractions like marshaling of data, or HTTP servers, or integration with metric systems for every new service that they add. They can rely on a well-defined and well-tested and well-supported set of libraries, and tooling, and infrastructure that we provide.

When we make an improvement or fix a bug in the shared library layer, every service can benefit, usually, without needing a single line of code change within the business logic. Here's an example, where we made some improvements to reduce the amount of CPU time of unmarshaling data between Cassandra our primary datastore, and Go, which is what we use to write all of our microservices. Some of our services saw a significant CPU and latency drop. This work has cascading and global improvements across the platform. It's a free speed improvement for anyone who's working on business logic. Everyone loves a free speed improvement.

How to Compose Services Together to Form Cohesive Products and Services

How can we compose services together to form a cohesive product, or offering, or service? We take a problem and subdivide it into a set of bounded context. The whole premise behind this is the single responsibility principle. Take one thing, do it correctly and do it well. Each service provides a well-defined interface. Ideally, we have safe operations. Consider that if you are going to expose this interface to the public world, what tunable parameters would you want to expose to the world? You don't want to provide every particular node, because that means that you might have lots of different permutations that you need to support.

As a particular example, here's a diagram of all the services that get involved when you tap your Monzo card at a payment terminal. Quite a few distinct components are involved in real-time when you make a transaction to contribute to the decision on whether a payment should be accepted, or rejected, or something in between. All of this needs to work as one cohesive unit to provide that decision. Part of that is calling key services, like our account service, which deals with the abstraction of accounts all across Monzo. It's not just providing bank accounts, but accounts as a whole, as a singular abstraction at Monzo. Also, the ledger service, which is responsible for tracking all money movements, no matter in what currency, or what environment, is responsible. It is a singular entity that's responsible for tracking all money movements all across Monzo.

This diagram is actually the maximal set of services. In reality, not every service gets involved in every invocation on every transaction. Many of these are there to support the complexity of receiving payments, for example. There is different validation and work we need to do to support chip-and-PIN versus contactless, versus if you swipe your card if you're in the U.S., or occasionally, if you're in the UK and the card term was broken. A service will only get called if it needs to get involved with a particular type of transaction. This thing is still really complex because accepting payments is really complex. Why do we have such granularity? We want to break down the complexity and minimize the risk of change. For example, if we want to change the way contactless payments work, we're not affecting the chip-and-PIN system or the magstripe system, so we can fall back to those if we get it wrong.

EU Legislation - Strong Customer Authentication

In September 2019, an EU legislation called strong customer authentication came into effect, which is intended to enhance the security of payments and reduce the amount of fraud. The regulation focused on adding, essentially, a two-factor authentication layer when making payments. For a payment to succeed, a customer needed two out of the three elements. We had to make changes in our payments flow in order to be compliant, just like most other banks. We had the option of baking in the strong customer authentication logic into the existing services that we had that were part of accepting and making a decision on payments. We chose to add additional services with a strict API boundary for validation. That means when the legislation inevitably changes, and a new law comes into effect, or we want to iterate on our implementation of this particular legislation, we can only be concerned with that layer rather than tightly coupling the logic into our existing payments flow. It also means that other services, which are not completely payments related, like our customer support services, can call into the strong customer authentication set of services to get information about the specific state that a customer might be in. If a customer writes in and they're getting all of their transactions rejected, then they can write in and our customer service can essentially help by seeing the information about the state of the customer through these services. Strong customer authentication is a great example of a feature that had a lot of cross-collaboration between engineers working on payments, product, financial crime, and security at Monzo. Different parts of the project were independently implemented and deployed by engineers in the various teams. Each team was able to act locally on their specific components that they were allocated, but think globally to deliver the overall project. By having these API boundaries, each team was focused on their specific mental model. Essentially, this caused the natural breaking down of a complex problem of implementing this large amount of legislation with a lot of moving parts, into a set of simpler, composable microservices.

Building Extensible Systems

Heath: When we're building systems like this, we talked about how we can either add systems or how we can change systems. The way that we generally think about this is when we're building a system, we want it to be extensible rather than flexible. That's a subtle distinction. Rather than building a highly abstract service that can foresee the many different ways we may need to approach a problem, we try not to do that. We try and add it in a relatively simple way, and only abstract those things once we've actually done that thing a few times and we know that that abstraction is correct. There's been a couple of talks at QCon about the cost of abstraction and premature abstraction. I think this is a very specific point. We can try and guess what our future requirements might be. Most of the times, we've done that at Monzo, we've been wrong. Every time we try and build a highly abstract system, it turns out that even if we're in the right direction, it's still not quite bang on. Instead, we want to be able to add new services that do the job very well and just have those defined responsibilities, and be able to change other systems, like small changes to allow them to use that functionality and be extended. That's the difference we see, where we want to usually add some functionality rather than changing something to be super abstract.

That's not always the case. If we have a particular responsibility, you can think about that providing an API that provides a service to someone else in the company, or some other group of services or people. Sometimes, the responsibility of that system either remains the same, and we just haven't implemented certain things and those things logically sit within that boundary. Or we change the responsibility. We refactor these things over time. In this particular example, we might have a service and that has quite a defined boundary. If we add additional code to that, then we're going to effectively extend that boundary. The surface area is now larger. Sometimes that's the right decision if we're still operating in the same area of responsibility. Obviously, this is now a larger application. It's now more complex. When an engineer is either moved on to a team or joins the company, they now need to read more code within that box to understand how the system works. It also means that if we just added code directly to the system without thinking about the larger, overall problem, we may not be thinking about how it was originally implemented. When we're doing this, we tend to have a few patterns. Either we add additional services and update callers to use these. Or, we'll add additional functionality to these and they grow. Or, many times, we'll remove code and split them out into a larger set of smaller services.

An Evolving Infrastructure

I think this is the thing to bear in mind. Our infrastructure is very much an evolving process. There are many services that won't change for a long time. We have a system that generates IDs. We generate standardized format IDs. That has not changed in three or four years. We might add some functionality to that but it's very stable. There are other areas where we're still learning more as we grow. Like those core abstractions, clearly, we've learned a lot more about banking in the last five years than we knew five years ago. We need the ability to change those things over time and refactor them. While we may refactor individual services, generally, we're evolving that system over time as we learn more about it. We'll either potentially expand services, break them up into a smaller number. Sam talked about some of the patterns for these things with monolith decomposition. All of those patterns still apply. If we're breaking functionality out, we may pull that into another service. The original service may temporarily proxy through to that new service, while we update the callers to switch over. It's a bit of a migration process. In a few cases, we've found that we've artificially split something, and after we've used that for a bit, it didn't really make sense. At that point, we can combine those things together. If you find that you're changing a couple of services together, and even if you can deploy them independently, which all of the services at Monzo can be. If you find you're changing two things a lot of the time, then maybe that's a signal that those things were prematurely pulled apart.

Many times we've completely retired or replaced systems. As an example, we used to have a prepaid card. We don't have a prepaid card anymore. The way that we interacted with that model of payments is actually very different to the way that our direct card processor interacts. We built the new system. We've issued everybody new cards. Then at some point, once all those cards were out of rotation, then we could shut those systems down and retire quite a lot of them.

Iteratively Building a Better Product

I think that iterative process is the thing that generally we take to heart at Monzo, really, both on an infrastructure perspective, but also from a product perspective. A particular cartoon we love from Henrik Kniberg, is about, we're trying to iteratively build a better product. Even right back at the beginning, if we tried to build a bank in isolation and not had enough customer feedback, then we might have ended up with something that didn't really satisfy our customers' needs. By talking to people continuously, and by making small changes, but making them quite frequently, we can hopefully make sure we're going in the right direction. That applies both for our individual services and also for our product.

The Core Platform

Patel: There's been a few instrumental components that have allowed this ecosystem to flourish at Monzo. We've talked about how we compose microservices and how we develop a set of robust libraries. The other key layer is our core platform. The team we work on focuses on providing components like Kubernetes, and Cassandra, so that we can host and deploy and develop containers. Cassandra for data storage. Etcd for distributed locking. Components like Prometheus for instrumentation. We provide these components as services so that engineers can focus on building a bank rather than having lots of different teams doing individual operational work with many different components. Even with these components that we've specified, we provide well-defined interfaces and abstractions rather than surfacing the full implementation details about each of these components.

Reducing the Barrier of Deployments

One key superpower we've been able to leverage is reducing the barrier of deployments. Engineers can ship to production from their very first week. Just today, right about now, we would have had hundreds of deployments of various services all across Monzo. Once code goes through automatic validation and gets peer reviewed, and is approved and merged into the mainline, it's ready to be deployed to production. We've built a bespoke deployment tool called Shipper, which handles all of the complexities like rolling deployments in Kubernetes and running migrations in Cassandra. It deals with services that might look unhappy so that you can roll them back, and deployments going bad. All this means is that we can build and roll out changes in minutes using a single command. Every engineer is empowered to do this at Monzo. Engineers shouldn't be expected to know complex things like Kubernetes and Cassandra. They don't have to write YAML or write CQL commands, which are hand strewn, to deploy their services.

Code Generation

Code generation is another avenue where we optimize for engineer productivity, and it gives us a lot of standardization. As you can imagine, we have now 1600 microservices. The number of endpoints that are exposed is really big. We define our API semantics in Protocol Buffers format. Then use code generation tools with our own extensions, we've extended upon the existing tooling available to generate the majority of boilerplate code. You can achieve something like this with gRPC. What this means is that each service is usually about 500 to 1000 lines of actual business logic. This includes, if you've worked with Go, or the If R does not equal null code as well. That size is really understandable for a group of engineers.

Standardize Service Naming

Even really simple things and core things like standardizing service naming. Nobody is deploying a service with innuendo names. Each service is well described in its naming. Service structure, the way we restructure files, where do you put particular files within your code, is all standardized. The vast majority of services use a standardized service generator. All this code is generated up front and the sub-structure is generated up front. No matter what team I go into, I know where I can find the database code. It will be in the dao folder. I know where I can find the routing logic. It will be in the handler folder. Queue consumers will be in the consumer folder. This allows for much easier collaboration and onboarding for engineers onto different teams. At Monzo, engineers move around teams really often. We are really a flexible and growing organization. Having this standardization across all the teams is really important. Once you get used to the structure in one area, you can be a power user across the entire repository, across all of our services.

Tooling

If you're working in a language like Go, you can build parsers and understand your existing code, and extract information from code. Go provides this to you right from the standard library. As we've standardized our service structure, we've been able to build tooling and can operate across all of our services. For example, this tool on-screen called service query, which can print out all of the API endpoints for a given service, and prompt it straight from the code. Even if it's not been well defined in the Protocol Buffers, which is definitely an anti-pattern, it can extract that information directly from the code. We can use the same tooling to do static analysis and validation when you submit a pull request. That means a cognitive overhead for an engineer to peer review, and make sure that this change is safe and potentially backwards and forwards compatible is all delegated to automated tooling. We've reduced the risk of engineers breaking changes when they are deploying their code. Violations are automatically detected and can be rectified during the pull request process using automated tooling before they're merged into the mainline.

Metrics and Alerts

Every single Go service using our libraries gets a wealth of metrics built for free. Engineers can go to a common fully templated dashboard, type in their service name, and within the first minute of deploying a new service, have up to date visualizations and metrics about how many HTTP calls they're making. How many Cassandra calls they might be making. How many locks they are taking, CPU information. A wealth of information. This also feeds into automated alerting. If a team has deployed a service, and has not quite figured out the correct thresholds, they can fall back on automated alerting, which we already have, so that if a service is really degrading and causing potential impact, the automated alerting will catch that beforehand. Alerts are automatically routed to the right team which owns the service. When a service is first built, before it's even merged into the mainline, each service has to have a team owner assigned to it. This is categorized specifically in a code owner's file, which is monitored and automated by GitHub. This means that we have good visibility and ownership across our entire set of services.

Unifying the RPC Layer and Tracing

Similarly, we've spent a lot of time on our backend to unify our RPC layer, so when a service calls another service, to communicate with each other. This means that trace IDs and context parameters are parsed across service boundaries. From there, we can use technologies like OpenTracing and OpenTelemetry, and open-source tools like Jaeger to provide rich traces of each hop. Here, you can narrow down how long each hop took, and the dependencies on external services and systems. We've baked in Cassandra integration and etcd integration right into the library so that we can visualize all of that in Jaeger. It's not just about RPCs, you also want to trace your queries to the database, what actual query was made, how long did it take? Sometimes engineers want to follow a request path through service boundaries, and see logs in a unified view. By having consistent trace IDs which are propagated, we can tag logs automatically on our backend, which makes it really easy for querying what happened between service boundaries. You can log information and see in detail what every single request went through.

There is nothing unique about our platform, which makes this exclusive to Monzo. We leverage the same open-source tools like Prometheus, Grafana, The Elastic Stack, and OpenTelemetry to collect, aggregate, and visualize this data. You can do the same on your platform.

Having a Paved Road

Heath: All of this philosophy is around having, effectively, a paved road, where it's just the easiest option. If you go along with that, and the tools that we provided satisfy the problem that you have, you just get all of this stuff for free. If that thing is so compelling, that there are other options, but you have to do a lot of other work, then it means you end up with this very consistent view across most of your platform. It's not mandatory. There are certain cases where we need to use other tools. Because we have a very strong default, it means that 99% of our system uses the default.

Backend Engineering 101

This starts from day one as an engineer. We go through the onboarding flow. We talk about the things that we want people to think about. Have a documented way to bring people up to speed. I think having that training and giving people an easy path to onboard is really important. Then you can rely on shared expertise across the company. We have lots of patterns for solving problems across all these services and within many teams. As engineers join the company and as we grow, we can leverage that repeatedly. Crucially, you don't need to know how 1600 different systems work. Different teams are working on a cluster and those clusters operate very particular systems. Generally, a person will join a team and they will be looking at a very particular thing.

That applies to lots of different things. Local development, for example, right now, it's clearly impossible to spin 1600 things up on your laptop. Dockers, containers with your database in don't really like that many things connecting to you. Building Go Bindings is really quick. You very rarely need to run many of these things. You're running a subset. We have, essentially, an RPC filter that can detect that you're trying to send a request to a downstream that isn't currently running. It can compile it and start it. Then send the request to it. That means that as you're using the platform locally, you just spin up the things that you're needing. We'll refine that progressively over time.

Deviating From the Paved Road

There are some times where we do need to deviate from this paved road. For example, we use machine learning systems that are primarily in Python. For most of our business logic, our approach in Go satisfies those requirements. That's really the benefit of a microservice architecture, we can use the right tool for the job. In our case, Go is generally that right tool. When we do need something else, we can totally use that.

Improved Organizational Flexibility

Patel: We accept that we've traded off some computational efficiency. As a result, we've been able to leverage organizational flexibility by building services, which are granular enough to be easily understood. Ownership is really well defined. It must be well defined, but can be more fluid as a response to market behavior and goals of the company. Each service shares the same code structure and the same tooling. This reduces cognitive overhead for every engineer that is currently at Monzo, and who joins Monzo. It has allowed us to gain a lot of scalability by being able to independently deploy these microservices.

Focus On the Problem

Heath: By standardizing on that small set of technology choices, we can, as a group, collectively improve those tools. Engineers can focus on the business problem at hand. Our overall underlying systems get progressively better over time. That means we can focus on the business problem. We don't have to think about the underlying infrastructure all the time. At the same time, our platform teams can continuously work on that, and raise that bar of abstraction continuously, so that as we go, things get easier.

Increase Velocity While Reducing Risk

Patel: Breaking down the complexity into bite-sized chunks means that each service is simpler and easy to understand. The granularity and ownership of services reduces the contention between teams, while risk is reduced as we can make small, isolated changes to specific sections of our systems. All of this is in aid of reducing the barriers to make changes. It allows us to serve our customers better, which is ultimately what we want to do as an organization. We want engineers to feel empowered to work on new and innovative functionality and deliver a better product to customers.

Questions and Answers

Participant 1: I just wondered about, do you have strict guidelines on things like the maximum size of a service. Do you have strict coding guidelines as well, length, and maximum number of lines of code in a method, and things like that?

Patel: We don't prescribe any strict guidelines because there will always be exceptions to any guidelines that you put in. In terms of code formatting and stuff like that, Go is really good at enforcing that stuff. Making sure that you have proper error handling. Make sure that you're not skipping errors, which is something that Go allows you to do. We have static analysis as an advanced check on top of that, to make sure that you are not papering over errors. You have to make a really strong justification and add an explicit annotation or comment in the code to explain why an error is being skipped, why you think this will succeed 100% of the time, even though the interface doesn't allude to that. We don't prescribe any strict guidelines in terms of the amount of code you can have within a function. Ultimately, code needs to be readable to other humans. Computers are really good at optimizing code, and squishing it down, and inlining it when necessary. Optimize code for readability for humans. One of our engineering principles is not to optimize unless there's a bottleneck. That stuff works really well.

Heath: We also have, on our engineering principles, which is something which we expect everybody to adhere to. We want to optimize for code being read, and for code being debugged. Optimizing introduces complexity every time. It's pretty rare that we actually need that complexity, so optimizing for legibility, if that's the size of functions or whatever. It's the best way.

Participant 2: Do you ever find that you need to update a whole bunch of services at once? How do you go about doing that, for a security vulnerability or something like that?

Heath: I can't think of many times we've had to do it for security reasons. I can give you a very concrete example. We have an interface to send metrics that goes through a Prometheus library, but it is wrapped. We used to use InfluxDB, and we subbed that out for Prometheus without any code changes, because we have this very simple interface. That interface does not have a context. We want to be able to propagate contexts all the way through our stack. At this point, we have 1500 services, which have many instrumentation calls, and we need to add a context parameter at the beginning of every single one because we want to update the shared library. I think this is where having a mono repo works really well for us. In that particular case, we wanted to change the library to change the interface. If we couldn't refactor everything in one go, we would have had to add a new one. Mark the old one as deprecated, have some migration period. Then some enforcement mechanism to get people to move over to the new thing. Because we have a mono repo, we could change that and use Go's reformatting tools to add a context to every single call across the entire codebase in a single commit. Then, that was a pull request. We had to get approval from the teams because we'd changed their code. That meant that we could do that refactoring in essentially one atomic unit. Now we have to deploy 1500 services. We have tools around that. We have lots of testing around these things. We have lots of tooling that allows us to initiate rollout, and if that's not working well it will automatically roll back. Because we have those safety mechanisms in place, we can do that with a quite high degree of confidence.

Participant 3: You're using Cassandra as your main database, which I'm sure surprises some because it's not generally associated with ACID and all of these other things that you expect of banks, but it does have tunable consistency. Could you comment on how you use it a little bit?

Heath: I think there's a common misconception that banks must be ACID compliant. Eric Brewer wrote a really interesting post several years ago, about how banks are basically available with soft state and eventual consistency as their BASE rather than ACID. I think that's the thing that we really have to think about here. In our case, we use Cassandra because it provides a masterless, horizontally scalable database that gives us a lot of controls over writing the data to multiple locations. We don't have this one server, a primary that fails over to a secondary. We can avoid those problems. We've traded off the ability to have transactional consistency. That might sound insane for a bank, but we can provide those things. In many cases, the financial networks already deal with this. If you tap your card, which supports offline transactions in a store, most cards will allow you to do three or so transactions for up to £30. Hypothetically, your bank may only find out about that two or three days later, at which point the bank is now told you spent £28 on your card. If you don't have £28, they're still going to remove £28 from your account. You have a series of commutative operations that are like debit £28, credit lots of pounds, debit some number of pounds.

When we talk about consistency and transactional isolation, you're trading off financial risk for consistency. Sometimes that's ok. Sometimes that's not ok. We have a variety of different systems. Some things we won't allow transactions to happen unless we can achieve a lock and successfully complete a transaction and then unlock. In many other cases, we get told about it two or three days later, in which case, we don't need a lock. We have a queue of things to apply to people's accounts. We don't really rely on the eventual consistency that much, in that particular case, but we don't need transactional isolation.

Participant 4: Considering the volume of microservices, how do you maintain versioning and integration testing, and system integration testing?

Patel: For versioning and integration testing, I think this is where our mono repository, our single repository for all of our services really helps. For example, every single service that is deployed runs through a battery of unit tests. We can do automatic validation and inference to go through the ASD and figure out what services have changed. We only need to run the test for services and their dependencies and their upstream dependencies. The service graph can get smaller. If you've made a localized change to your particular service, we can isolate the testing to that service. Obviously, we also have full integration testing on offline boxes that we run periodically, to make sure that the full state of the system is in a consistent and a nice, green state.

Naturally, what we do for our integration testing is we spin up these real components in containers. We're doing a bunch of work at the moment, so that these systems can call into our platform. We do run integration testing into our staging platform. Ideally, we'd be able to run these integration tests in an isolated platform maybe in production, maybe some variation of production which is not a testing environment with test data, to make sure that we are testing our assumptions properly. We now have 4 million customers. It is a completely different order of magnitude than having a few thousand test cases in the staging environment. To make sure that we are testing our assumptions properly. That's the unit of work that we are undertaking right now.

Participant 5: Just a quick question in terms of testing microservices. In terms of environments, you get your test pre-prod and prod. I'm more interested in terms of the data side and how you test around that and manage, considering GDPR, considering you're working with a lot of customer data. How do you manage? Or, what guidelines do you have or tips in terms of just testing lower database environments, if that makes sense?

Heath: In the case of testing in staging or pre-production, we want to create test cases that are realistic. They don't need to be real people. We have lots of things that generate people's names and randomize data, but then have relatively realistic test cases that we can generate, prefill into the database, and then we can test with those. There's that aspect. As soon as you're in production, yes, you have real people's data. That is something that we take very seriously. We can't just run tests against people's accounts. One thing that we did when we ran crowdfunding, a year and half ago, we run that through the same platform, which might sound a bit crazy. In order to know that you can support that, you need to be able to test things very accurately. One of the techniques we used there was using shadow traffic. I think Facebook and a couple of the other companies have tried this. Essentially, we updated our system to take a request that was an idempotent like a read, and we could pass that through and then give the response back to the customer. Then this proxy could then sit there and repeat that request a randomized number of times.

 

See more presentations with transcripts

 

Recorded at:

Sep 19, 2020

BT