BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations A Guided Journey of Cloud-Native Featuring Monzo

A Guided Journey of Cloud-Native Featuring Monzo

Bookmarks
50:24

Summary

Cheryl Hung and Matt Heath present the Cloud Native Computing Foundation, what it is and what it does, and how Monzo is using cloud-native.

Bio

Cheryl Hung is the Director of Ecosystem at the Cloud Native Computing Foundation. Her mission is to increase the adoption of Kubernetes and cloud native by fostering sustainable open source communities. She leads and advocates on behalf of cloud native end users, consulting and training companies. Matt Heath works as a Distributed Systems Engineer at Monzo.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Hung: I'm Cheryl [Hung]. I'm here from the CNCF. I'm presenting with Matt [Heath] from Monzo, who is over there and together we're going to go on a journey through Cloud Native. Matt is going to talk to us about Monzo and their actual journey. I'll start by introducing what the Cloud Native Computing Foundation is, if you haven't heard about. I'll talk you through an example of a Cloud Native architecture and then we'll do the real-life example from Monzo.

About me, I'm director of Ecosystem at the Cloud Native Computing Foundation. My mission is to increase the adoption of Kubernetes and other Cloud Native technologies by advocating for end users. I founded and run the Cloud Native London Meetup group, so if you're not completely exhausted by QCon in the last three days, the Meetup is actually running tonight. It's the first Wednesday of every evening, so feel free to join. Previously, I ran Product and DevOps at a startup called StorageOS - a storage startup. I was an engineer at Google for about five years and I have Computer Science degree from Cambridge. I'm also on Twitter @oicheryl and these slides are also on my blog at oicheryl.com.

What Is the CNCF?

The mission of the CNCF is to make Cloud Native Computing ubiquitous by fostering and sustaining an ecosystem of open source vendor-neutral projects. In practice, this means doing stuff that open source developers don't want to do. Community management, hiring technical writers to improve documentation, hiring translators to translate it into different languages, holding and defending the legal trademarks like the Kubernetes trademark, doing marketing and PR so everything from printing stickers to running the social media accounts and then running events like KubeCon, CloudNative Con which is conference for Kubernetes. It's a nonprofit; it's underneath the Linux Foundation and there are six full-time staff including myself plus about 20 that shared with the other, with the rest of the Linux Foundation, these would be events organizers, for example. The Linux Foundation as a whole is about 200 employees globally. We're a nonprofit so we're funded by our members and membership. There's 350 companies or so which are members of the CNCF. I focus actually on the end user members and end-user supporters on the bottom left-hand side. These are companies that are using Kubernetes but not necessarily selling Cloud Native services themselves. We have lots of logos, lots of companies. If you want to join, come and talk to me.

CNCF Structure

Externally, the CNCF is structured as three pillars. The governing board are the ones who decide on the strategy and marketing and budget. The technical oversight committee are nine architects who were elected from the community and they're the ones who decide what projects are Cloud Native and high-quality and high velocity and therefore should join the CNCF. The current chair of the technical oversight committee the TOC is Kelsey Hightower. The end user community is the one that I focus on. These are companies such as Adidas us who you wouldn't think of as being Kubernetes users, but my job is to make sure that they're successful and productive as they adopt Cloud Native and that their requirements go back into the projects and make them better.

CNCF Projects

I've implied this before, but the CNCF is most well-known for hosting Kubernetes, Prometheus, and a whole host of other projects as well. It's more than 30 projects now, and the TOC, those nine architects I mentioned before, are the ones who decide on the criteria that define what's a graduated project, what's incubating, what's sandbox. This is orders, levels of maturity and therefore which projects should live at which levels. For instance, to be a graduated project you have to have committers from at least two different organizations. This is a stop in the case of open source projects that are backed by one company and then when that company loses interest that open source project stops being maintained.

What Matters Is the Complexity

Let's pretend now that you've got a new idea for application for a startup and you want to build it in a Cloud Native way and you want to try and use these open source projects. Where do you begin? Who recognizes this? SimCity. I played way too much SimCity when I was a kid so I'm going to use SimCity as a metaphor for Cloud Native. In the SimCity games, you start with a very small number of houses and you try and grow into a large city, and as time goes on you add services, so you add police and hospitals and schools depending on what your residents need. There's no linear progression through the game but there are things you tend to do in stages before other ones. Usually, you would add roads before you add a university, for example. Similarly, in Cloud Native, you there's no fixed order for when you should do things. What matters is the complexity of what you're building and therefore what are the challenges that you face.

I'm going to use a four-stage model and use a number of services that you're running as a proxy for the complexity. A house has maybe 1 to 10 services, a village might have 10 to 100, a town might have 100 to 500 and then a city is more than 500 services. Once upon a time, you have a great idea and you are going to build a Tinder for Cheesecake app so people are going to swipe left and right on their favorite cheesecakes and you're going to deliver personalized cheesecake to their doorstep every month. This is brilliant and you're going to totally have the next billion-dollar unicorn startup here.

First off, you start with your very small demo, you start playing around with some ideas. First thing you quickly realize is you need to manage your builds so that they are consistent and you don't spend time having to deal with a lot of inconsistent builds and what you're working on. You start out by containerizing your services. Containerd and rkt are both container runtimes. Containerd is the runtime behind Docker. You pick one of these and start running containers using those. Define your services using GRPC, so you have a consistent method of defining services and you start building your little MVP Tinder for Cheesecake. It's going well, you decide to add one or two people so you start hiring engineers to join your team. Everybody now needs to share that code, so you set up your central source control and you set up a continuous integration system so that you can build your container images directly from a shared repository. Everything is looking good. You decide that now is the time to soft launch this to your friends and family.

Once a village station number of services are growing so you quickly realize that managing all of these containers is really time-consuming and having to do them manually is really tedious. You've only got a couple of engineers; you don't want to dedicate someone full time to actually watching services and restarting them when they fail, so you use Kubernetes to automate and restart your services as they run. You also realize that you need to package certain services together so you can use Helm to package those.

Things are looking good. Your friends are saying "Oh, your cheesecake is delicious. We're loving this service, but we are seeing some occasional errors. Can you tell us what's going on? Can you fix it?" You need to track down the source of those problems and you can do so by using Prometheus to instrument your services and logging those hours out to Fluentd. This is going well. You decide that your friends and family are happy. You ready your public launch, you put up your three nodes on a public cloud service and launch it.

Now in your Tinder for Cheesecake app, you have a recommendation system to recommend the next cheesecake that's coming up. You have a payment system to take payments. You have a delivery system because you got to marshal your drivers to actually deliver the cheesecake. It's getting increasingly difficult to figure out how these services are interacting and actually calling each other. You can use NATS to standardize on the messaging between services and Jaeger, you can use to actually trace the course between services so that you can understand what's going on.

One day you're watching your metrics dashboard and you see a sudden spike like this. It turns out an influencer has featured your product. Now your usage is starting to rise and all these people are coming in to join. You're celebrating all these lovely new users, but you still need to be able to ship new features without disrupting your existing ones. With a combination of good test-driven development practices and using a service mesh like Envoy or Linkerd, you can actually have continuous deployment, as well. You rely on the service mesh to do canary deployments for your releases, health checks, load balancing, and so on and with that you actually launch your Tinder for Cheesecake app. Users are loving this, investors are loving this. You get a whole bunch more funding and now you need to build Tinder for Ice-cream and Tinder for Crème Brûlée and whatever.

Now you're really talking about scale, managing things at scale. Up till now, you've just dumped all your data into a single database, run on a single node, do some occasional backups and kind of hope for the best. But users are complaining that accessing their accounts gets slower and slower and you don't want the single point of failure in your system anyway. Vitess is a project that allows you to shard databases across multiple nodes so now you also get some high availability out of it. And then, you also need to start worrying about all these external bad actors who might affect your system and start recommending strawberry cheesecakes that nobody wants or something. Anyway, so your engineering team is also growing because now you have quite a few products, so engineering teams are working on it. You can use Harbor which is a container registry, set up an internal container registry so you no longer have the reliance on an external one like Docker Hub. And you can use Notary to securely distribute software so that you're not accidentally shipping something malicious or something that you didn't expect to.

It's a very quick run through from 0 to 100 but everything is good. Your users are happy, your developers are happy because they can still ship stuff quickly, your SRE's are happy because everything's running smoothly and you'll skip off into the sunset surrounded by piles of creamy desserts, everything's great. A couple more that I didn't mention for them but I want to mention, CoreDNS and etcd are both built into Kubernetes anyway. And then the bottom three projects are specification projects that, if you're looking to build your own tracing system or networking or software distribution, you can look at.

Summary

In summary, I would say don't feel like you have to do everything all at once. Focus on getting your basic software engineering principles down first before trying to just jump on Kubernetes because it's a kind of cool hot thing. Only actually add things when you face a challenge that requires you to solve it, so understand why you're doing it. Obviously, this is a bit of a simplified version so Matt [Heath] is going to talk us through his experience.

Then KubeCon, CloudNative Con. How many people have heard of KubeCon, CloudNative Con? This is the biggest conference around Kubernetes. The European one last year was 4,000 people. This year in Barcelona we're expecting between 10,000 and 12,000. It's really grown, it's really big. Yes, it's in main Barcelona and if you're interested in learning more I highly recommend it. It's a really good place to be. With that I'm going to hand over to Matt [Heath] and then we're going to do joint questions at the end.

How Monzo Works

Heath: Hi, I'm Matt, I'm a back end engineer, I work at Monzo. If you have any questions we'll do some questions after here. I'll be around all day or you can get me on Twitter @mattheath, so please fire anything my way. As I mentioned, I work at Monzo. Monzo is a digital bank based in the UK. We have these neon colored debit cards which stand out in lots of places. They're also UV reactive, great if you're in the dark summer. Our mission at Monzo is to make money work for everyone. Everyone has a terrible banking story where they have done something like some tiny mistake and they've been really penalized by loads of fees or they weren't able to do something, their bank didn't work at some point in the night, you needed a phone a support at 2:00 a.m. in the morning and that just isn't a thing. Those are the kind of things, those annoyances are what we're trying to fix at Monzo. We're trying to provide the kind of banking experience you would expect, if you created a bank in the 21st century, which given that most banks are in the UK, 300 years old, it’s maybe not that unfair.

A quick run through of how Monzo works. We have an app - when you use your card you get real-time notifications and you can see actually where you spent rather than some obscure description on your bank statement. That gives you a bit more information about what you spent your money on. As I mentioned, we deliver like sweet emojis in real time so as you can see this one is Burger Bear because you buy burgers at this Burger Bear place. Mostly you get one emoji. There's a lot of code that deals with emoji, we love emoji at Monzo.

Then we have more normal things so you can pay your friends really quickly, you can split bills, you can request money for restaurants, you can have shared tabs with your friends or housemates or if you go on holiday. We also do the spending analysis so you can do budgeting rather than if you are using 3DS rather than a weird, obscure web page from the ‘90s. You can approve by push notification on your phone, mind-blowing. You can automate your finances with [inaudible 0:17:24] and one of my favorite features is we provide a gambling block and this is something that a number of other companies are doing now as well. This helps people who identify as gamblers, self-exclude themselves and provides a small amount of social friction, which really helps lots of people in that kind of situation.

A Little Bit of History

We started actually in February 2015, but we released our first cards in November 2015. At this point, we have 3000 made. We really didn't think anybody would want a prepaid debit card. We needed to get some test data so we needed some cards, but we really didn't think that people would find it a compelling offering compared to a bank account, so we forced people to have these to start with. We had events, we gave them out in pounds and eventually that started picking up and we ran out of cards. Then we had a beach program which again our prepaid system, so we got a lot more cards made and our customer numbers grew. Then we moved over to current accounts at which point things spiraled rapidly out of control. At this point, we're the fastest growing UK bank. We have 1.6 million customers and I guess, the real question is why am I here talking about Cloud Native? Why is that important to us as a bank? I think the real reason is if you want to watch Netflix at 2:00 a.m. in the morning, Netflix always works. Other services like that always work and your bank should always work. If you can't actually use your cards that's usually a lot more of a problem than you can't watch the latest episode or something on TV.

We provide chat 24/7 every day of the year, because we think that's important but equally, our systems have to be up for that work and we think the best way to do that is by building using Cloud Native technologies. There's a quote I think I put in a talk five years ago, from Adrian Cockcroft. This is still true. We're still trying to build a highly available service that allows our company to move really quickly on top of a load of broken cloud components and they're progressively more reliable, but this allows us to build more reliable systems. That, I think, sums up most of our ethos at Monzo.

How Does Monzo Operate?

Where are we today? We run on a combination of Amazon Google Cloud Platform and we actually have physical data centers. In turns out you can't put fiber into a cloud yet and payment networks require that. We also run on Kubernetes. We use Docker, Calico, Cassandra, Etcd, Prometheus, Jaeger, etc. Most of our platform is written in Go so the services that we've written, I think 99% of them are written in Go. A graph that is relatively similar to the other one. The number of services we have has also spiraled rapidly out of control. As our company size has increased that means that we can quickly build new features, quickly build new things and I think Suhail [Patel] actually got the data voice, so thanks for that. Suhail [Patel] did a talk on Monday about our load testing in production and we have I think 947 services in prod right now and that number will only increase progressively faster.

Obviously, we didn't have all about on day one. How do you build a bank? Does anyone know? I sure as hell didn't know. The natural conclusion is you've got to think about it a bit, you open your text editor and then you intensely panic while you have no idea what to type. For any large project, if you look at what you want to achieve and you're staring at a blank slate, it's like a writer's block. You have no idea what you're going to start with. In our case, we built a small prototype.

Starting with a House

We didn't have banking license yet, we couldn't connect to any payment networks. We couldn't actually move money in the real world, so instead we built a house, we built a tiny prototype that allows us to quickly test moving money between people. In our case, it was a really basic app. As you can see we don't have names yet, we memorized everyone's three digit account numbers. I don't know why we didn't sequentially assign them, but we randomly assigned them. At some point later, we added a name feature so you could have a name, that's pretty great.

On day one, that really meant that we started off with a GitHub repository, source control was the first thing we did, created an organization, created a repository. I think 65,000 commits later, we have a bank. These things happen. The next challenge we have is, if we build an app, where are we going to run the infrastructure. In our case, we wanted to run this in Amazon. It's the quickest thing to do. We don't want to go and buy servers. That's the natural thing that we had done everyone who was around at that point. The problem is in 2015, you were not legally allowed to run a bank in the cloud. Yes, they hadn't released their cloud guidance. This is a bit of a problem and so we decided for this point we're building a prototype. We'll build it on Amazon anyway and we will deal with that later. We'll pretend it's okay for now. What does that mean? We have source control. We spun out like a CI survey which is like Jenkins. We had some artisanally handcrafted servers on EC2 and that's the most basic thing.

We found a couple of servers, we wrote an app and we put on the servers all manually and that allowed us to test the thing. I wouldn't say that that's clearly not a good practice, but that's what we did on day one. We also started off we Go based services, so we're quite keen to build with microservices from the start because we knew that we'd have to abstract many different payment networks and things would get really complicated and there were pros and cons on that. We had no idea of the problem we were solving, so actually the boundaries we defined were often a bit wrong. We defined those with protocol buffers so a bit like GRPC, you can define it in a file what the interface is. We use Cassandra as our database, which is not what you would expect. On day one we went with a distributed database and that caused us a number of problems for the first year and a half, where we didn't have the scale requirements to really need Cassandra, but we had the data modeling problems because we were using a non-relational database which slowed down our development. But the reason was we had to get a banking license which would take two years, and we were hoping, if we did have a successful company, none of us wanted to change the database underneath a running bank, about now basically. We avoided that problem but traded off that further once.

Starting with a House

A little bit later we got our real cards, we started giving those out to people. We actually move some money in the real world, as you can see the app is still called “Bank” at this point. I know we're heading into this next level of complexity, so village. In our case, it really was a village. There were a load of services that we sat on some servers we manually configured. That wasn't great. At this point, we needed to introduce some method to deploy that easily. We needed an orchestration system or some way to put the apps on the servers. In our case, this is back in 2015, we picked Mesos Marathon because Mesos was battle-tested on a number of large companies and we thought we go with that one. We were using I think a script in Jenkins to do the deployments. We also had a load of RPC calls between all of these services and that meant that it was in order to add new functionality without changing a critical system. We had to change those things, so we added asynchronous messaging using an NSQ at the time, which allowed us to do Pub/Sub between additional services.

Now that real people are using our products, real money is moving in the world and people need access to money, so when errors happened this was a problem, so we introduced error tracking as well. I think we were using Sentry at the time as a service offering. We also introduced metrics at this point and this graph hopefully demonstrates why. This is one from our Cassandra clients showing that, for some reason, one of the nodes is not doing as many queries as the other ones. Before we had metrics we were basically just guessing it was like whack-a-mole. We had no idea what was going on, but by adding a metric system we could see that something's gone wrong here, a bit later something's gone wrong over here. And it turns out that actually, these are due to network saturation on the nodes while they're running backups. While that was happening, the queries on that node were slower because the network was saturated but without graphs of that, we have no way to identify.

At this point, we're late 2015. We've added a messaging system, we've got deployment, we've got orchestration, we've added error tracking metrics and in our case, we also needed distributed locking. Using Cassandra as a database and people who have seen that before may know that it's not as compliant, it doesn't have transactions. There are occasions in a financial system where you might need some method of consensus, fewer than you might imagine but some. In our case, we introduced etcd. That meant we would use it in a couple of places before, we built up experience with this, and when we moved to Kubernetes, that meant that we had experience with one of its core components.

Reliability and Scalability

A little bit later, we are moving on, where we're building up to a bank and we built lots of the additional banking services and we started to launch our current account. At this point, reliability and scalability are our primary concerns. People are depending on our product now. Some people had gone full Monzo so they'd moved their entire life into our product and that meant that if it doesn't work, people can't pay their bills, that'll affect their credit record. There were all sorts of really terrible things that would happen then, so reliability was our primary concern.

We were also growing really quickly. I think at this point we grew at 5% every week for 10 months which is like soul-crushing growth, that is hard to deal with both on the technical and cultural side. Hiring people that quickly is really hard. These are the things we were dealing with at that point, and we took a step back and wanted to evaluate the technologies we'd used before we got a lot further down the road and it was much harder to change them.

At this point we wrote a blog post on this I think in 2016, but we looked at the things we're doing. We were still running Go binaries natively on the machines and so we containerized those. We moved to Kubernetes because we saw that that was where the mindshare of the industry was going and the development of that ecosystem was significantly higher. It also allowed use their pods model which was something that at the time something Mesos Marathon didn't provide. We switched to using a service mesh, we used Linkerd so we run that [inaudible 00:29:19] machine and we also started to introduce Kafka as a resilient message queue. We were using NSQ before, which is really good for super high throughput messages, but it doesn't guarantee delivery, whereas Kafka has those tunable consistency guarantees.

At this point, we have a similar system that we have before. We have an orchestration platform Kubernetes, we can run our services on here. If we deploy one Kubernetes, we'll manage them across failure zones so we make sure we don't run all of the things on the same server, which is great. If one of them fails I'll handle that, if a server fails it will handle that.

The other thing we gain at this point is by moving to containers. We can limit the CPU so we can kind of throttle things and make sure that works. We can also now run a variety of other software on top of our platform and that meant that our costs drop massively. We had a load of service that were running database and things that were idle and we could move those into our platform and we save loads of money by packing densely and by moving everything on that. This provides us with a standardized platform that we use for basically everything the ones I know. We've moved from what we had before to using Docker, Kubernetes, Linkerd, and Kafka.

Onwards to a City

Then the next stage is onwards and upwards really. Things are getting really complex at this point. We have hundreds of services, our team has grown a lot. It's very hard for people to understand how things work and that means you end up with people who've been around at the company for a long time, have an instinctual feeling for, "Hmm, this graph looks a bit weird, therefore this extremely complicated thing has happened," and anyone who's just joined the company has no ability to do that. Tools like Jaeger have allowed us to debug things like this. It's interesting that we only installed Jaeger and probably got it running about six months ago. We'd make sure in our code that we passed through the context but we'd not actually deployed a tracing system until that point.

See as I've been saying that you can actually get that far without what you would assume is a critical tool. We also switched our metrics over to Prometheus and we are having some problems with the open source version of Influx[DB] because it runs on one machine and so we switched over to Prometheus again because we found was moving that way. For security we could do network policy enforcement with Calico and by moving our service mesh into the pods we also got that additional security.

What’s Next?

What's next? The main problems we have now are this managing complexity, we have some technical challenges scaling if we're going multi-region. I imagine multi-region consensus is a relatively hard problem if you're a bank. And then the other aspects are more cultural like how do we instill the values and provide tools that people can use that they can understand the impact they'll have and how to effectively and quickly build new functionality for customers.

We formalized some of the cultural things in our engineering principles which we had a blog post about and this defines how we think about developing software. Then that's pretty much it. The main problems we have are managing organizational complexity and by using a load of open source Cloud Native tools we don't have to build them ourselves. We can benefit from the work of everyone else and the Cloud Native Compute Foundation provides most of those. Thank you very much.

Questions and Answers

Participant 1: DevOps, normally banks and DevOps don't go very well together. How have you tackled that problem?

Heath: What would you see as meaning DevOps in that context?

Participant 1: Segregation of duty tends to sort of get pretty high on the agenda.

Heath: That's an interesting question. I feel segregation of duties is one way of solving a problem. The problem you're trying to stop is one person being able to ultimately steal everyone's money or do other kinds of nefarious things. Also, you don't want specific people to be able to collude. Our security team has certain access to certain things and the standard people are assigned left and right side. But those were a relatively small amount of tasks and only, to be honest, because we're interacting with other systems other financial networks that's their model, whereas in our model we have like multi-party authentication for various things. You can submit an RPC call through our command line to do something sensitive. Most things sensitive are blocked, but certain things that you need to be able to do, that'll actually open a task in our web tool that someone can go and approve and then you just run the command again and it will execute. That allows us to limit the access to specific things without specific segregation of duties. We really believe that the more information you have, the more context I have, the better I can do my job, so if I only know about that bit, that's a problem.

Participant 2: For Kubernetes, you said that you got a lot of benefit from packing density. Did you go straight to one big Kubernetes cluster or did you start with smaller ones and then decided to move them together?

Heath: I can answer in Monzo's case. In our case, yes, we have one giant cluster and maybe we might want to split that between availability zones, but we went from that to a single HA cluster. In 2016 I don't think there was an officially supported way to do a high availability master. That was an interesting learning experience. Now there are tools that do that for you but yes, we have a cluster prone environment effectively and when we expand to multiple regions I imagine it'll be a cluster prone region.

Hung: Generally, I work with officially about 80-ish companies, so I see a lot of different configurations and actually when I was at Google I was using Borg which was the predecessor to Kubernetes. For Google, efficiency is everything so giant clusters bin pack as much as you can. But I've seen some very strange ones, like there's a financial company, which I'm not going to name, where they run one application per cluster because that's how they manage their isolation of services. I don't know if I would recommend it from the efficiency point of view.

Participant 3: You showed this great slide of text editor trying to figure out how in the world you were going to build a bank, and the challenge of trying to figure out your business problem at the same time that you might be trying to figure out all these other new technologies. You took the path of I know all about EC2, I can spin these things up manually. Part of the argument here I think is there's all these Cloud Native solutions that I could use to bootstrap, but there's a learning curve to that. If you had to do this again today, what path would you take?

Heath: If I was doing it again today, there are so many options that just didn't exist two-three years ago. I don't think we'd run our own Kubernetes clusters. There are a certain number of pretty interesting network things we have to do which might require us to run a small number of nodes, but we would probably push most of that into a hosted option, either on Google or Amazon. I think if we're starting now, it took three of us three months to do that re-platforming section and that is three months that you could spend doing things on your business which would be even better.

Participant 4: Given there's been a bit of a theme through last year's and this year's conference as well, would serverless and maybe Function as a Service platforms be part of the options you'd consider if you were starting afresh today on a public cloud provider?

Heath: Yes, potentially. I don't have a huge amount of experience with it and I know the development flow has improved significantly over the last year or so. Now I think that would be relatively straightforward for us to do. Most of our services are essentially a load of functions that are just grouped because they have some common functionality, but pretty much our entire platform is based on an interface that is request comes in response goes out and we could model that entirely in a serverless way. That would require you to spin up no infrastructure and that'd be much quicker. I think if those things had existed when we were doing that, we definitely would have done.

Participant 5: In your giant cluster, has a noisy neighbor ever been a problem for you?

Heath: I think that must be a generalizable question.

Hung: Well, I don't know. When you said “giant cluster” you were specifically asking of Monzo.

Heath: We've definitely had problems with that, but that must be a problem to other people we see as well.

Hung: Yes. There's no one solution to it, right? You have to implement a whole bunch of different things to try and tackle it in different ways.

Heath: Kubernetes allows you to constrain the CPU memory primarily and I think some other. We've had things where we deployed our telegraph thing for our metric stack. It was just in the cluster and it was doing like hundreds of megabytes a second of UDP traffic which just destroyed the performance of anything else on the machine. Also at the time, when we switched to Calico which was pretty great, we moved away from Flannel which we'd been using in production for two years. When we started using Flannel, the AWS set up didn't work. We tested that in production for a while so we basically ran everything in UDP user land mode for two years which ate loads of CPU on all our machines. That was a noisy neighbor and then our metric system and other things definitely had a problem. I think Disk I/O was actually one of the problems we see. Most of our services don't write to Disk until they have a problem and then they write infinitely to Disk. That is a problem that happened multiple times and now you can tune that, we run on core OS, so you can tune out with it generally. Just throw stuff away once you head in there.

Participant 6: The general guidance given for any Greenfield project is to not start with the microservice but to build a monolith and split it across because you don't know the context, especially since you're quite new to the banking domain as well. How was the journey like? Because I'm curious to know how to start with microservices for a Greenfield project.

Heath: I say if you don't know what the problem is, if you're building a new company or trying to solve a business problem, ideally you'd spend as little time as possible on the technology and get to a point you have a product in people's hands to test your business model. In our case, we had a slightly weird situation where getting a banking license we optimistically thought it might take a year but took two obviously, classic engineering assessment there. That meant that we had some amount of time. We weren't originally planning to launch a prepaid card, we had no plans to do that. We were going to build a banking system so we had a small amount of time to deal with that. But because we didn't understand the business we built a load of services that we've had to re-factor interfaces for and actually, we have a couple that we just have to live with those. We'll probably replace them at some point this year but we've lived with some not great obstructions for four years. I'd say generally, building a prototype as a monolith and shipping it and getting customers is probably a better idea. I actually don't really know how to do that anymore. I've been writing Ghost services for about six years so I'm probably not the best person to ask.

Participant 7: My question is a bit related to what you just said. Basically, in the world of so many microservices, I think you said you have up to like 900 plus now. How do you go from a user journey? You are trying to create a new feature or user journey, and then managing the number of microservices that you have to touch so basically clear boundaries? Then, if it boils down to a developer it's in, how do you pass that across several squads? Do you potentially get slower because of that or how do you manage that?

Heath: I just said that's how do you take a new feature and work out with services are getting touched to how you would decide create new ones?

Hung: I would say I don't think that there's a formal standardized way to do it. It's a bit of art and craft like deciding what the right level of abstraction is and what things you should put inside a service versus splitting into separate services.

Heath: Yes, I'd say so. We have lots of services that do lots of things already, so I think the first thing, if I'm going to build this feature, is there some other generalized thing over here that I can reuse parts of. One thing we have seen problems with is where we've prematurely tried to abstract lots of functionality into common services. That has actually caused a lot more problems at Monzo than just copying and pasting some code and writing another service. I think there's no hard and fast rule. It is pretty much “We are going to build this thing. How would I model the data into this object, that object, that object? What functionality do I need for that kind of thing?”

Participant 8: Just as a follow up of that with 900 services and fewer than 900 developers working on those 900 services, how do you know what you've got?

Heath: That's a really good question. That's actually probably one of our main problems. We have, I'm not actually sure let's say 150 engineers roughly, 950 services - yes, hard to know. And we have them on a repo, so you open that in your text editor and there's 970 folders and how do you know which ones are which. There are a lot of core ones and there are core abstractions we provide, we only have one ledger, we have a transaction model or various things like that. Then certain teams have established patterns of how they might build a payment scheme that can deal with authorization presentments from MasterCard and take settlement files and do all the reconciliation and reporting and various other things. It's pretty much asking around see if there's a pattern you can use, communication really.

Hung: I don't actually know how many services Google has. Easily in the tens of thousands, hundreds of thousands, I don't really know. You just abandon the idea that you can understand the whole thing. You literally just give up, you're like “If I need to try something, I will understand it from this point to this point and everything else I would ignore.” It's the way it works.

Participant 9: As you're a banker, do you need to introduce layers of encryption that your manager serves between you and cloud providers with storage for example and if yes does it make a pain for you? Whether you have to manage your own layer of encryption between you and the cloud provider.

Heath: We have lots of different types of encryption, it really depends on the use case. As an example, database backups, we really encrypt on nodes and then we also use encryption to rest on our third-party provider so that's encrypted that way. We have lots of zero visibility data at Monzo like things that we are not allowed to see, no human should ever see. And if we do see it, then that's a problem because if I see someone's card number, we need to reissue that card and we can do it nicely for the customer. But that means if I logged a load of unencrypted traffic and then I looked at it, we might have to replace a thousand people's cards which would be very problematic for those thousand people, less so for us. The way that we deal with that is tokenization, for example. Any log lines I think we encrypt with a key that's not available; it’s an offline key, so we can do an asynchronous encryption on that and the times that we would actually need to go and look at that, extremely slim. There are so many different encryption cases we'll have TLS, if we're using S3, for example, we can put encrypted objects in S3.

Participant 9: Cheryl, probably how you encrypt traffic between services is a bit of puzzle. Where do you solve that?. Are there things in the projects that CNCF curate that can help in this area?

Hung: I think this might be a bit of a hole in the projects because, even with 30 projects, it doesn't actually cover everything, so I don't think that there's an obvious solution for it right now.

Participant 9: If the stuff is in your cluster, but we're talking here about a case we are going outside of a Kubernetes cluster into a different world.

Hung: Yes, it's kind of own your own at the moment.

Participant 10: You mentioned in 2015 you couldn't run a bank in the cloud you weren't allowed to by the FCA, what's the current situation? What changed with that?

Heath: We started building Monzo in February 2015. In June 2015 the FCA released their draft cloud guidance which provided some guidelines on how public or private clouds could be used by financial institutions and I think that was firmed up in the beginning of 2016 maybe. At that point I think the FCA were running quite a lot of their things on either Amazon or some other provider but at that point, there was no guidance. If you're in a regulated world and you have to do audits on all of your systems, for example, you'll need to work with auditors who understand cloud systems because some of the payment scheme requirements we've had are “In your network diagram please point to the physical firewall that will do this particular job.” I don't know it's in Dublin somewhere. Working with that to demonstrate that security groups network ACLs and egress policies, if you're doing that with policy enforcement in Kubernetes with Calico, demonstrating that those things provide the same level of security and the same level of access control to change them. Those are the things you have to provide to the auditors. I think it was just 2015 - 2016 was a really interesting turning point. Using a cloud is totally fine now.

 

See more presentations with transcripts

 

Recorded at:

Jul 12, 2019

BT