InfoQ Homepage Presentations Building a Reliable Cloud-Based Bank in Java

Building a Reliable Cloud-Based Bank in Java

View Presentation

Speed:

Download

54:29

Summary

Jason Maude talks about the experience of Starling Bank, a mobile-only, cloud-based bank that launched in the UK in 2017. He looks at the system architecture of the bank, the design principles that give them the ability to release quickly and reliably, and why they decided to build the back end using Java.

Bio

Jason Maude has over a decade of experience working in the financial sector, primarily in creating and delivering software. He is passionate about creating teams and explaining complex technical concepts to those who are convinced that they won't be able to understand them. He currently works at Starling Bank as one of their lead engineers and host of the Starling podcast.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Maude: My name is Jason Maude and I am a senior engineer at Starling Bank. What I want to talk to you about today is Starling's Bank's journey, how we built Starling Bank, and the architecture behind it. I particularly want to address a point that was raised in the very first keynote of this conference. The idea that this trade off between reliability on the one hand, and speed of delivery on the other, is a false dichotomy. It's a false choice. The top companies, the top players in this game are not making that choice, they are going for both. They're trying to improve both their reliability and their speed of delivery. I only want to talk through how you can do that, what sort of architecture you can implement in order to achieve that goal. I also want to explain why we chose Java at Starling Bank as our language of choice to do this, and why that decision helped us implement this architecture that we wanted.

The Problem with Banking

So first up, I'm going to talk about banking. What is the problem with banking? Banking IT has proven itself somewhat resistant to changes in the software industry. Does anybody work in banking here or financial services? Is that fair to say, do you think? Yes, it's slightly slower, slightly more conservative? This is a quote from Mark Zuckerberg, and it was for a time the unofficial motto of Facebook, before they changed it to something much less catchy. The idea behind it was not, "Let's introduce bugs. Let's have as many bugs as possible." The idea behind it was that sometimes you have to develop and ship code fast. All the time you have to develop and ship code fast. The reason behind that is that you want to learn, you want to learn quickly.

Sometimes learning is achieved through failure and that failure means that you have to cope with bugs, bugs get introduced. So you break things. You learn through that. And that learning helps you deliver better software. It allows you to deliver software that your customers really want. This sort of epitomizes the culture we're moving towards, or have moved towards. This sort of fail fast, learn stuff, continuous delivery culture. Brilliant. So, why don’t banking and financial services, why don't they fully embrace that culture?

Here's a quote from Harry Potter, and Gringotts is the bank in Harry Potter. This quote epitomizes the essence, the core of banking. Banking is about trust. If I go and put my money in the bank, I want to know that it's going to be there. I want to know that it's not going to be stolen by any child wizards, and I want to know that I can reliably go and get that money when I need to and take it out. I need to be able to reliably know that I can make a payment and that payment will go out to the other bank that I want to send the money to, and that payment will happen once, it won't happen twice. I won't drain my account by sending out all my money, and I won't not send out the payment and then get charged late fees or have people shouting at me down the phone because I haven't paid them.

Reliability and security and overall trust are big things in banking. This leads to a conservative culture. People don't really want to change in case they break something, in case one of these payments doesn't go out, in case they lose their trust and people withdraw their money and then the bank collapses. This conservatism leaks into the software engineering practices that many banks have. It means that for them, continuous delivery means once every three months and even then, "Whoa, slow down cowboy. You're being a bit faster." So this inherent conservatism, this inherent, "Let's make sure that we release slowly, let's make sure we don't break anything," that is often the experience of many people working with software, with banks or other financial institutions.

But at least it means that banks never suffer IT problems. I was worried that one was going to be a bit too dry, so I'm glad that you caught onto that. Let's delve into one of the biggest problems that a bank has ever seen. This is a tale of banking woe from the United Kingdom. So it's the 18th of July and a group of engineers are preparing to release an update to a batch job scheduler. This is a thing that takes jobs and schedules them. It says, "Right, now run this one, run this one, run this one."

They prepared the release, they've tested the code, they've got all the sign off from 20 different layers of management, they've got their backup plan in place, and they release the code and they monitor it. They're following good practices here, they're not just trusting that everything will go right. They've released the code, they're monitoring the code. They monitor it. It's not going so well or it's a bit too slow, there are too many failures. The error rate is slightly too high.

So what do they do? Next day, the 19th, they decide, "Let's roll back. Let's move the changes back to the previous version." Great. And they roll back the code successfully. The database, however, is another story. They don't manage to roll back the database successfully. So now, when the batch jobs starts for the 19th, you have the database at one version and the code at another, and these two versions aren't compatible.

The batch job scheduler starts running jobs. Those jobs fail because the database is in the wrong place, and the batch job scheduler doesn't care. It just keeps going and keeps chucking out jobs and says, "Run this one, run this one." Some of those jobs, it turns out, were dependent on each other. One needed to complete successfully before the next one could run successfully. But merrily the system plowed along, churning up the data and causing no end of chaos. And so, it comes to the 20th of June and the system is in a nightmarish state. People's mortgages have been calculated wrong, no one can get into their banking IT systems through the online web portal, customers are phoning up and shouting down the phone. It's all a big nightmare.

The engineers have to desperately run around and try and reset everything and put everything back in place. But it takes them almost a month to correct the problems, during which time everything is in this chaotic state. So, if you move fast like Mark Zuckerberg, you'll break things, and if you move slow like this bank or many other banks, you break things.

If you develop, you break things. You will break things if you develop code. And once you accept this fact, and once crucially you put this as part of your design philosophy and move it into your code, then you can start to eliminate this speed versus reliability dichotomy.

Who Are Starling Bank?

I'm now going to give you a brief introduction to who Starling Bank are. We describe ourselves as a tech start-up with a banking license. We're a mobile-only bank in the UK, and by mobile-only bank, what we mean is that you can only access Starling Bank through your phone. You download an app onto your iPhone or Android device, you apply for a bank account through the app submitting all the data that you need, and then you get access to all your banking information through the app.

We've got all of the standard features that you would associate with a bank account. You have a debit card, you have the ability to pay money from your bank to other banks. We also have some features that are quite, we think innovative, quite new, fine-grained card control that allows you to turn your card off for particular interactions such as online payments, aggregated spending data so that you can see how much you spent at a particular merchant or how much you spent in a particular category every month. The ability to automatically provision your card from your app into GooglePay or ApplePay or any other virtual wallet.

We built this bank in a year. We started building in earnest in July, 2016, and by May, 2017, we were ready to launch ourselves publicly. We launched the apps and the public started downloading them and they liked them so much that come March 2018, we were awarded the best British Bank award at the British Bank Awards. The question is then how have we managed to deliver so quickly and deliver all of these features, not only the existing ones that banks have, but all these new ones, while at the same time maintaining this reliability that customers demand? Now I'm going to talk about how we have built our architecture to be reliable.

Self-Contained Systems

We work off the principle of self-contained systems, and you can read more about the design philosophy of self-contained systems at the web address there. Now, we don't use all of the design philosophy there, but we use quite a lot of it, we base quite a lot of our design thinking on that. Now, self-contained systems are systems that are designed to handle one type of job. So in the context of a bank, those jobs might be, say, to handle cards and card payments, or to handle sending money to another bank or to maintain a list of the customers transactions, or to send notifications out to other people.

Now, these self-contained systems, I wouldn't really describe them as microservices, I think they're too big. We can have the how-big-is-a-microservice argument later. I would more describe them as micro lifts. They have their own database, which for us is running in an RDS instance in the cloud. They have their own logic and they have their own APIs, and their APIs can be connected from other self-contained systems or from the outside world.

Each of these self-contained systems for us runs in an easy two instance in AWS and we can have multiple of these spun up at any particular time. So if there's a particular large number of card payments coming through, for example, we can spin up three, four, five, six, however many instances we need of that particular self-contained system. We at least have two of each running at all times for redundancy purposes. They each have their database they can connect to, and these self-contained systems can be accessed by each other via their APIs. They can also be accessed by the mobile apps and they can be accessed by our browser-based management portal, which is the management system written for the bank managers so that they can manage the bank. There are no startup dependencies between these things, and crucially there's no distributed transactions either. So, there's nothing to link these two things together or couple them together. All of these different services are running independently of one another.

Recovery in Distributed Architectures

So we have a cloud-based modern scalable architecture. How do we make it reliable? Recovery in distributed systems. The problem that the people running the batch scheduler found is that if you have a load of distributed systems talking to one another, and one of them happens to go wrong, then that problem can spread to the next one. The batch job scheduler can sit there and say, "I'd like to run this job please." And the second system will go, "Fine, yes, I will run that job." And it will create a data problem. It'll process things incorrectly, store them in the database incorrectly, so it will create a problem. And then that problem could spread to a third system, or a fourth system.

In fact, the bug is very much like a virus. It can spread from system to system and you can find yourself in a situation where you have corrupted data all over the place. You have commands that you don't really want to run all over the place. Fundamentally, this is happening because the systems, all of these different services you're running, are too trusting. They accept without question that the commands they are being given are good to run. “Yes, sure. I'll run that for you. No problem.” I'm obviously in the correct state, otherwise, you wouldn't have asked me to run that.

But as we know, bugs can happen, things can break. So, the question then is how do we stop this wildfire or virus-like spread of a problem from one system to the next system in a scenario that becomes very difficult to unwind? We have invented an architectural philosophy around this, which we call LOASCTTDITTEO, or Lots Of Autonomous Services Continually Trying To Do Idempotent Things To Each Other. We like this because it just rolls off the tongue. No, so it's obviously a bit too long, this one. We showed this to Adrian Cockcroft from AWS and he shortened it for us down to DITTO architecture, Do Idempotent Things To Others.

Who here is aware of what idempotency means, what the concept is? That's good. For the benefit of those of you who don't know, I'll explain idempotency at least in the context that we're using it. Idempotency means that if you try and run a command twice, the exact same command twice, the outcome is the same as if you ran it once. And indeed, if you run it three times or four times or end times, no matter how many times you run this same specific command, the outcome will be the same as if you run it once. This is a very important concept when it comes to building reliable software, especially in banking.

DITTO Architecture

DITTO architecture and its principles. The top core principle of DITTO architecture is this idea that everything you do, every command you run, every work item you process should be run at least once and at most once. Now, why didn't I just say once rather than extending it out? The reason is because saying at least once and at most once gives you the two problems that you must tackle when trying to work out how to make a reliable system. When you were trying to run a command, you've got to make sure that it actually happens. So you've got to retry making that command until you're sure that it has been a success. But you don't want to retry it and have it happen again and again and again and again. It must be idempotent. You must be able to make sure that when you ask to make that payment, you only make it once. You don't make it multiple times, even if there is a problem in your system.

The systems that we have, all of these self-contained systems, all of these services, are trying to work towards eventual consistency. So they receive instructions and they immediately store these instructions in the database. Then the database becomes the repository of what work must be done. The store, the list of what work has to be achieved, what payments have to be made, what cards have to be reissued, what addresses have to be changed, whatever the command is, whatever the piece of work that you need to do is, that is stored and logged in the database at first. And then from there you can try and catch up and make everything consistent and make sure that the processing that needs to be done to make this payment or issue the card or what have you, is correctly performed by all the systems that need to be involved.

Your smarts are in your services here, not in the pipes between them. The connection between the services has no queuing mechanism or no detection mechanism, they just pass the requests between services as they're asked. All of the smarts to detect whether you are running something once or twice or whether you haven't run it at all, are contained within the services which are continually, as I said, trying to catch themselves up to eventual consistency. As I mentioned before, there are no distributed transactions here in our system, so you have to anticipate that something could get halfway through and fail. If a particular command, a particular work item requires you to make changes in three different systems, you have to anticipate that at any particular stage this could fail and then the system will need to make sure that it can catch up, and make sure that it can reach that situation of eventual consistency. Above all, mistrust, suspicion, skepticism needs to be built into each of the services. "Have I already done this? Have I not done this at all?" It needs to check and re-check.

Now I'm going to go through an example of how this will work. So, the example I'm going to pick is someone needs to make a payment from their Starling Bank account to another bank account somewhere else. How would this work? They take their mobile phone and they say, "I'd like to make a payment to this bank account please, £20. Thank you. Go." That request is then sent to customer, the customer service. That's the service responsible for storing the balance and storing the transactions that the customer wants to make.

The customer service receives the request and the first thing it does is put that in the database. It does a little bit of validation beforehand just to make sure the customer has enough money to send that out. But it very quickly, as quickly as it possibly can, stores that in the database. And that is the synchronous bit. We try and reduce synchronicity to a minimum. The synchronous bit is taking it, validating it, putting it in the database, done. Then responding to the mobile and saying, "Thank you very much, 200 accepted. Tick."

It then records the transaction, says, "Great," reduces the balance as necessary, and then sends that payment off to the payment system. The payment system is there to communicate with the outside world, it's there to connect to the faster payments network that we have in the UK, which allows us to send payments between different banks. It will say, "Hey, payment, please make this payment." And payment will go, "Thank you very much. Store in the database, accepted," and then payment will send it out to the payment network.

Payment then has to contact bank, and bank is what is the system that maintains the ledger and it has to record in the ledger that a payment has been sent out so that later when the faster payments network come to us with the bill and say, "Here's the bill that you need to pay because we've made these payments on your behalf, that we will be able to reconcile that bill." So bank takes the payment, stores it in the database, "200 accepted, thank you very much." Then writes it into the ledger.

Now that is all a very nice and jolly and happy path, but what happens when things start to go wrong? This time we're going to imagine things going wrong. So the customer comes along, they say, "I'd like to make a payment, please." Customer accepts it, "200 accepted, thank you." Stores it in the database, reduces the balance, records the transaction, and then it sends it off to payment and payment's not there. The payment instances aren't working, none of them are up. There's a problem somewhere. So customer goes, "Ah, okay." It doesn't get a good response back, so what does it do?

Well, it's got this work item. It waits for five minutes and then it tries again, and it sends that payment over to the payment service again, and then the payment service says, "Oh great, this time I'm here and I will accept it. Thank you very much, 202 accepted." So this concept of being able to retry work items to making sure that you have actually done them by going, "Well, if I didn't get a good response, I'll wait five minutes and then try again," provides the at least once component that we need.

Now let's imagine that payment tries to send something to bank, because obviously, it needs to store the payment in the ledger. Payment sends the payment to bank and bank stores the payments in the ledger, and then tries to communicate back that it's done this and the communication fails. Maybe the bank instance goes down before it's had a chance to communicate back. Maybe the payment instance disappears before it has a chance to be responded to. Either way, there's some breakdown in communication and bank actually does the work of putting this payment in the ledger, but that isn't communicated back to payment.

So, following our catch-up retry problem, payment comes along and says, "All right. I'll try it again five minutes later," and it sends off the message to bank. Bank receives this payment and says, "Okay, I've got this. Let me just check to see if I've put this in the ledger already." Goes off to the ledger, finds out it has put it in the ledger already. Now at this point, it doesn't throw an exception. It doesn't complain and shout and go, "Oh, there's a problem." It just goes, "Thank you very much, 200 accepted. I've done that for you. What you wanted to achieve, that this has gone into the ledger, has been achieved. Tick." So that is the idempotency, that is the at most once.

Now, you'll notice that all over this diagram and the previous one I've been putting UUID in brackets all over the place. UUID it is a unique identifier which for us is a 32-character hexadecimal string that we associate with everything, and I mean everything. Everything in our system gets a UUID, every card, every customer, every payment, every transaction, every notification, every ledger entry, everything gets a UUID. This UUID is passed around with the items. That is the key to making sure that idempotency is achieved, because the only sure-fire way you can guarantee that this is exactly the same command and I have processed this one already, is by having a UUID and so you can see the UUID and you can say, "Well, this UUID matches a UUID I already have in the database, I'm not going to do this again." Idempotency.

Catch-up Processing

What does this look like when we implement it in code? What we have is two concepts here that I want to explain, which is the catch-up processor and the recoverable command. The Recoverable Command, the RC bit labeled there, is an idempotent command to do something, to make the payment, to send a new card out, to cancel the card, etc. That has built into it checks to make sure that the work item hasn't already been processed, it receives a UUID, so it takes a UUID in and then interprets what that UUID is and what it means, but it makes sure that everything has a UUID associated with it. So everything in order to be processed has to have this UUID. Once it's processed the item, if it processed it successfully, it stores that fact in the database and it goes, "Hey, database, I have processed this item successfully. Please mark this with a timestamp or similar marker to say this work item has been processed, do not try and process it to again."

The catch-up processor is there to provide the retry function. So it is its job that runs every five minutes or one minute or 10 minutes, depending on what sort of cadence you need, and it will take up 1,000, 100, however many you need items from the database that need processing. It will farm those out to various different recoverable commands. We have a big ring buffer command bus which we will put work items on and then we have threads going around picking up those work items and processing them through the appropriate recoverable command.

If something goes wrong, if there's a bug, what happens is that the recoverable command will not say that the work item is complete. The catch-up processor will come back in five minutes and try and re-process it. If it's still a problem, it'll keep processing it until it's fixed, until the desired state has been achieved and then it will go tick. These catch-up processors help us continually work towards eventual consistency.

But Why Java?

Now I want to get down to - given this as the enterprise languages track - the reason that we chose Java for our enterprise. One of the benefits of Java at a tactical work item level is that exceptions are really noisy. You get a lot of noise, you can throw an exception, you can stop processing, you can bubble this exception up. That allows you to make sure that a work item will not be processed, that any further processing done on the work item, including marking it as complete in the database, will be stopped. This then becomes obvious to monitor, you can see what's going on, and you can make sure that your system will carry on iterating over this item until it is complete.

At a more strategic level, we have a reliable ecosystem in Java. Java has a large user base, it has good tooling, many people who know it so it's easy to hire for. And this provides us with a higher order reliability. We're not just thinking now about reliability at the level of an individual payment. Customers don't just want to know that single payments or transactions or their day-to-day banking will go through very well. They want to know that they won't be called in a years' time to be told that their bank has gone out of business and now they need to spend loads of time and go through mountains of paperwork, switching their bank account to another bank account. They need to know that this bank account will be here in a decade's time, 15 years, 20 years, etc, etc. So choosing a language that is well supported in terms of its tooling, its user base and so on, is a very good consideration for an enterprise that is thinking long term, which banks have to.

It also means that we have easier integrations with legacy third parties. A lot of the trick with banking is allowing consumers to interface with systems that were designed 30, 40, 50 years ago, and removing that difficulty of interfacing with things that are written by transferring, fixed with disseminated text files to around about the system, which we have to do. Since Java provides us with an easy way of interfacing with all of those crazy more outdated systems, it becomes easier for us to offer our consumers an easy interface into the banking world.

The Benefits of DITTO Architecture

What are the benefits of this architecture? What are the things that we can do? What does this give us? Instance termination is safe. This is the key point about our architecture. So, if we feel an instance is in trouble either because something has gone wrong and it's run out of memory, or if we think it's under attack, or even if we just want to terminate it and bring up a new version, we can do so, safe in the knowledge that if it's in the middle of processing anything that that work item will be re-picked up by a new instance and processed in an idempotent manner, processed in a manner that means that we don't have a duplicate occurring, a duplicate payment or what have you.

When I say instance termination is safe, I don't just mean the instances running code. I also mean the database instances. If a database instance goes out for some reason, alright, we'll lose some functionality for while, but at least we won't get into a position where we have that spreading bug problem because everything that comes along to that instance to do something, the first thing that the instance tries to do is save the work item in the database. And if the database isn't there, then the service will just go, "500. I can't save anything in the database. This is all broken. Go away and come back later." So anything trying to contact it will be able to catch up by contacting it later through the catch-up processors and the retry functionality. That allows us to be a bank that does database upgrades in the middle of the day during office hours, which we do and have been doing recently.

It allows us to be a bank where continual delivery means slightly faster than once every three months. We can make sure that because we can kill instances at any time, we can take old versions out of service and bring new versions into service any time we choose without worrying about what it's doing. Now, we don't quite have continual delivery of production, we still get sign-off to go into production, but we are releasing to production at least once a day, if not twice, three, four, five times a day. This allows us to move much faster and deliver features much faster.

Traditional banking, or incumbent banking, I should say, works in this "bi-modal" manner. What that means is that they end up delivering their user interface, their apps and their web portals very fast, but their back end, that goes very much more slowly. You have this problem whereby you're trying to deliver the user interface over a back end that may not support the features that you want to implement. It must be an interesting challenge to work in those banks where you have to iterate the user interface to deliver things over a back end that doesn't really change. Whereas with us, our back end goes 10 times faster than our front end. So we're actually in a position where if a new feature needs to be released, we can release the whole feature back front end all at once. Fantastic. We can deploy that and make sure that it goes into production, that feature gets out to customers as quickly as we can.

But you might be thinking we're a regulated industry. Banking is a regulated industry in the UK as I'm sure it is here. What do the regulators say about this? Here's an example. One of the things that the regulations say is when you do a release, you must let everyone know that a release is happening. Now, generally, banks will do that by sending out an email, but if we're releasing every day, we don't want to spam people with emails going, "There's a release, there's a release, there's a release." So what we do is we go into Slack and we post what we call the "rolling" giphy indicating that our production system is rolling. You can see various examples of our rolling here. This lets everyone know that the system is rolling. We've fulfilled our regulatory requirement. This has been signed off by auditors. They're happy with this. They're happy that we are fulfilling our requirements to let our employees know that there is a change in code coming along.

Oftentimes, people use regulation as an excuse. They say, "Ah no, can't ... Oh, we'd love to satisfy and delight our customers, but alas, regulation prevents us from doing so." That's a load of rubbish. Yes, it's complete BS. This shows that regulators are happy with you delivering as long as you fulfill the spirit of what is intended you can deliver quickly to customers. You just have to think of an inventive way of fulfilling the spirit of the regulation.

What happens if something goes wrong? If something explodes, we have screens everywhere. These are screens showing Grafana graphs that we've created of various metrics, card payments going out, faster payments going out, exceptions happening. We've got these screens all throughout the office. So if something goes wrong, people will quickly, notice because these things are very visible and they'll say, "Hang on a minute, what's going on with all the exceptions coming out of such and such?" We can quickly look into it. And because we are happy with releasing on a very fast cadence, we can release a fix to that problem. It doesn't require an emergency. "Oh God, we've got to release today. How on earth do we do that? Quick, get the VP of blah blah blah on the phone and this person and this person and the 20 other people we need to do the release." We can just do the release. So we can fix bugs quickly.

We can also test this. Here's our chat-ops killing of particular instances of things. Here's me going into slack and saying, "I'd like to recycle all of the instances of this particular service, please. Take them all out of service one by one and bring up new ones." That's recycle. But I could also go "Kill. Kill all of the instances, bring them down." The eagle-eyed amongst you might be able to see that that's in demo. Yes, fine, testing and demo is okay, but in order to make sure that this is actually working, you need to test it in production.

We run chaos in production. We're a bank, we run chaos engineering in production. We don't run a lot of it. We don't have a full Simian army going, we only have one little monkey who goes and kills instances, but he will kill instances in production. He will take instances of services down. Now, we do that fairly regularly. We don't often take all instances of a particular service down at once, but on occasion, we have had to do that and we felt that that's the most prudent thing to do in response to an emergency. And we can do all this because we're happy that things will catch up.

The Future

I want to briefly talk about two topics regarding the future of where Starling is going, where this architecture is going. The first one is scale. We started hitting scale problems recently. One of the scale problems we've hit is the self DOS problem where we create a denial of service attack against ourselves by having one service pick up a work item, try and communicate with another service, "Hey, can you process this work item?" And that service goes, "Error." Then five minutes later the first service goes, "Can you process this work item?" If there's a large scale problem, it will be doing that with a thousand work items. Obviously, because the catch-up processors are run in each instance, it will be doing 2000 work items because there were two instances running a thousand work items. And if we've scaled this up too many instances, there'll be end thousand work items coming across and which will all be being rejected. This will fill up the ring buffers, it can overheat the database of particular services. We found that happening.

So we've started instituting queue management where we've visualized all of this in our management application and we can go and we can pause particular catch-up processors. We can say, "Stop this. Pause this catch-up processor, it's overheating the system. We need to go and fix it." Then we can click play when we want to play it again and then it'll catch up when we're happy it's being fixed. Also, we can pause or even delete, in a real emergency, particular work items. So we can say, "Stop this work item, it's causing a problem. Let us fix it first." That visualization helps us overcome that problem.

The second problem we have is race conditions. Obviously, we've got various different instances trying to go into the database and pick up work items through their catch-up processors. And if two instances both go and pick up the same work item, then the idempotency checks become very difficult because you can check whether that work item has been processed, but if two things are processing at the same time, then the idempotency checks will pass on both of them because the work item hasn't been processed in either case. So this is a tricky situation where you could possibly get a duplicate because your idempotency checks get bypassed.

We're thinking around how to fix that and we haven't come up with a good solution yet, but we're progressing towards an answer, which is maybe we need to separate out that work item processing. Maybe we could need to be more intelligent with how the catch-up processors pick up items out of the database and have that separated from the actual instances and the running of the particular work items. But we still want to maintain the smarts in the services, not in the pipes.

The other thing to say is about the future of the JVM, and the future of Java in general. We're at the moment on Java 8 and we're trying to upgrade to Java 10 and then hopefully to Java 11, which has the long term support with it. And we're encountering a few problems, and those problems mainly relate to third-party libraries we use. We use third-party libraries and those third-party libraries use other third-party libraries, etc, etc, which use things in Java that in Java 8 are visible methods and classes that are visible, and they can go and get. In Java 9 when modularization was brought in, a load of that visibility disappeared. Now these third-party libraries have had to upgrade themselves. The third-party libraries that depend on them have to upgrade themselves, and so on and so on, in order to cope with this new world.

So we're finding upgrading our system that we're spending a lot of time upgrading, slowly finding versions of these libraries that work with later versions of Java. This takes a lot of effort. The question is, will Java maintain that reliability, that backwards compatibility, which gives Java a lot of its cache? Will it still be there? Can we still say it has backwards compatibility? Will it still maintain that in the future? I guess watch this space and come to the next talk for answers on that one.

Some Important Takeaways

Let me finish up with some important takeaways from this talk. First up, design software to be skeptical. I first put this down as designed software to be suspicious, and then I thought, "No wait, that doesn't really scan, people might misinterpret that." So design software to be skeptical to question what it is doing. Don't just blindly accept that the command you have received is, "Yes, that must be good. Someone else must've checked that." Check it yourself. Have your individual services mistrust what they are being handed by other services or by the outside world to check whether they've already been run, whether they have been run at all, whether they have been completed.

Give everything a UUID, all your work items, all your objects in the database everything everywhere must have a UUID if you are to implement an architecture like this, because that is the only way to guarantee that reliability that you need. It's the only way to guarantee idempotency. Fire alarms are good. Having something that's continually going off going, "There's a problem here. There's a problem here. There's a problem here." Yes, it might be annoying, but it allows you to quickly identify that there is a problem and help your cause and help fix it and help your customers, who presumably this error is affecting, get back on track.

And above all to end on a positive note, you can do anything that you can undo. If you are in a situation where there's going to be a problem, something's going to go wrong, which inevitably it will, and do you anticipate that and you build your system to cope with that, then you can deploy at speed. You can go at speed because if you make a bug, if you create a bug, if you create a problem which you inevitably will, then your system will be able to catch up and will be able to get itself into a good position and it won't failover or fail your customers. If you do that, then you can break this dichotomy, this false choice between speed and reliability, and you can have both.

Questions & Answers

Participant 1: You mentioned that for every request the note will first write the request to the database before processing it. Do you have an idea of how much overhead this has, and how it affects the performance?

Maude: It certainly affects the performance a lot. In order to answer that question, we'd have to consider what would happen if we decided not to do that, and how it affects the performance. But if we decided not to do it, then I'm not sure we could run a banking system like this. So, in essence, it's a cost that we have to pay at the moment. We have to be able to pay that cost. I think it does add quite a lot of load onto the system. And maybe if you don't need a system as reliable as a banking system, you wouldn't do it. But we haven't tried to exactly measure what would happen if we take it away because then that would just completely wreck our systems. I'm afraid I can't give you an exact answer on that one.

Participant 2: Just to clarify, I'm trying to understand the DITTO, this is a very interesting concept. If everything has to be idempotent and it seems that there is a constraint that the flow has to be only one direction. So, you cannot expect the second, you can only say, “Create it, accept it,” but you cannot wait, then it produces something that will be used by the caller.

Maude: I see what you mean. In a certain way, yes, you're right. I think what you're saying is that you can't have something that goes “If service A call service B, then service B can't hand back a fully formed, completed object to service A”. Yes, you are absolutely right about that. The only thing service B can hand back is a promise, essentially a UUID saying, "I promise to do this piece of work. I will do this piece of work for you. If you want to know how this piece of work is going, please enquire using the UUID you have provided."

Generally speaking, most of the time in the happy path, things happen very quickly. So we don't actually need to worry too much about that. The reason that we make it asynchronous, even though it appears synchronous to the human eye, in the happy path, is because in the sad path we need to have that asynchronicity so that we can catch up later and have that eventual consistency.

Participant 3: What if your retriable command just keeps failing forever? How do you deal with that?

Maude: We have people who are watching it during office hours. Everyone's looking at those screens, as I said, screens with exceptions on them. If you get a particular work item that is failing consistently, what you see on those exception screens is a consistent pattern of normally two times because we're running two instances every five minutes, there is an exception blip. People notice that and then go in and say, "What's going on with this work item? How can we fix it?" and then release a problem. If there's a problem that's out of office hours, we have a rotation of a pager duty rotation to alert us to a serious problem is occurring and we need to go and fix it now, and then people will get up in the middle of the night and go and fix it. So, we don't have any work or items that infinitely fail because we jump on the exceptions and correct them.

Participant 4: I'm wondering in your architecture, did you choose to use message queues in any of your workflows or maybe your retry logic? And if not, what was your rationale for avoiding that?

Maude: We didn't choose to use message queues for anything. We chose to put all of the queuing and processing and retry logic in the services, the smarts in the service, not in the pipes. So we chose to use the database essentially, as our queue of work. The rationale behind that is that it's then easy to kill an instance, bring it back up again, and still process those work items. You haven't lost those work items. Maybe you lost their position in the command bus, the ring buffer, that's having the recoverable commands farmed out to it, But it doesn't matter because that recoverable, that ring buffer will refill with exactly the same commands because they're all stored in the database and you can go and get them. So we felt that that was a more reliable way of processing basically.

Participant 5: Can a work item be partially finished?

Maude: Yes, absolutely, work items can definitely be partially finished. Every single service along the way has to check that it's done their bit of the work item and that, if it needs to pass that work item onto another service, that pass on has been done. So, you need to build this retry catch-up processor logic in every service that needs to process that item.

Just before we go, I would like to say if you want to know more, please go and check out the Starling podcast where I host this and I get people from Starling Bank to come on and discuss various topics around how we built the bank from a technical point of view.

See more presentations with transcripts

Recorded at:

Mar 30, 2019

Jason Maude

InfoQ Software Architects' Newsletter