BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Rampant Pragmatism: Growth and Change at Starling Bank

Rampant Pragmatism: Growth and Change at Starling Bank

Bookmarks
49:52

Summary

Daniel Osborne and Martin Dow discuss relational theory, functional relational programming and self-contained systems. Osborne and Dow explain their approach to complexity, and show how they inform the design of many parts of their system, including the ledger and their web stack.

Bio

Daniel Osborne is Web Technology Practice Lead at Starling Bank. Martin Dow is Engineering Lead for Core Banking at Starling Bank.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Osborne: Thank you for joining us to hear our talk about pragmatism, which is rampant at Starling Bank. Starling was founded by our CEO, Anne Boden, in 2014, in reaction to the banking crisis, and believed, and continues to believe now that customers need to be treated fairly, to be empowered. That building a bank on modern technology is the best way to achieve this. Martin and I started within about a month of each other 4 years ago, as then a very small team, just two with a couple of engineers. I'm now Practice Lead for Web Technologies, which makes me responsible for anything that runs in the browser and how it gets there, ideally, quickly and securely.

Dow: I'm Martin. I'm Engineering Lead for core banking, where I deal with services at the heart of the bank, involving finance, treasury, credit, fraud. To achieve Anne's mission, Starling has always been a really customer and delivery focused organization. That has translated into a pretty pragmatic engineering culture. Dan and I, we sit at opposite ends of the technology stack. We thought it'd be really interesting to talk to you about actually how similar, many things are across those two different areas, if you dig a little beneath the surface.

Agenda

We're going to start by talking with some ideas, some theory. We'll talk about complexity, a bit about relational modeling, functions, projections. That will mainly give us a bit of a language to help explain how we think about our systems. Then we're going to use that language to describe a couple of slightly more practical examples. This is a web guy over here, he's going to talk to us about how that theory applies in the web stack. I'm going to try and do the same in my neck of the woods, talk about a banking ledger and how we push data around our platform. Then we'll sum up by talking a bit about some of Starling's engineering principles.

Complexity: Essence vs. Accident

I want to start off by talking about complexity, to think a little bit about what makes building systems difficult. What makes that difficulty become harder, the longer you are building these systems? The distinction I want to introduce was suggested by Fred Brooks in a very famous 1986 paper called, "No Silver Bullet." The bit we're interested in is the subtitle, which is, "Essence and Accident in Software Engineering." Brooks distinguished between essential tasks. The essential is the system that you're building as it exists in an idealized world. If you didn't have to worry about the technical constraints, what of your system would be left? It's a form of complexity that you can never remove, or you would be left with no working system at all. It's the data. It's the logic at the heart of a system. It's declarative. It's the what, not the how.

On the other hand, the accidental is how you get that system running. Whether you're poking holes with a punch card, writing assembly code, or deploying an EC2 instance in AWS using CloudFormation. Accidental is definitely a slightly odd word here, but what Brooks is trying to get us to think about is how much time and effort are we spending, defining what the software should do for the customer, solving real problems versus persuading a machine to push some bits around to actually execute that stuff. We can think of complexity as a spectrum. On the left, we have the core of the system, we can never remove that. The system won't do anything useful without it. We can divide that into state or data, and a wrapper of logic or business rules. As we move to the right, we can think about accidental complexity. Without the accidental part, the system won't execute. A certain amount of that we classify as useful. We almost certainly need it to build a useful and a performant system. However, we want to make sure we don't drift too far along into the useless complexity. That adds no value to the system, but adds lots of costs. It can take on at least two forms. We can talk about under-investment in our system might cause us to expend a load of effort on boilerplate, on wrestling abstractions that are no longer fit for purpose. On the other side, maybe there's over-investment or over-engineering our systems. We create these high upfront build costs, probably some long-term maintenance in the hope of some future gain in productivity. As you move along the spectrum, you have to ask the question, how much accidental complexity are you willing and able to accept given your particular context, given your business, given your strategy, given the stage of your growth? Perhaps we can think of pragmatism as avoiding some of the temptations that lurk to the right side of the spectrum. Choosing and moving with care.

Essential Complexity: Data Modeling vs. System Design

What exactly is essential complexity? What does it look like if you're actually building systems for a living? Dan and I love this quote from Rich Hickey, the godfather of Clojure, "It doesn't really matter what language you use. Data is really the foundation of the system." For many of us who build data-centric systems, we would argue that you can think of essential complexity as more or less as data modeling. There'll be lots of other things to consider. For now, we can think of modeling the data or system or store as actually the best way of getting to grips with that essential complexity. It's a simple and effective tool for the team to think through the problems it faces. What I'm really talking about is a team standing around a whiteboard drawing boxes. It's a focus on the domain. Gives us half a chance of actually sharing a discussion about the problems that we're solving. That effective communication is obviously essential. Finding this clear, simple syntax, whatever it is, allows everyone to contribute to the discussion, to challenge, to improve the solution, no matter what level of expertise. I acknowledge this is really simple stuff. When as a community we spend so much time focused on the complexity of our cloud deployments, on the subtleties of your favorite programming languages syntax, it does serve occasionally to remind ourselves that at the core this is a job about creative thinking, about finding simple mental models to understand the problems that we are solving and the solutions. Once you have those, we'd argue that quite a lot of the code actually writes itself.

Flex the Model

The team keeps talking, challenging this model we've come up with. I like to think as flexing the model as you're looking at it, stretching it to see how it responds to a bit of strain from a few different angles. Imagine how it would handle some future requirement. Think about scenarios maybe where a cardinality you've drawn on the board might no longer make sense if we introduced some new concept. You're trying to identify concepts, entities that you might have missed, or conflated. We think really carefully about time. We're drawing this as a snapshot, but the data you've stored is stored over a period of time, it arrives over a period of time. You'll probably want to ensure also that as you look back on the system, you know what state that system was in. You can reproduce it. You can and you should often store event model data in a relational database, that you don't need every entity to share the same generic data type to store an event based model. It might feel a little bit like upfront design. You're spending quite a lot of time on the stage. The cost of iterating at this point in the process is so cheap. It's a form of really efficient prototyping. The idea is obviously not necessarily to support all those future requirements. We don't even have to create all the tables. The thing is to think deeply about the problem.

Relational Modeling - Access Path Independence

I'm not going to ask any interview questions about Boyce-Codd. What I'm really talking about here is the process of normalization in relational database design. Questions that you ask yourself in the hope that the model might survive a little bit of future change. When Edgar Codd described relational modeling in his 1970 paper, he advocated for this concept of data independence. It's a clear separation of the application program from the physical representation of the data it's accessing on disk, and described this problem of access path dependence. It's a problem we still see today all over the place. Document databases are really useful in many contexts. The code that operates against, that needs to have a pretty intimate knowledge of the structure of the data, as it sits in the database, and probably, the structure of many versions of that data, if you're updating or read, for example. REST APIs have a similar property. Note the popularity of talks on GraphQL appearing now at QCon. Even breaking a system into microservices. In each case, you're forcing yourself to collapse this multi-dimensional space down along a single dimension, which you happen to deem more important than the others at that point in time. Inevitably, there'll be other contexts. We'll wish in those contexts that we'd chosen a different dimension. The key thing here is that the data is stored without the knowledge of the path from which it is accessed. We're pulling that out of the data itself, so that we can add it back in other parts of the system where we think it's more appropriate.

Postgres

We use Postgres as our relational database management system. With Postgres, we started to move along the right of that spectrum beyond the essential data into the essential logic and even to the accidental. SQL gives us this data independence. It's that declarative language through which we can choose on a query by query basis, which dimension and which access path we want to pick for the data. It lets us define constraints, which is incredibly important for us. It allows us to create a declarative barrier. Facts are only stored if they are true according to the predicates or the rules that we define up front. The system enforces the rules. You don't need every developer to do so. Although you obviously rely on them to write the rules themselves.

Not a System Yet

Osborne: Martin has talked a lot about the importance of a solid relational core in implementing the essential state of the system. He started us on this journey into our systems essential logic by committing to using Postgres with its views and constraints. We're only so far along the spectrum and halfway along the logic section. It's not a system yet, unless our customers are willing to log into a PL/SQL console and get to grips with our data model and check their balance. Our data model is perhaps complex. Through burning nodes and updates, we forced ourselves to thoroughly model time and change. That means we have a lot of tables. Perhaps, even for a very sophisticated customer well-versed in SQL. We have so many relations through this careful, up front, future-proof, and flexible design, that it really would be a nightmare to check your balance to get tax data for last year. Now we need to build this stuff for customers and for our operations teams to do their jobs proactively. It's in this business logic layer that we add back in the access path the code removed. The data exposed to a customer via mobile is very different to that exposed to fraud personnel in ops. This business logic views and projects a dimension of the relational core from Postgres appropriate to a customer or an operative. Here is where your differing user needs are expressed, and indeed, your security model could be implemented. Doing so is a joy owing to the data independence provided by the relational core.

Out of the Tar Pit

I'm talking about how to survive building and maintaining your essential logic without getting tied up and straying too far over to the right side of the complexity spectrum. I'm going to draw on the works of Ben Moseley and Peter Marks to help me do it. These guys wrote the awesome, "Out of the Tar Pit" paper in 2006, which I highly recommend everybody reading, if you haven't already. They build on Brooks' complexity dichotomy. One thing they do quite early on in this paper is they review programming languages, how they relate to the modern software crisis. How they relate to complexity. They cover object oriented programming as an example of the imperative style. Functional programming and logic programming is the more declarative style. They say that the latter, logic programming, is really the Holy Grail for expressing your essential business logic in such a way that it describes what is design of a system, rather than how the system should work. They actually settle on the recommendation of functional programming, though. I suppose, as Haskell developers, that's perhaps unsurprising.

Java

Hang on, you use Java. I want to be clear that the paper didn't directly influence engineering at Starling. It does give us a really good language to talk about software systems, and explains really well some of the most critical engineering principles that we follow. Our services are written in Java, and our business logic. What is a function anyway? It's a transformation where some data comes in and some data goes out. Ideally, nothing else happens. No other dependencies wrapped up or hidden within the function. No side effects. No sending an email. No notification, or making a call out to another service over HTTP. Just the domain and arrange, nice and simple like in math. We have this distributed system written with Java, well known, strongly typed, imperative OOP language about from 100 feet up. Each one of our services is actually really just a collection of pure functions, or compositions, or pipelines of pure functions. Data in through HTTP at the top, transformed in services, and written to the datastore. Or, data in from the datastore, transformed in services, and up and out as JSON over HTTP. Every step of the pipeline is declarative. The data lives in Postgres. Derived data lives in Postgres too, defined in SQL. At the bottom of our Java stack is a Java interface annotated with some SQL. Java interfaces and annotations are declarative. SQL is declarative. This data is in the form of an immutable object, and a Java object. Perhaps immutable collection of Java objects, which may be transformed with functions applied to Java streams. Then off it goes as JSON over the wire. Libraries like Jackson transformation is cross-cutting and declarative. Libraries like JAX-RS allow us to bind a Java interface to HTTP declaratively. We use RESTEasy, because it automatically generates HTTP clients from annotated HTTP server interfaces. There's a ton of awesome declarative stuff out there in Java land. If you keep it stateless, you keep it immutable, and then you stay in a safe space away from the right-hand side of the complexity spectrum.

FRP System Components

Ben and Peter in this paper, they go on to propose the hypothetical architecture called FRP, Functional Relational Programming. Shares an acronym with Functional Reactive Programming, but it's different. This poses as a solution to the modern software crisis. It consists of essential state down at the bottom right. This is a relational core at the heart of this system. Essential logic. Why is it producing derived data? Transforming data like creating views in your RDBMS. Executing pure functions is part relational, part functional. You need integrity constraints there in your DBMS. There's these things called feeders and observers, interfaces through which relations are created and you get your data in, interfaces through which your data is projected. You get your data out to customers on a mobile app, or perhaps, to that other reporting system that your boss needs for his daily dose of MI. Learning to classify these systems in terms of this complexity spectrum or diagrams like this one, it means we can start to divide and conquer. Then Ben and Peter advise us to avoid and separate. Avoid is what we've spent most of our time so far talking about stripping away anything essential, stripping it back to only the essential state and behavior. For example, out at the top, it says that we should avoid a mutable state in our business object because mutable state is accidental complexity. Avoiding things in general, means thinking very carefully about which costs we want to pay. Everything that you don't manage to avoid might cost you something. Always question whether now is the right time. When will we see the benefits? Are we doing this for the right reasons? Am I just padding my CV? You're going to come up with so many ideas. You're going to see so many talks. You're going to read lots of blogs and tweets, and you can't do or use them all. At Starling, I think we work really hard to avoid complexity. Some of us take a daily dose of healthy skepticism.

Separation. It turns out that we can't really avoid all complexity, accidental or otherwise. Separation is this technique we can use to reduce the potential cost of taking on useful accidental complexity. When you separate things, you have to wrap up and share cross-cutting concerns, to get the benefits of consistency. By wrapping up cross-cutting concerns, expressing them declaratively, and applying them consistently, we're able to think more about what is design of the system than how it works. By wrapping up cross-cutting concerns we make our services homogeneous, and consistent. At Starling, you probably won't find that one team one day will go off and write a service in Haskell. Both being huge fans of Clojure, Martin and I, we found this a bit tough, actually. We work hard at avoiding heterogeneity in our systems. Consistent services also promotes or allows consistent infrastructure, which can then also be automated, declarative, through tools like Terraform.

Observers and Feeders

Observers and Feeders are like an instance of separation. They implement separation. Separation isn't really this new idea, it's been around for a while. It happens on another access, the separation of read and write. Bertrand Meyer in the '80s coined the term CQS, Command-Query Separation. It's about placing reads and write operations in separate methods on your objects. Queries have to be referentially transparent. They must produce no side effects. They must really merely return data. Greg Young took it a little bit further with Command Query Responsibility Segregation, CQRS, wherein writes and reads are placed in separate objects. Separating read and write creates space for a range of architectures and patterns, like the option to have multiple read models granted potentially at a cost of accidental complexity through things like synchronization, and time lag, and eventual consistency. This allows the physical scaling of read and write to occur asymmetrically, which is both realistic and pragmatic. We've done quite a lot of papers in theory. For some examples, Martin's the core banking and database-y guy.

React

As Starling we use React. It's a very well-known JavaScript library often thought of as the view layer for single page applications. We can define React at a high level, in some formal sounding terms, by saying that it encourages developers to write a pure set of functions over the domain of your system state, your business and user state into the range of the DOM. You might go on to say that your browser is a function over the domain of the DOM into the range of pixels on your monitor. React brings together functional and reactive programming patterns, and it raises the consideration of state to the forefront of your developer's mind by offering you two kinds of data. Props, which are values your component receives from the outside. State, which your component manages itself.

Functional Programs

By avoiding state and side effects, by programming in a functional style, and using immutable data structures, your entire system gains this property of referential transparency, which is a great aid to informal reasoning. This is where every expression can be replaced with its value without changing the program's behavior. That's the definition of referential transparency. All it means is that you have a function foo, and every time you give it 2, and it gives you back 4, and it didn't send you an email. Functional programming shortens the brain to code gap. I think that's more important than anything else, when you and your teams are all reading back each other's code, at least 10 times more often than you're writing it.

Redux

Redux is another thing we use at Starling. It's a predictable state container for JavaScript applications. It's a state management tool, if you will. It's in no way coupled to React out of the box, but it's very common to use it paired. It has a single global state store which can be an immutable data structure. I think it should be an immutable data structure. It formalizes the change in state through actions, which describe a request to change state. It notifies subscribers when the store is changed. You can only change the store via these things called reducers, which you supply as a pure function of state and an action, and you must yield a new state. Redux assumes that you won't mutate your state anywhere else other than inside this function, which you have provided as a reducer. The store is a database. It's more than a database. It's an immutable data structure. If you use an immutable data structure there, then it's a value. That gives you this unparalleled capability in the debugging space. There are echoes of CQS, like separation here, C-commands. They're like events but they carry intent. They're immutable. These are akin to actions in Redux and queue queries. These are things like selectors or cursors, lenses, and zippers. Any pure functional technique for reintroducing modularity over and navigating your global immutable state purely without side effects, and as declaratively and logically, composibly as possible.

Web Development

React helps us avoid and separate. It forces us to think about values and state, to treat them differently, to handle them explicitly. It separates us from the mutable world of DOM and event handlers. Redux helps us avoid and separate. It provides us middleware stack, in which we can place our side effecting code and deal with the outside world like network calls separated from the rest of your otherwise rational, reasonable, pure functional program. We chose React, Redux, and Immutable.js because they each do one thing well, they each stand alone. They compose well with other things that do one thing well. Each of these are simple and they stay simple when composed together, which gives rise to understandability. This excellent informal reasoning has impact analysis and debugability. Things that compose easily, are easily pulled back apart. This gives rise to changeability. They support growth. The growth of your product complexity. The growth of your teams. The number of committers working in the same space.

This is a screenshot of one of the apps I look after. We call it the management portal. It's built on this web stack. It's simple. We adopted it really early, 4 years ago. It's a simple architecture. It served the original single engineer, muggins, here very well. What I found really pleasing is that it hasn't really changed in 4 years, the architecture. The tools changed a lot. Now we've got like 80, I think it was 96 committers at last count. Back then we had no customers, now we've got over a million. Back then we had no operations function and now we're heading towards 1000 people on board. Most of them use this application, day in, day out to service the needs of our customers, the bank as a business, and of course, the regulator.

React, Redux, Immutable.js, can I claim it's an FRP implementation? It's definitely functional programming. It's probably reactive programming. Is it relational? No, not quite. Probably not yet. For many of you out there, and I think perhaps for us in the future, I think the missing piece from this collection of libraries to implement an FRP like architecture, are tools like GraphQL. Through GraphQL, web clients, which for years have been tied to prefabricated REST APIs, they re-attain the power to query, to achieve data independence. At Starling, we lay out our homogeneous services as HTTP accessible projections of relational data. We have cross functional teams. Together, by and large, this usually results in APIs that actually are fit for purpose. I don't need any more queryability than I already have. Lucky me, for now. GraphQL is very much on the radar and so is functional relational programming.

Starling Ledger

Dow: We've talked about complexity, about how we model and store data independently of the access path, then add that access path back later elsewhere as projections or functions of this relational core. Dan's explained a bit about how that applies in his area. I'm going to try and do the same in my area, in view of a hypothetical ledger that is pretty close to what we do. Hopefully, show along the way, a little bit of the accidental complexity you might find in our system. The data models and everything, I've just sketched them out there, in no way they're a resemblance particularly to what we run in production or anything.

A Bank Is an Accounting Machine

A bank really is just an accounting machine. We'll start by just thinking a little bit about what a business's accounts really are. In the language we've covered so far, you can think of them as really a projection of all of that business's financially impacting data, a function. A function that takes as input all of that historically modeled, detailed relations that we came up with in front of the whiteboard. It transforms them into a unified, into a simple data structure. I've drawn some T accounts, which is what you draw on whiteboards with finance departments when you're building a bank. We've got debits and credits made as postings into ledger accounts. We group those postings into sets called journals, which are the numbers joining the postings together. Some of those add up to zero. The ledger, as a whole, always adds up to zero, which apparently is quite important. Then a ledger is immutable, which is also reasonably important. Though, you can and you do post immutable adjustments by distinguishing between the posted timestamp when the event landed in the ledger from the value timestamp.

Detailed and Diverse to Unified and Generic

Assuming that you've stored enough data in your data model, you could actually think of a company's accounts as little more than a query over these really diverse set of financially significant events that flow through our system. That essential state of our system expressed through a set of facts or relations. Transaction settlement messages from MasterCard or Visa payment networks. In and outbound messages representing bank transfers, and faster payment service or SEPA. In each case, we store lots of detail, so that we could keep track of everything from an operational perspective. We can switch your card transaction and the feed from pending to settle. We know when we need to settle with a particular payment network, for how much. Which customer payments have been settled as part of that batch, and so on? The finance team doesn't care about all that detail. Their domain looks at the business from 30,000 feet up. To represent these relations in the accounts, they all first need to be transformed so that they share the same data type. We go from this diverse, detailed, lots of individual representations into this simple, generic, unified data model in the ledger. We're purposely throwing away the detail in order to see the wood for the trees.

Denormalization and Projection of Essential Data

If we're talking essential versus accidental complexity, then the way I've described that the ledger is actually on the accidental side, because you could actually construct this crazy query. You could represent the ledger postings from this underlying complex data model. It would be a bit of a beast, of course, involving the union of dozens of different relations all projected into these ledger postings. It would be pretty unmaintainable. It would be represented in code. You would run the very real risk that a future change to that code, a bug that you might introduce, could rewrite history and you violate this property of immutability. At this point, we start to think about denormalization. We're going to store the ledger tables effectively as simplified copies of those raw, underlying facts.

Denormalized Postings

For the sake of illustration, I've drawn out what some logical data model might look like. It's obviously a pretty basic and weak model that I was trying to illustrate really, that is, I've got a payments domain, card domain, treasury, and FX. Lots of complicated detail. Everything joins in the middle to this simple ledger model in the center.

Separate Services?

I won't spend too much time talking about the why. I drew it as a monolithic, big, system-wide domain, if you imagine, of hundreds of tables all joining into the ledger. In reality, we have good reasons to break our system up. We break it up into about 30-odd services. There's an element of Conway's Law here, allowing teams to own their services, to control their releases. There's also an element of domain-driven design. By allowing the domains of the system to diverge and to grow independently, based on the expertise of the particular team owning that system. Why do we land on about 30 services? Where does that granularity come from?

Self-contained Systems

Firstly, we follow a strategy called self-contained systems. If you're going to break the bank up into many different services, we feel that they should be able to run those services independently of each other. We're trying to reduce the blast radius of errors here, to stop a problem in one service taking down others. That means minimizing synchronous calls between separate services. If each of those services are allowed to talk to a few others, then you're quickly in danger of building this distributed monolith. This network of expensive and brittle calls, in which any error can quickly cascade throughout the system. Using self-contained systems also means that each service has its own database.

Splitting the (Relational) Core

We've talked repeatedly about the importance of relational modeling. I've drawn some teal boxes here which are supposed to represent some imaginary systems, and each of a self-contained system that is dividing my domain up into many different services, each with its own database. Now I've got to cut my database into loads of parts. I can't write nice queries across them. I can't enforce constraints across those records. What's going on? What have I done? Of course, it's all about trade-offs. To gain the benefit in one area, you pay the price elsewhere. We want to gain the benefits of self-contained systems of these multiple services. We give up a bit of that simple, consistent modeling. We don't want to go all the way. We don't want to give it all up. There's definitely a role for reconciliation. As a bank, consistency is obviously really important to us. Given what I've described, how do we go about maintaining that consistency in the face of this destruction of a lovely, neat model?

I've drawn up here, my service boundaries going right through the middle of some entities. Those entities are effectively shared across services. Each piece of data in the system has a home. It is owned by one part of the system as its primary owner. We might have a finance message being something related to card and cards domain. You have a payment, perhaps created by the payment service, something to do with FX down at the bottom. In each of those domains, you have a rich set of historically modeled events, but one of them is going to be the one that is pushed across into the ledger. We see this as a form of event sourcing. From its home in one service, data is pushed into other services where it becomes joined into the relational model of that service. We've reduced the surface area of how these different services connect. A detailed model on one side, simplified, a domain specific model within the particular ledger service that I might have drawn here.

Pushing Data Around

We're pushing data rather than doing remote procedure call, or anything like that, because we want to minimize the amount of transformation that happens across the data, across this barrier, this bridge. We might send a subset of the columns or filter it out. We're really trying to avoid having a complicated dependency on some piece of code that happens to execute, that is versioned in GitHub and changes. Otherwise, I won't know, what's the representation of this particular card transaction on the other side? This part of the system we've described in the past as a DITTO architecture, Do Idempotent Things to Others. On the consumer side of that, we ensure that the receiving service here, the ledger, the endpoints receiving those messages are idempotent. We can replay the message as many times as we like. It will land once. That means you can handle errors by repeatedly replaying. You just keep throwing it over the fence, and when you know it's stuck, you stop trying.

On the producer side, we've gone through a few different iterations and implementations of this. It's definitely to show a little bit of the accidental complexity. We narrowed it to this, hopefully, thin channel across these entities being pushed through the system. The implementation we use now is based on Postgres, on a feature called pg_notify, where Postgres is notifying our Java services through the JDBC connection when a record is inserted into the database. What that means, in the same database transaction, we're able to commit a piece of data of record representing a transaction together with another record representing the command of pushing that piece of data into another service. We can delegate that consistency of pushing the data across the boundary to a Postgres transaction management. We don't need to worry about complicated two-phase commits or expensive distributed locking or synchronization technology. It works pretty well for us. Though, it's definitely part of the system that will continue to evolve and improve. I'd like for a start for it to be much more declarative, so there's less pushing, more declaring that I need to subscribe to a particular type of event.

Kafka?

Why don't you just use Kafka? For a start, at the beginning, we were unwilling to take on what we considered as inessential complexity, early on. We would sacrifice a lot of developer productivity, requiring everyone to run this complex piece of technology. As well as needing to build up the expertise in production, and we had so much to do delivering features, getting stuff done for the business. We also couldn't see immediately what benefit it would bring, because we still had to worry about the consistency guarantees. We still got to make sure that the data crosses a barrier, and so we'd still have to implement that stuff I described before. Kafka, obviously, improving rapidly. They've got exactly one semantic, and so on. I'm sure lots of our concerns have been addressed. I can see a use case where if we increase the number of consumers that we're pushing data to, or we had a use case where we're regularly replaying large swathes of the data, then we would try to introduce the broker in the middle to take the load off the source system.

Postgres Logical Replication?

A better option for us might actually be logical replication, which is a Postgres feature. Postgres actually exposes the writer head log within the database, and that's the log of changes that the database records as it's storing the data. You're able to effectively subscribe to that, and also specify the format in which you consume that piece of data. The client can be another Postgres database as in traditional replication, but it can also be a service. You can write a Java client to consume this writer head log effectively, or the logical replication log. That allows us to turn this place orientated datastore into an event producer. It's effectively the solution I spent a bit of time describing before. We get to keep our consistency guarantees and the transaction boundaries with a lot less moving parts than running a big, separate cluster. This architecture will continue to evolve over time.

Software Engineering: The Art of Compromise

You've seen a bit of the theory we presented at the start. How it translates into real life engineering at Starling, in the web, in the vault, where I live. We start to wrap things up now by thinking a little bit about Starling's engineering principles and what pragmatism means to us. On Greg's track, you're going to see quite a few Fintechs all solving really similar problems. We've all had very different journeys. Why is that? Software engineering, really, is no surprise, a long series of trade-offs. There's no single right answer. The questions we're asking are changing every day. We're really engaged as engineers in the art of compromise. The orchestration of compromise. Lots of small decisions, which in some define the journey that we take. If we exist in this state of constant compromise, what is it that stops us undertaking some myopic random walk through this forest of options, of opportunities, of choices? The compromises we make are through decisions. Those decisions are hopefully grounded on a principled foundation. It's those principles which give purpose and they give direction to the compromises we make, and thereby, determine the system that we actually end up building.

Rampant Pragmatism

Osborne: What is Pragmatism? It's an approach that evaluates theories or beliefs in terms of the success of their practical application. Your theory and experience inform what you believe. Then you can apply your beliefs, practically. Then we experience building systems, running systems, being on call for systems, smashing your mobile phone when pager duty rings. We evaluate it constantly. We figure out what to keep, what to tweak, what to throw away. Pragmatism is a feedback loop. It's navigating the world with your eyes open and moving deliberately.

Bets

What we call architecture as software engineers is really just the result of some people placing some bets on what they think needs to be easy to change later, and what doesn't. We did this 4 years ago. We started with a blank canvas. We made up some principles. We made up some high level aims that we thought were going to help us win at life. We said developers must love the experience of building this thing. Problems in production must not affect the entire system. Their blast radius must be limited. You place these bets, and you close some stuff off, and you leave some stuff open. You mix and pour some concrete over there. You place some lightweight, eco-friendly, modular furniture made out of hemp over there. For sure, for us as a startup back then with a small team with big ideas, data was always going to change. As a bank, looking after your money, and regulated to the highest degree in the finance world, we knew it had to add up. It has to be consistent. We placed a couple of bets to allow our data to change easily without compromising consistency. Obviously, we chose a relational model, as opposed to a graph, or a document, or object model. We placed emphasis on the ACID properties of transactions within our services. We think these were good bets.

However, we learned a few things about betting along the way. A few words of warning, pieces of advice, be aware that these bets, decisions and compromises you might make in the spirit of pragmatism, they can produce artifacts. Artifacts which have gravity and momentum. They can be big. They could last. In other words, concrete is a little bit harder to break up than hemp. Be guided by principles, explicit or implicit. Be guided by experience, or the experienced among you. Place your bets on what will or will not be expensive to change.

Bets + Principles = Architecture

How can we help ourselves make good bets? I think that bets plus principles yields an architecture. You've got a choice, option A and option B. You can put your spectacles on, and perhaps become less myopic, and reduce the risk of this decision by testing A and B against your principles. You can ask how A affects developer productivity. What about B? These principles make your bets less of a gamble. What are the principles that we established at Starling? What other bets did we place as engineers? First and foremost, we optimize for developer productivity. I can run the whole thing on my laptop. Granted, it's a beast of a laptop, but I can do that. I can run a raft of simulators at the same time representing external dependencies like a faster payments network, or the SWIFT network. I've never experienced this, quite like this before, anywhere else before joining Starling. Not to this extent. Not to this scale. Not to this effect. I genuinely believe that this on its own, this principle of optimizing for developer productivity is what allowed us to basically build this and go live in about a year with a handful of engineers. Since then, while we've been growing rapidly, we build and empower cross functional teams, requiring of them minimal process. I think these things keep us focused, moving fast and unhampered.

We also strive for understandability, or perhaps better, changeability, or even better, low risk changeability. We have these 30 services. They're written in Java, and they're in a mono repo. I get to open them in my IDE, all in one go. Press CTRL+ALT+H, and see how the data flows through my entire system, both within a service and across a network boundary into another service. This yields easy static analysis and informal reasoning. These help me to make the right change. We also preach a little bit, thou shalt write code for a future you, your colleague to your left, that you respect, and the guy who's on call. In practice, this looks like simple stuff that I hope we all know about already. Things like changing your stuff incrementally, so that you have small, short-lived branches and tractable PRs. Then since your feature is going to be spread out across PRs, engage a few people in the medium term review of those features across those PRs. Be kind. Communicate. Do unto others as you'd have them do unto you. Really simple stuff.

We believe in simplicity and consistency. All of our stateless services are composed of just a handful of common tokens. They all communicate externally and internally through HTTP APIs. They may or may not have daemon background processes or scheduled jobs. They all persist state to their own transactional relational datastore, and homogeneous services, and homogeneous infrastructure. We wrap up cross-cutting concerns and apply them consistently across those services. We limit dependencies in two ways, both within our programming language, so our POM file, our POM XML. We still use Maven. It doesn't have thousands of dependencies. We carefully manage the dependencies between our services too. Why all this care, or restriction even? Because simple and consistent is understandable and changeable. When we were a young and small team, we could all do everything. We could switch and pivot and keep pace. We had a lot of finance and banking domain to learn. That was hard enough. The development had to be fast. We had to be able to take it for granted. While we've been growing, this has also been very useful for onboarding new engineers, getting them up to speed quickly.

Conclusion

Dow: Hopefully, that has given you a bit of an overview of how we interpret pragmatism at Starling Bank. I'm certain that we will continue to grow and change. We will move further along that spectrum of complexity. We'll take on more complicated tools. We'll hire teams to manage them. We'll continue to denormalize our data. I can't wait to tell my boss that I desperately need to build a distributed database for a start. We will continue to focus on the essential, on data, on our customers, and on delivery.

We're, as an industry, quite obsessed with accidental complexity. I like shiny tools as much as the next person. You really do have to think carefully about the cost of the choices you're making and whether they're appropriate in your context. Don't let some sense of tech status anxiety cause you to make choices that are not appropriate to your business now. Not many companies actually have the tech challenges of a Google or a Facebook. I bet that you would not be willing to accept a lot of the compromises that they've been perfectly happy to make. Know also when to take that leap of faith, tackle the next big problem, address the bit of tech debt that you've built up. By the time you get to that point, you'll have all the context to make the decision and to do the right thing.

We won't stand up here and tell you particularly what choices are best for you. All the talks are here. You can watch them yourselves, weigh out the costs of the bets that you're making. As Vicki Boykis says, you may not actually need Kafka or Kubernetes right now, perhaps the price is too high. Then again, they're powerful. They're exciting technologies. Maybe it's not and you do. You decide.

 

See more presentations with transcripts

 

Recorded at:

Aug 12, 2020

BT