BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The Abyss of Ignorable: a Route into Chaos Testing from Starling Bank

The Abyss of Ignorable: a Route into Chaos Testing from Starling Bank

Key Takeaways

  • The hostility of cloud environments was a gift for resilient architecture and inspired chaos engineering which may outlive the unpredictability that inspired it.
  • Chaos engineering is acquiring the rigour and trappings of a discipline, but the key insights are still simple and powerful.
  • When Starling Bank started out with chaos, they started by setting out to remove risks from the abyss of ignorable.  It was cheap, easy, and effective, and is still a compelling starting point.
  • In 2016, Starling implemented their own simple chaos daemon, just as they implemented their own core banking system. The reason? Same as ever: simplicity.  
  • Once even the most basic chaos testing in place, you are running a system that expects this failure and a development organisation that builds for it. You have removed the temptation to ignore it.
     

The greatest gift cloud computing gave the world was unreliable instances.

Unreliable instances took us from a world where architects wrestled with single points of failure in Visio diagrams to a world where engineers built crash-safe architectures as a matter of course. Instance failure was real enough, frequent enough, in your face enough, that systems were built to sustain it. In doing so we built systems that were orders of magnitude more resilient than those we ran on bare metal.

It wasn’t really a case of theory changing so much as the practice changing. Unreliable instances were so much part of the way the world was, that everyone had to stop forgetting about it.

It can take tectonic shifts, like cloud, to force us into good behaviour. 

 So, as instances get more reliable and techniques such as live migration insulate us from that unreliability, not to mention the availability of managed services with extreme reliability characteristics, will the benefits slowly peter away and be forgotten? Will we allow our systems to lapse into fragility once again?

I don’t think so. Chaos is our salvation.

Indeed for some, it was already Chaos Monkey rather than cloud instability that put instance failure in their face enough to change the mindset. The natural incidence was bad enough to be a problem but not bad enough to change behaviour.

You see, these days we don’t take the chance of things falling into that abyss of ignorable. We lend nature a helping fist.

Whichever way you remember it, without instances which were to a certain degree unreliable, we’d have taken much longer to get into the habit of making them really unreliable. Natural unreliability bred chaos engineering.

Complexity

Today, injecting instance failure is entry-level stuff. If you’re not torching random servers in production, where have you been for the last few years? You mean you’re not simulating VPC-peering go-slows, EBS-jitter and internal DNS outage, whilst simultaneously revoking your primary site cert and upgrading your K8s cluster?

(Don’t do this.)

We are already in a world where instance failure looks like a decidedly on-prem concern compared with the range of failure modes the cloud can rain down. 

Cloud failure modes are extremely hard to pin down because cloud is stuffed full of opaque abstractions. Sometimes you don’t even know if a thing is really a thing, let alone whether that thing can fail, or the different ways it can fail, or the impacts its failure might have. Anyone who’s worked in a cloud or SRE role for any period of time has at some point sat in a group of engineers and uttered the immortal line: “Who knew that could happen?”.

There is a radical unpredictability about cloud which is not going away any time soon. It will keep us building defensively for some time yet. 

And that is before we even consider the systems we build on the cloud.

What have we learned?

We know that our mental model of the platform’s operational characteristics is wrong. We know that our mental model of our own systems is wrong. We know it is foolish to build for the lie of the first mental model, just as it is foolish to trust the lie of the second mental model. 

Chaos engineering is acquiring the rigour and trappings of a discipline, and a definition — experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Turbulence and complex systems introduce easily enough terror to warrant chaos. But, if we focus too much on complexity as the source of unpredictability, I think we miss a much simpler point - a point that is less about the systems we work with and more about our own limitations.

The Abyss of Ignorable

Imagine if every abstraction came with a divinely guaranteed SLA. (They don’t.) Every class and method call, every library and dependency.

Pretend that the SLA is a simple percentage. (They never are.)

There are some SLAs (100%, fifty nines) for which it would be wrong to even contemplate failure let alone handle it or test for it. The seconds you spent thinking about it would already be worth more than the expected loss from failure. In such a world you would still code on the assumption that there are no compiler bugs, JVM bugs, CPU instruction bugs - at least until such things were found. 

On the other hand there are SLAs (95%, 99.9%) for which, at reasonable workloads, failure is effectively guaranteed. So you handle them, test for them and your diligence is rewarded.

We get our behaviour in these cases right. We rightly dismiss the absurd and handle the mundane.

However, human judgement fails quite badly when it comes to unlikely events. And when the cost of handling unlikely events (in terms of complication) looks unpleasant, our intuition tends to reinforce our laziness. A system does not have to be turbulent or complex to expose this. Developers have generally been poor error handlers since development began.

In between the absurd and the mundane lurks what I call the abyss of ignorable. Into here fall the events that are unlikely enough that they seem to us to justify less attention.

In bare-metal server world, “instance” failure was firmly in the abyss of ignorable. This naturally led to fragile systems.

Cloud and chaos fixed the fragility by relocating instance failure out of the abyss of ignorable.

When Starling Bank started out with chaos, they started simply, and quite unscientifically, by setting out to remove risks from the abyss of ignorable. 

This was cheap, easy, effective, and if you’ve not already gone way beyond this, I’d like to suggest it might suit you very well too.

The Power of Simple

In 2016, Starling implemented their own simple chaos daemon, just as they implemented their own core banking system. The reason was the same one that drove so much of the decision making then and today: simplicity.  

I remember the decision. Even at the time the alternatives (complete with evil scripts for filling disks and killing the network and so on) were not simple to use in our environment. They came with dependencies and configuration, neither of which suited our needs. Back then we wanted to avoid agent-based arrangements - we were striving for as few moving parts as possible, so we embarked on a regime of using the AWS APIs to take potshots at servers. 

Is it crazy to build your own? Of course not. At the start, it’s not very much more complicated than:

forever {
  if rollD6() == 6 { 
    killRandomServer()
  }
  sleep(someMinutes())
}

In fact… let’s think it through. How much more complicated do we actually need to get?

Well, you must have control

forever {
  if switchedOn() && rollD6() == 6 { 
    killRandomServer()
  }
  sleep(someMinutes())
}

Brutally simple. Even without such a flag, you could probably just undeploy your chaos service or scale your chaos group down to zero or something. The type of control which is acceptable in your context will vary. Just bear in mind if you want chaos off, you probably really want chaos off — and no risk of it springing back to life through its own marvels of resilience.

What else? You must have observability.

forever {
  if switchedOn() && rollD6() == 6 {
    log(“Hey everyone, by the way I’m killing a server now, okay?”)
    killRandomServer()
    log(“Done it now. No going back. Soz.”)
  }
  sleep(someMinutes())
}

So primitive! But it hits the basics. When you’re investigating an incident and your data shows a server evaporating, you really want these log lines.

(Knowing which server? Pah! Bikeshedding.)

Incidentally, Starling have used log lines such as these to evidence certain DR-related audit requirements in the past. Being able to evidence things with real actual evidence is great fun.

And (I feel obliged to point out) you do need observability of your other services too. Remember you want to show that you can kill a server and that none of the things that matter to you suffer.

You are probably already monitoring the things that matter to you.

There is a world of complexity that you can open up from here. You can get cuter and more devious. Today, you can see commercial products and OSS projects for inspiration and you can certainly use real life incidents to inspire your chaos. 

The point I want to make is this: you have functioning chaos right there and it is already a very powerful force in the evolution of your system. 

From this moment onward, forever, you know your system is not vulnerable to a certain class of problems. See how powerful those choices can be? Especially if you made them as early as Starling Bank did.

Even more importantly, from this moment, you are by default building a system in a way that expects this failure condition, not just paying lip service to it. You have removed the temptation to discard it. You’ve squeezed the abyss of ignorable.

(OK, OK, … you might worry about expectations of chaos frequency being stymied by restarts, which could be caused unpredictably by chaos hitting itself. Go to the back of the class. What’s that? Randomisation seed? Out.)

Making it Stick

The longevity of the regime of chaos is something that deserves comment. I said “forever” and yet the experience of some is that their forays into chaos have been immuno-rejected before they’ve even got going. The culture has been unable to shift to the point where the true cause can be seen behind the rampaging monkey. This happens when people have come to accept the fragility of their systems as a fact of life.

Cultures evolve, priorities shift, engineers are reassigned, architectures grow and evolve, scale hits, pandemics change the world around us. There is still a bit of code, running today in Starling’s production environment, which does the simple thing. There has never yet been a chaos team. That code is everyone’s responsibility, but it’s been running for over four years. It has been maintained, fixed, moved, and its importance is known and respected.

It quickly developed the ability to run other experiments that mattered to us, like killing all servers in an autoscaling group, which in effect takes out an entire service in Starling’s architecture. In our non-production environments, that could be invoked on demand via a slackbot and was available to all engineers to investigate or verify hypotheses.

This sits squarely in the abyss of ignorable. It happens rarely but it happens. In case you think that simultaneously losing three servers, in three different availability zones, in the same scaling group, is extremely unlikely, remember that they all are running the same code and deployed at a roughly similar point in time. We’re simulating a type of failure I call “intrinsic” (caused by our own development) as opposed to “extrinsic” (caused by environment). If we push a bad change to all servers we can easily take them all out.

We added the ability to cause database failovers. Services are notorious for ignoring the possibility that their database might disappear, for two reasons: i) it’s very rare and ii) it’s widely believed that there is nothing useful you can do. i) puts us back in the abyss of ignorable. ii) is simply wrong. There is something you can do, namely sit tight and be prepared to get going again the second you can contact the database, which is unlikely to be the behaviour of your services unless you have given it careful thought.

Whatever experiments come and go, the thing that pegs chaos in everyone’s minds is still the simple knowledge that the chaos daemon is there and it’s always angry and it’s always killing servers. When people (technology and business alike) understand what is going on with server failures, you have an important step up towards explaining, normalising and embedding more interesting experimentation.

So don’t not do the simple thing. Don’t get side-tracked configuring Spiritual Discombobulation Monkey. Put a great deal of focus on the simple thing. Simple things can become part of the culture. They are effective quickly. 

And never forget that the system you are refining is a system of technology and the people who work on it. Determine what terrors languish in your abyss of ignorable and use chaos to haul them out into the open.

About the Author

Greg Hawkins is an independent consultant on tech, fintech, cloud and devops and advisor to banks and fintechs in the UK and abroad. He was CTO of Starling Bank from 2016-2018, during which time the fintech start-up acquired its banking licence, built from scratch and went from zero to smashing through the 100K download mark on both mobile platforms. He remains a senior advisor to Starling Bank today. Starling built full-stack banking systems from scratch, became the first publicly available cloud-based current account in the UK, the first UK bank with PSD2-capable open APIs and was voted Best British Bank and Best British Current Account in 2018, 2019 and 2020.

Rate this Article

Adoption
Style

BT