Creating Resilient Software with Akka
In this article I will give you a tour behind the scenes of Akka, but not in a technical sense—that curiosity can be satisfied by looking at the source code—rather I will show you what the driving forces behind the development of its guiding principles are. While these ideas are not new I hope that I can give you a new perspective on how they are connected.
What does Resilience Mean?
Let’s face it: failure does happen. No matter how hard you try, there is always one more case which you did not foresee and which your testing did not reveal. The question is thus not “how can I avoid failure”, it is “how do I deal with it”. Resilience is defined by Oxford Dictionaries as:
- the ability of a substance or object to spring back into shape; elasticity
- the capacity to recover quickly from difficulties; toughness
Everyday examples of such difficulties, which may bend parts of your application out of shape, include the database server that is temporarily unavailable, the load spike introduced by everybody wanting to use that awesome new product, the malformed user input which somehow made it through to that subroutine which now chokes on it. Recovery from these situations should ideally be automatic in order to restore normal service as quickly as possible.
Resilience in this context means that failure must be compartmentalized: one failing component shall not take down other components if that can be avoided. Service levels may be reduced, perhaps some functionality is temporarily not available, but the rest of the application still runs. Once the fire has been stopped from spreading it must be extinguished, the failing component needs to be restarted or a fail-over is initiated to a backup system. These processes need to be automatic and reliable because they catch all those unplanned for failure cases. Once the component’s functionality has been restored normal collaboration with other components resumes.
Loose Coupling is Key
From the view just presented it is clear that the individual components making up a resilient application cannot be strongly coupled, they cannot be intimately entangled. Every bit of knowledge about the internal state of another component would become invalid after failure recovery, making one component vulnerable to the other’s failure. The idea of loosely coupled objects is a very old one, for example realized in Smalltalk-80, where an object—besides holding state—can only communicate with other objects using messages. This abstraction is a very natural one in that it matches how humans interact: we do not literally pick each other’s brains, instead we exchange messages via verbal and non-verbal communication. The key feature of loose coupling is that the internal state is not directly accessed and all interaction is mediated by a protective layer which is the object’s behavior.
A shadow of this principle is still visible in the common practice of not exposing variable fields of Java classes to external callers directly. In its stead we reflexively generate getters and setters with the help of an IDE; these do not yield loose coupling, though, as I will demonstrate next.
Another property of mainstream object-oriented languages is that invoking an object’s behavior—calling a method—happens synchronously so that the resulting value is immediately available to the calling code and all side-effects have happened when the method call returns. The problem with this is that it does not allow for an object to be unavailable: any external caller must immediately be satisfied with an answer and if that is currently not possible the only choice is to escalate the failure by throwing an exception. This exception must then be handled by the caller, triggering failure recovery. The caller is thus responsible for knowing how to do that, which is akin to saying that every customer must be able to repair the vending machine. This leads to a rather strong coupling between objects even in the absence of direct access to private state.
Loose coupling therefore demands that components communicate by sending each other messages asynchronously, receiving possible replies at a future point in time in the form of further messages. If you are thinking “component == object” in this context for the first time, this will likely seem odd to you, but in fact this technique has been applied at a higher level for a long time. Different components or services in an enterprise architecture are commonly decoupled using message brokers and queues, so that the failure and recovery of a service can be bridged by storing incoming requests within the messaging infrastructure and resuming processing when the service is back up. Another example is the ubiquitous RESTful HTTP interface in front of a service, where an HTTP request message is sent by the client, and the server answers with a HTTP reply message.
When you think about a service in your architecture, that is usually not something which is written in a few minutes; a service is usually one of up to a few hundred components, where the latter is already a rather large application, a sizable piece of engineering work. Now imagine that your services were constructed internally in the same loosely coupled fashion, just on a smaller scale. The objects inside collaborate to provide the service and each object just solves a tiny part of the overall problem. These small objects communicate only via asynchronous message passing and if one of them fails all others happily keep working. These small objects are Akka’s actors and it is completely reasonable to write one actor in a few minutes.
Asynchronous Message Passing needs Supervision
The example of an HTTP server makes it very obvious that the client cannot be responsible for restoring the server’s functionality in the event of trouble; it will merely react to the absence of a meaningful reply by disabling the service or switching to a backup server and thus achieving compartmentalization, inserting a bulk-head between the client and the service instance. But who is then responsible for restarting the HTTP server and bringing the RESTful interface back on-line?
In a resilient application every component must have a supervisor which is responsible for dealing with its failures. You can also think of this relationship between two components as that of parent and child. When the child fails it tells the parent—or the parent detects it by constantly monitoring its children—and the parent will decide upon the way forward, eventually restoring the child component’s function or escalating the failure to its own parent. The important feature of this approach is that it is no longer the responsibility of the caller to handle the callee’s failure; in the vending machine example someone other than the customer is responsible for either fixing it or calling in support. Decoupling the failure handling path (child–parent) from the business communication path (caller–callee) is perhaps the most important contribution of the Actor Model to software design.
Since every child needs to have a parent, all components form a supervision hierarchy, not unlike the management structure of a large corporation. And at the top of that hierarchy you find the one who symbolizes the whole effort, the component which ties the application together and whose failure signifies that the application as a whole has failed. The further up a failure escalates the more disruptive the resulting recovery actions will have to be, affecting ever larger parts of the application. This has two consequences:
- operations which have a high likelihood of failure should be performed low in the hierarchy so that recovery is as swift and simple as possible.
- important object state should be retained as high-up in the hierarchy as possible to protect it from said recovery actions—it frequently is necessary to reset a component to a known-good state in order to get it going again, whereby it loses all its accumulated state.
The second pattern is referred to as the Error Kernel and the obvious goal is to keep this core part of the application, for which recovery is difficult or costly, as small and as simple as possible, so that failure can largely be avoided.
Typical enterprise architectures do not display such a homogenous supervision hierarchy: the exemplary HTTP server is perhaps monitored by a network surveillance tool and either an automatic script restarts it or an operator is alerted to perform manual recovery. Akka actors implement the described form of mandatory parental supervision in a completely homogenous fashion, the hierarchy consists of actors all the way up. (Well, obviously there must be a top actor whose parent is not “real”, but that is a technical albeit interesting detail.) Every actor defines a supervisor strategy which determines whether a failed child actor is resumed, restarted (i.e. reset to initial state) or stopped completely. And if nothing else helps, the failure is escalated, which means that the supervisor fails itself and entrusts the problem to its own parent.
Loosely Coupled Objects can be Distributed
Looking back at the story so far I have introduced components which only communicate by sending each other messages and which do not directly access each other’s private internal state. Communication happens asynchronously, allowing for some time to pass between when the caller sends the message and when the callee processes it. These characteristics make it possible to execute the behavior of these components completely independently of each other, possibly one after the other on a single thread—without the nested call stack usually encountered during synchronous method invocations—but equally possible in parallel on different threads, which may in turn be scheduled on the cores of a single server or even on different network hosts. The latter is enabled by not sharing mutable state between components, whereby the need to execute in an environment with shared memory vanishes.
Consider this for a moment: starting from the desire for a resilient software design we obtained objects which lend themselves well for distribution across multiple network nodes. Coming from the other side, fault tolerance of a system demands redundant hardware, for even the most reliable server will eventually fail. Therefore distribution of an application is even required to achieve resilience in the face of node failures.
Another astonishing consequence of resilience is that it requires the introduction of concurrency, due to the supervision-related communication, but at the same time gives us a tool for managing it. The communicating objects securely encapsulate their private state and present a natural boundary within which all processing is sequential and consistent—one message after the other—and none of the dreaded utilities for managing synchronization in concurrent programs are needed at the level of your own code. In exchange for this nice property, the development of Akka’s core has provided a very gratifying challenge for the Akka team so far.
Distribution of actors is fully transparent in Akka, going so far as to make it a deployment choice. This means that you write your application using actors and then by configuration deploy parts of the supervision hierarchy on collaborating network nodes or group certain actors for execution on the same thread pool. Actors send messages among each other using ActorRefs, immutable handles which can be shared and passed around freely, exposing the same messaging semantics no matter if the designated target actor is situated within the same JVM or on a different network host.
Distributed Supervision calls for Cluster Support
The fault-related communication between child actor and parent is performed via messages as well, but what if the parent lives on a different network node and someone trips over a network cable precisely when that child is in need of help? Your intuition probably is that the failure message would need to be buffered until the cable is plugged back in and then resent, and the same goes for the parent’s reply. This is quite close to what happens, but the truth is a little more complicated because Akka needs to keep your application running even if the parent’s machine will never become reachable again, for example because its power supply finally gave up. In such cases the actors on the parent’s machine are considered stopped, which in this case would mean that the failed child actor will also have to be stopped (and the grand-children and so on); after all no actor can be without a parent. The complicated part is that other network nodes may also be collaborating with that machine which is currently not reachable from the child’s node, and these other nodes may still be able to communicate with the parent’s machine (because a network cable is at fault after all). The decision whether or not all actors on that machine shall be declared stopped needs to be taken consistently between all collaborating nodes.
This is where Akka’s cluster support enters the picture. It manages a consistent membership list of all collaborating cluster nodes and it does so without any central master node. Instead it is built upon the ideas of Dynamo and uses a gossip protocol to disseminate the shared membership information. Every node monitors the liveness of a number of peers using heartbeat messages and if a heartbeat pause exceeds a configurable threshold word will be spread among the others of the silent node’s unreachability. Once an unreachable node is taken out of the cluster by marking it “down” (either automatically/programmatically or manually) all remaining nodes will consider all actors on the removed node as stopped, inform their parents and recursively stop their child actors, respectively.
The supervision link between parent and child is not the only beneficiary of the cluster’s service: every actor can monitor the life-cycle of any other actor in the cluster in order to receive a message when that other actor is stopped. This works in the nominal case by the target actor sending out these messages during its end-of-life procedures, but thanks to the cluster it also works if the target actor cannot actually perform those actions because its node crashes or becomes unreachable.
Resilience is achieved by dividing an application into loosely coupled components and by accepting failure as a fundamental part of the programming model, which is explicitly expressed and addressed by supervision. This is why I strongly believe that the Actor Model—specifically with mandatory parental supervision as implemented in Akka—is the primary tool for developing resilient applications. Another benefit is that actor-based applications naturally scale horizontally and vertically since the basic abstraction lends itself well to distribution. If you want to read up on these topics in more depth, especially on the technical details and how to get started with writing actor-based application components, please refer to the online documentation.
About the Author
After earning a PhD in high-energy particle physics and while working as a systems engineer in the space business, Roland came in contact with Akka. He started contributing to the open-source project in 2010 and has been employed by Typesafe since 2011 where he has been leading the Akka team since November 2012.
resilience is much more than failure of a component (or actor)
better still resilience need not piggyback on the failure of components, in fact there does not need to be a self destruct button to achieve resilience if software is able to self regulate, predict and adapt
“Resilience is the ability to absorb disturbances, to be changed and then to re-organise and still have the same identity (retain the same basic structure and ways of functioning). It includes the ability to learn from the disturbance. A resilient system is forgiving of external shocks. As resilience declines the magnitude of a shock from which it cannot recover gets smaller and smaller. Resilience shifts attention from purely growth and efficiency to needed recovery and flexibility…Systems are never static, they tend to move through four, recurring phases, known as an adaptive cycle. Generally, the pattern of change is a sequence from a rapid growth phase through to a conservation phase in which resources are increasingly unavailable, locked up in existing structures, followed by a release phase that quickly moves into a phase of reorganisation, and thence into another growth phase.” – Resilience Alliance
there is a cycle to life and it need not be manage by a particular supervisory routine which is blind to what it is it actually supervises...this article goes into what it truly means to be supervised beyond start, stop and restart
system dynamics of resilience
runtime (in-flight) supervision...some call it governance
what is far more important is to think in system dynamics rather than actors and their failures...we can influence (both good & bad) the execution behavior of our software via the promotion and late injection of feedback loops in a software system
you will notice that in the links presented the emphasis is placed far more on learnt behavior and its intelligent runtime adaption and not failure and its blind restart (which does not mean the problem has been resolved or will be handled anyway differently in the future)...from my experience large distributed systems experience far more problems that are workload related and any early intervention (even just to allow the software catch some air) increases hugely the resilience of the system and mitigates the introduction of more serious faults & failures
Uwe Zdun, Rafael Capilla, Huy Tran, Olaf Zimmermann Mar 09, 2014
Olav Maassen, Liz Keogh & Chris Matts Mar 08, 2014