Presentation: Progressive Architectures at the Royal Bank of Scotland
In their presentation posted at InfoQ systems and data architects Ben Stopford, Farzad Pezeshkpour and Mark Atwell show how RBS leveraged new technologies in their architectures while facing difficult challenges such as regulation, competition and tighter budgets. They also need to cope with stringent technical challenges including efficiency and scalability.
In high frequency trading an unavailable system causes significant costs, even if the unavailability only lasts for a second. As Stopford explains, a full garbage collection might lead to such a break which is why full GC needs to be avoided under all circumstances. More technical details about pauses in the CMS (Concurrent Mark Sweep) garbage collection being used can be found in Alexey Ragozin's blog.
In the Manhattan Processor project the RBS engineers considered two solutions. While the first one relies on a huge amount of Eden space and small old generations, the second model uses a small Eden space that can be collected in a predictable time (for example, in one millisecond.) The engineers decided for the latter option, because, 3rd party libraries cannot always be controlled, the creation of garbage is inevitable, and smaller collections are tolerable. The Manhattan Processor comprises concurrent processing with a small garbage footprint. Its architecture introduces a non-allocating queue specialized for multiple producers and single consumers. Stopford emphasizes the following benefits of the resulting architecture:
- Fast, pluggable architecture for concurrent processing
- Predictable pauses <1ms
- No full collection pauses in 15hr trading day
Farzad Pezeshkpour talked about handling business risk in the financial domain. He talked about using multi-step Monte Carlo simulations, which require large amounts of storage and computations, to handle business risk - all those interested in the mathematical foundation may read an article on how to apply such simulations to credit risk.
But how can IT handle the issue of storage volumes and efficient computations? A standard Grid architecture with application servers connected to directors and brokers which themselves are connected to the grid computers was the base for further investigation. Such an environment must deal with terabytes of storage and requirements such as speed, robustness, scalability, interoperability with Linux and Windows servers, as well as the need for an adequate support infrastructure. Among the possible options are databases, storage appliances, and distributed filesystems. The software engineers decided for using the hadoop HDFS distributed file system. In their architecture approach they collocated grid engines and HDFS nodes using 10 GB Ethernet for inter-node communication, after experimenting with various architecture alternatives.
As Pezeshkpour mentioned, there'll be further optimizations in the future with the innovations addressing all aspects, "methodologies, architecture & engineering."
Ben Stopford covered one core issue related to messaging. Whenever facts are distributed via messages, different actors may see different historical versions of the same facts. That's why Scott Marcar (Head of Risk and Finance Technology at RBS) asked for a centralized approach:
The copying of data lies at the root of many of the bank’s problems. By supplying a single real-time view that all systems can interface with we remove the need for reconciliation and promote the concept of truly shared services.
Two factors make this difficult to solve: Throughput and low latency. The base element of RBS's ODC approach is the Java-based in-memory data grid solution Oracle Coherence database. RBS needed to add joins to recompose different slices of data and handle object versioning. Stopford and his team introduced various patterns to add these joins while avoiding performance and memory penalties.
They implemented replication to increase performance while at the same time applying the Connected Replication Pattern that only replicates used data. This strategy reduces replication necessities to one-tenth and depends on tracking data usage. ODC consists of two tiers, an in-memory database tier on top and a sharded in-memory database below. Both tiers together enable RBS to meet their desired quality attributes.
Mark Atwell presented an approach for data virtualization that prevents the YAS (Yet Another System) syndrome by introducing a strategic DAL (Data Access Layer) instead. The RBS architects created a uniform layer that offers a unified database view. On top of this layer virtual transformations take place that adapt the data in accordance with what the accessing applications expect. As Atwell explains, the approach reveals benefits such as improvements in distributed joins & query optimizations. But there are also some liabilities such as "Pesky physics still get in the way!" but also the necessity to create adapters for some "unusual" data sources.
As the RBS show cases illustrate, software engineering for large systems is not only about technology but also about how to leverage the right technologies efficiently and effectively within an appropriate architecture.
It'd be interesting to know if dispatching work to multiple consumers (or conflating to one) done in the Manhattan processor would be easier to solve in Erlang. I don't know enough to answer that one I'm afraid.
Dimitar Bakardzhiev Mar 29, 2015