BT

Key Lessons Learned from Transition to NoSQL at an Online Gambling Website

Posted by Dan Macklin on Nov 21, 2015 |

As one of the world’s biggest online gambling websites - serving around 19 million customers in nearly 200 hundred countries - everything about bet365 and its IT is on a huge scale. 

It has to be. At peak times our betting platform runs at more than half a million transactions per second and must serve in excess of 2.5 million concurrent users.

Downtime is expensive, both in terms of loss of income and brand perception. Customers are quickly frustrated by any dips in service and are very vocal on social media in expressing that disappointment! Clearly, in an environment such as this, availability is a primary requirement.

But large and demanding as our systems already are, maintaining our leadership position calls for more than just a flawless service. In addition to ensuring we can meet the growing demand, we are also under pressure to introduce new services that move the customer experience forward.  

As you'd expect, introducing new services places even more demands on an already highly complex system and there is only so hard you can push. 

As it was, our SQL architecture was being pushed to its limits. Having already scaled our databases as far as it was cost effective to go, it was clear we needed to find a new way of working. 

The R&D division had been set up a few years earlier to deal with challenges of this nature and so it fell to my team to find the solution.

We found it in open-source software. By moving to Erlang programming language and Riak Key Value NoSQL data store coupled with Convergent Replicated Data Types (CRDTs), we can now quickly build new systems to handle the load.

Erlang

We chose Erlang because we found that it makes it easier to build reliable, simple systems that scale horizontally.  These attributes come from Erlang's concurrency semantics, “let it crash” error handling philosophy and functional programming nature. 

Erlang concurrency is designed around the actor model and encourages an elegant style of programming where problems are modelled by many isolated processes (actors) that communicate through immutable message passing.  

Each process has its own heap and by default is very lightweight (512 bytes) making it practical to spin up many hundreds of thousands of processes on commodity type servers.  These individual processes are scheduled by a virtual machine over all available processor cores in a soft real time manner making sure that each process gets a fair share of processing time. 

The fact that each Erlang process has its own heap means that it can crash independently without corrupting shared memory.  This makes it far easier for programmers to deal with failure and build software where individual subsystems can fail without bringing down the entire system. If something goes wrong at a process level simply let it crash and rely on Erlang’s supervision functionality to automatically restart it.

Finally Erlang’s functional heritage makes it easy (with a little discipline) to compose complex functionality from many small side effect free functions that can be easily tested.  The number of lines of code needed to solve a problem is reduced and on the whole, once you get used to the syntax, the code becomes far simpler and easy to maintain.

The net result was by adopting Erlang we could greatly improve the parallelism, throughput and reliability of our software, whilst writing fewer lines of simpler code. In our experience simple code is good code.

NoSQL Database

When we first started investigating NoSQL databases, we looked at many different types of systems from many different companies. We soon realised that a Key Value Store seemed to be an interesting architecture for us to look at, because it was a good fit for our use cases and there are several Key Value databases on the market.

We were drawn to Amazon’s well-known Dynamo system, a reliable, distributed, masterless key/value store used to maintain shopping carts.  From the Dynamo paper, several companies have implemented open source versions of the architecture. We studied these and chose Riak KV database because it has more features to support data consistency at scale.

Let’s look at some of these features that are critical to our requirements and use cases.

The problem for us is as soon as you've got data in a distributed system, managing the consistency of that data becomes much more difficult, particularly if you value performance and availability as we do.  In the event of a network partition - a temporary failure or network split between datacentres – some members of the distributed system can’t talk to each other for a period of time.  

This is important because in a distributed system running many nodes (in our infrastructure we are using hundreds of nodes), in many datacentres, even if you have the most expensive kit in the world failures will happen, servers will go down and networks will break.   This makes it imperative that systems can deal with these conditions without losing or corrupting data.

The CAP theorem says that in the event of a network partition where a quorum of nodes is unable to communicate you have two options. You can go for consistency - a CA system - or you can go for availability - an AP system. If you go down the consistency route, it effectively means that if your system finds it can’t be consistent it will sacrifice availability to preserve the consistency of your data. For us, this would be unacceptable.

In an Available system, the kind that’s essential for us, the system will carry on working. The problem here is that if a system keeps working in a network-partitioned environment or in the face of many concurrent unsynchronised updates you can’t hope to maintain data consistency. Data is bound to get out of sync eventually and you will need to repair it.  The most that you can hope for is that it will become eventually consistent.

Eventual consistency throws up a whole host of new challenges.  For example in an eventually consistent system you can’t even guarantee that a client will be able to read its own writes.  So how do you build systems that can cope? 

The answer is you have to have a deep understanding of your use-cases and take a probabilistic approach (how likely is it that my data is inconsistent, how long will it take to repair, how does the user interact with the application, what is the probability of that happening and if it does can the business live with the risk?) and adopt mechanisms that detect inconsistency and repair it without introducing a burden on developer productivity.

Fortunately Riak has a number of features in this area that differentiate it from the competition. 

Firstly Riak uses a mechanism called Vector Clocks (a causality tracking mechanism based on tuples of actors and counters) to detect concurrent updates to data without the need for a central clock (which is very hard if not impossible to keep completely in sync in a distributed system). 

If Riak detects a problem (for example the same data has been updated on two sides of a network partition) it stores both copies of the data (known as siblings) and returns them to the client application for resolution on the next read.  

This means you may have two, or three, or four, or even five or more copies of your data that has been saved and you’ve got to write a merge function that makes sense of it.  This is a great step forwards and is much better than simply using timestamps to keep the last written value (an approach taken by other systems) but still leaves developers with the difficult task of writing bespoke merge functions to repair data.

Initially when we looked into this issue we concluded that while Riak alone is great technology, writing merge functions was likely to confuse the hell out of all our developers and slow down development - something that is also not going to be acceptable for a company like ours, which is all about getting things done at pace.  So we did more research and that’s when we discovered Convergent Replicated Data Types – CRDT’s.

State based CRDT’s are a relatively new class of algorithm that typically operate over sets and have the mathematical properties of Associativity, Commutativity and Idempotence (which just happen to be the algorithmic characteristics required to produce a deterministic merge function over a number of sets). 

Fortunately after some analysis we found that much of our data could be modelled within sets so by leveraging CRDT’s our developers don't have to worry about writing bespoke merge functions for 95% of carefully selected use cases.   This gave us the confidence to push ahead with the project as it gives the best of both worlds.

You've got an available system, where you deal with Eventual Consistency and if something goes wrong or something happens in parallel that you’re not expecting to happen, you’ve got a way of maintaining a level of consistency (eventually).   From version 2.0 Basho (the developers of Riak) have built numerous types of CRDT into Riak.

Initially CRDT’s can be quite confusing and mysterious.  Even the name is confusing as there are two types of CRDT which are both slightly different.  Operation based CRDT’s are known as Commutative Replicated Data Types and State based CRDT’s are known as Convergent Replicated Data Types.

In Operation based CRDT’s commutative operations are propagated and applied to all data replicas in situ, and as such their implementation resembles a distributed log-ship.  Whilst this approach reduces transmission volumes, the fact that these operations are not idempotent means that practical systems require additional network protocol guarantees that are not easy to implement.  Therefore most state of the art production systems end up going with state based CRDT’s.  In State based CRDT’s full state updates are sent to all replicas.

Upon receiving a new state the replica queries its own state, runs an update function (which must monotonically increase the state – for example adding a value to an ordered set is a monotonic operation) and finally calls a merge function (which must be commutative, associative and idempotent), to bring causal consistency to the new and previous states.

If this sounds tricky hopefully some examples will help.  At bet365 we ended up using the ORSWOT (Observe Remove Set without Tombstones) as it facilitated add and remove operations in a relatively space efficient manner.  Each ORSWOT is stored as a value in a key value store.

The grey text represents our data.  In this case it’s a set of Erlang binary strings ([<<"Data1">>, <<"Data2">>, <<"Data3">>]).  The green text represents a Version Vector (a type of server side Vector Clock) whose job is to keep track of the entire sets top level causal history.  The blue text next to each element represents its Dot (Dotted Version Vector).   The Dot stores the actor and its associated count for the last mutation.

So let’s take a look at what happens during an add operation.  For simplicity let’s set our initial state to be:

{[{x,1}], [ {<<“Data1”>>, [{x,1}]}]}

Adding <<“Data2”>> to this set using unique actor y results in:

{[{x,1},{y,1}], [ {<<“Data1”>>, [{x,1}]}, {<<“Data2”>>,[{y,1}]}]}

Note that the new actor y has been added to the Version Vector and <<"Data2">> has been added to the set with a birth Dot of [{y,1}].

That was easy enough, so what happens if you need to merge two concurrently updated ORSWOTS (ORSWOT A and ORSWOT B)?

First we’ll set up some new data for our example.

We will be merging ORSWOT A:

ORSWOT A = {[{x,1},{y,2}], [ 
{<<“Data1”>>,[{x,1}]},{<<“Data2”>>,[{y,1}]},{<<“Data3”>>,[{y,2}]}]}

Where ORSWOT A has seen the addition of:

  1. element <<“Data1”>> via actor x
  2. element <<“Data2”>> via actor y
  3. element <<“Data3”>> via actor y

With ORSWOT B:

ORSWOT B = { [{x,1},{y,1},{z,2}], [ {<<“Data2”>>,[{y,1}]}, {<<“Data3”>>,[{z,1}]}, 
{<<“Data4”>>,[{z,2}]}]}

Where ORSWOT B has seen:

  1. The adding element <<“Data1”>> via actor x    
  2. The adding element <<“Data2”>> via actor y
  3. The adding element <<“Data3”>> via actor z    
  4. The adding element <<“Data4”>> via actor z
  5. The removal of element <<“Data1”>> via actor z

Resulting in a merged result of:

{[{x,1},{y,2},{z,2}], [{<<“Data2”>>,[{y,1}]},{<<“Data3”>>,[{y,2},{z,1}]},{<<“Data4”>>,[{z,2}]}]}

Which reduces to [<<"Data2">>, <<"Data3">>, <<"Data4">>] once all of the ORSWOT meta-data is removed.

I will now try to demystify the merge function.  At a high level this involves looking for common elements and establishing “a happens before relationship” between each elements dot and the Version Vector in the other set.

First we’ll look at the common elements in both sets, as they are the easiest to understand.

  • Data2 is retained as it exists in both sets with the same dot {y,1}.
  • Data3 is retained as :

    It’s element in ORSWOT A has a dot {y,2} which has a greater count than {y,1} from ORSWOT B’s version vector of [{x,1},{y,1},{z,2}].

    It’s element in ORSWOT B has a dot {z,1} which does not exist in than ORSWOT A’s version vector of [{x,1},{y,2}] so is implicitly greater.

    Therefore the dots {y,2} and {z,1} are merged to give a new dot of [{y,2},{z,1}] thereby maintaining the elements causal history for subsequent operations.

Next let’s look at the elements solely from ORSWOT A:

  • <<"Data1">> is deleted because its dot of [{x,1}] is not dominated by (less than or equal to) ORSWOT B’s Version Vector of [{x,1},{y,2},{z,2}].  This suggests that ORSWOT B has seen the addition of <<"Data1">> by Actor x and its subsequent removal by Actor z.

Finally let’s look at the elements solely from ORSWOT B:

  • <<"Data4">> is included because it’s dot of [{z,2}] dominates (is greater than) ORSWOT A’s Version Vector [{x,1}, {y,2}].

Hopefully this has given a taste of how the ORSWOT State based CRDT works.  For more details I’d recommend to check out riak_dt github page which contains a full Erlang implementation of the algorithm.

Conclusions

In summary given the probabilistic nature of eventually consistent systems (even those that use CRDT’s) one of our main insights is that it’s important that development teams really understand their use cases. We approached this by building many Proofs of Concepts so that we could demonstrate the different failure scenarios and see where Eventual Consistency could safely be applied to our business.

Talking about something theoretically is one thing - showing exactly how this failure or that failure could happen is something else altogether. In R&D we take a very activist attitude, and rely on POCs to make sure things are working as expected and demonstrate trade-offs, rather than sitting there poring over the theoretical aspects.

The environment in which we operate has been changing very rapidly and the move to distributed systems and technologies like Erlang and Riak NoSQL, have been progressing concurrently, so our developers have had to deal with multiple sea changes in the way they work. 

It’s natural for development teams to resist radical change, be cautious and stick to what is known. But the task for an R&D team like ours is to say, “If we want to be market leading then we must use technology to get a business edge. To do this we need to innovate and take calculated risks - it’s crucial to think differently – and most importantly to bring the wider development teams along with the change”

A lot of the marketing hype about NoSQL suggests you can just implement it and it will work straight out of the box.  Our own experience is that if you’re solving large and interesting real-world problems and/or dealing with existing systems this is very rarely the case.

Some of the techniques that we now use to structure our data are very alien - if we applied them to our SQL databases we would get fired - but in a NoSQL environment they work well. So sometimes you have really got to think in a totally different way about how you structure the system and sometimes, non standard approaches work.

About the Author

Dan Macklin is a hands-on technical manager who loves to learn, make things happen and get things done.  After running his own business for ten years, Dan is now the Head of Research and Development at bet365.

Rate this Article

Relevance
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

correction by Eugene Sotirescu

"As one of the world’s biggest online gaming websites"

bet365 is an online GAMBLING, not "gaming" website.

Re: correction by Roxana Bacila

Hi Eugene,
Thank you for reporting the error. It is now fixed.
Roxana Bacila
Community Contributions Facilitator
InfoQ Enterprise Software Development Community

How does it compare to Gemstone? by Cameron Purdy

Previously Bet365 was a Gemstone user, right? www.businesswire.com/portal/site/google/?ndmVie...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

3 Discuss
General Feedback
Bugs
Advertising
Editorial
Marketing
InfoQ.com and all content copyright © 2006-2016 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT

We notice you’re using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.