Badoo is a dating social network that currently handles billions of events per day, explains Vladimir Kazanov, data platform engineering lead. At Skills Matter, he talked through some of the challenges of operating at this scale, and what tooling Badoo uses in order to process and report on this data.
The goal of the business intelligence department at Badoo is to collect user event information, processing and reporting on it to create insights. It’s these insights that help the company make formed decisions. Kazanov explains that these integral events go through a lifecycle:
- Receive: Using Protobuf, various client libraries are generated for producing events. These are then streamed through LSD, an open source streaming daemon which is used to filter and route the events.
- Store: Data is stored in a data lake in the ORC file format, running on HDFS. Events with schemas are stored in Exasol, a columnar distributed analytics database.
- Process: Data is processed using Spark, a Java-based distributed computation framework which allows data to be queried over a cluster.
- Report: A reporting tool called microstrategy is used which allows Exasol to be queried using dashboards and reports. In addition to this, a custom tool called CubeDB is used, designed to run queries faster for technical reporting.
In order to create a new event, first, a business analyst creates a schema for it. From this schema, Protobuf client libraries are generated for various platforms. Kazanov sees this cross-platform support as one of its core advantages, as it makes it easy for mobile and web applications to start publishing this new event.
When streaming events through LSD, Badoo batches the data hourly rather than doing so in realtime. This is because, in the event of failure, Kazanov believes that re-loading a batch is easier, as it’s simple to compare against the target database to see if it was written correctly.
Kazanov also believes that storing data in ORC is particularly useful. He lists some of the reasons for this as being columnar-orientated, having strong compression properties, and that it’s supported by multiple applications. It can also be easily be queried using Hive, a database on top of Hadoop with an SQL like query language.
When talking about querying data, Kazanov explains that one of the advantages of Exasol is that it uses SQL. This presents a low learning curve for developers, who don’t need to learn a new query language. On top of this though, he sees the core benefit as its performance:
Exasol allows us to store terrabytes of data in the cluster, and do really efficient queries to it. I’m talking about minutes, whereas comparable systems don’t come close.
The full talk can be watched online, with open source tools from Badoo, such CubeDB, accepting community contributions.