Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

Discord has migrated trillions of message records from Apache Cassandra to ScyllaDB, reducing the size of the largest cluster from 177 Cassandra nodes to 72 ScyllaDB nodes and reducing tail latencies for reads and writes. The move has unlocked new product use cases because of the improved database stability and performance.

As Discord grew, it migrated its data from MongoDB to Cassandra in 2017 because it was looking for a scalable database to handle ever-growing data volumes. Initially, the Cassandra cluster consisted of 12 nodes and stored billions of messages. Still, after five years, the cluster had 177 nodes and was frequently experiencing performance problems, forcing the team to reduce some of the maintenance operations, which became too expensive to run.

Some performance issues were caused by hot partitions resulting from the table schema design, where partitioning was based on the Discord channel and time bucket. Bo Ingram, a senior software engineer at Discord, explains the impact of hot partitions on the database cluster:

When we encountered a hot partition, it frequently affected latency across our entire database cluster. One channel and bucket pair received a large amount of traffic, and latency in the node would increase as the node tried harder and harder to serve traffic and fell further and further behind. [...] Since we perform reads and writes with a quorum consistency level, all queries to the nodes that serve the hot partition suffer latency increases, resulting in a broader end-user impact.

Based on the experimenting and testing done internally, the team has decided to move its data across all clusters to ScyllaDB. They opted for ScyllaDB primarily to improve performance, including avoiding garbage-collection-related issues they experienced with Cassandra. They also worked with the ScyllaDB team to improve some use cases they depended on, like reverse queries.

After migrating all smaller clusters by 2020, the team prepared to migrate the biggest cluster, containing trillions of messages. To minimize the hot partition problem, they created a new intermediary service layer in their architecture, named data services, written in Rust and interfaced via gRPC API.

One important responsibility of data services is request coalescing, which avoids multiple database calls when many users request the same message. Secondly, the team implemented consistent hash-based routing to data service instances based on a routing key, such as a channel id. Together, these changes significantly reduced hot partition problems, giving Discord extra time to prepare for the big migration.

Source: https://discord.com/blog/how-discord-stores-trillions-of-messages

For the migration itself, the team first considered using ScyllaDB’s Apache Spark migrator but, in the end, decided to implement a bespoke solution in Rust with SQLite used for checkpointing, which allowed them to shorten the migration time from three months to nine days. After addressing some minor hiccups, they validated the completed migration and switched to ScyllaDB in May 2022. Since then, the new cluster has proven stable and provided consistent performance, which, together with the data service layer, allowed it to handle extra traffic generated by the World Cup gracefully.

About the Author

Rafal Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Rafal Gancarz

Rate this Article

This content is in the Database topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter