Uber Open-Sourced Its Highly Scalable and Reliable Shuffle as a Service for Apache Spark

Uber engineering has recently open-sourced its highly scalable and reliable shuffle as a service for Apache Spark.

Apache Spark is one of the most important tools and platforms in data engineering and analytics. Applications in spark are called jobs and Spark jobs have multiple stages. Shuffling is a well-known technique to transfer data among stages in a distributed environment. The tasks writing out shuffle data are known as map tasks and the tasks reading the shuffle data are known as reduce tasks. These stages and shuffling are illustrated in the following figure for more clarity.

Basic Shuffle Operation

Spark is shuffling data on local machines by default. It causes challenges while the scale is getting very large (about 10,000 nodes on Uber Scale). At this scale of operation, major reliability and scalability problems happen.

One main challenging area in using Spark at Uber scale is system reliability. Machines are generating terabytes of data to shuffle every day. This causes disk SSDs to wear out faster while they are not designed and optimized for high IO workloads. SSDs are designed to work generally for 3 years but in heavy Spark shuffling operations, they are working for about 6 months. Also, lots of failures happen for shuffling operations which decreases system reliability.

The other challenge in this area is scalability. Applications could produce lots of data that could not be fitted on a single machine. It causes a full disk exception problem. Also, When upgrading a machine to more CPU and memory, the disk size is not growing in the same proportion. This leads to competition over disk resources for many applications.

To resolve the mentioned issues, engineers at Uber architected and designed Remote Shuffle Service (RSS) as shown in the following diagrams. It solves the mentioned reliability and scalability problems in the common Spark shuffling operation.

Overall architecture of RSS with main modules as Client, Server, and Service Registry

This architecture is simply explained in the blog post by writers :

All Spark executors will use clients to talk to the service registry and remote shuffle servers. At a high level, the Spark driver will identify the unique shuffle server instances for the same partition with the help of a Zookeeper and will pass that information to all mappers and reducer tasks. As shown in the Figure, all the hosts with the same partition data are writing to the same shuffle server; once all mappers are finished writing, the reducer task for the same partition will go to a specific shuffle server partition and fetch the partition data.

By using this architecture engineers could offload 9-10 PB disk write. It increased disk wear out time from 3 to 36 months. They could also achieve reliability above 99.99% by using RSS. It also helps to evenly distribute the loads on the servers.

Uber engineering open-sourced RSS and plans to donate it to Apache Software Foundation. As for the future, Uber engineering plans to make RSS available on Kubernetes.

Data shuffling is one of the important challenges in Spark Optimization. There are other shuffle services for Spark like Mesos Shuffle Service, Cosco by Facebook, Magnet by LinkedIn, and EMR Remote Shuffle Service by Alibaba.

About the Author

Reza Rahimi

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Reza Rahimi

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter