BT

Spark Sets New Record in Sort Performance

| by Benjamin Darfler Follow 0 Followers on Nov 26, 2014. Estimated reading time: 2 minutes |

Databricks, a company founded by the creators of Apache Spark, has recently announced a new record in the Daytona GraySort contest using the Spark processing engine. The Daytona GraySort contest is a 3rd party benchmark measuring how fast a system can sort 100 Terabytes of data. Databricks posted a throughput of 4.27 TB/min over a cluster of 206 machines for their official run which constitutes a 3x performance improvement, using 10x fewer machines when compared to the previous record submitted by Yahoo! running Hadoop MapReduce.

In a blog post announcing their submission to the Daytona GraySort contest, Databricks explained some of the technological improvements recently introduced to Spark that allowed it to sustain such a large throughput.

Spark 1.1 introduced a new shuffle implementation called sort-based shuffle. The previous shuffle implementation required an in-memory buffer for each partition in the shuffle which lead to notable memory overhead. The new sort-based shuffle requires only one in-memory buffer at a time. This significantly reduced the memory usage and allowed for considerably more tasks to be run concurrently on the same hardware.

In addition to the new shuffle algorithm, the network module was revamped based on Netty’s native Epoll socket transport which maintains its on pool of memory, bypassing the JVM’s memory allocator and reducing the impact of garbage collection. The new network module was then used to build an external shuffle service to allow shuffled files to be served even during garbage collection pauses in the main Spark executor.

Finally, Spark 1.1 included TimSort as its new default sorting algorithm. TimSort is derived from merge sort and insertion sort and performs better than quicksort in most real-world datasets, especially for datasets that are partially ordered.

All of these improvements allowed the Spark cluster to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase which saturated the 10Gbps ethernet link.

Spark is an advanced execution engine born out of research done at the AMPLab at UC Berkley. It allows programs to run up to 10x faster than Hadoop MapReduce and when data is on disk, and up to 100x faster when data resides in memory. Spark supports programs written in Java, Scala or Python and uses familiar functional programming constructs to build data processing flows.

Spark has garnered significant attention as a next generation execution platform for Hadoop and is seen by some as a replacement for MapReduce. It graduated to a top level Apache project in February and since then has been included in the Cloudera, Hortonworks and MapR’s Hadoop distributions. More recently, Hortonworks announced they will support running Hive on Spark as part of their Stinger.next initiative.

Databricks was founded in 2013 as a commercial entity supporting Spark and its associated projects. Those projects include Spark Streaming for stream processing, Spark SQL for querying Hive data and MLlib for machine learning.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT