At the 2019 Spark AI Summit Europe conference, NVIDIA software engineers Thomas Graves and Miguel Martinez hosted a session on Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library. InfoQ recently talked with Jim Scott, head of developer relations at NVIDIA, to learn more about accelerating Apache Spark with GPUs and the RAPIDS library.
InfoQ: Can you tell our readers what the RAPIDS toolkit is?
Jim Scott: GPUs have been responsible for the advancement of deep learning in the past several years, while ETL (Extract, Transform, Load) and traditional machine-learning workloads continue to be written in Python, often with single-threaded tools like Scikit-Learn or large, multi-CPU distributed solutions like Spark.
RAPIDS is a suite of open-source software libraries and APIs for executing end-to-end data science and analytics pipelines entirely on GPUs, allowing for a substantial speed up, particularly on large data sets. Built on top of NVIDIA CUDA, RAPIDS exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces, and a DataFrame API that integrates with a variety of machine-learning algorithms for end-to-end pipeline accelerations.
For scaling from GPU workstations to multi-GPU servers and multi-node clusters, the RAPIDS toolkit integrates with Dask or Apache Spark.
InfoQ: Apache Arrow and GPU-accelerated ETL were discussed in the Spark AI Summit session. What are the benefits of Apache Arrow memory format with GPUs for ETL?
Scott: The cost of copying and converting data formats between different storage, ETL and machine-learning frameworks can frequently exceed that of the underlying GPU kernel execution.
RAPIDS takes advantage of Apache Arrow, a cross-language development platform for in-memory data.
Arrow specifies a standardized language-independent columnar memory format optimized for data locality to accelerate analytical processing performance on modern hardware like CPUs and GPUs.
Arrow also provides zero-copy streaming messaging and inter-process communication without serialization overhead.
InfoQ: XGBoost was also discussed in the Spark AI Summit session. What are the advantages of XGBoost + RAPIDS?
Scott: XGBoost is a scalable, distributed gradient-boosted decision tree (GBDT) machine-learning library. XGBoost provides parallel tree boosting and is the leading machine-learning library for regression, classification, and ranking problems. The RAPIDS team works closely with the Distributed Machine Learning Common (DMLC) XGBoost organization, and XGBoost now includes seamless, drop-in GPU acceleration, significantly speeding up model training and improving accuracy for better predictions.
InfoQ: How does RAPIDS XGBOOST4J-SPARK speed up classification model training time and cost?
Scott: RAPIDS XGBOOST4J-SPARK has three features that help with speed-up and cost:
- GPU-accelerated reader: This essentially reads any number/size of supported input file formats directly into GPU memory and divides up evenly among the different training nodes.
- GPU-accelerated training: We have improved XGBoost training time with a dynamic in-memory representation of the training data that optimally stores features based on the sparsity of a dataset rather than a fixed in-memory representation based on the largest number of features amongst different training instances.
- Efficient GPU memory utilization: XGBoost requires that data fit into memory which creates a restriction on data size using either a single GPU or distributed multi-GPU multi-node training. The latest release has improved GPU memory utilization by 5X, i.e., users now can now train with data that is five times the size as compared to the first version. This is one of the critical factors to improve total cost of training without impacting performance.
InfoQ: Can you discuss how Spark 3.0 empowers GPU apps?
Scott: Apache Spark 3.0 empowers GPU applications by providing user APIs and configurations to easily request and utilize GPUs and is now extensible to allow columnar processing on the GPU.
Internally, Spark added GPU scheduling, further integration with the cluster managers (YARN, Kubernetes, etc.) to request GPUs, and plugin points to allow it to be extended to run operations on the GPU. This makes GPUs easier to request and use for Spark application developers, allows for closer integration with deep learning and AI frameworks such as Horovod and TensorFlow on Spark, and allows for better utilization of GPUs. Through the extensibility to allow columnar processing it opens up the possibility for users to add plugins that accelerate queries using the GPU.
The GPU-accelerated stack below illustrates how NVIDIA technology will accelerate Spark 3.0 applications without application code change.
For more information, check out the following resources: