A team of scientists at Facebook AI Research (FAIR) announced a system for training deep-learning recommendation models (DLRM) using PyTorch on a custom-built AI hardware platform, ZionEX. Using this system, the team trained models with up to 12T parameters and achieved nearly an order-of-magnitude speedup in training time compared to other systems.
The research was described in a paper published on arXiv. The ZionEX hardware, the latest iteration of the Zion platform, is a cluster of thousands of custom-built compute nodes; each node has 4-socket CPUs and 8 GPUs, as well as 4 network interface controllers (NIC) for the CPUs and a RDMA over Converged Ethernet (RoCE) NIC for each GPU. The training software is based on Facebook's deep-learning framework PyTorch, with several custom-coded operators and enhancements to the framework's distributed training features, including GPU pipelining and network communication tuning. To evaluate the performance of the system, the team trained DLRMs of various sizes, from 95B parameters to 12T. When compared to the previous generation training system, ZionEX achieves nearly an order-of-magnitude speedup in throughput.
Deep-learning recommendation models are used extensively by Facebook's news feed and search features to rank results and predict click-through rate; according to Facebook, DLRMs represent "the single largest AI application in terms of infrastructure demand in its data-centers." The DLRM architecture consists of a fully-connected multi-layer perceptron (MLP) and many embedding tables for high-dimensional inputs. Because of this, the models can be quite large, even rivaling OpenAI's GPT-3 or Google's Switch Transformer. In 2019, Facebook open-sourced their DLRM implementation and unveiled Zion, their custom AI training hardware.
However, with the demand for larger models, Facebook found several scaling challenges with the Zion hardware and their DLRM training algorithm. To train the MLP module, Facebook used a data-parallel setup, with a single parameter server and distributed trainers; for the embedding tables, by contrast, a model-parallel scheme was used, with multiple parameter servers. The communication requirements to coordinate the distributed servers were bottle-necked by the networking design of the Zion nodes.
In particular, the node's NIC was attached to the CPU; thus, communication between the GPUs of different nodes required CPU cycles. Furthermore, using the datacenter network required the use of the TCP/IP networking protocol, which Facebook found to be suboptimal for distributed training. The redesigned ZionEX nodes support improved networking by adding a NIC for each GPU, and attaching these NICs to a dedicated backend network that uses the more efficient RDMA/GPUDirect protocols. This backend network can be scaled to many thousands of nodes.
To take full advantage of the new hardware, Facebook also redesigned the training software. The previous system used asynchronous updates of the embedding parameters, which resulted in reduced model accuracy at high scale. The redesigned training software continues to use a hybrid approach---data-parallel for the MLP and model-parallel for the embeddings---but eliminates the parameter servers and uses synchronous updates of the embedding parameters. To further improve the performance of the embedding training, the team wrote and open-sourced more efficient implementations of several PyTorch operators, reducing memory and network usage during training.
To evaluate the performance of their system, the team trained four different DLRMs, with 95B, 793B, 845B, and 12T parameters respectively. The models were trained on a cluster with up to 16 nodes, or 128 GPUs. Although increasing the number of nodes in the cluster does speed up the training time, up to 8x for the 845B-parameter model, the researchers noted that the network communication among the GPUs becomes the bottleneck, especially during the "all-to-all" communication required for updating embedding parameters. To reduce the amount of data transferred in this operation, the team opted to quantize this data to 16 bits, reducing network traffic without significantly reducing model accuracy.
The proliferation of large deep-learning models has spurred several attempts to improve the scalability of model training. Because the models are so large, multiple GPUs must be used, which requires changes in the training software to handle distributed operations. Google has developed Mesh TensorFlow, a distributed layer for their TensorFlow framework, as well as GShard, an extension to TensorFlow's XLA compiler that supports parallel operations. Microsoft recently updated their DeepSpeed library for reducing memory requirements of large models, such that researchers can train models with hundreds of billions of parameters on a single GPU or scale to trillions of parameters using parallel techniques.
The source code and benchmarks for Facebook's operator library are available on GitHub, as is their DLRM implementation.