A new open-source library called Alpa aims to automate distributed training and serving of large deep networks. It proposes a compiler where existing model-parallel strategies are combined and the usage of computing resources is optimized according to the deep network architecture.
The network inference loss scales logarithmically with the number of weight parameters and utilized data during training. This led to an increased effort within the deep learning community to develop larger models. To keep up with the growing network size, the scaling of training compute has also accelerated, doubling approximately every 6 months. As accelerator memory capabilities are limited, the main engineering obstacle is the mapping of network parameters to the existing accelerator devices depending on both cluster properties and communication primitives. In particular, this problem surfaces during training as corresponding gradient tensors are also stored and exchanged via the accelerator memory.
There are two main methods to carry out model-parallel distributed training. In the first strategy, functions (e.g. convolutional layers) constituting the computational graph are divided between the devices (aka inter-operator parallelism), and input mini-batches are split into micro-batches where each micro-batch is executed over the same set of functions (aka pipeline) that are placed on multiple devices (e.g. Device Placement Optimization, GPipe). In the second case, the function parameters are divided (aka intra-operator parallelism) and batch inputs are run over different parts of function parameters that are placed on different devices (e.g. GShard, DeepSpeed-ZeRO).
Both strategies have related trade-offs. For example, inter-operator parallelism offers lower network bandwidth usage which can be a favorable scenario for multi-node training. Intra-operator parallelism, on the other hand, minimizes GPU idle time but suffers from higher data exchange requirements. Therefore, it may be more suitable when GPUs are connected with high-bandwidth connection primitives such as NVIDIA NVLink or AMD xGMI. Current training stations are generally made of multiple GPU units with custom inter-GPU connection modules. However, this is not the case on public cloud, hence a hybrid strategy utilizing both inter- and intra-operator parallelism may significantly improve resource usage.
Deep learning libraries continue to release new APIs assisting placement planning for model parameters and input data such as DTensor for Tensorflow and FSDP for PyTorch. The distribution plans can be manually created (e.g. Megatron-LM), but it is beneficial to have auto-generated plans and training schedules especially for very large networks and for AutoML-designed architectures. Alpha can be considered as an attempt to automate such placement procedures. Its compiler leverages both intra- and inter-function parallelism to output runtime-optimized distributed training strategies depending on the cluster and the deep network structure.
Currently, Alpa is built upon Jax that offers automatic composable transformations (i.e. automatic vectorization, gradient computation, SPMD parallelization, and JIT compilation) for side-effect-free function chains applied to the data.
The results presented in the OSDI 22 paper indicate that Alpa provides competitive training strategies when compared to the manual placement and the previous state-of-the-art methods. Additional info can be obtained from the official Google blog and details related to the design decisions are explained on the official documentation. The project also showcases a proof-of-concept OPT-175B model server.