Microsoft Releases AI Training Library ZeRO-3 Offload

Microsoft recently open-sourced ZeRO-3 Offload, an extension of their DeepSpeed AI training library that improves memory efficiency while training very large deep-learning models. ZeRO-3 Offload allows users to train models with up to 40 billion parameters on a single GPU and over 2 trillion parameters on 512 GPUs.

The DeepSpeed team provided an overview of the features and benefits of the release in a recent blog post. ZeRO-3 Offload increases the memory efficiency of distributed training for deep-learning models built on the PyTorch framework, providing super-linear scaling across multiple GPUs. By offloading the storage of some data from the GPU to the CPU, larger model sizes per GPU can be trained, enabling model sizes up to 40B parameters on a single GPU. Adopting the DeepSpeed framework for training requires minimal refactoring of model code, and current users can take advantage of the new features by modifying a config file. According to the DeepSpeed team, the release is "geared towards our continued goal of democratizing AI by making efficient large-scale DL training available to everyone."

A recent trend in deep-learning models has been the exponential growth in the number of model parameters: from hundreds of millions of parameters in 2018 to hundreds of trillions in 2020. While the larger models do achieve improved performance, training the model requires multiple GPU accelerators and distributed training techniques such as model parallelism, which uses a separate GPU for different subsets of a model's parameters. However, while the PyTorch framework does support this technique, it typically requires changes to the model training code.

Microsoft first released the DeepSpeed library and the Zero Redundancy Optimizer (ZeRO) in early 2020. Microsoft's Project Turing used the library to train the Turing Natural Language Generation (T-NLG) model, which at 17B parameters was at the time the largest known language model. ZeRO defines three stages of optimization:

ZeRO-1: Optimizer State Partitioning
ZeRO-2: ZeRO-1 plus Gradient Partitioning
ZeRO-3: ZeRO-2 plus Parameter Partitioning

The initial release of DeepSpeed included only the first stage, ZeRO-1. Later releases included ZeRO-2 as well as ZeRO-Offload, a scheme for "offloading" data and compute from the GPU to the CPU of a training machine. This frees up GPU memory and allows for a single GPU to manage a larger model.

The new release implements ZeRO-3 Offload, a combination of all three stages of ZeRO optimizations and offloading. This allows for single-GPU training of models 3x larger than those supported with ZeRO-2, which according to Microsoft is the previous state-of-the-art. For larger models, a 2T-parameter model can be trained on a cluster of 512 GPUs, compared to the previous state-of-the-art requirement of 1600 GPUs. The system can also support higher throughput: 50 Tflops per GPU compared to 30 Tflops with standard PyTorch training.

As large models have become more commonplace, distributed training has become a major focus for research and development. Google's TensorFlow framework, PyTorch's main rival, has a separate library called Mesh TensorFlow for training large models. Facebook, the creator of PyTorch, created a distributed training extension for PyTorch called FairScale. Huggingface, a popular source for pre-trained AI models, has adopted both DeepSpeed and FairScale for training their large models. In a discussion on Hacker News, one user noted that FairScale has also implemented several of the ideas from DeepSpeed. Another pointed out that the ZeRO optimizations might not perform as well for all types of AI model:

It is mostly applicable to transformer models, the ideas in the paper would be alien if you work on computer vision.

The DeepSpeed source code is available on GitHub.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter