Azure Optimized Stack with DeepSpeed for Hyperscale Model Training

Azure Machine Learning (AzureML) now provides an optimized stack that uses the latest NVIDIA GPU technology with Quantum InfiniBand to efficiently train and fine-tune large models like Megatron-Turing and GPT-3.

In recent years, large-scale transformers-based deep learning models trained on huge amounts of data are used for new products and several cognitive tasks. These models have grown in size and magnitude and the customers' needs for training and fine tuning have grown accordingly.

The training and fine tuning of these kinds of models require a complex and distributed architecture and the set up of these architectures require several manual and error prone steps. With this new optimized stack, AzureML allows a better experience in terms of usability and performances, providing a simple to use training pipeline. The AzureML proposed stack includes: hardware, OS, VM image, Docker image (with optimized PyTorch, DeepSpeed, ONNX Runtime and other Python packages) for performance and scalability without complexity.

Optimized stack for scalable distributed training on Azure

A possible experimental setup is composed of NDm A100 v4-series that includes two socket AMD EPYC 7V12 64-Core CPUs, 1.7TB of main memory and eight A100 80GB GPUS. A balanced PCIe topology to connect 4 GPUs to each CPU is used and each GPU has its own topology agnostic 200 Gb/s NVIDIA Mellanox HDR InfiniBand. The 1.7TB of main memory and the DeepSpeed library offload capabilities allows the scaling to large models size. This setup can be used both in AzureML studio and Azure VMSS, however the AzureML studio solution is recommended because it is the easiest way to have the setup up and running in the right and easy way.

Differences between distributed architecture and AzureML training setup

The AzureML proposed stack allows an efficient training of 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling to 2x more GPUs (1024 vs. 512), and up to 1.8x higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs). This stack also has the capability to offer a near-linear scalability in terms of increasing the model size and the increase of the number of GPUs. Thanks to DeepSpeed ZeRO-3 with its CPU offloading capabilities and this new AzureML stack, the efficient throughput/GPU of 157 TFLOPs is maintained as the model increase from 175 billion to 2 trillion parameters and, given a model size (eg 175 Billion in the following graph), a linear scaling is achieved if the number of GPU increase.

More detailed results are described in the deepspeed extended technical blog.

a. throughput/GPU vs model size from 175 billion to 2 trillion parameters (BS/GPU=8),

b. Linear increases performance scaling with the increase in number of GPU devices for the 175B model (BS/GPU=16).

About the Author

Claudio Masolo

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Claudio Masolo

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter