Near-Optimal Scaling of Large Deep Network Training on Public Cloud

A recently published study, MiCS, by AWS provides experimental evidence that the infrastructure used to carry out model training should be taken into account when designing architectures, especially for large deep neural networks trained on the public cloud. The article shows distributing the model weights unevenly between nodes decreases inter-node communication overhead on AWS V100 (p3dn.24xlarge) and A100 (p4d.24xlarge) GPU instances. As a result, the bulk of the gradient exchange happens within a node thus providing a throughput boost during training depending on the model size. The work is part of the recent effort to increase the efficiency of large-scale training workloads.

Test loss scales logarithmically with the number of network parameters and utilized data for deep neural networks. Therefore, in the last couple of years, research and industrial efforts have shifted towards developing high-capacity deep networks that can be used for many downstream tasks (e.g. by supervised tuning). To be able to satisfy the requirements of training such large networks, the scaling of training compute has also accelerated, doubling approximately every 6 months.

As large-scale deep network usage became common, parameter sharding strategies for training these models have been proposed such as ZeRO and GShard. When developing the proof-of-concept frameworks, on-prem GPU stations are generally preferred that are equipped with large-bandwidth communication primitives. In practice, however, industrial applications generally reside on the public cloud. This brings additional technical challenges due to the limitations and availability of architectural components on the cloud.

The public cloud employs software-defined reusable components which allow straightforward management of compute instances. Unlike purpose-designed GPU stations, cloud virtual machine clusters may have 12 to 24 times slower inter-node bandwidth than intra-node bandwidth between GPUs (e.g. NVIDIA NVLink and NVSwitch, AMD xGMI). This makes distributed gradient synchronization a major bottleneck for training large deep networks.

MiCS proposes to limit inter-node communication by placing model parameters to GPUs that are as close as possible. This can be done by decreasing the model partition size and by favoring intra-node GPUs. When multiple nodes are required to cover the whole parameter range, the smallest amount of nodes is preferred to divide the weights. In this way, practical communication differences are reflected at the algorithmic level. The authors also modify the gradient accumulation method to adopt the uneven weight distribution.

The results of several experiments performed in 100Gbps and 400Gbps network settings are presented in the paper. The number of GPUs and the size of the deep networks are also varied to have comparative performance results. MiCS shows consistent improvements up to 2.82 times more throughput for 100Gbps network settings and up to 2.21 times more throughput for the 400Gbps case.

Additional information about the paper can be obtained from the official Amazon Science blog post. A similar recommendation can be also seen in the official GCP blog post (3rd suggestion).

About the Author

Sabri Bolkar

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Sabri Bolkar

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter