PyTorch 1.8 Release Includes Distributed Training Updates and AMD ROCm Support

PyTorch, Facebook's open-source deep-learning framework, announced the release of version 1.8 which includes updated APIs, improvements for distributed training, and support for the ROCm platform for AMD's GPU accelerators. New versions of domain-specific libraries TorchVision, TorchAudio, and TorchText were also released.

The PyTorch team highlighted the major features of the release in a recent blog post. The new release includes binaries built to use the ROCm platform, to improve performance on systems using AMD GPUs. There are several new NumPy-compatible API updates, as well as new features for distributed training: pipeline parallelism and gradient compression. The release also includes a new beta toolkit, torch.fx, for Python-to-Python functional transformation. Overall, version 1.8 contains more than 3,000 commits since the 1.7 release.

The new FX toolkit was inspired by Jax and TensorFlow, and offers developers a mechanism for transforming Python code which subclasses the PyTorch nn.Module class. The three main components of the toolkit are a symbolic tracer, an intermediate representation, and a Python code generator. These components allow developers to convert a Module-subclass to a Graph representation, modify the Graph in code, then convert the new Graph to Python source code. This generated code is automatically compatible with the existing PyTorch eager-execution system. The PyTorch documentation includes several example use cases of FX, including optimization of models for inference purposes and model quantization.

To address problems with large-scale models which do not fit in a single GPU, PyTorch introduced model-parallel training in version 1.4. This allows for training on multiple GPUs, each of which hosts a subset of model parameters. However, typical implementations of this paradigm often result in only one GPU being used at a time. The new release introduces pipeline parallelism similar to GPipe, which splits an input mini-batch into several micro-batches, which are pipelined across the GPUs, reducing the overall idle time, although not completely eliminating it. The release also introduces communication hooks to the distributed data-parallel framework, allowing for optimization of the gradient communication step in training. Several pre-built hooks are included, such as gradient compression and PowerSGD.

For developers using AMD GPU accelerators, the PyTorch installer now offers the option to choose a binary built for the Radeon Open Compute (ROCm) platform; previously, AMD users needed to build the source code for the ROCm platform or download binaries from AMD. For users seeking to speed up code using NumPy, the new release contains NumPy-compatible APIs for Fast-Fourier Transforms (FFTs) and common linear algebra functions. Finally, several domain-specific PyTorch libraries have also been updated: TorchVision added several updates for mobile, including a mobile version of Detectron2, TorchAudio has improved I/O performance, and TorchText's dataset API has been made compatible with the PyTorch DataLoader utility.

In a discussion about the release on Hacker News, users contrasted PyTorch with TensorFlow, noting that although TensorFlow is lagging PyTorch in ROCm support, TensorFlow's support for Google's TPU devices is superior. One user praised PyTorch as:

[T]he most impressive piece of software engineering that I know of....There's just an incredible amount of complexity being hidden behind behind a very simple interface there...

The PyTorch 1.8 release notes and code are available on GitHub.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter