MLCommons Announces Latest MLPerf Training Benchmark Results

Engineering consortium MLCommons recently announced the results of the latest round of their MLPerf Training benchmark competition. Over 158 AI training job performance metrics were submitted by 14 organizations, with the best results improving up to 2.3x compared to the previous round.

The announcement was made on the MLCommons blog. The MLPerf Training benchmark suite consists of eight deep-learning models for a variety of tasks, including computer vision (CV), natural language processing (NLP), and reinforcement learning (RL). Entrants train a deep-learning model for the task until it achieves a certain quality score, such as model accuracy, with the goal of achieving this in the lowest "wall clock" elapsed time. In the Closed division, where entrants all train the same model architecture, NVIDIA posted the best performance in seven of the eight tasks, and was narrowly beaten on the eighth by Microsoft Azure's entry. Most tasks showed improvement in times compared to the previous round's results, and compared to 2018's first round results some benchmarks have seen up to 30x improvement. According to David Kanter, executive director of MLCommons,

That rapid increase in performance will ultimately unleash new machine learning innovations that will benefit society.

The MLPerf Training benchmark suite was launched in 2018, with the goal of creating a "level playing field" for comparing the ability of various systems to quickly train ML models. The current suite contains several tasks and associated datasets: image classification on ImageNet, image segmentation on KiTS19, light-weight and heavy-weight object detection on COCO, speech recognition on LibriSpeech, NLP on Wikipedia, and recommendation on 1TB Click Logs. There is also an RL task to learn to play the game Go, which does not require a dataset. Each task also has a defined quality metric and target metric value. Each submission includes timing data from multiple training runs; the highest and lowest values are dropped and the average of the remaining times is used as the entry's final result.

Participants may choose to enter multiple systems, which are a defined set of hardware resources and model implementation, and each system may have results for one or more of the tasks. Systems compete in either the Closed division, where the model architecture and ranges of hyperparameters are pre-defined, or the Open division, which allows arbitrary models. Systems are also categorized as available, containing only components that can be purchased directly or used via a public cloud; preview, which are systems that are not publicly available currently, but must become so before the next MLPerf round; or research, which are systems that are experimental or internal-only.

While NVIDIA directly submitted results for all of the tasks, most of the other competitors also used NVIDIA hardware accelerators, including Azure, Baidu, Dell, Fujitsu, GIGABYTE, HPE, Inspur, Lenovo, and Supermicro. Graphcore and Intel-HabanaLabs used their hardware, but only competed on the ImageNet and NLP tasks. Google did not compete in the Closed division, but competed in the Open division against Samsung and Graphcore, submitting a 480B-parameter BERT model trained on their Cloud TPUs. In a blog post, Google suggested that their model should be added to the Closed division benchmark, in order to bring focus to "scalability challenges that large models present."

In a Hacker News discussion about the results, commenters suggested reasons why improvements in ML training times seem to be outpacing Moore's law. One pointed out a list of optimizations and improvements to hardware, while another noted:

Neural architecture search is finding more efficient architectural building blocks which translate to better quality models with fewer overall parameters. There's a lot of work going on right now to improve transformer architectures in particular; I think once the dust settles they will be massively more efficient than the current models. Efficient transformers often also introduce an inductive bias, which ends up improving model quality.

The MLPerf Training rules as well as reference implementations of models for each benchmark are available on GitHub.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter