A team of researchers from MIT, Yonsei University, and University of Brasilia have launched a new website, Computer Progress, which analyzes the computational burden from over 1,000 deep learning research papers. Data from the site show that computational burden is growing faster than the expected rate, suggesting that algorithms still have room for improvement.
Lead researcher Neil Thompson announced the launch on Twitter. Thompson, along with Kristjan Greenewald from MIT-IBM Watson AI Lab, professor Keeheon Lee of Yonsei University, and Gabriel Manso of University of Brasilia, detailed the motivation of the work and their results in an article published in IEEE Spectrum. The team analyzed 1,058 deep learning research papers from arXiv to determine a scaling formula that relates a model's performance to the computational burden, or amount of compute resources needed to train the model. Theoretically, the lower bound of computational burden is a fourth-order polynomial with respect to performance; however, the researchers found that current algorithms do much worse; for example, they found that ImageNet image classification algorithms scale as a ninth-order polynomial, which would require 500 times the compute to reduce error rate by half. According to the authors, these scaling trends suggest that researchers should search for better algorithms:
Faced with computational scaling that would be economically and environmentally ruinous, we must either adapt how we do deep learning or face a future of much slower progress.
Deep neural networks are often over-parameterized, meaning that they have more model parameters than expected for the training data size. Empirically, this is shown to improve model performance and generalization, while training methods such stochastic gradient descent (SGD) and regularization keep the models from over-fitting. Researchers have also found that an increase in model performance or accuracy requires an increase in training data, with a corresponding growth in model size. Assuming that performance improvements require a quadratic increase in training data size, and that computation increases quadratically with model parameters, Thompson and his colleagues propose a theoretical lower bound that computation grows as the fourth power of performance.
To check this assumption, the researchers reviewed deep learning papers on several fields in computer vision (CV) and natural language processing (NLP), including image recognition, object detection, question answering, named-entity recognition, and machine translation. From the papers they extracted the accuracy metrics of the models discussed along with the computational burden of training the models, defined as the number of processors x computation rate x time (essentially, the total number of floating-point operations). They then performed linear regression to express model performance as a function of computation. These equations show that model performance scales much worse than the fourth-degree polynomial predicted by theory: from 7.7th degree for question answering to a polynomial of degree "around 50" for object detection, named-entity recognition, and machine translation.
However, it is possible that these scaling challenges can be solved by improved algorithms. The MIT team's research suggests "three years of algorithmic improvement is equivalent to an increase in computing power of 10x." In 2020, OpenAI did a similar study of image recognition algorithms, finding that "since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months." More recently, Thompson and another colleague conducted a survey of 113 computer algorithm problem domains, including computer networking, signal processing, operating systems, and cryptography, to analyze how improved algorithms improved performance in solving the problems. They found that, while "around half" of problems, or "algorithm families" have not experienced any improvement, 14% achieved "transformative" improvements, and 30%-43% achieved improvements "comparable or greater than those that users experienced from Moore’s Law."
The Computer Progress team also suggested several complementary approaches that could improve deep learning efficiency, many of which have been covered on InfoQ. Optical computing could reduce the power consumption needed by large deep learning models, and overall model size can be addressed by quantization and pruning. Finally, meta-learning offers a way to reduce the number of training cycles needed to complete model training.
The Computer Progress website hosts the compute vs performance scaling data, along with links to the source papers, as well as a call for researchers to submit their own performance results.