Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News New Technique Speeds up Deep-Learning Inference on TensorFlow by 2x

New Technique Speeds up Deep-Learning Inference on TensorFlow by 2x

Leia em Português

Researchers at North Carolina State University recently presented a paper at the International Conference on Supercomputing (ICS) on their new technique, "deep reuse" (DR), that can speed up inference time for deep-learning neural networks running on TensorFlow by up to 2x, with almost no loss of accuracy.

Dr. Xipeng Shen, along with graduate student Lin Ning, authored the paper describing the technique, which requires no special hardware or changes to the deep-learning model. By taking advantage of similarities in the data values that are input into a neural network layer, DR eliminates redundant computation during inference, reducing the total time taken. Reducing computation also reduces power consumption, a key feature for mobile or embedded applications. In experiments running several common computer-vision deep-learning models on GPUs, including CifarNet, AlexNet, and VGG-19, DR achieved from 1.75X to 2.02X speedup, with an increase in error of 0.0005. In some cases, DR actually improved accuracy slightly. In similar experiments on a mobile phone, DR "achieves an average of 2.12x speedup for CifarNet and 2.55X for AlexNet."

A large portion of processing during neural-network inference is spent multiplying a vector of data with a matrix of weights; the vector could be input data or the activation maps which are fed into the hidden layers of the network. While there are training techniques to produce smaller models which have fewer vector-matrix products, DR does not require any changes to training process or to the model.

At inference-time, DR uses locality-sensitive hashing (LSH) to cluster the input of each network layer. The centroid of the cluster is used in the vector-matrix product, instead of the actual input vector. The result is saved in memory; whenever a new input is encountered, it is quickly mapped to cluster and the saved vector-matrix product result is output, instead of being re-calculated with the new input. This can reduce the accuracy of the computations, but according to the team it is "marginal compared to the 54-78% overall inference accuracies." There is also some overhead required to apply the LSH algorithm, but again, the total gains more than make up for it.

DR's improved runtime means reduced power consumption, which is especially attractive for mobile or embedded devices; however, it does not reduce the storage and memory requirements of the model. Many deep-learning models are too large to run on mobile devices; for example, the research team could not run the VGG-19 model on their mobile device. This problem can be addressed by producing smaller models or by compression techniques such as post-training quantization. The researchers investigated DR's performance with compressed networks, and found it achieved 2x to 3x speedup on the convolutional layers of a compressed AlexNet.

The team implemented DR using TensorFlow for GPU experiments and TensorFlow Lite for mobile device experiments. In a previous paper, the researchers investigated the use of DR to speed up training. In these latest experiments, the team used pre-trained models from the TensorFlow slim library as a baseline to compare DR's improvements to inference.

Rate this Article