Researchers from Arm Limited and Princeton University have developed a technique that produces deep-learning computer-vision models for internet-of-things (IoT) hardware systems with as little as 2KB of RAM. By using Bayesian optimization and network pruning, the team is able to reduce the size of image recognition models while still achieve state-of-the-art accuracy.
In a paper published on arXiv in May, the team described Sparse Architecture Search (SpArSe), a technique for finding convolutional neural network (CNN) computer-vision models that can be run on severely resource-constrained microcontroller-unit (MCU) hardware. SpArSe uses multi-objective Bayesian optimization (MOBO) and neural-network pruning to find the best trade-off between model size and accuracy. According to the authors, SpArSe is the only system that has produced CNN models that can be run on hardware with as little as 2KB of RAM. Previous work for MCU-hosted computer-vision uses other machine-learning models, such as nearest neighbor or decision trees. The team compared its CNN results with several of these solutions, showing that SpArSe produces models that are "more accurate and up to 4.35x smaller."
MCU devices are popular for IoT and other embedded applications because of their low cost and power consumption; yet these qualities come with a trade-off: limited storage and memory. The Arduino Uno, for example, has only 32kB of flash storage and 2kB of RAM. These devices do not have the resources to perform inference using state-of-the-art CNN models; their storage constrains the number of model parameters, and their RAM constrains the number of activations. A relatively small CNN model, such as LeNet, with approximately 60,000 parameters, requires almost double the Uno's storage, even using compression techniques such as integer quantization. The only solution is to reduce the overall number of weights and activations.
The key to reducing model size with SpArSe is pruning. Similar to dropout, pruning removes neurons from the network. However, instead of randomly turning off neurons during a forward pass in training, pruning permanently removes the neurons from the network. This technique can "reduce the number of parameters up to 280 times on LeNet architectures...with a negligible decrease of accuracy."
In addition to network pruning, SpArSe searches for a set of hyperparameters, such as number of network layers and convolution filter size, attempting to find the most accurate model that also has a minimal number of parameters and activations. Hyperparameter optimization and architecture search---often described as automated machine learning (AutoML)---are active research areas in deep learning. Facebook also recently released tools for Bayesian optimization techniques similar to those used by SpArSe. In contrast with SpArSe, however, these techniques usually concentrate on simply finding the model with the best accuracy.
The research team compared the results of SpArSe with Bonsai, a computer-vision system based on decision-trees that also produces models that will operate in less than 2KB of RAM. While SpArSe did outperform Bonsai on the MNIST dataset, Bonsai won on the CIFAR-10 dataset. Furthermore, SpArSe required one GPU-day of training time on MNIST and 18 GPU-days of training on the CIFAR-10 data, whereas Bonsai requires only 15 minutes on a single-core laptop.