Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Open-Sources Computer Vision Model Big Transfer

Google Open-Sources Computer Vision Model Big Transfer

This item in japanese

Google Brain has released the pre-trained models and fine-tuning code for Big Transfer (BiT), a deep-learning computer vision model. The models are pre-trained on publicly-available generic image datasets and can meet or exceed state-of-the-art performance on several vision benchmarks after fine-tuning on just a few samples.

Paper co-authors Lucas Beyer and Alexander Kolesnikov gave an overview of their work in a recent blog post. To help advance the performance of deep-learning vision models, the team investigated large-scale pre-training and the effects of model size, dataset size, training duration, normalization strategy, and hyperparameter choice. As a result of this work, the team developed a "recipe" of components and training heuristics that achieves strong performance on a variety of benchmarks, including an "unprecedented top-5 accuracy of 80.0%" on the ObjectNet dataset. Beyer and Kolesnikov claim,

[Big Transfer] will allow anyone to reach state-of-the-art performance on their task of interest, even with just a handful of labeled images per class.

Deep-learning models have made great strides in computer vision, particularly in recognizing objects in images. One key to this success has been the availability of large-scale labelled datasets: collections of images with corresponding text descriptions of the objects they contain. These datasets must be created manually, with human workers applying a label to each of thousands of images: the popular ImageNet dataset, for example, contains over 14 million labeled images containing 21k different object classes. However, the images are usually generic, showing commonplace objects such as people, pets, or household items. Creating a dataset of similar scale for a specialized task, say for an industrial robot, might be prohibitively expensive or time-consuming.

In this situation, AI engineers often apply transfer learning, a strategy that has become popular with large-scale natural-language processing (NLP) models. A neural network is first pre-trained on a large generic dataset until it achieves a certain level of performance on a test dataset. Then the model is fine-tuned with a smaller task-specific dataset, sometimes with as few as a single example of the task-specific objects. Large NLP models routinely set new state-of-the-art performance levels using transfer learning.

For BiT, the Google researchers used a ResNet-v2 neural architecture. To investigate the effects of pre-training dataset size, the team replicated their experiments on three groups of models pre-trained with different datasets: BiT-S models pre-trained on 1.28M images from ILSVRC-2012, BiT-M models pre-trained on 14.2M images from ImageNet-21k, and BiT-L models pre-trained on 300M images from JFT-300M. The models were then fine-tuned and evaluated on several common benchmarks: ILSVRC-2012, CIFAR-10/100, Oxford-IIIT Pet, and Oxford Flowers-102.

The team noted several findings from their experiments. First, the benefits from increasing model size diminish on smaller datasets, and there is little benefit in pre-training smaller models on larger datasets. Second, the large models performed better using group normalization compared to batch normalization. Finally, to avoid an expensive hyperparameter search during fine-tuning, the team developed a heuristic called BiT-HyperRule, where all hyperparameters are fixed except "training schedule length, resolution, and whether to use MixUp regularization."

Google has released the best-performing pre-trained models from the BiT-S and BiT-M groups. However, they have not released any of the BiT-L models based on the JFT-300M dataset. Commenters on Hacker News pointed out that no model trained on JFT-300M has ever been released. One commenter pointed to several models released by Facebook which were pre-trained on an even larger dataset. Another said:

I've wondered if legal/copyright issues block any release: there's always someone who tries to argue that a model is a derived work, and nothing in the JFT-300M papers mentions having licenses covering public redistribution.

The code for fine-tuning and tutorials for using the released pre-trained models are available on GitHub.

Rate this Article