Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Announces 800M Parameter Vision-Language AI Model ALIGN

Google Announces 800M Parameter Vision-Language AI Model ALIGN

This item in japanese

Google Research announced the development of A Large-scale ImaGe and Noisy-Text Embedding (ALIGN), an 800M-parameter pre-trained deep-learning model trained on a noisy dataset of 1.8B image-text pairs. The model can be used on several downstream tasks and achieves state-of-the-art accuracy on several image-text retrieval benchmarks.

Researchers Chao Jia and Yinfei Yang gave an overview of the work in a recent blog post. The team scraped html pages from the web and used the alt-text tags associated with the images to produce a dataset of image-text pairs. The ALIGN model, which is a combination of a BERT-style natural language processing (NLP) encoder and EfficientNet-style computer vision (CV) encoder, was pre-trained on this dataset. The result is a model that can map both images and text into a shared latent embedding space. This shared embedding can then be used on several image-text tasks, including image-text retrieval and image classification. The model also exhibits "image math" search properties, where an image of a panda plus the text "Australia" returns an image of a koala.

Image plus text queries


Training large deep-learning AI models requires large datasets. While recent NLP models have been pre-trained using unsupervised learning on datasets scraped from the web, most CV models are trained on curated datasets such as ImageNet and COCO that have been built and annotated by human workers. Thus, these datasets are much smaller than the NLP datasets used to train models such as GPT-3; for example, COCO contains only 330K images, whereas GPT-3 was trained on nearly half a trillion words.

In 2018, Google researchers published a paper describing the Conceptual Captions dataset, which was built by scraping images from web pages and using the alt-text tags to create annotations for the images. Conceptual Captions contained around 3M images, an order-of-magnitude more than COCO. Because the alt-text data was "noisy," Google created an automated filtering pipeline to improve the data quality---the 3M images required scraping over 5B images, a rejection rate of 99.94%. Along with this large dataset, Google also launched the Conceptual Captions challenge, which evaluates models against a held-out test set of about 12.5K image-text pairs.

For this newest research, the Google team dispensed with the filtering steps and simply scraped nearly two billion noisy image-text pairs, two orders of magnitude larger than Conceptual Captions. The resulting dataset was used to train a ALIGN, a deep-learning model based on two encoder architectures, a 340M-parameter BERT for the text data and a 480M-parameter EfficientNet for the images, using contrastive loss as training objective for the combined model. The team evaluated the resulting model on the Flickr30K and COCO benchmarks, using both zero-shot and fine-tuning scenarios. Compared to previous work, ALIGN achieved new state-of-the-art accuracy on all tasks, by a "large margin." The model also performs well on the ImageNet classification benchmark, scoring 6th place on the leaderboard.

Several other organizations have recently investigated combined vision-language models. In January of this year, OpenAI released the CLIP model that was also trained on a dataset based on alt-text tags, containing 400M image-text pairs. CLIP had set the previous state-of-the-art records on many of the benchmarks used to evaluate ALIGN and has been open-sourced on GitHub. In April, Alibaba announced their M6 model which was trained on an image-text dataset of 1.9TB of images and 292GB of text, also scraped from the web.

In a discussion on Reddit, AI writer Gwern Branwen compared ALIGN to similar research done by Google subsidiary DeepMind, noting

It may be underperforming intramodal fusion, but nevertheless, simple archs mean 'TPUs go brrrrr' and get SOTA and this ALIGN even beats CLIP!

The Google team will present their paper on ALIGN at the upcoming International Conference on Machine Learning (ICML).

Rate this Article