Google Announces 800M Parameter Vision-Language AI Model ALIGN

Google Research announced the development of A Large-scale ImaGe and Noisy-Text Embedding (ALIGN), an 800M-parameter pre-trained deep-learning model trained on a noisy dataset of 1.8B image-text pairs. The model can be used on several downstream tasks and achieves state-of-the-art accuracy on several image-text retrieval benchmarks.

Researchers Chao Jia and Yinfei Yang gave an overview of the work in a recent blog post. The team scraped html pages from the web and used the alt-text tags associated with the images to produce a dataset of image-text pairs. The ALIGN model, which is a combination of a BERT-style natural language processing (NLP) encoder and EfficientNet-style computer vision (CV) encoder, was pre-trained on this dataset. The result is a model that can map both images and text into a shared latent embedding space. This shared embedding can then be used on several image-text tasks, including image-text retrieval and image classification. The model also exhibits "image math" search properties, where an image of a panda plus the text "Australia" returns an image of a koala.

Image plus text queries

Source: https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html

Training large deep-learning AI models requires large datasets. While recent NLP models have been pre-trained using unsupervised learning on datasets scraped from the web, most CV models are trained on curated datasets such as ImageNet and COCO that have been built and annotated by human workers. Thus, these datasets are much smaller than the NLP datasets used to train models such as GPT-3; for example, COCO contains only 330K images, whereas GPT-3 was trained on nearly half a trillion words.

In 2018, Google researchers published a paper describing the Conceptual Captions dataset, which was built by scraping images from web pages and using the alt-text tags to create annotations for the images. Conceptual Captions contained around 3M images, an order-of-magnitude more than COCO. Because the alt-text data was "noisy," Google created an automated filtering pipeline to improve the data quality---the 3M images required scraping over 5B images, a rejection rate of 99.94%. Along with this large dataset, Google also launched the Conceptual Captions challenge, which evaluates models against a held-out test set of about 12.5K image-text pairs.

For this newest research, the Google team dispensed with the filtering steps and simply scraped nearly two billion noisy image-text pairs, two orders of magnitude larger than Conceptual Captions. The resulting dataset was used to train a ALIGN, a deep-learning model based on two encoder architectures, a 340M-parameter BERT for the text data and a 480M-parameter EfficientNet for the images, using contrastive loss as training objective for the combined model. The team evaluated the resulting model on the Flickr30K and COCO benchmarks, using both zero-shot and fine-tuning scenarios. Compared to previous work, ALIGN achieved new state-of-the-art accuracy on all tasks, by a "large margin." The model also performs well on the ImageNet classification benchmark, scoring 6th place on the leaderboard.

Several other organizations have recently investigated combined vision-language models. In January of this year, OpenAI released the CLIP model that was also trained on a dataset based on alt-text tags, containing 400M image-text pairs. CLIP had set the previous state-of-the-art records on many of the benchmarks used to evaluate ALIGN and has been open-sourced on GitHub. In April, Alibaba announced their M6 model which was trained on an image-text dataset of 1.9TB of images and 292GB of text, also scraped from the web.

In a discussion on Reddit, AI writer Gwern Branwen compared ALIGN to similar research done by Google subsidiary DeepMind, noting

It may be underperforming intramodal fusion, but nevertheless, simple archs mean 'TPUs go brrrrr' and get SOTA and this ALIGN even beats CLIP!

The Google team will present their paper on ALIGN at the upcoming International Conference on Machine Learning (ICML).

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Cell-Based Architecture Adoption Guidelines

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Cloudflare Introduces Workflows for Building Scalable Resilient Multi-Step Applications

Cloudflare Introduces Short-Lived SSH Access, Eliminating the Need for SSH Credentials

Microsoft Introduces Modern Web App Pattern for .NET: Accelerating App Modernization to the Cloud

Apache Tomcat 11.0 Delivers Support for Virtual Threads and Jakarta EE 11

AWS Lambda Introduces a Visual Studio Code-Based Editor with Advanced Features and AI Integration

Generally AI - Season 2 - Episode 5: Do Robots Dream of Electric Pianos?

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Steve Klabnik and Herb Sutter Talk about Rust and C++

Challenges and Lessons Porting Code from C to Rust

Grab Employs LLMs for Conversational Data Discovery with GPT-4, Glean and Slack

Cell-Based Architecture Adoption Guidelines

Software Architecture Tracks at QCon San Francisco 2024 – Navigating Current Challenges and Trends

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

What Developers Can Do to Continue to Program as They Age

How Rules Can Foster Creativity: The Design System of Reykjavík

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

OSI Releases New Definition for Open Source AI, Setting Standards for Transparency and Accessibility

Being a Responsible Developer in the Age of AI Hype

Optimizing Uber's Search Infrastructure: Upgrading to Apache Lucene 9.5

Improving the Efficiency of Goku Time-Series Database at Pinterest

Expedia Migrates a Massive Cassandra Cluster to ScyllaDB with Zero Downtime

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?