Alibaba Announces 10 Billion Parameter Multi-Modal AI M6

Alibaba has created an AI model called Multi-Modality to Multi-Modality Multitask Mega-transformer (M6). The model contains 10 billion parameters and is pretrained on a dataset consisting of 1.9TB of images and 292GB of Chinese-language text. M6 can be fine-tuned for several downstream tasks, including text-guided image generation, visual question answering, and image-text matching.

The model and several experiments were described in a paper published on arXiv. M6 is based on the Transformer architecture, modified to accept both image input data as well as text. To perform pretraining, Alibaba used several sources, including online encyclopedias, discussion forums, and e-commerce sites, to create a dataset combining Chinese-language text and related images. After pretraining, Alibaba researchers fine-tuned the model to perform several computer vision (CV) and natural language processing (NLP) tasks: image generation, visual question answering, image captioning, question answering, poem generation, and image-text matching. For some of these tasks, such as image-text matching, M6 showed improved performance compared to baseline models. Results on other tasks, such as image generation and poem generation, were evaluated by human judges.

Extremely large NLP models such as GPT-3 have made recent headlines, demonstrating near-human or even super-human performance on benchmark tasks. Inspired by the success of these models, researchers have adapted the Transformer architecture for other domains, including CV and combined vision-and-language problems. In 2019, Microsoft created UNiversal Image-TExt Representation Learning (UNITER), which achieved state of the art performance on vision/language tasks, including visual question answering (VQA) and image-text retrieval. In 2020, Alibaba published a paper on InterBERT, their first iteration of M6, which they deployed to their e-commerce site Taobao, where they observed improved click-through rates from its search results. Earlier this year, OpenAI announced their DALL·E image generation model, based on GPT-3, and released several images demonstrating its ability to generate high-quality yet surreal images from natural language descriptions.

One challenge with these large models is that they require correspondingly large datasets. Because researchers often assemble these datasets by scraping websites such as Wikipedia, the data is dominated by English language content. To train M6, Alibaba researchers assembled a combined text-and-image Chinese-language dataset that is, according to the team, "the first large-scale, multimodal and multidomain corpus for Chinese pretraining." The dataset contains both plaintext as well as image-text pairs. There are 60.5M images, each of at least 5k pixels, for a total of 1.9TB, and 292.4GB of text containing nearly 420M text passages with almost 112B tokens.

To perform pretraining, the images in the dataset are split into smaller patches which are then fed into a feature extractor to produce a sequence of image features. The image feature sequences and text sequences are then fed into the Transformer as with a typical NLP model. M6 is pretrained using several different objectives, including text de-noising (similar to BERT and other NLP models); image-to-text transfer, where the model learns to generate image captions; and multimodality-to-text transfer, where the model learns to generate a target text string given image input and a masked text input.

Alibaba trained both a 10B-parameter version of M6, called M6-10B, and a 100B-parameter version based on a mixture of experts (MoE) dubbed M6-100B. Even with memory-saving techniques such as mixed-precision training and activation-checkpointing, M6-10B was too large to fit in a single GPU, so the team used model-parallel training to spread the model across multiple GPUs. Training M6-100B was "much more difficult" and was trained using Alibaba's in-house parallel training framework Whale.

Writing on Twitter, OpenAI's head of policy research Miles Brundage noted:

They mention a 100B model but no results from it, suggesting it didn't quite work. And MOE = less compute than a dense 100B. Still, a serious data/engineering/eval effort + further along than I'd have extrapolated from the 1st public GPT-2 scale Chinese LM a few months ago.

At this time, neither the M6 model nor the training data have been released, although Alibaba states that they intend to release the dataset to "nourish further development in the community." Alibaba's Damo Academy has several GitHub repositories containing the code used in their recent NLP research papers.

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?