Alibaba Announces 10 Billion Parameter Multi-Modal AI M6

Alibaba has created an AI model called Multi-Modality to Multi-Modality Multitask Mega-transformer (M6). The model contains 10 billion parameters and is pretrained on a dataset consisting of 1.9TB of images and 292GB of Chinese-language text. M6 can be fine-tuned for several downstream tasks, including text-guided image generation, visual question answering, and image-text matching.

The model and several experiments were described in a paper published on arXiv. M6 is based on the Transformer architecture, modified to accept both image input data as well as text. To perform pretraining, Alibaba used several sources, including online encyclopedias, discussion forums, and e-commerce sites, to create a dataset combining Chinese-language text and related images. After pretraining, Alibaba researchers fine-tuned the model to perform several computer vision (CV) and natural language processing (NLP) tasks: image generation, visual question answering, image captioning, question answering, poem generation, and image-text matching. For some of these tasks, such as image-text matching, M6 showed improved performance compared to baseline models. Results on other tasks, such as image generation and poem generation, were evaluated by human judges.

Extremely large NLP models such as GPT-3 have made recent headlines, demonstrating near-human or even super-human performance on benchmark tasks. Inspired by the success of these models, researchers have adapted the Transformer architecture for other domains, including CV and combined vision-and-language problems. In 2019, Microsoft created UNiversal Image-TExt Representation Learning (UNITER), which achieved state of the art performance on vision/language tasks, including visual question answering (VQA) and image-text retrieval. In 2020, Alibaba published a paper on InterBERT, their first iteration of M6, which they deployed to their e-commerce site Taobao, where they observed improved click-through rates from its search results. Earlier this year, OpenAI announced their DALL·E image generation model, based on GPT-3, and released several images demonstrating its ability to generate high-quality yet surreal images from natural language descriptions.

One challenge with these large models is that they require correspondingly large datasets. Because researchers often assemble these datasets by scraping websites such as Wikipedia, the data is dominated by English language content. To train M6, Alibaba researchers assembled a combined text-and-image Chinese-language dataset that is, according to the team, "the first large-scale, multimodal and multidomain corpus for Chinese pretraining." The dataset contains both plaintext as well as image-text pairs. There are 60.5M images, each of at least 5k pixels, for a total of 1.9TB, and 292.4GB of text containing nearly 420M text passages with almost 112B tokens.

To perform pretraining, the images in the dataset are split into smaller patches which are then fed into a feature extractor to produce a sequence of image features. The image feature sequences and text sequences are then fed into the Transformer as with a typical NLP model. M6 is pretrained using several different objectives, including text de-noising (similar to BERT and other NLP models); image-to-text transfer, where the model learns to generate image captions; and multimodality-to-text transfer, where the model learns to generate a target text string given image input and a masked text input.

Alibaba trained both a 10B-parameter version of M6, called M6-10B, and a 100B-parameter version based on a mixture of experts (MoE) dubbed M6-100B. Even with memory-saving techniques such as mixed-precision training and activation-checkpointing, M6-10B was too large to fit in a single GPU, so the team used model-parallel training to spread the model across multiple GPUs. Training M6-100B was "much more difficult" and was trained using Alibaba's in-house parallel training framework Whale.

Writing on Twitter, OpenAI's head of policy research Miles Brundage noted:

They mention a 100B model but no results from it, suggesting it didn't quite work. And MOE = less compute than a dense 100B. Still, a serious data/engineering/eval effort + further along than I'd have extrapolated from the 1st public GPT-2 scale Chinese LM a few months ago.

At this time, neither the M6 model nor the training data have been released, although Alibaba states that they intend to release the dataset to "nourish further development in the community." Alibaba's Damo Academy has several GitHub repositories containing the code used in their recent NLP research papers.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter