Meta Open-Sources Multi-Modal AI Algorithm Data2vec

Meta AI recently open-sourced data2vec, a unified framework for self-supervised deep learning on images, text, and speech audio data. When evaluated on common benchmarks, models trained using data2vec perform as well as or better than state-of-the-art models trained with modality-specific objectives.

The algorithm and experiments were described in a paper published on arXiv. Data2vec unifies self-supervised learning by having models learn to predict representations of input data---that is, the values in the hidden layers of a neural network. This abstraction away from input data allows the same training algorithm to be used for many different data types. To demonstrate data2vec's effectiveness, the Meta researchers separately trained models for computer vision (CV), natural language processing (NLP), and speech recognition (SR). Their models outperformed previous self-supervised models on CV and SR tasks, and were "competitive" on NLP. According to the Meta team,

In addition to helping accelerate progress in AI, data2vec brings us closer to building machines that learn seamlessly about different aspects of the world around them. It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do.

Because supervised machine learning often requires training on large hand-labeled datasets to perform well, many researchers have turned to transfer learning, where a model is pre-trained via self-supervised learning on a large unlabeled dataset, then fine-tuned for a specific task. Many pre-trained NLP models, such as BERT, use a masked language model objective for self-supervised training, where the model is trained to predict words or tokens that are masked from an input sequence. Similar objectives have been applied to other domains, but often these different data types are pre-trained with different training objects; for example, CV models often use a contrastive loss, learning to map similar images to neighborhoods in a latent space.

For data2vec, the Meta team opted to use a masked learning objective, but instead of predicting masked tokens or units of input, the training objective is to predict "contextualized latent representations" based on the entire input. The model is based on a Transformer network and is used during training in either "teacher" or "student" mode. First, the teacher encodes the full input into a representation. Next, the student is fed an input with some data masked; the student must predict the full representation produced by the teacher; that is, it must predict the state of multiple hidden layers in the teacher.

To evaluate the performance of data2vec, the Meta researchers used the algorithm to pre-train several models. To do this, the team first implemented "modality-specific feature encoders and masking strategies" to feed into a generic Transformer. They pre-trained three sets of models and evaluated them on the ImageNet (CV), Librispeech (SR), and GLUE (NLP) benchmarks. On ImageNet, the data2vec models outperformed similar-sized ViT models on ImageNet-1K, and on Librispeech, data2vec outperformed "the best prior work," including HuBERT. On GLUE, the data2vec model performed "competitively" to a baseline RoBERTa model.

On Twitter, lead researcher Alexei Baevski answered several questions about the work. He noted that training the NLP model took "about 3.5 days" using 16 GPUs.

The data2vec code and pre-trained models for SR and NLP are available on GitHub. The CV model is not currently available, but is listed as "coming soon."

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter