Meta has released version 2.0 of Data2Vec, a self-supervised algorithm that can learn in the same way from three different modalities: speech, vision, and text, and achieves the same accuracy of the other computer vision models but 16x faster. The code and pretrained models are also shared with the other researchers.
Self-supervised learning enables machines to learn without relying on labeled data. These models learn the structure of text, speech, and images by simply observing the world. Allowing these models to learn more efficiently is particularly important for videos, which require a lot of computational cost to process.
data2vec 2.0 algorithm improves the efficiency of the original data2vec; it predicts contextualized representation of the data instead of the single word of a sentence or pixel of an image. This means that the algorithm uses the whole training example into account; in this way, the data2vec learns faster than the other algorithms.
The efficiency improvement is made in several ways: targeting representations built for particular training samples and reusing them for masked versions, where some parts of the training examples are hidden. Each version is computed by student model which predicts the same contextualized target for different masked versions; in this way, the computational effort to create the representations is amortized. Another efficiency trick implemented is to not run the student encoder network for the parts of training examples that are hidden, saving significant compute cycles. This trick is similar to the masked autoencoders. Finally, a more efficient decoder model based on a multilayer convolutional network is used instead of transformers networks.
To understand the performances of data2vec 2.0, it was tested on computer vision, speech, and text tasks considering the final accuracy and the time to pre-train the model, measured on the same hardware.
For the vision task, the benchmark was the ImageNet1-k image classification. Data2vec equals the accuracy of MAE (masked autoencoders) but is 16x faster.
data2vec 2.0 performances for computer vision task
To test data2vec on the speech tasks, LibriSpeech recognition benchmark was used as the benchmark. data2vec performed more than 11 times faster than wav2vec 2.0 (a specific pre-trained model developed by Meta for speech recognition), with similar accuracy.
data2vec 2.0 performances for speech task
data2vec 2.0 performances for NLP task
For the NLP task, the GLUE benchmark was used to understand the performances of data2vec. It achieves the same accuracy of RoBERTa (a reimplementation of BERT, a transformer base pretrained model for NLP developed by Google) in the half time of training.