Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News AI Listens by Seeing as Well

AI Listens by Seeing as Well

This item in japanese

Meta AI released a self-supervised speech recognition model that also uses video and achieves 75% better accuracy for some amount of data than current state-of-the-art models.

This new model, Audio-Visual Hidden BERT (AV-HuBERT), uses audiovisual features for improving models based only on hearing speech. Visual features used are based on lip-reading, similar to what humans do. The lip-reading helps to filter background noise when someone is speaking, which is an extremely hard task only using audio.

For generating input data, the first pre-processing is extracting audio and video features from video and creating clusters using k-means. The audiovisual frames are the input to AV-HuBERT model and the cluster IDs are the output.

Figure 1: Clustering video and audio features

The next step is similar to BERT, a self-supervised language model, by using masks on spans of audio and visual streams, so that the mode can predict and learn context. Fusing these features in contextualized representations using transformers, one can compute the loss function on the frames where the audio or visual are masked.

Meta AI released the framework implementing this code on GitHub.

For loading a pre-trained model, the following script can be useful:

>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Figure 2: AV-HuBERT model representation

This framework can be useful to detect deepfakes and generate more realistic avatars in AR.  By syncing image and speech, this model can help to generate avatars speaking coherently with the face movement. Text to image is still a hot topic in the AI research community. In addition, this model can help in noisy environments to recognize speech more efficiently. Another great potential application is allowing lip sync in many languages requiring low resources, since it needs less data to train.

About the Author

Rate this Article