Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Facebook Open-Sources Multilingual Speech Recognition Deep-Learning Model

Facebook Open-Sources Multilingual Speech Recognition Deep-Learning Model

This item in japanese

Facebook AI Research (FAIR) open-sourced Cross-Lingual Speech Recognition (XSLR), a multilingual speech recognition AI model. XSLR is trained on 53 languages and outperforms existing systems when evaluated on common benchmarks.

The model architecture and related experiments were described in a paper published on arXiv. XSLR is built on the wav2vec architecture and uses transfer learning to improve performance on "low-resource" languages. The system is pre-trained on three public datasets containing 53 languages. When evaluated on the CommonVoice and BABEL benchmarks, the model outperforms existing baselines. The system can also learn languages not seen during pre-training, outperforming monolingual models specifically trained on those languages. According to lead author Alexis Conneau,

Our to enable few-shot learning for languages that are actually low-resource, leveraging unsupervised data from higher-resource languages.

Training a deep-learning model requires a large dataset of labeled examples; for speech-recognition, this would mean audio data with corresponding text transcripts. Acquiring such a dataset can be challenging for non-European languages---often termed low-resource languages because of the lack of readily available data. In this situation, researchers turn to transfer learning: fine-tuning models that have been pre-trained on a large publicly-available dataset. This strategy has been applied by Facebook and others for neural-machine translation, using popular sequence-to-sequence natural language Transformer models such as BERT.

FAIR published the original wav2vec deep-learning model for automated speech recognition (ASR) in 2019 and the updated wav2vec 2.0 model in 2020. The model uses a convolutional neural-network (CNN) feature encoder to convert audio into latent speech representations which are quantized then fed into a Transformer; the Transformer converts sequences of speech representations into text. For the pre-training phase, a certain percentage of the latent representations are masked, and the network learns to predict the masked values; this is analogous to the masked language model training used in BERT.

XSLR is uses the same architecture as wav2vec 2.0. It is pre-trained using multilingual batches of audio data drawn from three datasets: CommonVoice, a corpus of read speech; BABEL, a corpus of telephone conversations; and Multilingual LibriSpeech (MLS), a corpus of audiobooks. The full dataset contains over 56k hours of speech in 53 languages. The fine-tuned model is evaluated against held-out datasets from CommonVoice and BABEL. The team trained several models of varying size; the largest model contained 24 Transformer blocks of dimension 1,204 with 16 attention heads.

On low-resource languages, even those used only in fine-tuning but not in pre-training, the large XSLR model outperforms baseline models. Low-resource languages especially benefit from pre-training on related languages; for example, performance on Italian improves when additional Spanish language data is included in pre-training. The researchers also noted that the XSLR performs worse than baseline on high-resource languages due to interference, or sharing of model capacity across languages. This interference can be mitigated by increasing the model capacity and adjusting the sampling of the languages during pre-training.

In response to a Twitter question about the fine-tuning the model, Conneau said

[F]ine-tuning on 10-minutes or 1-hour of annotated data...leads to good performance for character/phoneme recognition...Then the more supervision the better the performance.

The wav2vec and XSLR models and code are available on GitHub.

Rate this Article