Facebook AI Research (FAIR) open-sourced XLS-R, a cross-lingual speech recognition (SR) AI model. XSLR is trained on 436K hours of speech audio from 128 languages, an order of magnitude more than the largest previous models, and outperforms the current state-of-the-art on several downstream SR and translation tasks.
FAIR announced the release on their blog. XLS-R is based on wav2vec 2.0, a self-supervised approach to learning representations of speech audio. The model is trained on several publicly available audio datasets, including VoxPopuli, a recently-released corpus containing audio recordings of the European parliament. Overall, the model was trained on 128 European, Asian, and African languages, including 88 low-resource languages which have less than 100 hours of audio data each. XLS-R achieved new state-of-the-art performance levels on several benchmarks, including VoxLingua107, CommonVoice, VoxPopuli, and several languages of BABEL, and translation to English on CoVoST-2. According to the FAIR team,
We trust this [research] will enable machine learning applications that better understand all human speech and catalyze further research to make speech technology more accessible across the globe, especially among underserved populations. We will continue to improve our algorithms by developing new ways to learn from less supervision and scale our approach to the more than 7,000 languages around the world.
Training a deep-learning speech-recognition model requires a large dataset containing audio data with corresponding text transcripts. Acquiring such a dataset can be challenging for low-resource languages because of the lack of readily available data. In this situation, researchers turn to transfer learning: fine-tuning models that have been pre-trained on a large publicly-available dataset. FAIR's previous work in this area resulted in XLSR-53, a 300M-parameter model trained on 50K hours of audio data in 53 languages.
image source: https://arxiv.org/abs/2111.09296
XLS-R is based on the wav2vec 2.0 architecture, which uses a convolutional neural-network (CNN) feature encoder to convert audio into latent speech representations which are quantized then fed into a Transformer. During training, spans of input are masked, and the model's objective is to identify the quantized representation of the masked input. The resulting trained model is an encoder of audio input; for downstream tasks, the output of the encoder can be sent to a linear layer for speech classification and recognition or to a decoder for translation.
The FAIR team compared the performance of XLS-R with baseline models on several benchmark tasks, including automatic speech translation (AST), automatic speech recognition (ASR), language identification, and speaker identification. For the AST task of translating from other languages to English, the model outperformed previous work by an average of 7.4 BLEU. In translating from English, XSL-R performed similarly to the baselines; the authors speculate this is "likely because English data dominates the training corpus" of previous models. On BABEL, the hardest task according to the authors, XSL-R outperformed the baselines, "even on languages for which XLS-R does not add any pretraining data," showing the benefits of cross-lingual transfer. Overall, the authors found that XLS-R "performs best for low-resource and mid-resource languages."
In a Twitter discussion about the work, a reader asked co-author Alexis Conneau about approaches to ensure XSL-R's safety regarding bias. Conneau replied,
Depends on the downstream tasks and the biases you have in mind. At pre-training time, you can filter the unlabeled data. At fine-tuning time, there's a ton of work on controlling generation (ASR/AST), it's hard to do a comprehensive summary.
The XSL-R code is available on GitHub, and the pre-trained models are available from the HuggingFace model repository.