Google Research announced Universal Speech Model (USM), a 2B parameter automated speech recognition (ASR) model trained on over 12M hours of speech audio. USM can recognize speech in over 100 languages, including low-resource languages, and achieves new state-of-the-art performance on several benchmarks.
The model and experiments were described in a paper published on arXiv. USM uses a Conformer-based encoder as its backbone, which is trained using unsupervised learning on unlabeled audio from YouTube videos that include speech from over 300 languages. The Google team also introduced a new chunk-wise attention mechanism to the network architecture, to solve a quality degradation problem common with Conformer models on long-form audio input. Instead of fine-tuning the entire model for downstream tasks, the Google team added small adapter network units to the frozen encoder for ASR and automated speech translation (AST) tasks. According to Google:
USM...can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, and Nzema, to name a few. Some of these languages are spoken by fewer than twenty million people, making it very hard to find the necessary training data. We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.
USM was trained in three stages on three types of data: audio-only, text-only, and paired audio-text. The first training stage used unsupervised learning to train the Conformer encoder backbone on audio-only data: 12M hours of YouTube audio from 300 languages plus public datasets containing another 429k hours of speech in 51 languages. The next stage used Multi-Objective Supervised pre-Training (MOST) to further train the encoder on all three types of data. The final stage involves fine-tuning a task-specific model. For the last stage, Google achieved better results by freezing the encoder and fine-tuning a small task-specific adapter instead of fine-tuning the full model.
USM Training Pipeline. Image Source: https://arxiv.org/abs/2303.01037
The Google team evaluated USM on several ASR and AST benchmarks. USM achieved state-of-the-art performance on multiple benchmarks, including SpeechStew, FLEURS, and CoVoST. USM also achieved a word error rate (WER) of less than 30% on a YouTube captioning dataset with speech from 73 languages; according to Google, "no published model can successfully decode all 73 languages" from this dataset. The researchers also compared USM to OpenAI's Whisper model; USM outperformed Whisper on the "18 languages that Whisper can successfully decode with lower than 40% WER."
In a discussion about USM on Hacker News, several users hailed its performance on low-resource languages, given the scarcity of training data available. One noted:
What this implies is that the core encoder model (trained on popular languages with a ton of data) does a really good job at learning the generalized basics of any language (period). Then, they enter a comparatively minuscule amount of labeled data from the small languages at the end (supervised fine tuning - since parameters are tuned iteratively, the latest training data has the biggest impact on the final performance). The model has enough of a general "understanding" of grammar across languages that it can fill in the gaps it doesn’t get from that small labeled set.
USM is currently only available via a private hosted inference API on Google Cloud Platform. Users must request access, with priority given to "researchers and institutions."