Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google's Universal Speech Model Performs Speech Recognition on Hundreds of Languages

Google's Universal Speech Model Performs Speech Recognition on Hundreds of Languages

This item in japanese

Google Research announced Universal Speech Model (USM), a 2B parameter automated speech recognition (ASR) model trained on over 12M hours of speech audio. USM can recognize speech in over 100 languages, including low-resource languages, and achieves new state-of-the-art performance on several benchmarks.

The model and experiments were described in a paper published on arXiv. USM uses a Conformer-based encoder as its backbone, which is trained using unsupervised learning on unlabeled audio from YouTube videos that include speech from over 300 languages. The Google team also introduced a new chunk-wise attention mechanism to the network architecture, to solve a quality degradation problem common with Conformer models on long-form audio input. Instead of fine-tuning the entire model for downstream tasks, the Google team added small adapter network units to the frozen encoder for ASR and automated speech translation (AST) tasks. According to Google:

USM...can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, and Nzema, to name a few. Some of these languages are spoken by fewer than twenty million people, making it very hard to find the necessary training data. We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.

USM was trained in three stages on three types of data: audio-only, text-only, and paired audio-text. The first training stage used unsupervised learning to train the Conformer encoder backbone on audio-only data: 12M hours of YouTube audio from 300 languages plus public datasets containing another 429k hours of speech in 51 languages. The next stage used Multi-Objective Supervised pre-Training (MOST) to further train the encoder on all three types of data. The final stage involves fine-tuning a task-specific model. For the last stage, Google achieved better results by freezing the encoder and fine-tuning a small task-specific adapter instead of fine-tuning the full model.

USM Training

USM Training Pipeline. Image Source:

The Google team evaluated USM on several ASR and AST benchmarks. USM achieved state-of-the-art performance on multiple benchmarks, including SpeechStew, FLEURS, and CoVoST. USM also achieved a word error rate (WER) of less than 30% on a YouTube captioning dataset with speech from 73 languages; according to Google, "no published model can successfully decode all 73 languages" from this dataset. The researchers also compared USM to OpenAI's Whisper model; USM outperformed Whisper on the "18 languages that Whisper can successfully decode with lower than 40% WER."

In a discussion about USM on Hacker News, several users hailed its performance on low-resource languages, given the scarcity of training data available. One noted:

What this implies is that the core encoder model (trained on popular languages with a ton of data) does a really good job at learning the generalized basics of any language (period). Then, they enter a comparatively minuscule amount of labeled data from the small languages at the end (supervised fine tuning - since parameters are tuned iteratively, the latest training data has the biggest impact on the final performance). The model has enough of a general "understanding" of grammar across languages that it can fill in the gaps it doesn’t get from that small labeled set.

USM is currently only available via a private hosted inference API on Google Cloud Platform. Users must request access, with priority given to "researchers and institutions."

About the Author

Rate this Article