Meta's Open-Source Massively Multilingual Speech AI Handles over 1,100 Languages

Meta AI open-sourced the Massively Multilingual Speech (MMS) model, which supports automatic speech recognition (ASR) and text-to-speech synthesis (TTS) in over 1,100 languages and language identification (LID) in over 4,000 languages. MMS can outperform existing models and covers nearly 10x the number of languages.

MMS is based on the wav2vec model and is pre-trained on a dataset containing 491K hours of speech in 1,406 languages, which is based on existing cross-lingual datasets as well as a new dataset of 9,345 hours of unlabelled recordings of religious text readings, songs, and other speech in 3,860 languages. To fine-tune the ASR and TTS models, Meta used recordings of Bible readings in 1,107 languages, which provided labeled cross-lingual speech data. The fine-tuned MMS models can perform ASR and TTS in those 1,107 languages as well as LID in 4,017 languages. According to Meta,

Many of the world’s languages are in danger of disappearing, and the limitations of current speech recognition and speech generation technology will only accelerate this trend. We envision a world where technology has the opposite effect, encouraging people to keep their languages alive since they can access information and use technology by speaking in their preferred language.

Training speech processing AI models using supervised learning requires large datasets of labeled speech data---usually audio recordings paired with transcripts. For many languages such as English, such datasets are readily available; however, for low-resources languages with very few native speakers, collecting a large dataset might be impossible. Meta's previous research on XLS-R and NLLB showed that a single cross-lingual model combined with self-supervised pre-training can, after fine-tuning on small amounts of data, perform well on approximately 100 languages, even on low-resource ones. More recently, InfoQ covered OpenAI's Whisper and Google's USM, which also support around 100 languages each.

To scale their model to handle thousands of languages, Meta needed an audio dataset with more languages. The team chose to use audio recordings of the Christian New Testament; this provided labeled audio data in over 1,000 languages, with an average of 32 hours per language. Although each language's recording was a single speaker, usually male, the researchers found that this introduced very little bias in the final models: the models performed similarly in female and male benchmark audio. They also did not find any bias due to the model being trained largely on religious texts.

Meta's chief AI scientist Yann LeCun called out several highlights of MMS on Twitter, noting in particular it has "half the word error rate of Whisper." Several users pointed out that the model's usefulness was limited by its non-commercial license. Another user pointed out other drawbacks, and questioned whether it was indeed better than Whisper:

In my testing, it performs worse than Whisper for transcription to text, mis-hearing words and not hearing implied punctuation. Also it's about 10x slower than Faster-Whisper. [MMS] uses 20 GB of RAM, while Whisper uses about 1 GB. For these reasons and others this is fairly impractical for people to use for a real application. Also note that you need to specify the language being spoken while Whisper will identify it for you. Hope these issues get resolved over time and OpenAI has a competitor eventually in this area.

The MMS code and pretrained model files are available on GitHub. A list of the supported languages for each task (ASR, TTS, and LID) is available online.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter