Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Meta's Voicebox Outperforms State-of-the-Art Models on Speech Synthesis

Meta's Voicebox Outperforms State-of-the-Art Models on Speech Synthesis

Meta recently announced Voicebox, a speech generation model that can perform text-to-speech (TTS) synthesis in six languages, as well as edit and remove noise from speech recordings. Voicebox is trained on over 50k hours of audio data and outperforms previous state-of-the-art models on several TTS benchmarks.

Unlike many TTS models which are autoregressive, Voicebox is based on a newer architecture called flow-matching. The model is trained to predict masked sections in an audio input, which allows it to perform infilling tasks, such as removing environmental noise from speech recordings or correcting mispronounced words. It can also perform tasks that it was not specifically trained to do, such as cross-lingual style transfer. Voicebox is trained on audiobook recordings paired with their text, recorded in English, French, German, Spanish, Polish, and Portuguese. Although Meta recently open-sourced another multilingual TTS model, the researchers have not done so with Voicebox, citing safety concerns:

There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time. While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility. With these considerations, today we are sharing audio samples and a research paper detailing the approach and results we have achieved.

Similar to large language models (LLM), Voicebox was trained to predict a masked section of its input. In the case of Voicebox, the input includes a segment of speech audio and its text transcript. A portion of the audio is masked, but the text is not; thus given the text, the model learns to synthesize the audio of the masked words in a way that matches the surrounding audio.

Although Voicebox was trained only on this one task, as with LLMs it can perform in-context learning to perform other tasks. For example, it can perform style transfer: given audio and its transcript, plus additional text to synthesize, it will use the style of the given audio. It can also remove noise: given audio of speech and its associated transcript, along with a mask indicating the noisy section of the audio, the model can resynthesize that section. The Meta team evaluated Voicebox's performance on several of these tasks, including zero-shot TTS, cross-lingual zero-shot TTS, and text guided denoising. When measuring the mode's word error rate (WER) and audio similarity, Voicebox outperformed the previous state-of-the-art models VALL-E and A3T.

In an effort to minimize the potential safety risks of the Voicebox model, Meta also developed a classifier model trained to detect synthesized speech. When tested on the Librispeech benchmark, the classifier could "trivially" distinguish the original audio from the benchmark data, vs speech synthesized by Voicebox from the text transcript. 

In a Hacker News discussion about Voicebox, one user pointed out Meta's decision not to release the model, and wondered how difficult it would be to replicate given the size of the training data. Another user replied

Assuming 10 hours [each], 6k books feels a very achievable dataset. Even Librivox claims 18k books (with many duplicates and hugely varying quality levels). If you wanted to get expansive, you could dig into the podcast archives of BBC, NPR, etc which could potentially yield millions of hours.

InfoQ recently covered Massively Multilingual Speech (MMS), a speech model that Meta did open-source. MMS can perform ASR and TTS in over 1k languages; however, it cannot perform tasks such as editing and style transfer that Voicebox can. InfoQ also covered Google's AudioPaLM model which can perform ASR, TTS, and speech-to-speech translation (S2ST) with voice transfer.

About the Author

Rate this Article