Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Microsoft Unveils VALL-E, a Game-Changing TTS Language Model

Microsoft Unveils VALL-E, a Game-Changing TTS Language Model

Microsoft has introduced VALL-E, a novel language model method for text-to-speech synthesis (TTS) that employs audio codec codes as intermediate representations and can replicate anyone's voice after listening to just three seconds of audio recording.

VALL-E is a neural codec language model where the AI tokenizes speech and uses its algorithms to use those tokens to build waveforms that sound like the speaker, including keeping the speaker's timbre and emotional tone.

According to the research paper, VALL-E can produce high-quality personalized speech with just a three-second enrolled recording of an oblique speaker acting as an acoustic stimulus. It does so without the need for additional structural engineering, pre-designed acoustic features, or fine-tuning. It supports contextual learning and prompt-based zero-shot TTS approaches.

Audio demonstrations of the AI model in action are provided by VALL-E. The "Speaker Prompt," one of the samples, is a three-second auditory cue that VALL-E must duplicate. For comparative purposes, the "Ground Truth" is a previously recorded excerpt of the same speaker using a certain phrase (sort of like the "control" in the experiment). The "Baseline" sample represents a typical text-to-speech synthesis example, and the "VALL-E" sample represents the output of the VALL-E model.

In comparison to the most sophisticated zero-shot TTS system, VALL-E performs significantly better on LibriSpeech and VCTK, according to evaluation data. On LibriSpeech and VCTK, VALL-E even produced cutting-edge zero-shot TTS outcomes.

The field of voice synthesis has advanced significantly in recent years thanks to the development of neural networks and end-to-end modeling. Currently, vocoders and acoustic models are often utilized in cascaded text-to-speech (TTS) systems, with mel spectrograms acting as the intermediary representations. High-quality speech from a single speaker or a group of speakers can be synthesized by sophisticated TTS systems.

TTS technology has been integrated into a wide range of applications and devices, such as virtual assistants like Amazon's Alexa and Google Assistant, navigation apps, and e-learning platforms. It's also used in industries such as entertainment, advertising, and customer service to create more engaging and personalized experiences.

About the Author

Rate this Article