Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Amazon Announces One Billion Parameter Speech Model BASE TTS

Amazon Announces One Billion Parameter Speech Model BASE TTS

Amazon Science recently published their work on Big Adaptive Streamable TTS with Emergent abilities (BASE TTS). BASE TTS supports voice-cloning and outperforms baseline TTS models when evaluated by human judges. Further, Amazon's experiments show that scaling model and data size improves the subjective quality of the model's output.

The core of BASE TTS is an autoregressive Transformer, similar to large language models (LLMs). The model is trained on 100k hours of unlabeled speech audio scraped from the web; the researchers automatically generated transcripts for the data using automatic speech recognition (ASR). To evaluate the effects of data and model size on quality, the Amazon team trained small- and medium-sized versions of the model. They also created a test dataset to be used by linguistic experts to evaluate the model's emergent abilities, such as expressing emotion, that the model was not explicitly trained to perform. According to Amazon:

From BASE TTS’s strong performance on English and Spanish, we caught a first glimpse of a multilingual TTS approach that achieves high expressiveness, adaptation to textual clues and data efficiency, using only public domain data, and applicable to streaming TTS use cases such as voicing LLM outputs. Our approach points towards potential Scaling Laws of [Large TTS] models, where an even larger amount of speech and other (text, image) data are needed to support multimodal objectives and to break new grounds in TTS.

BASE TTS is the latest of several LLM-inspired TTS models that support voice-cloning or transfer. In 2023, InfoQ covered Microsoft's VALL-E, which can replicate a voice given three seconds of audio recording; Google's AudioPaLM, which is based on a LLM and can perform TTS, automated speech recognition (ASR), and speech-to-speech translation (S2ST) with voice transfer; and Meta's Voicebox, a non-autoregressive model that can perform TTS in six languages, as well as edit and remove noise from speech recordings.

BASE TTS Architecture

BASE TTS Architecture (Image Source: Amazon Research Paper)

The key idea in BASE TTS is to convert speech audio to and from discrete speech tokens. Amazon used a model called WavLM to create an encoder that separates "phonetic and prosodic information" from the audio as well as extracting a representation of the speaker's voice. An autoregressive Transformer called SpeechGPT can then generate speech tokens for synthesis, conditioned on text tokens (the text to speak) and on a reference voice to use for synthesis. Finally, to produce audio, the output of SpeechGPT is passed to a speech token decoder.

In a discussion about BASE TTS on Hacker News, users compared examples of its output with speech produced by other models:

The emotion examples are interesting. One of the current most obvious indicators of AI-generated voices/voice cloning is a lack of emotion and range, which make them objectively worse compared to professional voice actors, unless a lack of emotion and range is the desired voice direction. But if you listen to the emotion examples, the range [is] essentially what you'd get from an audiobook narrator, not more traditional voice acting.

While the BASE-TTS demo site contains several sample audio files, Amazon has opted not to open-source the model, citing concerns of potential misuse of its voice-cloning ability.

About the Author

Rate this Article