Alexa Soon to Offer "Newscaster" Voice: Applying Generative Neural Networks for Text-to-Speech

Amazon recently announced the development of a customized Alexa voice, suitable for reading the news. In earlier implementations, text-to-speech functionality was achieved by concatenating small snippets of audio to produce the full sentence outcome. Amazon is using a generative Neural Network to synthesize a voice that is not only more natural, but can provide different speaking styles according to the context of the text being converted to speech.

The first application of this system demonstrated a voice that sounds more natural for reading the news. Amazon's Alexa will switch to the new voice in the coming weeks. The voice that resembles a newscaster was made possible by capturing audio snippets from news channels and then utilising machine learning to detect the way they read the text. These nuances are difficult to identify in a deterministic algorithm, so a statistical approach is employed to detect and apply them. It took Amazon just a few hours of data to teach the Machine Learning algorithm how to sound like a newscaster, implying that different styles could be on the way.

To get a newscaster like voice, one approach is to enlist voice talent to read out in their own style, split their recordings into small voice samples and synthesize them in the final output. This is time consuming and expensive. The Neural text-to-speech system's innovation is that it employs a 'style encoding' module identifying the speaking style of the voice sample. This way the system combines a large amount of neutral-style speech data with a few hours of supplementary data in the desired style. It can model aspects of speech like nuances, prosody and other characteristics, that are independent of speaking style and the ones that are particular to a single speaking style.

The announcement follows the recent addition of whisper mode in Alexa, which allows for a softer tone of voice for late-night or early mode conversations with the digital assistant. Google Assistant is already using a speech synthesis based on Machine Learning developed by its London-based AI lab DeepMind. Apple's Siri is using Hidden Markov Model Machine Learning to synthesize the voice from up to 20 hours of professional recordings.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter