Microsoft Previews Neural Network Text-To-Speech Capabilities

In a recent blog post, Microsoft announced a public preview of their neural network-powered text-to-speech capability, which is part of their Azure Cognitive Services offering. Within this release, the service makes computer generated voices indistinguishable from actual recordings. This technology has applications in chatbots, virtual assistants and converting digital text, such as e-books, into audiobooks.

The technology was first revealed at the Microsoft Ignite conference earlier this fall and improvements have been made in the areas of enhanced voice quality, runtime performance and greater service availability.

The voice quality has been improved as a result of a large supervised training across a diverse set of speakers. In addition, more features from unsupervised pre-training have been included and the neural model design is more robust. Xuedong Huang, a technical fellow at Microsoft, explains the benefits of these enhancements:

Our text-to-speech capability uses deep neural networks to overcome the limits of traditional text-to-speech systems in matching the patterns of stress and intonation in spoken language, called prosody, and in synthesizing the units of speech into a computer voice.

Text-to-speech systems are not new, but Huang explains the differences between these previous systems and the latest service from Microsoft:

Traditional text-to-speech systems break down prosody into separate linguistic analysis and acoustic prediction steps that are governed by independent models. That can result in muffled, buzzy voice synthesis. Our neural capability does prosody prediction and voice synthesis simultaneously. The result is a more fluid and natural-sounding voice.

The performance of the neural text-to-speech engine is now six times faster than the previous version as a result of code optimization, hardware acceleration, applying parallel inference models and model simplifications. Microsoft considers the runtime performance to be near-instantaneous, Huang explains what the impact of these enhancement has had on the service:

The real-time factor has been improved from the previous version to less than 0.05X, meaning 1 second of audio can be generated in less than 50 milliseconds.

Microsoft has provided some samples that demonstrate the computer generated voices are "indistinguishable from actual recordings":

Sentence	Recording	Text-To-Speech
The third type, a logarithm of the unsigned fold change, is undoubtedly the most tractable.
As the name suggests, the original submarines came from Yugoslavia.
This is easy enough if you have an unfinished attic directly above the bathroom.

The preview service currently offers two pre-built neural text-to-speech voices in English, including a female voice named Jessa and a male voice named Guy. Additional languages will be available in the future, as well as customization services for customers who want to build their own branded voices.

Azure Kubernetes Service (AKS) provides the underlying infrastructure that powers neural text-to-speech services and is available in three data centers across the US, Europe and Asia.

Discounts for the service are available during the preview. Please visit the Azure pricing page for more details.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter