Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google Announces General Availability of Cloud Text-to-Speech and Updates to Cloud Speech-to-Text

Google Announces General Availability of Cloud Text-to-Speech and Updates to Cloud Speech-to-Text

This item in japanese

Google announced the general availability of Cloud Text-to-Speech, which allows developers to add natural-sounding speech to their devices or applications. Furthermore, Google also announced updates to Cloud Speech-to-Text by adding a broader set of features and enhancing the availability and reliability.

Cloud Text-to-Speech was first announced in March of this year, and customers have since then asked for more language support for WaveNet voices – a technology mimicking human voices to sound more natural. Google anticipated by adding 17 new WaveNet voices to allow Cloud Text-to-Speech customers to build apps in many more languages. Currently, Cloud Text-to-Speech supports 14 languages and variants, with 56 total voices, including 30 standard voices, and 26 WaveNet voices. 

Google’s Cloud Text-to-Speech leverages several technologies, including WaveNet - a deep neural network for generating raw audio waveforms capable of producing better and more realistic-sounding speech. Furthermore, Google included the Audio Profiles (beta) for use with text-to-speech, enabling users to optimize the service for playback on different kinds of hardware. As an example, Google stated in the announcement:

You can now specify whether audio is intended to be played over phone lines, headphones, or speakers, and we’ll optimize the audio for playback. For example, if the audio your application produces is listened to primarily on headphones, you can create synthetic speech from Cloud Text-to-Speech API that is optimized specifically for headphones.


At Google Cloud Next last July, new features for Cloud Speech-to-Text were announced and are now available on the service in beta. Developers can build applications by accepting multiple languages with language autodetection, separate different speakers with speaker diarization and multi-channel recognition, and higher word-level confidence.

The Google Text-to-Speech is mostly a transcription service interpreting human voices to record what is said. Furthermore, this service can add proper punctuation like commas and periods to the text output. Now Google will further evolve the service with new multichannel recognition features for transcribing audio from multiple speakers including sentiment analysis with Cloud Natural Language. When channels do not separate audio samples, developers can use a feature called speaker diarization, which allows them to input the number of speakers as an API parameter – and through Machine Learning, according to the same announcement, have the following capability:

Cloud Speech-to-Text will tag each word with a speaker number. Speaker tags attached to each word are continuously updated as more and more data is received, so Cloud Speech-to-Text becomes increasingly more accurate at identifying who is speaking, and what they said.


Besides the speaker diarization and multi-channel recognition features, the Cloud Speech-to-Text can accept multiple languages and autodetect those. Developers can use the voice and command functionality of the feature and send up to four language codes in each query to Cloud Speech-to-Text. Subsequently, the API will automatically determine which language was spoken and return the transcript in that language. Finally, another feature is the word-level confidence scores, which allows developers to build apps highlighting specific words and then depending on the score, write code to prompt users to repeat those words as needed.

The Text-to-Speech service from Google is not the only one available in the public cloud; for instance, Amazon offers Polly on AWS, which lists 54 available voices – and Microsoft provides their Text to Speech service, still in preview with more than 75 voices in over 45 languages. Furthermore, with Speech-to-Text, Google will have competitors in Amazon Transcribe on AWS, a feature-rich and generally available service – and Microsoft, Speech to Text service also still in preview. Besides the competition, users of these speech and text services also show some sentiment and have discussions on them. Within a Hacker News thread on the Google text and speech services, one of the participants said:

Not sure why it would be a worthwhile endeavour to roll up your text to speech system when all major cloud providers give that service for a price unless this cost would make a significant fraction of your total costs. Unless that's the case, why not use this service until the day comes when Google might raise prices and then we could decide what to do about it? It's still an API call, after all.

According to Mike Wheatley in a recent Silicon Angle article, Google will target three main markets with the Cloud Text-to-Speech service: 

  • Voice response systems for call centers, for which Cloud Text-to-Speech can provide real-time, natural-language conversation.
  • The IoT sector, specifically products such as car infotainment systems, TVs and robots, enabling these kinds of devices to talk back to users.
  • Applications such as podcasts and audiobooks, which convert text into speech.

Developers can try out both the Speech-to-Text and Cloud Text-to-Speech services. For pricing details of Speech-to-Text, see the pricing page. Similar the pricing details for Text-to-Speech are available on the respective pricing page.

Rate this Article