Google Upgrades Its Speech-to-Text Service with Tailored Deep-Learning Models

A month after Google announced breakthroughs in Text-to-Speech generation technologies stemming from the Magenta project, the company followed through with a major upgrade of its Speech-to-Text API cloud service. The updated service leverages deep-learning models for speech transcription that are tailored to specific use-cases: short voice commands, phone calls and video, with a default model in all other contexts. The upgraded service now handles 120 languages and variants with different model availability and feature levels. Business applications range from over-the-phone meetings, to call-centers and video transcription. Transcription accuracy is improved in the presence of multiple speakers and significant background noise.

Two other elements make up the upgrade. The standard service level agreement (SLA) now offers a commitment of 99.9% availability. And the service includes a new mechanism to tag transcription jobs and provide feedback to the Google team.

The specialized models are adapted to the characteristics of the audio media in terms of sampling, resulting in bandwidth and signal duration. Audio over the phone is sampled at 8Khz resulting in lower audio quality, compared to audio from videos which are usually sampled at 16Khz. Hence the need for models optimized to each media type.

Crowdsourcing real-word audio samples is at the heart of Google's strategy to improve its models, with the launch of an opt-in program called data logging where users can choose to share their audio with Google in order to help improve the models. Enabling data logging gives the user access to enhanced models with even better performances. Google announces a reduction of 54% in word errors compared to the standard phone call model and a 64% error reduction for the enhanced video model.

In terms of best practices, Google suggests working with audio data compressed with a lossless codec such as FLAC, sampled at 16Khz and refraining from any audio pre-processing such as noise reduction or automatic gain control.

Word error reduction is not the only factor improving Speech-to-Text overall quality. Punctuation prediction remains an important yet challenging aspect of speech transcription. Google's Speech-to-Text API now offers the ability to add punctuation to the transcribed text, further improving readability of texts produced from long audio sequences. The automatic punctuation feature leverages a LSTM neural-network model.

As recent publications on speech synthesis and speech recognition from Google Research show, deep-learning for Speech-to-Text is frequently based on sequence-to-sequence neural-network models which can also be applied to machine-translation and text-summarization. In short, Seq2seq models use a first LSTM to encode the audio input and a second LSTM conditioned on the input sequence to decode and convert the data to the transcribed text.

Other existing Speech-to-Text services include the Microsoft speech recognition API with 29 supported languages, IBM Watson API which supports up to seven languages, and Amazon Transcribe launched in November 2017 which so far only works with US English and Spanish speech. A recent comparison of some of these services from the Florida Institute of Technology shows lower error rates for the Google service API. Another set of comparative tests underlines the importance of latency in speech transcription services.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter