Using Deep Learning Technologies IBM Reaches a New Milestone in Speech Recognition

| by Srini Penchikala Follow 36 Followers on Mar 31, 2017. Estimated reading time: 1 minute |

The research team at IBM recently announced they've reached a new industry record in speech recognition with a word error rate of 5.5% using the SWITCHBOARD linguistic corpus. This brings it closer to what's considered to be the human error rate of 5.1%. Humans typically miss one to two words out of every 20 words they hear. In a five-minute conversation, that could be as many as 80 words.

The research project includes applying deep learning technologies and incorporating acoustic models. The speech recognition model used Long Short Term Memory (LSTM) and WaveNet language models with a score fusion of three acoustic models. The acoustic models included a LSTM with multiple feature inputs, another LSTM trained with speaker-adversarial multi-task learning and a third model with a residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. The last model learns from positive examples but also takes advantage of negative examples, so it performs better where similar speech patterns are repeated.

Yoshua Bengio from Montreal Institute for Learning Algorithms (MILA) Lab at University of Montreal commented about the speech recognition.

In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challenge. Indeed, standard benchmarks do not always reveal the variations and complexities of real data. For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition.

He also said IBM research helps with advancing speech recognition by applying neural networks and deep learning into acoustic and language models.

In other speech processing news, IBM added Diarization to their Watson Speech to Text service which helps with use cases like distinguishing individual speakers in a conversation. All these achievements help with introducing technologies that will match the complexity of how the human ear, voice and brain interact.


Rate this Article

Adoption Stage

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread


Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you