BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Apple Reveals the Inner Workings of Siri's New Intonation

| by Roland Meertens Follow 0 Followers on Sep 10, 2017. Estimated reading time: 1 minute |

Apple has explained how they use deep learning to make Siri's intonation sound more natural.

IPhone owners can interact with Siri by asking questions in natural language and Siri responds by voice. The voice Siri uses is available in 21 languages and localized for 36 countries. At WWDC 2017, Apple announced that in iOS 11 Siri would use a new text to speech engine. In August 2017 Apple's machine learning journal unveiled how they were able to make Siri sound more human.

To generate speech your iPhone stitches pre-recorded human speech. Many hours of pre-recorded speech are broken down into words, and these words are broken down into their most elemental components: phonemes. Whenever a sentence must be generated, recordings of the appropriate phonemes are selected and stitched together.

Selecting what recording to use for each phoneme is a big challenge. Each component has to both match what they want to pronounce, and it has to match the selected units around that component. Your old navigation system only has a few recordings per phoneme, which is the reason the voice sounds unnatural. Apple decided to use deep learning to determine what properties a sound must have to be appropriate in a sentence.

Every iOS device contains a small database of pre-recorded phonemes. Each recording is associated with audio properties: spectrum pitch and duration. A so-called "deep mixture density network" is trained to predict a distribution over each of the features a phoneme must have to fit in a natural sounding sentence. Apple designed a cost function to train this network that takes two aspects into account: how well a phoneme would match what you want to pronounce, and how well it fits into a sentence.

After determining what exactly to look for, your phone searcher goes through its database using the "Viterbi" search algorithm. The best path following the recorded phonemes is selected, and the recordings are concatenated and played.

An alternative way would be to generate the sound waves, without concatenating recorded sound. In September 2016 Alphabets Deepmind unveiled a computer-generated text to speech engine called WaveNet. The downside is that it is slow to generate speech, even your fast desktop computer would take a long time. Siri won't be replaced by directly generated speech soon.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT