Deep Learning for Speech Synthesis of Audio from Brain Activity

Research teams use deep learning neural networks to synthesize speech from electrical signals recorded in human brains, to help people with speech challenges.

In three separate experiments, research teams used electrocorticography (ECoG) to measure electrical impulses in the brains of human subjects while the subjects listened to someone speaking, or while the subjects themselves spoke. The data was then used to train neural networks to produce speech sound output. The motivation for this work is to help people who cannot speak by creating a brain-computer interface or "speech prosthesis" that can directly convert signals in the user's brain into synthesized speech sound.

The first experiment, which was run by a team at Columbia University, used data from patients undergoing treatment for epilepsy. The patients had electrodes implanted in their auditory cortex, and ECoG data was collected from these electrodes while the patients listened to recordings of short spoken sentences. The researchers trained a deep neural-network (DNN) using Keras and Tensorflow using the ECoG data as the input and a vocoder/spectrogram representation of the recorded speech as the target. To evaluate the resulting audio, researchers asked listeners to listen to reconstructed digits and report what they heard; the best model achieved 75% accuracy.

A second team, led by Professor Tanja Schultz of the University Bremen in Germany, gathered data from patients undergoing craniotomies. The patients were shown single words, which they read aloud while their ECoG signals were recorded. The spoken audio was also recorded and converted to a spectrogram. Then a densely-connected convolutional network (or DenseNet) was trained to convert the brain signals into spectrograms. A WaveNet vocoder was then used to convert spectrograms into audible speech. To evaluate the synthesized speech, the researchers used an algorithm called short-time objective intelligibility (STOI) to measure the quality of the speech. The scores ranged from 30% to 50%.

Finally, a third team, led by Edward Chang of the University of California, also used data from patients who read aloud while ECoG signals were recorded. This team used the approach of using two long short-term memory (LSTM) networks. The first learned a mapping from the brain signals to an "intermediate articulatory kinematic representation" that models the physical behavior of a speaker's vocal tract. The second LSTM learned a mapping from the output of the kinematic representation to actual audio. This model allowed the researchers to synthesize speech from brain activity recorded when the patient only pantomimed speaking, without actually making sound. Using Amazon Mechanical Turk (https://www.mturk.com/), the researchers found listeners who, after hearing a synthesized sentence, selected from multiple choice answers to identify the sentence they heard. The median percentage of listeners who correctly identified each sentence was 83%.

There is still a long way to go before this technology can become a practical working prosthesis. For starters, all methods used data collected electrodes implanted in the brains of patients whose skulls were opened for brain surgery. And while Chang's group did demonstrate the synthesis of speech from signals generated by silent pantomime, many users who might need such a prosthesis may not be able to control their vocal tract well enough to do even that.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter