25th April 2019
Synthetic speech generated from brain recordings
New technology developed at the University of California San Francisco is a stepping-stone to a "neural speech prosthesis", researchers say.
A state-of-the-art brain-machine interface created by UC San Francisco neuroscientists can generate natural-sounding synthetic speech by using brain activity to control a virtual vocal tract – an anatomically detailed computer simulation including the lips, jaw, tongue and larynx. The study was conducted in research participants with intact speech, but the technology could one day restore the voices of people who have lost the ability to speak due to paralysis and other forms of neurological damage.
Stroke, traumatic brain injury, and neurodegenerative diseases such as Parkinson’s disease, multiple sclerosis and amyotrophic lateral sclerosis (ALS) often result in an irreversible loss of speaking ability. Some people with severe speech disabilities learn to spell out their thoughts letter-by-letter using assistive devices that track very small eye or facial muscle movements. However, producing text or synthesised speech with such devices is laborious, error-prone and painfully slow, typically permitting a maximum of 10 words per minute, compared to the 100 to 150 of natural speech.
The new system being developed in the laboratory of Edward Chang, MD – described yesterday in Nature – demonstrates that it is possible to create a synthesised version of a person's voice controlled by activity in their brain's speech centres. In the future, this approach could not only restore fluent communication to individuals with severe speech disability, the authors say, but could also reproduce the "musicality" of the human voice that conveys the speaker's emotions and personality.
“For the first time, this study demonstrates that we can generate entire spoken sentences based on an individual’s brain activity,” said Chang, professor of neurological surgery. “This is an exhilarating proof of principle that with technology that is already within reach, we should be able to build a device that is clinically viable in patients with speech loss.”
The team at UC San Francisco realised that previous attempts to directly decode speech from brain activity might have met with limited success because these brain regions do not directly represent the acoustic properties of speech sounds, but rather the instructions needed to coordinate the movements of the mouth and throat during speech.
“The relationship between the movements of the vocal tract and the speech sounds that are produced is a complicated one,” said Gopala Anumanchipalli, PhD, speech scientist and paper co-author. “We reasoned that if these speech centres in the brain are encoding movements, rather than sounds, we should try to do the same in decoding those signals.”
For their new study, the team contacted five volunteers being treated at the UCSF Epilepsy Center. These were patients with intact speech, who had electrodes temporarily implanted in their brains to map the source of their seizures in preparation for neurosurgery. Each patient was asked to read several hundred sentences aloud, while the researchers recorded activity from a brain region known to be involved in language production.
Based on the audio recordings of participants’ voices, the researchers used linguistic principles to reverse engineer the vocal tract movements needed to produce those sounds: pressing the lips together here, tightening vocal cords there, shifting the tip of the tongue to the roof of the mouth, then relaxing it, and so on.
This detailed mapping of sound to anatomy allowed the scientists to create a realistic "virtual vocal tract" for each participant, which could be controlled by their brain activity. This comprised two neural network machine learning algorithms:
• A decoder, for transforming brain activity patterns produced during speech into movements of the virtual vocal tract
• A synthesiser, to convert these vocal tract movements into a synthetic approximation of the participant's voice.
The synthetic speech produced by these algorithms was significantly better than synthetic speech directly decoded from participants' brain activity without the inclusion of simulations of the speakers' vocal tracts, the researchers found. The algorithms produced sentences that were understandable to hundreds of human listeners in crowdsourced transcription tests conducted on the Amazon Mechanical Turk platform.
As is the case with natural speech, the transcribers were more successful when given shorter lists of words to choose from – an accuracy of almost 70% for synthesised words from lists of 25 alternatives, but only 47% accuracy with a more challenging 50 words to choose from. The researchers are currently experimenting with higher-density electrode arrays and more advanced machine learning algorithms, which they hope will improve the synthesised speech.
“We still have a ways to go to perfectly mimic spoken language,” said Josh Chartier, a bioengineering graduate student in Chang's lab. “Still, the levels of accuracy we produced here would be an amazing improvement in real-time communication, compared to what’s currently available.”
“People who can’t move their arms and legs have learned to control robotic limbs with their brains,” said Chartier. “We are hopeful that one day, people with speech disabilities will learn to speak again using this brain-controlled artificial vocal tract.”