Computers Can Talk for Us Now

Neurotech@Berkeley
Nov 12, 2019 · 9 min read

Speech-generating technology offers ample methods of non-verbal communication, from text-to-speech software apps to eye-tracking systems, designed for people with speech or hearing impairments to communicate easily with the people around them. Today, there are many more intricate prototypes and experiments surfacing that integrate the use of BCI (Brain Computer Interface) machines, allowing people who not only cannot speak but also have limited motor skills, to communicate as well. Translating thoughts has become an increasingly attractive study within the neurotechnological world; the idea of just talking to someone through your thoughts is tantalizing.

Hands-free control seems to be a recurring theme within current technology, from devices like Amazon’s Echo and the Google Home, to built-in mobile applications like Apple’s Siri. This form of control is necessary for those who have limited mobility or bodily functions. People who suffer from amyotrophic lateral sclerosis or are recovering from a stroke that are unable to speak need a way to communicate with those around them. Currently, such devices are already in use; machines that track eye or facial/head movements can translate into words. For example, Stephen Hawking controlled a cursor with his cheek muscles which spelled out words from specific combinations of muscle twitches. Throughout time, the software could even predict and autofill words he commonly used; however, the computer system was high-maintenance and had to be routinely replaced and updated due to his constantly changing degenerative condition. In addition, formulating even just singular words proved to be time-consuming and tedious. Thus, researchers are on the hunt for a faster, more seamless method that will help those who want to communicate.

In 2017, neuroscientist Dr. Gerwin Schalk from the Wadsworth Center, alongside some researchers from UC Berkeley, explored the mechanisms of the temporal lobe by attaching 117 electrodes directly onto the surface of the cortex that is involved with talking and listening. An epileptic patient named Cathy faced a monitor that displayed single words and was told to repeat each word silently in her head. Electrodes recorded her brain patterns, and these patterns were later amplified, allowing Schalk and his team to extract data about the premise behind imagined speech and neural activity. They found evidence that when we imagine speaking, the auditory cortex is also activated, allowing us to “hear” ourselves in our own heads.

Dr. Schalk also incorporated music into his study; he played the same short segment of a song for about a dozen surgery patients and recorded brain activity while they listened. He was able to reproduce the melody from the data, and although the sound is slightly muffled, the song is still distinguishable.

Analysis of brain activity as subjects listened to the song showed that different groups of neurons in the auditory cortex fire more vigorously when it comes to specific tones and amplitudes; it’s as if each neuron vibrates with an optimal resonance. When the melody moves away from the neuron’s preferred tone, the firing rate slows. Computers were then trained to use this information to translate neural firing patterns back into sound.

Within the past year, more advanced developments in which real-time human speech was integrated alongside interpretations of auditory cortex functions. In early 2019, neuroengineers from Columbia University implemented the use of a vocoder, which analyzes human voice recordings and in turn, synthesizes speech. Scientists record brain activity and responses from the actions of people speaking, listening, imagining that they’re speaking, and imagining that they’re listening. It has been discovered that even this particular imagination process displays distinct patterns of brain activity that are related to the actual actions and that the neural activities of actual and imagined hearing conditions are more similar than between hearing and the physical articulation of words.

Analysis of such brain processes requires the use of devices such as EEG, however, the low, unreliable quality of reconstructed speech from EEG signals is a persistent restricting factor to why thought-based communication technology has not progressed quickly enough. Misinterpretations and inaccurate measurements are currently commonplace; much of the current research in the field involves finding optimal methods to gather accurate and applicable data. The team from Columbia University aimed to create a direct communication pathway to the brain, that could decode speech in both actual and imagined circumstances.

In their experiment, brain activity was recorded using invasive electrocorticography (ECoG), which requires surgical incisions to place electrode grids directly onto the surface of the brain. Because of the proximity of the electrodes to actual brain tissue, greater digital resolutions are broadcasted, compared to those of an EEG where the data is noisy because it has to pass through the skull.

In order to find this pathway, they used machine learning. The field of machine learning revolves around the concept of training a computer with data and a set of rules in order to have it create an analytical model that can make decisions without any additional explicit programming. They examined three factors heavily involved with machine learning for brainwave data: regression technique, regression frequency range, and speech representation. Regression techniques are the different ways a computer can estimate relationships between data. The simplest and most well-known is the linear regression, which represents a linear relationship between two variables. In comparison to this is the nonlinear, deep neural network, which recognizes patterns that may not be linear and connects sequences of multiple inputted and outputted functions. From these relationships, models can predict values like likelihood estimations, which finds experimental parameters that would optimize and produce data that correlates with already observed data.

ECoG data sampled at different frequencies alter the predictions of any regression model; thus, specific frequency ranges must correspond to their models. In addition, representation devices are necessary to interpret the signals transmitted from regression models. One such device is the auditory spectrogram, which displays a visual representation of the frequency spectrum versus time. Another example of a representation device is the vocoder, used specifically for speech analysis, but could also synthesize speech based on those signals. In this experiment, researchers compared linear regression and deep neural network models, low frequency and high-frequency ranges, and transmission through an auditory spectrogram or a speech vocoder.

Covered under Fair Use

To identify the optimal experimental conditions, 128 electrodes were placed over five neurosurgical patients who all reported normal hearing. They were asked to listen to eight continuous sentences, each repeated six times randomly, and spoken by a total of four different people, for a duration of thirty minutes. Then, the reconstructions were evaluated objectively using the ESTOI (extended short-time objective intelligibility) measure, which analyzes distortion in the patterns of the speech signal by comparing spectrograms of the processed speech versus clean speech signals. It was discovered early on in the experiment that using a combination of both low and high frequency ranges produced different and complementary information between the two bands from the auditory stimulus, and resulted in higher ESTOI scores, so a wide range of frequencies was used in all the trials. After multiple trials, results were consistent for each of the five subjects, and it was concluded that the combination of a deep neural network that uses a wide range of frequencies and a speech vocoder exhibits the highest performance in terms of intelligibility and quality of reconstruction.

Covered under Fair Use

In addition, the subjects listened to isolated digit sounds, from zero to nine, only once, and afterwards, the subjects repeated the digits, rated the reconstruction quality, and reported the gender of the speaker. Although measured subjectively a deep neural network and vocoder still exhibited the highest scores.

Covered under Fair Use

These results reveal how the retrieval of neural data is improved by using deep learning and speech synthesis algorithms, and similar results are expected for conditions involving imagined speech.

Meanwhile, at the University of California, San Francisco, neurosurgeon Edward Chang is exploring the motor roles behind this phenomenon. In his clinical trial, he put electrodes in areas that are associated with the vocal tract, like the lips, tongue, jaw, and larynx; these would be the muscles engaged if a person were to speak. The subjects were able to speak and hear and were asked to read simple sentences out loud as the researchers recorded brain activity.

From this, a decoder can generate a very digitized-sounding version of the same sentence (see video below), using a “two-stage approach.” First, electrodes placed on areas that control movements of the vocal tract allow an algorithm to analyze how the vocal tract operates during normal speech. Then, using this map of predicted movements and machine learning, neural activity is processed into spoken sentences. This process of translating activity to motor movements, and then to words creates less distortion — the same distortion that the Columbia University researchers aimed to overcome.

They were also asked to silently mouth the sentences, and even without audio, the decoder could still generate speech. About 70 percent of the speech was accurate, as the machine had difficulty differentiating between some closely-sounding phonemes. It could easily decode the “sh” sound in ship, while similar sounds like “b” and “p” result in less accurate transcriptions. However, these results seemed quite remarkable, as the computer was only given a few minutes to analyze speech; whereas other approaches would require days to weeks. Because of this, at the moment Chang only wishes to improve the quality of reconstructed speech before moving on to experiments actually comprised of patients who are paralyzed and cannot speak.

Recently, Chang and his colleagues devised a different approach to explore this concept — a question-and-answer experimental plan. This incorporated both aspects of listening and speaking in a timed environment.

During the training period, subjects heard each question ten times in random order, and each read the randomly ordered pre-selected possible answer choices ten times. During the test trials, each participant listened to questions and responded accordingly out of the choices; when they encountered the same question, subjects were encouraged to answer with a different choice.

This experiment only used signals within the high-gamma frequency range so that real-time detection could be executed while the person is listening and speaking. Basically, a speech detection model uses neural activity from ECoG electrodes to predict whether a question is being heard or an answer is being produced. From this, the time window of that activity is either passed to a question classifier or answer classifier, that each computed either question utterance likelihoods or answer utterance likelihoods. This means that the computer can “predict” which answer had the highest probability of being chosen based on ECoG data, concluding that an increase in accurate prediction rate was expected when context between the question and answer was taken into account. Objective evaluation of the task was based on whether time windows were associated with the correct event, a comparison of predicted likelihoods and actual utterances, a comparison between the number of detected and actual utterances, and how much each electrode contributes to the differentiation of events.

The procedure scored high in all four categories, leading to the implication that this “predictive” method will be used in future BCI applications. It is important to note that the predictions within this experiment rely on a finite amount of speech outcomes, so it may be long before a machine can “auto-fill” any answer if not given a set amount of possibilities.

Covered under Fair Use

Although the idea of having a machine process private thoughts may seem frightening and intrusive, doctors and researchers have agreed that these upcoming neurotechnological innovations will help people. As Dr. Chang said, “If someone wants to communicate and can’t, I think we have a responsibility as scientists and clinicians to restore that most fundamental human ability.” Translating thoughts as a form of communication will ultimately help millions of people who are unable to verbally communicate, finally giving voices to the voiceless.


This article was written by Jandy Le and edited by Jwalin Joshi.

Jandy studies Molecular Biology at UC Berkeley.

Jwalin studies Applied Mathematics and Computer Science at UC Berkeley.

Contact Neurotech@Berkeley for a list of sources.

Neurotech@Berkeley

Writers, consultants, engineers, and designers working toward advancing neurotechnology for the benefit of humanity.

Neurotech@Berkeley

Written by

We write on psychology, ethics, neuroscience, and the newest in neural engineering. @UC Berkeley

Neurotech@Berkeley

Writers, consultants, engineers, and designers working toward advancing neurotechnology for the benefit of humanity.

More From Medium

More from Neurotech@Berkeley

More from Neurotech@Berkeley

More from Neurotech@Berkeley

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade