Interspeech 2019 — Machine Learning-enabled Creativity and Innovation In Speech Tech
One thing that was evident at the 2019 Interspeech conference, held in Graz, Austria, is that the speech technology scene continues to flourish. The level of industry attention and investment continues to rise rapidly, there were 2,075 registered attendees (more than any previous year) and colleagues who I met when we were doing our PhDs, have grown in academic and industry leaders.
I really enjoy Interspeech. Compared with so many other conferences, I find there to be so much more engagement in poster and oral sessions and stronger attendee diversity, which seems to greatly contribute to the energetic and interactive atmosphere.
A wide breadth of topics specific to the speech science discipline were covered by four keynote speakers. Keiichi Tokuda presented a wonderful overview of the recent history of speech synthesis. Manfred Kaltenbacher gave a detailed and in-depth talk on the physiology and physics of speech production. Mirella Lapata’s keynote displayed just how far natural language processing interfaces have come — while not quite to the level of Samantha from the film “Her”, the science fiction is becoming less and less fantastical.
But it was Tanja Schultz’s talk for me which elicited the strongest reaction. Hearing synthesized speech reverberate in the main conference hall, produced using electrodes placed directly on a human cortex, was certainly the eeriest moment of the week. Brain-to-speech synthesis is as troubling as it is intriguing! At the same time her core point of “Think beyond acoustics” was inspiring for many in attendance.
My Paper Highlights
The amount of impressive papers at this year’s Interspeech was at times overwhelming. But I was able to distill my favourites down into a list of ten.
Google continues to fully exploit their machine learning resources of 1) extreme volumes of data, 2) massive computing power and 3) some of the best scientific research talent — to make rapid advances in the field. The idea of taking speech from someone uttered in one language and have it spoken automatically in another language was close to unimaginable a decade ago. But in this paper they show how a sequence-to-sequence model can be used to map spectrograms from speech spoken in Spanish, to a spectrogram of the speech in English and then synthesize it using a neural vocoder. The sheer computing power required to develop such a model is out of reach for almost all research groups, but nevertheless Google consistently demonstrates that quantum leaps can be made using this approach.
Another paper from Google — but this one was a lot less data heavy! In fact, the paper proposes an approach for developing automatic speech recognition systems for languages which have no audio training data. They do this by reusing an existing acoustic model trained on a phonologically similar language, and then modifying the pronunciation model. This is of course very early stage research, but for some language pairs (e.g., Filipino and Austronesian) there were very impressive results. This research generally aims to provide accurate speech recognition for heavily unresourced languages.
My final paper highlight from Google concerns the new Parrotron system which aims at transforming highly atypical speech from deaf speakers, into much more intelligible and clear speech. They achieve this by treating this as a sequence-to-sequence problem and use a modeling technique which transforms a spectrogram representation for the source speech into a spectrogram target. They then use a neural vocoder (used in the popular Tacotron 2 system) to generate the resulting speech. The results indicate a significant improvement in naturalness — as indicated by perceptual mean opinion (MOS) scores — and also dramatically improved intelligibility — with word-error rate levels from an automatic speech recognizer improving from around 89 % for the source speech to around 33 % with their fine-tuned system.
Keeping on the topic of synthesis, Amazon researchers continue to push the boundaries of modern speech synthesis through their research. This work covers developments towards achieving universal neural vocoding, allowing for a spectrogram representation of any audio containing speech to be generated with very high naturalness. If this can be achieved, future research on synthesis can focus almost exclusively on the generation of such spectral representations without the need for much further development of the vocoder component. Their approach involves training a WaveRNN-style vocoder using 74 speakers from 17 languages and their results show only marginal reduction in naturalness, compared with natural speech. Additionally, when training their model with sufficient variety in speakers and languages, they can achieve extremely high naturalness for speakers and languages not seen during training.
Switching topic from synthesis, Prof. Mower-Provost’s group at the University of Michigan continues to publish mHealth-related research on the topic of automatically detecting suicidal ideation from study-participant’s natural phone conversations. Their results to date highlight the challenges of mapping directly from acoustic features computed from phone conversations to categories related to suicide. In this paper, they demonstrate an alternative approach which instead involves modeling self-reported emotions and then using these model outputs to separate the target classes. They report considerable improvement using this intermediate, bridging approach.
Microsoft has made consistent gains in both automatic speech recognition and speaker diarization over the past few years. This paper leverages those advances to deliver a system which can effectively separate multiple people speaking in the same room and recognize the speech for each separately. They do this by having meetings where attendees all have their electronic devices (i.e. phones, tablets and laptops) on the table recording audio, and then they apply some clever beamforming techniques to exploit the information from multiple sources to improve on this challenging diarization problem
Automatic speech recognition (ASR) has in many ways the opposite goal of speaker, emotion and speaking style recognition — where the goal is to suppress speaker specific effects to help focus purely on the lexical linguistic content. This disentangling problem has been effectively handled by many ASR methods, however the problem of disentangling speech related to an individual’s identity from common speaking styles has received considerably less attention. Their results first demonstrate that representations commonly used for speaker identification (i.e. i-vectors and x-vectors) contain salient information for discriminating speaking style and emotion. They go on to train a modified autoencoder model where there are two encoders (one for speaker identify, the other for speaking style) and just one decoder. Their second set of results show that the style embedding is an effective feature representation for emotion recognition whereas the speaker identity embedding is much worse — suggesting that the approach was effective in disentangling these two aspects of speech.
Breathing often goes by with little attention from speech researchers. However, understanding and modeling breathing is crucial for many purposes, including segmenting conversational speech, producing natural sounding speech synthesis as well as understanding the speaker’s state. In this paper the authors use speech audio, paired with synchronized time series signals produced from a breathing belt worn by participants. Their modeling approach then looks at predicting the breathing belt time-series samples directly from spectral features computed from the audio. This predicted breathing signal is far more salient than typical audio features for understanding a variety of aspects to do with breathing.
Even though this paper was not about human speech, it was one of the best papers of the conference. Christian Bergler’s work aims at automatically detecting and recognizing different types of Orca calls from audio recordings. What was most impressive about his work was the utter distrust for any human decisions throughout his approach. He treated this as a fully unsupervised modeling problem, citing that the existing human categorization of Orca calls may at times be erroneous, inconsistent and can miss important sub-categories which can reveal much more about Orca communication. He also removed perception-inspired Mel scaling of the spectral inputs to his resnet autoencoder model because whale hearing does not work in the same way as humans. The result was an effective technique for training embeddings which are highly effective at separating Orca calls and a clustering approach for finding discrete categories from these embeddings. This approach is likely to be effective for other audio processing tasks.
It has been surprising that with the advent of the Google Tacotron 2 synthesizer and the recent advances in the quality and naturalness of speech synthesis that synthesizing conversation speech has received so little attention. One exception to this is the research coming out of KTH, where Éva, Gustav and team are using publicly available speech recordings, including conversational speech, and training a Tacotron-style synthesizer. Unlike text and read speech, sentences do not really exist in conversational speech. So rather than arbitrarily segmenting the speech into “sentences”, they use detected breathing to group and segment the audio. They also demonstrate the importance of including filled pauses (e.g., “um”, “ah”) in order to produce more natural and authentic conversational speech. The KTH team also won best show and tell for their demonstration of this system.
Till Next Year
Well that’s it for another year. The Interspeech bar continues to be raised, so the organizers of Interspeech 2020 in Shanghai will have their work cut out for them. I very much hope that the speech community’s research efforts continue to target problems which were previously gated by the underlying modeling technology. Problems like synthesizing natural conversational speech, recognizing mood disorders from voice and direct speech-to-speech translation are now achievable if we creatively leverage modern signal processing and machine learning functionality.
- A simple technique for fairer speech emotion recognition — Interspeech 2019 pre-conference blog post
- Towards trustworthy signal processing and machine learning — ICASSP 2019 conference summary
- Redefining signal processing for audio and speech technologies — ICASSP 2018 conference summary
- Robots, Deep Neural Networks and the Future of Speech — Interspeech 2017 conference summary
- Gender de-biasing in speech emotion recognition — Interspeech 2019 Cogito paper
- Attention-based Sequence Classification for Affect Detection — Interspeech 2018 Cogito paper