ICASSP 2020 — A fully virtual meeting of the signal processing community

Cogito Corporation
7 min readMay 11, 2020

--

Consistent with the 2019 meeting in Brighton, this year’s ICASSP 2020 was due to be held by the sea. But, due to the COVID-19 pandemic, the event had to be held as a completely remote conference. The organisers should be seriously commended for transitioning to this very different format in such a short space of time. Having ICASSP as a virtual conference had its pros and cons. Of course, it was hugely disappointing not being able to meet and discuss ideas with the scientific community in Barcelona. It was always going to be very challenging to have the same level of interaction that you get with an in-person conference. At the same time, there were some surprising benefits of using the virtual platform, like being able to access video presentations on-demand — for me at least, this was a very efficient way of digesting relevant scientific content. More striking, however, was the dramatic increase in the number of attendees (almost 15,000!) and it was great to see much higher attendance of researchers from countries typically underrepresented at this conference.

Figure: Prof. Ahmed H. Tewfik’s address during the opening ceremony

This year there was a very engaging set of plenary talks including Georgios Giannakis’s tributes to the “ensemble” of signal processing giants (Gauss, Fourier, Wiener and Kalman). Professor Yonina Eldar gave a wonderful overview of the recent advances in compressed sensing and described new methods of efficiently transmitting extremely compressed data, and then recovering full resolution when required using neural network models. Yoshua Bengio gave an inspiring keynote on deep representation learning. He gave particular emphasis to the direction of representing knowledge as small reusable pieces, using methods like recurrent independent mechanisms (RIMs) which supports these concepts of modularisation. On the final day, Prof. Mari Ostendorf presented on a topic very close to my heart — speech and language processing for conversational systems. In her talk she contended that the reason we speak so monotonically to virtual assistants stems from the lack of certainty that the system will be able to understand us. However, with the continuing improvements in automatic speech recognition this apprehension in users will diminish and will lead to more natural conversations with these systems. This in turn presents new challenges, like how to handle issues like disfluencies and the messy, non-linear nature of natural conversations.

Figure: Plenary talk from Prof. Yonina Eldar on future directions of compressed sensing and efficient data transmission.

Themes

There were several themes that I noticed over the course of the conference. One in particular, was the focus on feature learning and the attempts to disentangle and remove representations of certain factors you would like to be invariant to. You could see this in papers describing “speaker-invariant” representations (for instance for classifying emotion) and to generalise better across domains, contexts and corpora.

Another theme I noticed related closely to Yoshua Bengio’s keynote — and that was the focus on reusability. Of course, word embeddings have been hugely successful in the natural language processing (NLP) domain in terms of enabling reusability of representations. The audio processing area, however, is now catching up on this and there were many presentations demonstrating the benefits of developing and reusing audio embeddings. This was shown to be really effective when the particular target problem suffers from a lack of data. Prof. Ostendorf also highlighted the benefits of developing word-aligned prosodic embeddings which can be useful for multiple different conversational processing tasks.

Figure: Plenary talk from Prof. Mari Ostendorf on speech and language processing for conversational systems

A third theme I would like to highlight is the convergence towards certain types of neural network architectures that are being found to be effective across multiple different types of modeling problems. Convolutional Recurrent Neural Networks (CRNNs), with specific components for domain adaptation and variable invariance, I saw to achieve state-of-the-art results in areas from sound event detection to emotion recognition.

Paper highlights

In addition to a number of engaging discussions, there were also a handful of papers that I found quite noteworthy as well. Of those presented, I selected five papers this year which I found to be particularly impressive. An important caveat with this selection is that my main areas of interest are related to sound event detection, emotion recognition, natural language processing and general speech processing and machine learning. Here is my list in no particular order:

Paper #1 — Coincidence, categorization, and consolidation: learning to recognize sounds with minimal supervision

This paper from Google Research broke new ground in the area of sound event classification, in particular when you have massively underrepresented classes. Their approach combined three main concepts, trained with a single end-to-end model. The first concept, “coincidence”, is motivated by the observation that semantic classes change relatively slowly over time. As a result, one part of their network looks to predict whether pairs of audio frames (or audio and video frames) occur closely in time or not. This is essentially an unsupervised method for learning embeddings. Next they focus on “categorization”, by taking these learnt embeddings and applying category discovery to find clusters which have some semantic coherence. This was basically like having a K-means clustering module, but included as one part of the end-to-end network. The final concept is “consolidation”, and this basically involves selecting an example from each cluster, doing human annotation on this sample, and then propagating this human-derived label to all the samples in that cluster.

Paper #2 — Pitch estimation via self-supervision

Estimation of fundamental frequency (often referred to as “pitch estimation” or “pitch tracking”) has made progress in fits and starts over the last few decades. David Talkin and Paul Boersma both developed pitch trackers in the early 1990’s which, despite lots of subsequent research attention, remained the “go to” algorithms for many years afterwards. Some recent progress in this area has come from treating pitch tracking as a supervised machine learning problem, and developing models using modern neural network approaches. I see this paper, however, as a major step forward in producing state-of-the-art pitch tracking accuracy with an unsupervised machine learning model, that just needs a tiny amount of labeled data to scale to absolute fundamental frequency values. The icing on the cake with this was a web application which allows you to practice your Freddie Mercury singing, powered by this new pitch tracker!

Paper #3 — Generating empathetic responses by looking ahead the user’s sentiment

As conversational dialogue systems become increasingly effective at practically solving user’s problems, the focus begins to go to ensuring that the user also leaves with a positive feeling about the interaction. The researchers here propose a reinforcement learning framework for generating responses to users in a dialogue system. The key idea here is to look ahead to the user’s sentiment after the computer-generated response, and to use this emotion-related information to incentivise generation of responses which result in improved user sentiment. This is fairly early research, with a lot of open questions, however it is certainly a very novel and promising idea.

Paper #4 — Fast intent classification for spoken language understanding

Relatively few academic papers tend to focus on computational efficiency with speech and language processing algorithms, whereas these are crucial research problems when looking to deploy commercial systems. This paper from Amazon and the University of Massachusetts, focuses on delivering efficient classification of spoken intent using a model architecture which involves application of multiple “exit strategies”. Their approach, which is based on the previously proposed Branchynet (see original paper), involves only exercising higher layers in the model when the confidence level is above a certain threshold. If at a certain exit point the threshold is not exceeded, the inference step is aborted and the potential computational costs are avoided.

Paper #5 — Speaker-invariant affective representation learning via adversarial training

My last paper highlight comes from the field of emotion recognition. The paper relates to the problem of emotion recognition models typically displaying a high degree of speaker variability. Their proposed approach to solving this problem involves the use of a CRNN multi-task model, where one task corresponds to the primary emotion recognition problem and the other relates to speaker identification. The researchers use an adversarial training approach, combined with “gradient reversal” for the speaker identification branch, in order to avoid speaker specific aspects being encoded in the learnt representation.

Till next year

So that is it for ICASSP 2020. I am very glad that the conference had a chance to proceed despite the huge challenges of the on-going global pandemic. Although the virtual conference was engaging and informative, I do look forward to an interactive, in-person meeting with this scientific community next year in Toronto.

Related articles

Machine Learning-enabled Creativity and Innovation In Speech Tech — Interspeech 2019 conference summary

A simple technique for fairer speech emotion recognition — Interspeech 2019 pre-conference blog post

Towards trustworthy signal processing and machine learning — ICASSP 2019 conference summary

Redefining signal processing for audio and speech technologies — ICASSP 2018 conference summary

Robots, Deep Neural Networks and the Future of Speech — Interspeech 2017 conference summary

Gender de-biasing in speech emotion recognition — Interspeech 2019 Cogito paper

Attention-based Sequence Classification for Affect Detection — Interspeech 2018 Cogito paper

--

--

Cogito Corporation

Cogito delivers real-time emotional intelligence through behavioral science & machine learning. Helping people live more productive lives. #CX #AI #Technology