ICASSP 2018 — Redefining signal processing for audio and speech technologies
Canada has become a hotbed for deep learning and artificial intelligence (AI) in the last few years. This has been in large part due to the extent to which the country has embraced this field of technology and to the level of investment that the government makes in AI and machine learning research. As a result, some of the best universities are located here, giving rise to the most prominent experts in the field. The Montreal Institute for Learning Algorithms (MILA) is one of the largest and most successful academic machine learning organizations in the world, with Facebook investing over 7 million US dollars in a new lab there just last year. Further, the so-called “godfathers” of deep learning, Geoffrey Hinton, Yann LeCun and Yoshua Bengio, all spent significant amounts of their careers at the University of Toronto. So when the organizers of the IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing) conference decided to pull out of hosting in South Korea due to security concerns, Canada was an obvious alternative choice. The city of Calgary (that is cal-GAR-ee with stress on the second syllable!!), with its modern infrastructure and incredibly homely welcome, was a really suitable choice.
The signal processing community has undergone dramatic change in recent years. This was previously a discipline built on careful design of algorithms with expert domain knowledge and systems created to handle edge-cases they were explicitly programmed for. Fast forward to 2018 where the field is dominated by machine learning, with algorithms programmed by example using large volumes of data and with what some consider to be black-box learning algorithms. As a result, it is not surprising that many veterans from the field feel disillusioned. At the same time, it is some of those experienced researchers who have embraced the revolution of their field who are now having the most significant impact.
Figure 1: Yann LeCun and other leading experts at the panel session, Thursday evening
There were several emerging themes that I observed across the presentation areas. One striking trend was the increase in so-called end-to-end systems. From speech recognition to audio synthesis there is a clear trend towards replacing system components that typically require domain knowledge and careful configuration, with components that can be fully “learned” from data. One example of this was the Tacotron 2 speech synthesis system presented by Google. They have replaced the text processing part of speech synthesis, which converts input text to a phonetic and contextual representation, with a trainable system which maps directly from written characters to a spectrogram representation. Audio is then generated from this using the WaveNet approach they released in 2016.
There were also quite a number of papers utilizing multi-task learning for various purposes. Multi-task learning essentially refers to the training of a model, a large part of which is used for several different purposes. One presentation demonstrated that generalization of speech recognition accuracy across dialects could be improved by training the speech recognition model to additionally recognize dialect. The dialect recognition is not actually used for any application, but forcing the model to learn about this helps improve robustness to variation in speaking style and dialect. Multi-task learning can also be helpful for general model regularization as well as for reducing computational load.
Attention-mechanisms are increasingly being used to help a model focus on information that is useful for prediction. There were examples in the area of spoken language understanding where intent recognition was improved using this approach. But in addition to improving model accuracy for this purpose, the attention weights can be used to highlight significant words or phrases which led to the resulting prediction of intent. As a result, attention-based models can be used to “explain” to a certain extent why a certain prediction was made.
Researchers are increasingly emphasizing that it can be more effective to use convolutional rather than recurrent neural networks for modeling sequences. So for modeling temporal sequences, like audio, historical and contextual information can be modeled using temporal convolutional networks or dilated CNNs which can result in faster training times and improved accuracy of models due to an ability to incorporate longer spans of historical context. However, for progressive, real-time applications it is still not clear how such approaches can be used to develop practical and efficient models.
One final trend worth mentioning is the use of adversarial training. Most applications of generative adversarial networks (aka GANs) are used for generating synthetic data, like the creation of new images that combine themes and features from multiple images. But researchers are finding increasingly novel ways of applying adversarial training to improve the learning of feature representations in various sub-fields, including emotion recognition.
Figure 2: Conference banquet, with obligatory Smithbilt head attire
At the same time, there are some immense challenges for modern signal processing. Current state-of-the-art systems (e.g., automatic speech recognition or acoustic scene detection) require incredible amounts of training data, but in many cases this volume of labeled data may not be available. New approaches need to utilize unlabeled data in an unsupervised way, and this is particularly important for developing multilingual speech and language technologies. For many applications where signal processing systems are deployed on devices and wearables, algorithms need to be a great deal less computationally complex. Although improvements in hardware specifically designed for running deep learning models presents part of the solution to this problem, optimization of the computational and memory utilization of models is still critically important.
Many of the leading researchers highlighted the fact that machine learning and AI can be perceived as somewhat scary to the public. In addition to developing ever-improving learning algorithms, it is important that this is done in an open and ethical way and for constructive and positive purposes. Leaders in the field need to communicate this more clearly to the public, and they also need to be clear about what can and cannot be done given the current state-of-the-art in machine learning. Otherwise as we descend from the summit of this hype-cycle, the public may become increasingly disillusioned with AI which could lead to another so-called AI-winter.
Figure 3: Prof. Julia Hirschberg giving her plenary talk on “Detecting Deceptive Speech: Humans vs. Machines”
During the opening ceremony, I was somewhat shocked to hear that 80% of the conference attendees identified as being male. But perhaps more shocking to me was the lack of surprise expressed by my female peers who I spoke to about this. There is a huge opportunity within this community to improve this severe gender imbalance and right now there are several significant initiatives popping up to address this, such as the Workshop for Young Female Researchers in Speech Science & Technology (YFRSW) which will be held at Interspeech 2018 (https://sites.google.com/view/yfrsw2018/home). Professor Julia Hirschberg put it best during her keynote: “Diverse labs are a lot more fun and you get a lot more done”.
Where next for signal processing?
The IEEE Signal Processing Society is in a major state of transition, which was evident at this year’s ICASSP. The field of signal processing is being redefined by the rise of machine learning, and in particular deep learning. Many traditionally signal processing-based applications now have state-of-the-art implementations involving some form of deep learning. At the same time, there is a great opportunity here to incorporate many signal processing techniques developed over the last several decades within the deep learning framework. Convolutions and multi-resolution processing, for instance, are firmly rooted in the signal processing literature and these are fundamental to many modern deep learning algorithms.
Looking forward to next year’s ICASSP (which will take place in Brighton, UK), I expect the trend will continue in terms of researchers incorporating further aspects of their audio and speech processing pipeline within a “trainable” deep learning paradigm. There will likely be papers presented describing approaches for using parallel hardware and distributed algorithms for improving the efficiency of model training on large volumes of data. Also, if Yann LeCun’s keynote was anything to go by, there will be increasing attention on developing model learning techniques that need less labeled data, that can observe and learn about the world (through audio, language and images) with little interaction from humans, as well as an effort to develop machines which have some degree of common sense.
Written by Dr. John Kane, VP of Signal Processing & Machine Learning at Cogito