Featured
What is Machine Listening? (Part 1)
Listening is easy for humans, but still difficult for computers.
One of the human basic senses
When I get up in the morning during travel, the thing that usually wakes me up is sunlight coming into the bedroom. But actually, the very first moment I realise that I’m actually not at home is exotic birds chirping sounds, probably as well as the slightly different height of my pillow.
Hearing is one of the five basic senses of humans. We use this amazing ability naturally in our daily lives, but we often forget its importance as we can’t actually see it. We communicate with other people by talking, also feel and perceive this world through acoustic information along with other sensory data.
Probably, speech recognition is one of the most widely adopted sound recognition technology in the industry and it allows people to communicate with computers in a more natural way. It used to be very hard for computers to understand a human speech, but the technological level of it got much better from the time around 2010, the time when a modern deep learning technology appeared.
Conventionally, it was based on the rule-based method which uses hand-crafted features designed by domain expert engineers. But with the power of advanced deep learning methods, technological level of various technologies has been rapidly improved and AI systems now start to see things and understand what people talk to them.
Computer vision, natural language processing, and speech recognition are all really important technologies for artificial intelligence. However, we are missing some important thing here, a sound. Speech is sound. However, there are literally millions of other sounds we hear every day, but machines still do not well understand what's going on around them. Let me show some examples.
It is easy to recognise that this is a raining sound. When you listen to it, you may think of taking an umbrella or closing the window.
It is also obvious that it is a footstep sound. If we listen to it carefully, we can even know that it is high heels, getting closer, and walking at normal speed.
The bell-ringing sound of Big Ben (Elizabeth Tower) at Westminster delivers the time for everyone in the town. Humans naturally listen to the world in everyday life, to think, also to act. Examples above only show information such as weather, type of shoes, or time, but it is just a tiny bit of contextual information the sound carries.
Machine listening is a research area to make a system that can understand non-verbal information from the audio. A formal definition from the machine listening research laboratory from Queen Mary, University of London is like below.
“Machine listening” is the use of signal processing and machine learning for making sense of natural / everyday sounds, and recorded music.
Speech is only a small part of acoustic information
Human voice contains linguistic information. However, people also can guess various clues from the voice such as age, gender, emotion, and even the health status of the speaker. Music is another type of audio that contains even more complex information such as genre, mood, tempo, short, and pitch.
Still, music and voice are an extremely small part of what we hear in our daily lives. Actually, we don’t even know how many sounds people can distinguish and there are no clear boundaries between sounds. In machine listening academia, all other sound is usually called environmental sound and it is divided into two large groups of topics which are acoustic scenes and acoustic events.
The acoustic scene, as its name conveys, is about the location-related information such as bus, park, library, cafe, or city centre. It is impossible to recognise the scene with very short audio, so normally researchers assume that at least 10 seconds of audio is required to estimate the scene. On the other hand, the acoustic event is a term normally used for shorter sounds that contain clues of surrounding events such as glass break, knock, car horn, or dog bark. It might be a very short sound like 0.1 seconds, but also can be quite long like continuous water flow.
Probably it is easier to understand if we compare it with topics in a computer vision field which is more obviously visible.
Optical character recognition (OCR) in the computer vision field can be considered as voice recognition in machine listening as these are about linguistic information. Facial recognition can be a counterpart of music search or speaker identification because it identifies specific and unique targets. Lastly, object detection would be the most similar concept in computer vision field to the acoustic scene/event detection because it aims to identify a huge number of targets that is all in different forms.
2017: A memorable year for machine listening
Although the machine listening field has been actively researched since more than decades ago, it was still quite far from the level that can be widely applied to real-world applications. Only simple sound recognition for the limited number of sound was possible and its performance was unstable, just like an old speech recognition systems.
Even after modern neural network algorithms were introduced, unlike other technologies, it struggled to outperform conventional approaches as simply adopting the latest deep learning techniques were not a solution. But finally, researchers have made a breakthrough and deep learning approaches outperformed conventional methods in 2017.
There is an annual workshop called DCASE (Detection and Classification of Acoustic Scene and Events), organised by IEEE, and scene classification the top accuracy of the submitted system achieved 92% while it was only 76% in 2013. This result is particularly meaningful because the top 10 systems in 2016 were all conventional methods. Not only a scene classification task, but top accuracy systems of all other tasks were also replaced by deep learning methods.
I don’t think that deep learning simply can solve all the problems in the world, and it is only a part of the system. But still, 2017 is a memorable year because accumulated effort from researchers made a meaningful result, found a way to utilise and make the most of the latest ML technique, got one step closer to human-like machine listening system.
Required domain knowledge
Simply pushing audio clips into the off-the-shelf ML models might work pretty well for simple sound recognition demo. But actually these kinds of simple recognition works quite well even with the traditional approaches, and the concept of modern machine listening should be distinguished from this because the only thing it can do is simple trigger and action in highly limited conditions.
Advanced ML technology opened millions of opportunities that can give a positive impact on the quality of our daily life. Next-generation machine listening should aim for general auditory intelligence that can be used in a real-world situation, possible to improve continuously rather than re-inventing the wheel every time. To do so, it requires a range of domain knowledge in various fields such as signal processing, cognitive sciences, music, psychoacoustics, acoustics, and machine learning — because the real-world environment and auditory perception of human are highly complicated.
Conclusion
In this article, I’ve briefly introduced the concepts of machine listening. Next time, I will write more details about music information retrieval (MIR) which can be considered as a part of machine listening, but still, extremely huge research topic where we have tons of things to investigate.

