Intro to audio processing world for a Data scientist

Royal Jain
DeepAffects
Published in
5 min readApr 28, 2018

Coming from NLP background I had difficulties in understanding the concepts of speech/audio processing even though a lot of underlying science and concepts were the same. This blog series is an attempt to make the transition easier for people having similar difficulties. The First part of this series describes the feature space which is used by most machine learning/deep learning models.

Feature Space

This is the most confusing aspect as most of the data scientist not working in image or audio world are accustomed to using real-world entities like income, temperature, area or one-hot vectors/word embeddings as input to machine learning models. In contrast, most of the audio/speech models use some derivative of spectrogram as input to their systems. Here lies the problem especially for people unfamiliar with signal processing. We essentially have three ways —

  1. Treat the inputs as a black box and use one of many libraries available for calculating the features.
  2. Study in depth the reasons, implementation and advantages of various feature vector calculation techniques.
  3. Get a good intuition about these features vectors and then go in-depth when time and need arises.

The problem with approach one is that once you start reading research papers you’ll realize you are missing the understanding in some parts which are important for implementing and learning about the systems. This approach is okay if you are only working in this area for a short while but not feasible if you intend to stick here for long.

Approach two is comprehensive and gives you a greater advantage in understanding/implementing papers but it requires too much effort and time. Also, you might end up spending time and energy on things you don’t even need. While more knowledge has never harmed anyone but people working in constrained time environment, like myself, might want to follow the third approach.

Get a good understanding, start working and revisit things you need deeper understanding for later. Sounds good? Right? Only the problem is you need someone to tell you what things you need to get started and give you a good intuition of these things. This blog is an attempt to tackle this problem. Now that the outline is done, let’s get started —

Spectrogram

One word which comes up very often in research papers and discussion forums in this field is spectrogram, so what is it? Most of us are used to visualizing a sound wave in the following form —

Waveform

This is called waveform of an audio and on zooming its looks something like this —

Magnified Waveform

This shows how amplitude/pressure a sound wave creates at a point in space varies with time. However, thanks to Fourier and his friends most of the feature vectors are calculated using another form of representing the same audio wave. This is called spectrogram. It looks something like the image below, Awesome, ain’t it?

Spectrogram

What does it mean?

In this graph, the x-axis represents time and the y-axis represents frequency. The colour of the graph represents how much power is present at that time and frequency. The redder it is the more power there is at that frequency and time. Simple as that. Need some more examples?

Spectrogram for my voice plus background noise
Spectrogram of my sisters voice. She thinks she’s good at singing, she is not

You can see that my sister’s spectrogram has a lot of energy concentrated at higher frequency ~200–250 Hz (Typical adult female frequency is around 170–260 Hz) and in mine, you can observe energy at around 120–150 Hz (Typical adult male frequency is around 90–180 Hz). Rest of the things are noises and background effects.

Most of the feature vectors are derived from spectrogram. Each representation uses different mathematical operations to generate a vector from spectrogram. Here we look at one of them.

MFCCs —

By far the most common features used are MFCCs(Mel Frequency Cepstral Coefficient). This guy has an awesome explanation and intuition of MFCCs. I strongly urge you to go through the link above but if time is short, I’ll abridge the explanation without the maths here -

MFCCs intends to model the process by which humans produce and listen to sounds (More on this in the link above). MFCCs are calculated on short frames, usually 40ms, the reason being if the frame is much shorter we don’t have enough samples to get a reliable estimate of power associated with each frequency if it is longer the signal changes too much throughout the frame. The frequencies are then clubbed together in bins because human ears are not able to distinguish very close frequencies. We calculate the energy associated with each bin of frequencies. We then take the logarithm of the energies. This is also motivated by human hearing: we don’t hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. The bins are overlapping, so the energies associated with bins will be correlated. To decorrelate the energies we perform some operations (DCT in case you are wondering about the operation, forget about it if you are not).

Typical Workflow

Most of the systems in audio deep learning space follow this workflow —

  1. Calculate MFCCs (or similar feature vectors) on short frames
  2. Use a sequence of MFCC vector (typically comprising of 1–5 seconds) as inputs to sequence learning models, typically RNN, but CRF and HMM’s are also in active use.
  3. Average the outputs of step two to get output over an entire audio file

Its okay if you are not very comfortable with the workflow yet. That’s the objective of part 2 of this series. Hopefully, this blog would have given you a good picture before starting in audio processing world. Part 2 will describe window sizes, frame sizes and walk you through on how to work with DNN’s in audio space. Part 3, will start with how to read/write/play with audio files in python and end with everything you need to start training your first model.

Good luck and God Bless !!

--

--

Royal Jain
DeepAffects

Founder @ CodeParrot AI. Building the future of UI Development using AI