Age prediction of a speaker’s voice

Michael Notter
EPFL Extension School
13 min readFeb 18, 2022

How to perform EDA and data modeling on audio data using Python

[Find the code to this article here.]

Most people are familiar with how to run a data science project on image, text or tabular data. But not many have experience with analyzing audio data. In this article, we will learn how we can do exactly that. How to prepare, explore and analyze audio data with the help of machine learning. In short: As for all other modalities as well (e.g. text or images), the trick is to get the data into a machine interpretable format.

The interesting thing with audio data is that you can treat it as many different modalities:

  • You can extract high-level features and analyze the data like tabular data.
  • You can compute frequency plots and analyze the data like image data.
  • You can use temporal sensitive models and analyze the data like time-series data.
  • You can use speech-to-text models and analyze the data like text data.

In this article we will look at the first three approaches. But first, let’s take a closer look at what audio data actually looks like.

1. The many facets of audio data

While there are multiple Python libraries that allow you to work with audio data, for this example, we will be using librosa. So, let’s load an MP3 file and plot its content.

What you see here is the waveform representation of the spoken sentence: “he just got a new kite for his birthday”.

1.1. Waveform — signal in the time-domain

Before we called it time-series data, but now we name it waveform? Well, it’s both. This becomes clearer when we look only at a small segment of this audio file. The following illustration shows the same thing as above, but this time only 62.5 milliseconds of it.

What you can see is a temporal signal that oscillates around the value 0 with different frequencies and amplitudes.This signal represents the air pressure change over time, or the physical displacement of a loud speaker’s membrane (or the membrane in your ear for that matter). That’s why this depiction of the audio data is also called waveform.

The frequency is the speed with which this signal oscillates. Low frequency, e.g. 60 Hz could be the sound of bass guitar, while a birds song could be in the higher frequency of 8000 Hz. Human speech is usually anywhere between that.

To know how quickly this signal needs to be interpret, we also need to know the sampling rate at which the data was recorded. In this case, the sampling rate per second was 16'000 or 16k Hz. Which means that the 1'000 time points we can see in the previous figure represents 62.5 milliseconds (1000/16000 = 0.0625) of audio signal.

1.2. The Fourier Transform — signal in the frequency domain

While the previous visualization can tell us when something happens (i.e. around 2 seconds there seem to be a lot of waveform signal), it cannot really tell us with what frequency it happens. Because the waveform shows us information about the when, this signal is also said to be in the time domain.

Using a fast fourier transformation, we can invert this issue and get a clear information about what frequencies are present, while loosing all information about the when. In such a case, the signal representation is said to be in the frequency domain.

Let’s see what our spoken sentence from before looks like represented in the frequency domain.

What you can see here is that most of the signal is somewhere between ~100 and ~1000 Hz (i.e. between 10² and 10³). Plus there seem to be some additional stuff from 1'000 to 10'000 Hz.

1.3. Spectrogram

Luckily, we don’t always need to decide for either the time or frequency domain. Using a spectrogram plot, we can profit from both domains, while keeping most of their handicaps minimal. There are multiple ways how you can create such spectrogram plots, but for this article let’s take a look at three in particular.

1.3.1. Short-time Fourier transform (STFT)

Using a small adapted version of the fast fourier transformation before, namely the short-time fourier transformation (STFT), we can create such a spectrogram. The small trick that is applied here is that the FFT is computed for multiple small time windows (hence “short-time fourier”) in a sliding window manner.

As in all spectrogram plots, the color represents the amount (loudness/volume) of a given frequency, at a given timepoint. +0dB is the loudest, and -80dB is close to silence. On the horizontal x-axis we can see the time, while on the vertical y-axis we can see the different frequencies.

1.3.2. Mel spectrogram

As an alternative to the STFT, you can also compute the mel spectrogram, which is based on the mel scale. This scale accounts for the way we human perceive a sound’s pitch. The mel scale is calculated so that two pairs of frequencies separated by a delta in the mel scale are perceived by humans as having the same perceptual difference.

The mel spectrogram is computed very similar to the STFT, the main difference is just that the y-axis uses a different scale.

The difference to the STFT might not be too obvious first, but if you take a closer look, you can see that in the STFT plot, the frequency from 0 to 512 Hz take much more space on the y-axis than in the mel plot.

1.3.3. Mel-frequency cepstral coefficients (MFCCs)

The Mel-frequency cepstral coefficients (MFCCs) are an alternative representation of the mel spectrogram from before. The advantage of the MFCCs over the mel-spectrogram are the rather small number of features (i.e. unique horizontal lines), usually ~20.

Due to the fact that the mel spectrogram is closer to the way we human perceive pitch and that the MFCCs only has a few number of component features, most machine learning practitioner prefer the MFCCs way of representing audio data in an ‘image way’. Which isn’t to say that for a given problem an STFT, mel or waveform representation might work better.

So, lets go ahead and compute the MFCCs and plot them.

2. Data cleaning

Now that we understand a bit better what audio data looks like, let’s visualize a few more examples. Note: You can download these four examples via these links: Audio 1, Audio 2, Audio 3, Audio 4.

From these four examples, and more importantly, when listening to them, we can gather a few more insights about this audio dataset:

  1. Most recordings have a long silence period at the beginning and the end of the recording (see sample 1 and 2). This is something we should take care of with ‘trimming’.
  2. However, in some cases, these silence period are interrupted by a ‘click’, due to the pressing and releasing of the recording buttons (see sample 2).
  3. Some audio recording don’t have such silence phase, i.e. a straight line (see sample 3 and 4). When listening to these recordings we can observe that this is due to a lot of background noise.

To better understand how this is represented in the frequency domain, let’s look at the corresponding STFT spectrograms.

When we listen to the audio recordings we can observe that sample 3 has varying background noise covering multiple frequencies, while the background noise in sample 4 is rather constant. This is also what we see in the figures above. Sample 3 is very noisy throughout, while sample 4 is noisy only on a few frequencies (i.e. the thick horizontal lines). For now we won’t go into detail of how such noise could be removed, as this would be beyond the scope of this article.

So, let’s look into a ‘short-cut’ of how we could remove such noise, and trim the audio samples. While a more manual approach, using custom filtering functions, might be the best approach to remove noise from audio data, in our case we will go ahead and use the practical python package noisereduce.

If you listen to the created wav files, you can hear that the noise is almost completely gone. Yes, we also introduced a few more artifacts, but overall, we hope that our noise removal approach did more good than harm.

For the trimming step we can use librosa’s .effects.trim() function. Note, each dataset might need a different top_db parameter for the trimming, so best is to try out a few versions and see what works well. In our case it is top_db=20.

Let’s now take another look at the cleaned data.

Much better!

3. Feature extraction

Now that our data is clean, let’s go ahead and look into a few audio-specific feature that we could extract.

3.1. Onset detection

Looking at the waveform of a signal, librosa can reasonably well identify the onset of a new spoken word.

3.2. Length of an audio recording

Very much related to this is the length of an audio recording. The longer the recording, the more words can be spoken. So let’s compute the length of the recording and the speed at which words are spoken.

3.3. Tempo

Language is a very melodic signal, and each of us has a unique way and speed of speaking. Therefore, another feature that we could extract is the tempo of our speech, i.e. the number of beats that can be detected in an audio signal.

3.4. Fundamental frequency

The fundamental frequency is the lowest frequency at which a periodic sound appears. In music this is also known as pitch. In the spectrogram plots that we saw before, the fundamental frequency (also called f0) is the lowest bright horizontal strip in the image. While the repetition of the strip pattern above this fundamental are called harmonics.

To better illustrate what we exactly mean, let’s extract the fundamental frequency and plot them in our spectrogram.

The turquoise lines that you see around 100 Hz are the fundamental frequencies. So, this seems about write. But how can we now use that for feature engineering? Well, what we could do is compute specific characteristics of this f0.

Note: There are of course many more audio feature extraction techniques that you could explore. For a nice summary of a few of them, check out musicinformationretrieval.com.

4. Exploratory data analysis (EDA) on audio dataset

Now that we know what audio data looks like and how we can process it, let’s go a step further and conduct a proper EDA on it. To do so, let’s first download a dataset. Note, the dataset we will be using for this article was downloaded from the Common Voice repository from Kaggle. This 14 GB big dataset is only a small snapshot of a +70 GB big dataset from Mozilla. But don’t worry, for our example here we will use an ever smaller subsample of roughly ~9'000 audio files. You can download this dataset here.

So let’s take a closer look at this dataset and some already extracted features.

4.1. Investigation of features distribution

4.1.1. Target features

First, let’s look at the class distributions of our potential target classes age and gender.

S

So, independent from which target feature we chose, the class distribution is imbalanced. Something that we need to keep in mind.

4.1.2. Extracted features

As a next step, let’s take a closer look at the value distributions of the extracted features.

Except for words_per_second, most of these feature distributions are right skewed and therefore could profit from a log-transformation. So let's take care of that.

Much better, but what is interesting is the fact that the f0 features all seem to have a bimodal distribution. Let's plot the same thing as before, but this time separated by gender.

As suspected, there seems to be a gender effect here! But what we can also see is that some f0 scores (here in particular in males) are much lower and higher than they should be. These could potentially be outliers, due to bad feature extraction. Let's take a closer look at all data points with the following figure.

Given the few number of features and the fact that we have rather nice looking distributions with pronounced tails, we could go through each of them and decide the outlier cut off threshold feature by feature.

4.2. Feature correlation

As a next step, let’s take a look at the correlation between all features. But before we can do that, let’s go ahead and also encode the non-numerical target features. Note, we could use scikit-learn’s OrdinalEncoder to do that, but that would potentially disrupt the correct order in the age feature. So let's rather perform a manual mapping.

Now we’re good to go to use pandas .corr() function together with seaborn's heatmap() to gain more insight about the feature correlation.

Interesting! What we can see is that our extracted f0 features seem to have a rather strong relationship to gender target, while age doesn't seem to correlate much with anything.

4.3. Spectrogram features

For now we haven’t looked at the actual audio recordings during our EDA. As we saw before, we have a lot of options (i.e. in waveform or as STFT, mel or mfccs spectrogram). For this exploration here, let’s go ahead look at the mel spectrograms.

However, before we can do that we need to consider one thing: The audio samples are all of different length, meaning that the spectrograms will also have different length. Therefore, to normalize all recordings, let’s put cut them to a length of exactly 3 second. Meaning, samples that are too short will be filled up, while samples that are too long will be cut.

Once we have computed all of these spectrograms, we can go ahead and perform some EDA on them too! And because we saw that ‘gender’ seems to have a special relationship to our audio recordings, let’s visualize the average mel spectrogram for both gender separately, as well as their differences.

While it is difficult to see in the individual plot, the difference plot reveals that male speaker have on average lower voices than female. This can be seen by more strength in the lower frequencies (seeing in the red horizontal region) in the difference plot.

5. Machine learning models

Now, we’re ready for the modeling part. And as such, we have multiple options. With regards to models, we could …

  • train our own classical (i.e. shallow) machine learning models, such as LogisticRegression or SVC.
  • train our own deep learning models, i.e. deep neural network.
  • use a pretrained neural network from TensorflowHub for feature extraction and then train a shallow or deep model on these high-level features

And with regards to data, we could use …

  • the data from the CSV file, combine it with the ‘mel strength’ features from the spectrograms and consider the data as a tabular data set
  • the mel-spectrograms alone and consider them as a image data set
  • the high-level features from TensorflowHub, combine them with the other tabular data and consider it as a tabular data set as well

There are of course many different approaches and other ways to create the data set for the modeling part. For this article, let’s briefly explore one of them.

Classical (i.e. shallow) machine learning model

Let’s take the data from the descriptive statistic and combine it with a simple LogisticRegression model and see how well we can predict the age of a speaker. For this, let’s use a Pipeline object, so that we can explore the advantage of certain preprocessing routines (e.g. using scalers or PCA). Furthermore, let's use GridSearchCV to explore different hyper-parameter combinations, as well to perform cross-validation.

Fitting 4 folds for each of 96 candidates, totalling 384 fits

As an addition to the DataFrame output above, we can also plot the performance score as a function of the explored hyperparameters. However, given that we have multiple scalers and PCA approaches, we need to create a separate plot for each separate combination of hyperparameters.

Taking the extra step and visualizing the performance metrics as curves often give us relevant additional information, that we wouldn’t get when we just look at the pandas DataFrame.

In this plot we can see that overall, the models perform equally well. Some have a quicker ‘drop-off’ when we decrease the value of C, while other show a wider gap between train and test (here actually validation) score, especially when we don't use PCA.

Having said all that, let’s just go ahead with the best_estimator_ model and see how well it performs on the withheld test set.

That’s already a very good score. But to better understand how well our classification model performed, let’s also look at the corresponding confusion matrix. To do this, let’s create a short helper function.

As you can see, while the model was able to detect more twenties samples than others (left confusion matrix), it overall it actually was better in classifying teens and sixties entries (e.g. with an accuracy of 59% and 55% respectively).

Summary

In this article we first saw what audio data looks like, then in which different forms it can be transformed to, how it can be cleaned and explored and finally how it then can be used to train some machine learning models. We hope you have enjoyed it and do not hesitate to leave a comment if you have any question.

--

--

Michael Notter
EPFL Extension School

Senior ML researcher & neuroscientist fascinated by hidden patterns, innovating in neuroimaging, computer vision, AR/VR, vital signs & multi-sensor sensing.