Gender classification from raw audio with 1D convolutions

Oscar Knagg
7 min readSep 1, 2018

--

Deep learning has shaken up computer science by making a previously intractable class of perceptual recognition problems solvable. The initial spark of this revolution as well as many of it’s most touted achievements are based around on using convolutional neural networks to classify images . Yet in my opinion disproportionately less work has been done on the other main class of perceptual data — audio. I set out to even the balance with a deep learning project to classify someone’s gender based on a sample of raw audio containing their speech.

The video below is a demonstration of my final model applied to an interview between Sir Elton John and Kirsty Wark.

There is already a wealth of previous work on voice recognition problems. Previous research shows that the fundamental frequency of human males is typically between 85 and 155 Hz compared to 165 and 255 Hz for a female [1]. Hence applying deep learning is overkill in some sense as this is already a solved problem. However deep learning has some benefits such as being able to operate on raw audio with little or no preprocessing. I also took this opportunity to write a whole project in PyTorch in order to get a hands-on comparison with the much more familiar to me deep learning library Keras.

You might wonder where I would acquire the large amount of labeled training data required to train a deep learning model. The answer is simple — the OpenSLR dataset. This is roughly 100 hours of publicly available annotated audiobook data typically used for speech-to-text problems. Luckily for me it also contains the SPEAKERS.txt file which contains the name and gender of every speaker in the dataset.

Architecture

Audio data is sequential data. The vibration of a human’s vocal chords produces a continuous pressure wave and by sampling at discrete intervals we produce a sequence of real numbers (or two sequences in the case of a stereo recording). Knowing this it would seem appropriate to use an architecture suited to sequential data. Recurrent networks would be one choice but as a single second of audio from the OpenSLR dataset is a sequence of length 16,000, recurrent networks become unfeasible due to long training times, inference times and difficulty backpropagating errors across such long sequences.

Instead I opted to use a model based on 1 dimensional convolutions — a direct equivalent to the 2D convolutions that revolutionised image recognition. Initially I tried an architecture of stacked dilated convolutions inspired by WaveNet. Although this did produce good results it was very slow to train and a much simpler architecture based on alternating layers of 1D convolutions and max pooling produced equally good results in a fraction of the time. This fits with the knowledge that the underlying problem the model is solving is quite simple, we’re just looking for a high or low fundamental frequency.

Model architecture

Explanation of final architecture:

  • Each successive layer of convolutions and pooling extracts relevant frequencies and patterns at a successively higher level of abstraction, analogous to a 2D image classifier.
  • At the final layer we apply a global max pooling over the time dimension to find the highest value of each filter. My intuition is this will help the model ignore irrelevant time periods such as silence or noise and focus on information from relevant periods.
  • Combine these features via a fully connected layer to form the prediction.

Training

During initial experiments I noticed the validation accuracy topped out pretty much lower than I expected. I investigated and produced breakdown of average error on a per-speaker basis — it turned out that a handful of speakers were misclassified in almost of all their samples. These speakers included the dubious Vincent Tapia (female) and Kathy Caver (male). After listening to a few samples I was certain that the genders of these speakers were misclassified in the OpenSLR dataset and amended my SPEAKERS.txt file.

~$ diff SPEAKERS.txt SPEAKERS_original.txt
737c737
< 2053 | F | train-clean-360 | 25.03 | Vincent Tapia
---
> 2053 | M | train-clean-360 | 25.03 | Vincent Tapia
746c746
< 2078 | M | dev-clean | 8.03 | Kathy Caver
---
> 2078 | F | dev-clean | 8.03 | Kathy Caver
1220c1220
< 3989 | M | train-clean-360 | 25.06 | Rayne
---
> 3989 | F | train-clean-360 | 25.06 | Rayne
1498c1498
< 5123 | M | train-clean-360 | 25.11 | wvthcomp
---
> 5123 | F | train-clean-360 | 25.11 | wvthcomp
1905c1905
< 6686 | M | train-clean-360 | 20.63 | Elin
---
> 6686 | F | train-clean-360 | 20.63 | Elin
2011c2011
< 7085 | M | train-clean-360 | 25.19 | voicebynatalie
---
> 7085 | F | train-clean-360 | 25.19 | voicebynatalie
2020c2020
< 7117 | F | train-clean-360 | 25.17 | Art Leung
---
> 7117 | M | train-clean-360 | 25.17 | Art Leung
2269c2269
< 7976 | M | dev-clean | 8.13 | JenniferRutters
---
> 7976 | F | dev-clean | 8.13 | JenniferRutters

I trained the final model on 3 second fragments of audio at the full sampling rate of 16 KHz using SGD with momentum 0.9 and learning rate 0.05. A form of data augmentation was applied as in each pass over the training data we take only a random 3 seconds of audio from each file. This method trains the model to a best validation accuracy of 98.9% in 3 epochs, taking about 2.5 hours on a 1080 Ti. I was able to achieve results of ~98% accuracy in only 10 minutes using a smaller model on audio downsampled to only 4 KHz.

Segmenting the Interview

Once my final model was produced I was keen to test it out on some real world data and came across the interview embedded at the top of this article. To segment the video into Kirsty sections and Elton sections I first broke the audio up into many overlapping 3 second fragments. I then applied the model to each of these segments in turn. If the probability of the speaker female is close to 1 then we can conclude it is Kirsty speaking and vice-versa.

A visualisation of the results is below:

  • Blue shaded regions are the true times when Kirsty is speaking (hand labeled)
  • Red lines are brief interjections by Elton or points when Elton talks over Kirsty
  • A 1 second rolling mean was applied to the predictions as the raw predictions were quite noisy

There are some interesting points to make from this:

  1. The model has a clear bias towards predicting a male voice. When Elton is speaking the prediction is always confidently at P(female)=0 however Kirsty’s voice provokes a much less clear response from the model, rarely reaching P(female)=1. Also any points where both are speaking the model tends to favour predicting a male voice.
  2. Sometimes there is a bit of “lag” in swapping from predicting Elton to predicting Kirsty. This could be due to the large 3 second window that the model takes as input. Experimenting with smaller input fragments could be beneficial.
  3. The model performs much worse than you’d expect given 98.9% validation accuracy. This is a typical “gotcha” in machine learning systems and a main cause is that real world data often differs from training data in unexpected, easy to miss ways. In this case all the training data was stored as FLAC files whereas the I acquired the interview data as an MP4 — different audio encodings is a potential reason for the drop in performance

Although the model doesn’t seem to do entirely badly considering the limited amount of architecture searching and hyperparameter optimisation it still looks like it needs some more effort to get it to work smoothly outside of it’s training domain of audiobook data.

In addition to the points above there are many extensions I’d like to make; working with real-world noisy audio in various encodings, dealing with silence and people speaking over each other (multiclass to multilabel), training directly for audio segmentation and tuning of input fragment length and sampling rate.

I believe audio processing is an area ripe with potential deep learning applications (one-shot speaker recognition from raw audio anyone?) and I will be experimenting more with audio data in the future.

--

--