Utilizing Domain Knowledge in End-to-End Audio Processing

Figure 1. Model A: learning the mel-spectrogram transformation. Model B: classifying sound using mel-spectrograms.

We performed an exploratory study into improving end-to-end audio classification models. By introducing the intermediary regression task of approximating mel-spectrograms, we were able to classify raw waveform and mel-spectrogram input with equal accuracy. In future experiments we aim to fine-tune the end-to-end classification model to outperform models trained on hand-crafted features.

Read the Full Paper
https://arxiv.org/abs/1712.00254
See the Code
https://github.com/corticph/MSTmodel

Motivation

Across the machine learning field we have seen an increase in the popularity of neural network models that do not rely on engineered features but instead learn relevant features from the dataset directly. On image classification tasks, for instance, such state-of-the-art ‘end-to-end’ models rely on pixel values as input.

Naturally, researchers have tried the same in the audio domain but with less success. Reasons for this likely include the long-range temporal dependencies in speech, variations for the same sound due to temporal distortions and phase shifts, and the existence of relatively small labelled speech datasets due to the cost of transcription. Training models on raw waveform input is therefore an active field of study.

Approach and contribution

The idea for our approach comes from the observation that non-labeled audio data is available in practically unrestricted amounts, while labelling is costly. We therefore split up the task of using raw waveform input on an audio classification task into [A] modelling the mel-spectrogram transformation in a regression task and [B] performing audio classification on mel-spectrograms (see figure 1).

Figure 2. Left: mel-spectrogram label. Right: model prediction.

After training, model predictions of [A] and the mel-spectrogram labels capture the essential features of the representation (see figure 2). We then initialize a combined system [C] (see figure 3) with the learned parameters from [A] and [B]. We show that the results for [C] are on-par with those of [B] and thereby show that we are able to successfully learn a representation sufficiently close to the mel-spectrogram features.

Figure 3. Model C: end-to-end sound classification.

In future work, we believe the first layers can be further optimized with standard backpropagation to enhance performance on various audio-related tasks.