Audio as data with Silvertone

Luiz Lianza
Data with cafezinho
3 min readDec 15, 2022

Let’s understand how to work with audio to use ML and Deep Learning models. This post is based on the study made during the Silvertone project.

As for now, Silvertone is available using Streamlit.

Also, Don’t forget to check out the repositories of the app and the project.

Photo by Matt Botsford on Unsplash

Silvertone was a final project for Le Wagon Rio’s Data Science Bootcamp bath 1011. My team was formed by: me; Victor Sattamini, the team leader; Lucas Gama, the one with the best model; Guilherme Barros, the one with front-end skills.

The idea was to use a series of labeled audio sets with their sentiment to build an app capable of describing the general feeling of the speech. It is relevant to say that, for this model, the word content is irrelevant. The app only analyses the sound.

In order to analyse sound, we would need to do some steps, which I will share with you guys. So…

grab your coffee, and let’s talk about sentiment analysis from audio.

Audio files are soundwaves converted at information that an interpreter will understand and convert into soundwaves again. Like our head, that will do a similar process with the physical aspect. As far as basic models go, Keras Tensorflow and Scikit-Learn can’t use traditional audio formats like MP3 or WAV.

The first thing any model need to interprets the audio is to extract the wave content. The wave will be a time series giving the frequency in a linear form. This form is similar to the WAV format, though vectors will be more traditional for modeling.

To extract the wave you can use the librosa library as follows:

import librosa

x, sr = librosa.load(audio_file)

x is the wave in a time series format, and sr is the sound rate, which is the rate of the x information to convert it in time. We won’t use sr much now. Now, you can use any time series strategy with x already.

This format is quite simple, and there are models around there that use it and get excellent results. It wasn’t our case.

The good news is that with librosa we can transform the audio to other formats. Two of them were most used by us, mel-scaled spectogram and tonal centroid features.

Tonal centroid features represent the perfect fith, the minor third, and the major third as two-dimensional vectors each. So, the tonal centroid features will have six dimensions.

The perfect fifth is the music interval of a pair of pitches with a frequency ratio. The minor third and the major third are intervals between the music scale.

With the tonal centroid features, you can focus on more important information about the wave. You will be working with six-dimension vectors, but it is relevant to remember that time is still an important feature. With it, we were able to achieve around 70% of accuracy, which is quite good, using random forest.

You can extract the tonal centroid features with librosa this way:

tonnetz = librosa.feature.tonnetz(y=x, sr=sr)

Because the sound rate is relevant, you need to give the sound rate with the wave. Luckily we got both using the load method

To get the mel-scaled spectrogram, we need some transformations. The wave time series is first transformed into a spectrogram, representing the wave and the intensity of the sound with an image. The spectrogram goes through a mel scaling, giving us the mel-scaled spectrogram.

Librosa can give us the mel-scaled spectrogram directly by:

S = librosa.feature.melspectrogram(y=x, sr=sr)

The mel scaling is a logarithmic transformation of the signal frequency. The main idea is that humans don’t perceive distances in different pitches equally. By doing such transformation, the scale will be closer to how humans perceive distance in the pitch scale.

The mel-scaled spectogram was our choice to model with, giving us the best results. In another post, I will talk about the models and how to use those features. Keep in mind that a mel-scaled spectrogram is a spectrogram. That means we could treat it as an image.

Photo by Mel Poole on Unsplash

--

--