How To Generate MFCC From Audio. — ML For Lazy 2021
MFCCs are very important in audio processing and speech recognition systems. Suppose you want to make a machine classify the audio sound, use MFCCs, and get fascinating results.
Want to learn how we can use python to do this complicated task and get the best results in the audio processing and classification tasks. Let us hop in then and get the basic idea of what an MFCC is and how we can get them.
What are MFCC
MFCC, an acronym to Mel Frequency Cepstral Coefficients. They are very important when we are dealing with audio and audio processing techniques. Just consider them, the best from your audio, the best that your audio is representing.
Though the full form and acronym seem tense and complicated, yet it is so simple concept and easy to understand.
Let me try to make you understand this ‘Fire looking’ concept. Take an analogy of you studying in class, and you have a project of motivating the students for some game, say football. No, you are giving lectures daily for 10 days, and at the end of the 10th day, the students start joining the club and start playing football. Those students got motivated and joined.
In your speeches, you said some fascinating stuff and some weird stuff. The students who joined the club considered only those positive points and which affected them in the real world. And those who didn’t join took the weird stuff and decided not to join.
Both of them took those points into consideration, which affected them most or, in other words, were important for them, were most useful to them, and were most beneficial to them towards their decision.
In the same way, MFCC are the coefficients, i.e., some numerical values which are important from all the audio data and take only that part that is important and has the most value. These coefficients contribute most towards the audio features. They greatly influence the audio sample into consideration and are one of the important values in the audio sample.
Now, Let us see the Technical part of the same.
MFCC coefficients contain information about the rate of change in different spectral bands.
To get the MFCC we follow the following steps:
→ Take the Fourier Transform of signal
→ Map the power to the mel-scale using triangular overlapping windows or cosine overlapping windows
→ The logs of the powers at each of the mel frequencies
→ Take DCT — Discrete cosine transform of the mel log powers
→ The MFCC are the amplitudes of the resulting spectrum
Since we need not get in detail about these processes, these processes will be taken care of by the libraries. So remember them by name and nothing else. These may give You an idea, what is actually going behind the scenes when we are getting these coefficients.
What do they Represent?
As seen above, we follow the above steps to extract MFCCs from the audio signal. Since we take the Fourier transform, they contain information about the rate changes in the different spectrum bands.
If the MFCCs have a positive value, then most spectral energy is present at low frequencies. If the MFCCs have a negative value, most spectral energies are concentrated at high frequencies.
When dealing with the sound and MFCCs, the lower order coefficients contain most of the information about the spectral shape of the source-filter transfer function.
Enough of the technical stuff. Let us write the python code and use some libraries to extract these MFCCs from the audio signal.
MFCC Applications
MFCCs are commonly used as features in speech recognition systems, such as the systems that automatically translate the audio into text and other various speech recognition systems. Speech recognition systems have a wide range of applications, and in today’s world, it is one of the major branches of deep learning, machine learning. It is contributing to the world’s automated systems and language processing systems.
Another common application is in music information retrieval systems, including genre classification, audio similarity measures, accent classifications, and many more. These are the same feature that I am using in my project for the classification of accents.
A Key Point Of Consideration.
MFCCs are not very effective when there is noise present in the data, so we need to normalize their values in the automated speech recognition systems to decrease the influence of the noise on the data.
How to get the MFCC values
It may seem tedious to get these MFCCs and Then work with them. Worry not. We have some libraries which we can use to get the MFCCS and work with them easily and awesomely.
Librosa is one of the libraries and is very popular for working with audio.
Scipy is another one and is a major package, which is used for various purposes and also with audio.
We will be using Librosa to get these MFCCs. So, let us begin.
How to get the MFCC spectrograms and Plot them?
First, let us import all the required libraries, and then we will see what we need to do further. I hope you know the purpose of all the libraries. Since I am not using Numpy and Pandas very much in this tutorial, I like to keep them in my script. It gives me a sense of working.
import numpy as np import pandas as pd import librosa as lb import librosa.display import matplotlib.pyplot as plt
After importing libraries, I have created two lists for storing the audio data and their sample rates. Though not necessary, I need to do further processing with these values, so I have used them.
I have a CSV file, which contains the name of the audio files and their corresponding classes. So, let us import that CSV file and then work further.
csvPath = "path to csv" metadata = pd.read_csv(csvPath) metadata.head()
audioData = []
srate = []
for index, row in metadata.iterrows():
filename = "Path to the file" data, sampleRate = lb.load(filename) audioData.append(data)
srate.append(sampleRate)
plt.figure(figsize=(10,3))
lb.display.waveplot(data, sampleRate)
The ‘ Iterrows()’ is used to iterate the rows one by one in the dataset i.e pandas dataframe and perform operations on each row one by one. In this, I am loading all the files in the dataset using the column ‘ file_name’ and extracting the audio data and the sample rates of each audio file from the dataset. For visualizing the waveforms, I am using the librosa’s ‘display.waveplot()’, to get the feel of the audio pictorially. Since I am using ‘ for loop ‘ in the whole dataset, I will get the waveforms of all the audio files in the dataset.
Now, let us iterate the lists I have created using zip function.
a — audio data, s — sample rate
To get the MFCC features, all we need to do is call ‘feature.mfcc’ of librosa and git it the audio data and corresponding sample rate of the audio signal. Now, after printing the MFCC, we will see something like an array of numbers.
for a, s in zip(audioData, srate):
mfcc = librosa.feature.mfcc(y=a, sr=s)
print(mfcc)
Since we are using the Lists, we will get the MFCCs for all the audio samples one by one.
We can do one further thing, which is to scale these MFCC between a specific range. Let us do that, row-wise.
for a, s in zip(audioData, srate):
mfcc = librosa.feature.mfcc(y=a, sr=s)
mfccScaled = np.mean(mfcc.T, axis=0)
print(mfccScaled) plt.plot(mfccScaled, 'g')
plt.show()
Final Thoughts
In this blog post, we saw how to use the librosa library and get the MFCC feature. This is one way of extracting important features from the audio data and is mostly used in audio processing systems and other automated speech systems.
Now, it is your turn to code this simple tutorial yet important one by getting your hands dirty. So get up and start coding and enjoy the learning.
If you like the post, share it, give it is 👍 and leave a comment below.
If you want to read my previous post, then check out this post about Master’s Project, My first day.
I am a Postgrad student from Kashmir in Computer Science. In these covid days, I turn towards spreading the information about machine learning which is my Passion and Future studies. The aim is to make people understand and understand the basic concepts of Machine and Deep learning myself, which are crucial to further success in this field.
Originally published at https://mlforlazy.in on May 27, 2021.