[Week 3-4] What Does The City Say?

What does the city say?
bbm406f17
Published in
3 min readDec 17, 2017

First, we used soundfile library for reading our audio datas and visualized them to understand their audio files and to see how different they are from each other.Matplotlib and librosa libraries were used to draw these sound waves. Matplotlib’s spectral method performs all necessary computation and drawing of the spectrum.Librosa likewise provides a useful method for drawing spectrograms of wave and log power. Example of wave form and power spectrogram of a Bus audio :

“Also if you look more closely you can see the monster of the Stranger Things in the plot. :D”
The office is often quiet.

We extracted features from the audio datas. MFCC coefficients extracted on mono audio with 40 ms frames with Hamming window and 40 mel bands. The baseline system for the dataset uses 19 coefficients (excluding the 0th coefficient, we exclude this because research shows that this feature does not affect much), delta and acceleration(delta-delta) coefficients were calculated( with window length of 9 frames) resulting in a frame-based feature vector of dimension 59.

Example plots of Bus audio

After our reviews, we have seen that the most common method used in the related works that we have studied is the GMM. For this reason, firstly, we implemented the GMM classifier with help from DCASE 2016 baseline code [here], using GaussianMixture library . The GMM learns an acoustic model for each acoustic scene class, and performs the classification with maximum likelihood classification scheme.

We can represent each audio data as a bag of acoustic features extracted from audio segments. With these features GMM can be trained for each class label using only audio data from that class.

We trained and tested our model using two datasets.You can see a small sample test below.

Misclassifications can be caused by the dominance of the background sounds. For example, even though the recording is for bus, model classifies it as openairmarket. When we open and listen the recordings we can see that both of them have the same siren voice in the background. You can listen these recordings below:

Or in another test, even though the recording is for park, model classifies it as bus. When we open and listen the park recording we can see that it has bus voice in the background. Because there was a bus past the park while the sound was being recorded.You can listen this park recording below:

The information about DataSet1 is in the previous post.

Dataset-2

In the dataset we use, it is composed of 30-second records of various acoustic scenes.The dataset consisting of a 10 audio recordings for each scene. The list of scenes is: busy street, quiet street, park, open-air market, bus, subway-train restaurant, shop/supermarket, office, subway station.

Now we are working on implementing DNN model.In the next post, we will talk about our test result in DNN. And we will compare the methods we use and interpret the results.

--

--