[Week 2- What Does The City Say?]

What does the city say?
bbm406f17
Published in
2 min readNov 26, 2017

We spent the last week researching the techniques for feature extraction from audio and trying to decide which classifiers to use for the environment classification.

Feature

In the related works we examined, there are many features that used to describe audio signals like Mel-frequency cepstrum coefficient analysis (MFCC), statistical moments from the audio signal’s spectrum (i.e. spectral centroid, spectral bandwidth, spectral asymmetry, and spectral flatness), zero-crossing rate, energy range, and frequency roll-off also in some studies Smile983 (983-dimensional), Smile6k (6573-dimensional) feature sets obtained via openSMILE feature extraction tool.

DataSets

In the related works, the dataset which is used in the ongoing IEEE challenge. The dataset includes 15 different open and closed areas (labels), 9.75 hours of recording in total and 8.7GB in wav format. The classes are:

• Bus: traveling by bus in the city
• Cafe or Restaurant: small cafe or restaurant
• Car: driving or traveling as a passenger, in the city
• City center
• Forest path
• Grocery store: medium size grocery store
• Home
• Lakeside beach
• Library
• Metro station
• Office: multiple people, typical work day
• Residential area
• Train
• Tram
• Urban park

Classification Methods

In the related works we examined, mostly used method is Gaussian Mixture Model (GMM).The other methods are K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Deep Neural Networks (DNN), Recurrent Neural Networks (RNNs), Recurrent Deep Neural Networks (RDNNs), Convolutional Neural Networks (CNNs), Recurrent Convolutional Neural Networks (RCNNs).

Obtained Results from the Related Works

In related works we examined, in the small set of features such as MFCCs and Smile983, the RNNs and RDNNs, performs better than the models such as GMMs, SVMs and DNNs. In large feature sets such as Smile6k, the DNNs performs better than the others. The GMM with MFCC features, the baseline model provided by the DCASE contest, achieves 67.6% test accuracy but in the related work a DNN with the Smile6k feature reaches 80% test accuracy, RNNs and RDNNs generally 68∼77%, SVMs 56∼73%, CNNs and RCNNs with spectrogram reaching 63∼64% accuracy results are obtained.

Related Works Links

https://www.ml.cmu.edu/research/dap-papers/DAP_Dai_Wei.pdf
https://www.researchgate.net/profile/Maja_Mataric/publication/221263267_Where_am_I_Scene_Recognition_for_Mobile_Robots_using_Audio_Features/links/00b49527481161c422000000.pdf
https://www.csl.sony.fr/downloads/papers/2007/aucouturier-07b.pdf
http://sail.usc.edu/publications/files/SelinaChu-TASLP2009.pdf

--

--