[Week 5–6] What Does The City Say?

What does the city say?
bbm406f17
Published in
2 min readJan 4, 2018

We’ve been doing classroom and video presentations since our previous blog. We also finished our final report. As we said before, we put our DNN and GMM-DNN comparison results this week.

We created 747 for training, 240 for validation and 150 for testing by reducing all dataset. However, while minimizing the data (from 1170 to 747) some of the situations (pay attention to the segments extracted from same original recordings, make sure that all files recorded in the same location are placed on the same side of the evaluation[1] and deal with the data contain errors) are very difficult to achieve, very time-consuming, requires special attention. We made 2 tests for GMM-DNN and DNN test with using MFCC features.

  • GMM-DNN test & results:

As a result of our tests, we achieved a maximum of 76% validation accuracy for GMM and 61.66% for DNN. As a consequence of DNN being lower than GMM, it may be that DNN needs more train data [2] and looks at the general characteristics of the data and is not as sensitive as GMM to local differences. GMM is good in small data sets and clean data. GMM gave higher accuracy than DNN because we created a small and clean dataset.

  • DNN Test & Results:

We get %65.33 test accuracy with DNN+MFCC in the large dataset. As you can see from the above confusion matrix, our model always classified the car sounds correctly. It classifies the city center sounds correct. Sometimes it confused residential-area with park and forest path. This is a reasonable confusion because they contain similar voices such as human voice, birds, water. The model has correctly classified the bus but sometimes confused with train or car. As it can be understood from our results, if background sounds are similar, it might make sense to mix them. Sometimes irrational classifications can also be done because the MFCC features may be inadequate alone [3].

  • Future Work

MFCC usually gives good results when used alone in speech recognition. However, for Scene classification, using just MFCC may be insufficient.
As future work, we can use other features, different methods, cleaner and more train data. Also, feature reduction can be done with PCA and regularizer can be used to reduce overfitting. This way we can get better results.

--

--