Audio and Image Features used for CNN
Analysis of different features in CNN
Convolutional Neural Network(CNN): It is an artificial neural network that is so far been most popularly used for analysis images. Although image analysis is most widely used for image analysis, they can also be used for other data analysis and classification problems.
CNN is an artificial neural network that has some type of specialization which is able to pick the patterns and make sense out of it.
Different data set required for CNN
Data is raw information, it is a representation of both human and machine observations of the world. Everything can be represented as data. All science, literature all can be represented as 1’s and 0’s in the computer. When we enter the virtual world, we are literally surrounded by data, since it is the fundamental building block of everything we see and when we observe something in physical life, it becomes data in our brain.
Step 1. Select the Data: Deciding the right kind of data to use. The data entirely depends on the problem we are trying to solve. There are public datasets available for almost all the topics. Sources can be used datasets from Kaggle.com as they have a nice format and data sets are well explained, also there is a list of public datasets available at Github and we can also do an advanced google search and specify the file type you want. Usually, the website has an API which makes it easier to get the data which is needed, otherwise, use the library beautiful soup to take the raw HTML page and scrap the data directly.
Step 2. Process the Data — write a function to extract the data from the data set and then feed data to the neural network. The network will create the separator line between two classes when given new data it can predict the required results.
Format the Data: Format the data properly. Data can be a form in the text file, relational database, CSV file. And libraries are available to convert any file type to another. Make sure the data is formatted to a file type that is appropriate for the program.
Clean the Data: Sometimes we have instances in the data which are incomplete, we can iterate through each and delete through the instances by checking the values whether they are empty or not.
Feature to use: If correct features aren’t used, the model will make bad predictions. Use the features which are relevant to the problem.
Step 3. Transform the Data — One possible transformation is decomposition, sometimes we have features which are too complex. If you decompose the feature, the model will become more accurate. If we are satisfied with the features and their class labels, then we can transform the features into vectors. Vectors are numerical representations of features. All features can be represented as vectors, words, images, and videos. We can take these vectors and feed that to the neural network directly.
Image Features in CNN
Convolutional Layers have filters which detect the patterns. Different patterns in an image are Multiple edges, Shapes, Textures, Objects, etc.
Different detectors can be used as filters such as:
- Edge Detector
- Corner Detector
- Shape Detector
The deeper network goes, the more sophisticated our filters becomes, so later layers rather than edges and simple shapes our filter may be able to detect specific objects like Eyes, ears, feathers, Fur, hair, scales, beaks. Deeper layers filters are able to detect Full dogs, cats, lizards, and birds etc.
Working of Convolution Neural Network with images
In the above the picture there is a neural network accepting handwritten images, and network classifies in the categories whether it is 1,2,3, …9.
Assume 1st hidden layer is a convolutional layer and also have to specify the number for filters to be used. Filters are the small matrix with rows and columns the matrices and the matrices initialize with random values.
In the above picture, it is specified that there is a filter of size 3*3, the convolutional layer receives input, the filter will slide over each 3*3 set of pixels from the input itself until it slides over blocks of 3*3 block of pixels from the entire image, sliding is referred to as convolving.
The matrix representation of an image of “7”. The values in the matrix are the individual pixels from the image, this input will be passed to the convolutional layer. It is specified that this layer has only one filter and this filter is going convolve across each 3*3 block of pixels from the input.
When the filter first lands on the first 3*3 matrix, the dot product of the filter itself with the 3*3 block of pixels from the input will be computed and stored in the first cell. This will occur for each 3*3 set of pixels that the filter convolves.
Take the dot product of the filter with the first 3*3 block and then store in the first cell of the convolutional layer. Now slide to the next 3*3 block, take the dot product and store the value in the next cell. Continue the process until you get the dot product for each cell stored from the filter.
The dot of the matrix is going to be the output of the layer and then passed to the next layer as the input. Afterwards, the results will be generated with the filters in the next layers.
Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks.
Audio features used in CNN
Speech can be represented as an image as well. Sound presented as frequency vs time in spectrogram. Spectrogram can be thought of as an image and can apply CNN on this.
In Image, there are three layers R G B.
Working of Convolution Neural Network with Audio
In the image below there is frequency, time and frames. Frequency, as we speak, is the property of sound that determines pitch.
Speech as an image with three layers. While making CNN, take into consideration, 1st and 2nd derivative of speech image with frequency and time.
CNN can do prediction. Analyzing the speech data, CNN can not only learn from images but can also learn from speeches. CNN can do analyze the data, learn from this data and able to identify words, utterances.
Difference between the image feature and audio features: Audio file has to be converted into an image(spectrogram) to run the CNN on that image and also it is difficult for the network to learn, perform data analysis and make predictions.
Advancements in last two years: With the development of cloud computing, where computer power has drastically improved. Computation of various mathematical model becomes easier. And also image classification techniques has improved the accuracy of the neural network. Image classification split the image in smaller images, inference through a similar or classic neural network for classification.
Yann LeCun, a pioneering mathematician introduced the basic structure of modern CNN and Alex Krizhevsky proposed the first successful CNN architecture, AlexNet in 2012. The basic CNN architecture comprises multiple processing layers that are capable of learning feature representations starting directly from raw inputs. These learned representations at each layer are used to develop multiple levels of abstractions that allow CNNs to successfully applied in a variety of tasks. However, due to the lack of large training data and computing power at that time, deep CNNs couldn’t perform well on more complex problems. But with the development of high-performance computing hardware such as GPUs and availability of a large amount of data thanks to the internet, now this is a rapidly developing field.
In recent years major work has been in the analysis of acceleration methods in terms of CNN architecture compression, algorithm optimization, and hardware-based improvement.
The rapid growth of data size and availability of data sets for different problems have encouraged the developers to work in AI field with ease. There are different websites such as kaggle.com, Labelme, ImageNet, LSUN, MS COCO, COIL100, Visual Genome, Google’s Open Images, Labelled Faces in the Wild, Stanford Dogs Dataset, Indoor Scene Recognition, etc where different datasets are available which can be used in Artificial Intelligence project for the training the model.
[1807.08596] Recent Advances in Convolutional Neural Network Acceleration
Abstract: In recent years, convolutional neural networks (CNNs) have shown great performance in various fields such as…
[1512.07108] Recent Advances in Convolutional Neural Networks
Abstract: In the last few years, deep learning has led to very good performance on a variety of problems, such as…
Conclusion: Audio Analysis with CNN has some limitations, an audio file is converted into an image and then testing & learning is performed. Modern technologies are using LSTM for audio analysis, where the algorithm looks back to itself to get some history to check what that person has spoken in the past.