English — People Detection Using Convolutional Neural Networks
So, a couple of months ago I was about to finish college. I’d been majoring in Computer Engineering, a 5-years course (Engineering courses are 5-years long in Brazil) and one of the last things you do before saying bye bye college life is a “TCC”, like a final year project, but normally in 6 months (if you’re lucky). I’ve been meddling with Machine Learning since last year and I had in mind I wanted to do something related to that subject. Moreover, I’ve had good experience with Embedded Systems before and wanted to do something with that as well. Machine Learning tasks normally take a lot of time and space, which is something that embedded system do not have to spare, therefore I was playing with fire…
I’ve approached one of my College’s professors at the beginning of the semester (he’s normally an Embedded Systems guy, but he’s got some knowledge in AI too) and asked for his help. In Machine Learning I wanted to work with Convolution Neural Networks (CNNs) because I’ve heard of their power before and wanted to experiment with it. TL;DR He proposed we’d studied CNNs involving detecting people in aerial images. It’s a research that has been done before so we’d need to tweak it a little bit to make it unique. We decided to study about its use in embedded systems and how hard it would be to deploy one. I’m writing here parts of the article I produced as a report of that research hoping that it would help anyone who are venturing in the same area.
The use of machine learning algorithms to solve the problem of detecting people in images is one of the most interesting approaches to this task. SVM allied with a HOG feature extractor is one of the most common algorithms used. Other classifiers like Cascade (with Haar) have also been proved to yield satisfactory results. Finally, deep-learning techniques such as Convolutional Neural Networks have been presenting remarkable results in the image recognition area and might represent an appropriate solution for the proposed task.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) is a subclass of neural networks that have been found to work well with image classification . Just like a conventional neural network each layer is composed of weights and biases that can be updated in order to learn about a certain type of data. CNNs are composed of different layers each feeding from the data produced by the previous ones, the main types of layers found in these networks are: convolutional layers, pooling, ReLU, normalization layers, fully connected layers and loss .
The convolutional layer is the core of CNNs , they are a set of feature maps that are in fact learnable matrices of data. During the learning phase these data points are convolved together with the input data and the activation (output) map is produced.
The pooling layer works as a subsampling tool, therefore the feature maps will be reduced in each pooling layer, this layer reduces the data in a smaller feature map making the network less sensitive to changes in rotation and axis. This subsampling can be done by calculating the average, the sum or the maximum of a certain number of data points. This helps in a faster calculation in subsequent layers .
The rectified linear units (ReLUs) are the activation functions used to evaluate the neurons in the network. The function used is
this is a linear and non-saturating function that has been widely used in deep learning due to its behaviour. The function is easily calculated (either in the forward or the backward pass), therefore it doesn’t require complex computations like other activation functions.
Normalization layers usually apply a normalization function to a group of neurons in order to generalize the data they represent. Local Response Normalization (LRN) are normally seen applied with CNNs because it works well with ReLUs. Due to the unbounded nature of ReLUs, the LRN tends to make neurons with large activations more apparent, which is good to the training process.
The last layers in a CNN are normally composed of fully connected layers which are a common MultiLayer Perceptron (MLP). Normally, at the end of the network a loss function is applied, and a back-propagation algorithm is computed throughout the network in order to update the weights and biases to shift the learning process towards the right results. The most common update algorithm in CNNs is Stochastic Gradient Descent. Finally, a technique called dropout can be applied in the fully connected layers where at a certain probability rate some neurons are excluded from training at each iteration.
The CNN used in this work was proposed by Alex Krizhevisky : it has 5 convolutional layers and 3 fully connected layers.
Datasets and Data Preparation
Two datasets were used in the experiments: GMVRT-v1 and GMVRT-v2. The first one contains 4223 positive images and 8461 negative images with size of 64x128 each. The positive images were taken from different angles and variations of rotation, pitch and yaw. It contains images of people in different situations wearing a variety of clothes. In some cases, there is more than one person in the image. The negative images in this dataset contain urban and rural scenery with the same variations in rotation, pitch and yaw.
The second dataset contains 3846 positive images and 13821 negative images of size 128x128. The images in this dataset also contain people in different situations as well as different scenery images for the negative samples.
Convolution Neural Networks require a large amount of data in order to perform well, therefore a data augmentation process was executed. Moreover, the numbers of positive and negative samples were asymmetric in number and needed to be corrected. Once the two datasets were merged the total number of images was 8069 positive images and 22282 negative images. The augmentation process was then executed to equal these figures. The process consisted of randomly choosing one of the positive pictures and applying (also randomly) one of three transformation: x/y shifting, rotation and scaling. After this the number of images amounted to 44564.
Lastly, in order to be read by the framework, images need to be converted to a file format called MDB. They firstly need to be manually split into validation and training sets, and then a script provided by the framework will compact all images into a single MDB file. Note that this process resizes the images to 256x256 since this is the standard size for most deep learning frameworks.
Caffe and Convolutional Neural Network Training
The framework used to train and evaluate the CNN was the Caffe framework. It is an open source framework specially designed to work with deep learning. It was written in C++ and its default configuration supports GPU processing. It was also used as a base implementation for the custom CNN developed by the author.
The training process was executed in a Amazon AWS G2 2x Large instance. This instance is a special kind designed to enable better processing for GPU-based tasks. A machine with GPU will yield a faster processing time. Moreover, the machine has 15GiB GDDR memory and 8 virtual CPUs enabling a faster training time.
For the training process, the dataset was split in validation and training sets using the ratio of 70% and 20% respectively, while 10% of the data was saved for testing after the training process. The weights and biases in each convolutional network were updated using the backpropagation algorithm. Stochastic gradient descent (SGD) was the choice for optimization and the Softmax function was the one chosen for evaluating the loss at the end of each iteration. The learning rate and gamma were set to 0.005 and 0.1 with a step function at every 30 epochs. The momentum was set to 0.9 which helps the SGD converge faster to a satisfactory result. Finally, the maximum number of epochs was set to 1000.
An algorithm was developed in python to simulate the forwards pass function in the Caffe framework. This function implements the classification process in the framework and it is the one responsible for calculating the weights and biases of every layer when a new data point is inserted into the model.
The output of the Caffe training process was used to generate files for weights and biases for each layer. These files could then be used to perform classification. Figure 6 shows a flowchart of the implemented algorithm.
The algorithm will simply load the new image as data and use the weights and biases from each layer to compute the new data for the next layer. In the convolutional layers some transformations are applied (according to the layer architecture) such as ReLu, LRN and maxpoling. In the fully connected layers the dot product of weights and biases is calculated and at the end the classes scores are generated. Some transformations might be applied in the fully connected layers, such as the dropout technique. At the end, every class (two in this case) will have a score, the classification will simply look for the higher score to predict a class for the input data.
The training process executed in the AWS machine took 6.8 hours to complete, yielding an average accuracy in the validation set of 99.46%. Even though the total amount of interactions reached 26000, the optimization applied made the algorithm converge in the initial stages of training, with trivial improvement afterwards.
To evaluate the correctness of the algorithm implemented, the Caffe framework was used to classify the 4456 test samples in a PC with 8GB of RAM and an Intel i3 processor. The framework reached an accuracy of 99.21%, very similar to the validation one. Note that the accuracy of the CNN is very high even though the images suffered slight changes during the data augmentation process.
The algorithm implemented was then used to classify the same test samples. The accuracy reached was of 99.21%, the same accuracy as the framework.
The classification process was also ran in the Raspberry Pi. The accuracy and confusion matrix were the same. One explanation for the small difference in the confusion matrices between the implementation and the framework can be given by the way the two load the trained knowledge to use. The framework uses its own file, which might contain data with a better precision whereas the data used by the implementation was taken from the same file, except it needed to be exported to an intermediary format that could be read by the implementation.
After the code was deployed to the Raspberry Pi some timing tests were ran as well as memory tests to evaluate the performance in an environment with limited resources. Once again, all tests samples were classified and a classification average time of 3.5s was found. As expected the time taken to process the final layers is far less than the initial time due to the pooling process executed throughout the network which reduces data size.
To analyse the memory used, one needs to evaluate the amount of data used by each layer. We considered in this analysis the amount of data needed to store the feature maps (trained knowledge) as well as the data produced by the layer. The final size of data in each layer was calculated considering that each layer contains floating point numbers and that in the Raspberry platform a float point has 4 bytes. At any given time it is necessary to have in memory the output produced by the previous layer as well as the feature map of the current layer, these two will generate the output for the next layer and so forth.
Note as well that the convolutional layers are sparser, therefore they hold less data than the fully connected layers. Consequently, in order to evaluate the maximum amount of memory required by the implementation to run we’d need to look into one of latter layers. As mentioned before, at any given time to be able to run a layer needs its input feature map alongside with the output of the previous layer, thus the worst-case scenario is presented by the fully connected layer 1 when the system will need approximately 152MB of memory available to compute the next layer.