Image Classification With Convolutional Neural Networks: What do they do? How do they do it? Let’s find out!

By: Dylan Zucker, Melanie Cheng, Ryan Pasculano

Bucknell AI & CogSci
8 min readNov 1, 2018

--

Our team was tasked with researching any artificial intelligence(AI) related topic and coding an implementation to demonstrate what we learned, for our Artificial Intelligence and Cognitive Science course. In this article we explain the basics of convolutional neural networks (CNN), and how we applied what we learned by making two separate image classifiers. For the implementation part of our project, we used Keras with the Tensorflow backend for our classifier, then observed how the structure of the network itself affected the accuracy of overall accuracy of the model. We used datasets of 10,000 images of cats and dogs for 30 epochs, with 3,500 training samples per epoch, to train and test our model until it consistently achieved an error rate of 10%. To explore and understand the optimal structure for our model, research was conducted on previous works covering CNNs with comparatively low error rates, including AlexNet, GoogLeNet, and ResNet.

What Do They Do?

In a traditional neural network (NN) there is just three types of layers, an input layer, an output layer, and hidden layers. Each layer is fully connected, meaning that all of the nodes in a given layer are connected to every node in the previous and next layer. Additionally, these layers contain certain parameters, such as momentum, learning rate, and an activation function, that dictate how the neural net functions. A CNN has different types hidden of layers, multiple activation functions that differ from layer to layer, numerous training sets, and more which allows for feature recognition. These features make it more suitable for performing the task of image classification than other types of networks (Gandhi, 2018).

Figure 1: How CNNs look for characteristics that distinguish one object from another in order to identify it. From https://techcrunch.com/2017/04/13/neural-networks-made-easy/

How Do They Do It?

The various types of hidden layers that distinguish CNNs from a traditional NN are convolutional, pooling, dense, and flattening. A dense layer is synonymous with a fully connected layer in a traditional neural network. Convolutional layers are the most complex of the four types and vary depending on filters, stride, padding, and depth. A filter is an NxN grid of weights that are multiplied by the values of a small section of an input layer to create a single value on an output layer. This filter is then slid across the entire input to generate the output (See figure 2). An input layer in a given convolutional layer of a CNN has a height, width, and depth attribute. The filter will also have the same depth as the input layer. The stride will determine how far the filter slides after each calculation. To ensure that each pixel of the image is processed equally, padding is added around the edges. Padding is the process of expanding the height and width of the image by adding zeros around the border which can be seen in Figure 2 . (Fei-Fei, Johnson, & Yeung, 2018).

In the figure below, we can observe a convolutional layer that has a 5x5x3 input layer and two filters with dimensions of 3x3x3. The filter shifts by two after each calculation and there is one layer of zeros so the padding is one. There is a depth of 2 because there are two filters that the input runs through before the values propagate to the output. An equation to calculate the output layer size is: (N — F) / S +1, where N is the dimension of the input, F is the filter size, and S is the stride. (Fei-Fei et al, 2018) In the example below N = 7, F = 3, and S = 2, so the height and width of the output layer will be (7–3) / 2 + 1 = 3. The depth of the filters and the output layer remain at 2.

Figure 2: Demonstrates the use of filters and padding to calculate values for the output. From https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050

A pooling layer is used to scale the image down by sampling small areas and attempting to only take the most significant features into consideration. Similar to the convolutional layer, it slides across the image and looks at only a small portion of the picture at a time. There are several different ways to perform the calculation, but the most common methods take either the maximum or the average value (LeCun, 1998). For example, if the input is a 4x4 image and the filter is taking the maximum value on a 2x2 area with a stride of two, the output would be a 2x2 image with the maximum values taken from each quadrant of the original to create the output (see Figure 3 below). Lastly, a flattening layer takes the three-dimensional image layer and rearranges the nodes into a single one-dimensional layer to allow for easy conversion into a single column of output nodes.

Figure 3: Maximum pooling with a 2x2 filter and a stride of 2. From https://www.oreilly.com/ideas/visualizing-convolutional-neural-networks

Another aspect of CNNs is the activation functions. The Rectified Linear Unit (ReLU) for CNNs has replaced the sigmoid function commonly used in classic NNs, and the former sets all negative values to zero while leaving the positive values untouched. This is computationally more efficient because it ignores the extra noise in the image and the function therefore has a simpler derivative future calculations. The simplified derivative for back propagation is useful in reducing the runtime of large networks. In image classification with numerous possible classes, the output layer usually uses one-hot encoding, where each class has its own output node and is set to 1 when the object in question is present in the image. The softmax function can take all of the output values into consideration and returns the probability of the image belonging to each class (Softmax function, 2018). This is beneficial for the normalization of all values and allows the CNN to easily rank its predictions (i.e. first second and third guess). However, one downside of softmax is that it can only be used in the final output layer calculations compared to the ReLU and sigmoid functions.

Figure 4: Example of softmax function from https://sefiks.com/2017/11/08/softmax-as-a-neural-networks-activation-function/

It is nearly impossible to train a CNN on every possible image of an object because there are seemingly infinitely many pictures of a single object. To get around this, a subset of pictures, called a training set, is selected and used to train the neural net on reasonable number of images. Throughout the training process, three independent datasets are used: a test set for training, a validation set for periodic checking during training, and a test set to provide an unbiased evaluation of the final NN. Datasets typically range from a few hundred up to over a million images depending on the complexity of the CNN and the computational power of the GPU. To determine the model’s ability to classify objects with data separate from the training data, a separate dataset called the validation set is used to determine its accuracy while training. Once training is finished, the test set is used to determine the accuracy over random samplings of images that it has never seen (Shah, 2017). The information obtained through the test set makes it possible to compare the performance of different structures.

Figure 5: Testing error, validation error, and overfitting from https://buzzrobot.com/bias-and-variance-11d8e1fee627

Lastly, the practically infinite amount of possible inputs makes the concern of overtraining a reality. As a regular NN trains, its error decreases and the convergence slows down until the error becomes minimal therefore obtaining maximum accuracy. Training on every possible input so it is able to learn on the entire set would not be possible when analyzing images. If a CNN is trained for too long, it will start overgeneralizing the data by become really good at identifying the training set images as a whole rather than looking for distinct characteristics of a particular object. As a consequence of overtraining, its ability to accurately classify new images decreases (“How Neural Networks are Trained”, 2018). Imagine that you are trying to fit a curve to three points on a graph. While a quadratic polynomial would easily work, so would a higher degree polynomial. For this example, the choice of a higher degree polynomial would result in a line oscillating between points, and any attempt at interpolating from the higher degree polynomial would be less useful than the lower degree polynomial. Using dropout when training the CNN is one method of avoiding overtraining. A parameter of a CNN between 0 and 1, dropout represents the possibility that a node in a single layer is excluded from the model during training and can be lowered as the CNN increases in size. The process causes the remaining connections to become more significant and reduces the number of overall computations, which can be a limiting factor on larger CNNs.

Figure 6:Example of dropout from http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf

Let’s Find Out!

Through our research and various test runs, our network worked best with a dropout rate of 0.5, three total convolutional layers with two initial 32x32x3 layers and one 64x64x3 layer, using the ReLU activation function, and layers after each convolutional layer, with a pool size of 2x2. To account for the RGB values of the pixels, the depth of the convolutional layers was set to 3. After the final max pooling layer we flatten the output so that it can be incorporated into a dense layer consisting of sixty-four nodes. This layer then feeds into a dropout layer, with a rate of 0.5, which feeds into the final dense layer that is the output layer. Using the model that we built, we expanded our knowledge and understanding of CNNs by incorporating a third training set containing images of bananas and experienced the negative effects of overtraining.

CNNs are complex in nature and there exist additional topics, such as the preprocessing of images by means of normalization to reduce the size of input data, that we did not look into due to both time constraints and complexity. Another feature that Keras provides is optimization techniques such as RMSprop which varies the learning rate during training. We also did not consider the many possible metrics that can be used to determine the precision and accuracy of the neural network. These are areas that we would have explored if we had more time and might do so in the future. Now go out there and build yourself an AI to take over the world.

Check out our code HERE!

--

--