DeepSnakes — Part III

Hermes Ribeiro Sant' Anna
The Artificial Neuron
5 min readJul 2, 2018

· Can an AI distinguish venomous from non-venomous snakes?

· The first task is to distinguish between python snakes from rattlesnakes.

· We try two deep learning approaches, simple CNN and AlexNet

· Alexnet approach reaches 91% dev set accuracy.

How to build deep learning architectures to tackle computer vison problems? The Deep Snakes series will cover the pipeline of using deep learning techniques to solve this one problem:

Can an artificial intelligence tell a venomous snake from a non-venomous snake by only seeing pictures of serpents?

This is a report on the experiences and experiments of using increasingly complex NN models, starting with the simplest logistic regression up to the latest deep learning architectures.

In the previous post DeepSnakes — Part II, we used shallow networks to try to improve the model prediction. Although modest, there were some visible improvements by adding a single hidden layer between the input and the output. We now focus on applying deep learning schemes to increase even further our AI predictability.

You can find the code behind this task on GitHub.

Architecture

Our first approach is to build a simple convolutional neural network to explore the basics of Deep Learning for computer vision problems, Convolutional Neural Networks (CNN). The main building blocks of CNN are convolutional and pooling layers. It is common practice however, to name a block containing one or more convolutions followed by a pooling operation as one single layer. We explain further the operations behind each building block.

Convolution

Convolution consists of sweeping a filter with size k x k pixels through patches of the input image containing W x H pixels, where k < W and H. Since RGB images are a composition of three pixel matrices, with one channel for each primary color, the actual dimensions for the filters is k x k x 3, since the image’s actual dimension is W x H x 3. However, for the sake of simplicity, we will omit the number of channels, and you should take it as implicit in all demonstrations. The best way to view this operation is by watching an animation.

Usually, in each layer, the network contains several filters. The activations of each filter is usually stacked in the third dimension, which usually increases the depth of the 3D matrix (tensor) in each convolutional layer. To give a better explanation, convolutions act as edge detectors in the first layers of a Neural Network and as high-level feature detectors in deeper layers, being able to detect complex geometries like human heads, car wheels, animal legs and so on. It can also detect complex texture patterns as Neural Style Transfer demonstrates.

Pooling

Pooling layers lump together groups of pixels by applying a simple statistic to transform many pixels into one activation. As a result, pooling operations are useful to decrease the dimensionality of any hidden activation matrix. The most common forms of pooling is max and average pooling. Once again, this operation is best visualized through an animation.

Finally, it is worthy to notice that CNNs usually use a regularization technique called Dropout. Dropout randomly drops (multiply by zero) a proportion of the activations on each training step. This technique is convenient to avoid overfitting, since the final model is composed of several random architectures combined.

First architecture: neural model with three hidden layers

Our first attempt is a model with 2x convolution-pooling layers, followed by a fully connected hidden layer. In order to connect the conv-pool layers with the fully connected, we linearize the 3D activation tensors just like we linearized our images in the previous posts (Part I and Part II).

This architecture rendered a lower dev set error (0.5392) than the shallow NN (0.5955), however, the dev set accuracy is quite the same (~70%). Despite the modest differences, this intermediate result reveals a crucial fact: there are over 100 times less parameters in this CNN (190,017) than in the shallow NN (20,644,681). This is a clear demonstration of the expressivity power that convolutional neural networks have over shallow fully connected NN, and shows why we should stick to this type of architecture for real improvements in the accuracy.

Second architecture: AlexNet

AlexNet is the pioneer deep learning architecture. Developed by Alex Krizhevsky, Ilya Sutskever and the godfather of Deep Learning, Geofrey Hinton, this algorithm won the ILSVR 2012 competition by a large marging, proving that deep learning was the top technique for image classification and localization.

Alexnet hidden layer architecture is made of eight layers, five convolutional and two fully connected with the following characteristic:

96 kernels (filters) of size 11x11x3 with a stride of 4 pixels;

  • Response normalization
  • Pooling (size 3, stride 2)

256 kernels of size 5x5x96;

  • Response normalization
  • Pooling (size 3, stride 2)

384 kernels of size 3x3x256

384 kernels of size 3x3x384

256 kernels of size 3x3x384

Two consecutive fully connected layers with 4096 neurons each with dropout rate of 0.5

We employed a similar architecture, with some simplifications such as using half the amount of filters and changing response normalization to batch normalization. We (and the paper) perform dataset augmentation. This technique consists of applying random cropping, mirroring and rotation to the input images in order to render, as much as possible, a model invariant to translation, zoom and rotation. For more details concerning our implementation, please see the complete Jupyter notebook.

After 1000 training epochs, this model rendered a dev set error of 0.3675 and an accuracy of 91.74%. The results speak for themselves. Due to the combination of several improvements from our first model such as, deep convolutional layers, dropout, dataset augmentation and batch normalization, we were able to obtain an absolute jump of about 20% in accuracy. Now, we are still far from reaching 100% accuracy, but this clearly shows that deep learning is the way to go in our ultimate goal: to detect venomous from non venomous snakes.

Conclusion

We are done with telling python snakes from rattlesnakes. From now on, our focus is to build an excellent architecture to tackle the bigger problem, telling venomous from non-venomous snakes. AlexNet showed an amazing performance, as compared to any other architecture we tried previously. However, we intend to use architectures that are even more advanced in the future. Apart from this, we also gained a lot of insight. Besides using dataset augmentation, we need to use more training images, since this directly affects our results. We also would like to increase the image resolution, in order to capture more nuanced features from snake images. Hopefully, with these changes, we can obtain an even higher accuracy in our end goal.

--

--