Identifying venomous snakes with Deep Learning

Hermes Ribeiro Sant' Anna
The Artificial Neuron
11 min readJun 4, 2018

DeepSnakes — Part I

· Can an AI distinguish venomous from non-venomous snakes?

· We first try to distinguish between python snakes from rattlesnakes.

· Part I trains and evaluates a Logistic Regression classificator.

· Results were insightful, considering the low complexity model.

How to build deep learning architectures to tackle computer vison problems? The Deep Snakes series will cover the pipeline of using deep learning techniques to solve this one problem:

Can an artificial intelligence tell a venomous snake from a non-venomous snake by only seeing pictures of serpents?

This is a report on the experiences and experiments of using increasingly complex NN models, starting with the simplest logistic regression up to the latest deep learning architectures. The source code is available on GitHub.

A little bit of background

Artificial intelligence is all over the news lately. Although it seems to be a current theme, the idea of using computers to mimic human cognition dates back to the days of Alan Turing. The bulk media coverage gravitates around neural networks, one of many machine-learning methods; seed started about 70 years ago by McCulloch and Pitts. You may have heard it with its super cool name Deep Neural Networks or Deep Learning. To make things clear, deep learning (DL) and deep neural networks (DNN) are a set of techniques inside the spectrum of neural networks.

Deep learning made tremendous mainstream success from about 2012 on because it proved powerful in performing tasks deemed possible only to humans. This is an ever changing, ever evolving field. As of today, DL excels in tasks involving image and voice understanding, text processing and translation, game playing as well as insight extraction from huge datasets.

The problem

Although there are some rules of thumb to get to know if a snake is venomous, like observing the head shape and skin patterns, those are not reliable methods. The task of identifying a venomous snake should always be left to specialists. However, since neural networks has the ability to learn specialized tasks by seeing many examples, we will try and see if we can get a neural network to correctly identify if a snake is venomous by image observation alone. In other words, we will try to develop the best machine learning snake specialists possible.

In this problem, the first challenge is to acquire data for processing. Unlike humans, which can learn something new with a handful of examples, current machine learning models are data hungry, needing large amounts of examples for it to internalize any knowledge. For instance, you can teach a child what is a cow, a bird or a snake with few textbook images, while machines need hundreds or even thousands of pictures to learn the same concept.

In order to acquire the dataset, the first source for snake pictures in mind is image search engines. This work used the Bing Image Search API, through a python routine provided by Adrian Rosenbrock to search and download snake pictures from the internet.

We downloaded pictures of two different snake species and try out different models to distinguish between both of them. After that, we manually screened the dataset to remove wrongly classified pictures and duplicates. We ended up with 317 images of pythons and 288 rattlesnakes, which is an unbalanced dataset with a ratio of about 52%/48%. We assume this will not be a big problem. To prepare the dataset for training in python programming language we did:

1. Resize all images to 128x128x3 RGB pixels

2. Shuffle python and rattlesnake images together.

3. Split the dataset into an approximate 80/20 train/dev sets.

4. Joined these images into two 4D arrays of size [# of samples, 128, 128, 3]

First model — Logistic Regression

Logistic Regression (LR) is a simple model for classification tasks with continuous features. It holds much resemblance with artificial neural networks, so we are going to call it the simplest neural network possible. It has only to neural layers, an input layer (which rigorously does not contain any neuron) and an output layer. In the input layer, we linearize each 3D matrix [width, height, color channels] containing a single image into a 1D vector with size [width × height × channels], rendering a layer with 49152 neurons. Each pixel can have 256 values indicating the color intensity in each channel. We preprocess these values so that each pixel intensity is between 0 and 1. Moreover, since our first problem is to choose between only two classes, we will have a single output neuron, which should ideally compute a value of 1 if the image contains a python or 0 if it contains a rattlesnake. The computation inside the output neuron is an affine transformation followed by a sigmoid activation function. More on that later. Translating the problem into a sentence would be:

Given a linearized input image, can we teach a single neuron to tell the probability of the image being either of the two snakes?

Note that, since this neuron will output a probability rather than a binary number, these values can be any real number from 0 to 1. What we do to transform it into a binary computation is to specify a cutoff line. The most trivial is to say that, if this number is smaller than 0.5, we say that it is probable that the picture contains a rattlesnake; otherwise, we say that probably the picture contains a python.

We start with only two classes of snakes and logistic regression because most modelling endeavors should begin simple and grow in complexity step-by-step. Growing slowly into complexity stood the test of time in being the less time consuming approach to get from absolute ignorance to a rich and complex modelling scheme. Ignoring this advice can lead to disastrous waste of time correcting problems from different many sources at once.

Training the model

Developing a model to perform this task is called training. During one single step of training, the model:

· “Sees” a picture (feed-forward computation).

· Guesses, in a scale of 0–1, what is in the picture (inference).

· Compares it with the real label (0 or 1) and calculates an error (loss computation).

· Decides which direction to go in order to decrease the error (backpropagation).

· Takes a small step in the direction of greatest error improvement (optimization).

Let us go into the details for each step.

Feed-forward step: In LR models, there is one weight for each input pixel, which can be any positive or negative real number. The magnitude of each pixel determines its relative importance to the classification task. The sign of each pixel determines the positive or negative contribution of each pixel in the decision-making process (assuming bias is zero, however, a similar logic holds true to biased neurons). Therefore, seeing an image is weighting all pixels in the input, summing the results and adding a bias. This results in one single real number named affine transformation result, pre activation value or simply Z.

Inference step: Since Z can be anywhere between -∞ and ∞, it is hard to interpret if this value signals a python or a rattlesnake image. To make this inference, we pass Z through a sigmoid activation function, which possesses the ability to squash any real number into a 0–1 range. This value is called activation. In our case, if the activation is greater than 0.5, we say the model infers a python, otherwise, it infers a rattlesnake.

Loss computation step: Now the LR has output an activation between 0 and 1, we can compare it to the original picture label and calculate the inference error. For classification problems it is typical to use binary log loss. Without going into its equation, log loss measures the prediction error in a scale from 0 to ∞, where 0 means the prediction is exactly right and the greater the error, the further is the prediction to the original label. As an example, if the predicted value is 0.999 (probably a python) and the image label is 1 (actually a python), the log loss will be close to zero (about 0.001). Since the predicted value is close to the label, the error (loss) is relatively small. However, if the predicted value is, again, 0.999 (probably a python) but the real value was 0 (actually a rattlesnake), the loss would be about 6.9, showing the model is completely wrong about this prediction.

Backpropagation step: This holds a loose analogy to retrospective thinking. The model goes back all the way into its decision making process to figure out, pixel by pixel, which direction it should go to decrease the cost. In numerical terms, the algorithm is computing how to change its inner neural weights, in order to make a better prediction. The result is a vector containing the direction and magnitude towards the greater decrease in cost.

Optimization step: Finally, the algorithm changes the weight values in an attempt to decrease the cost. To make this change, we use batch gradient descent (also known as optimization by steepest descent). Gradient descent is analogous to a rabbit going down a valley; knowing the direction and magnitude of greatest descent (gradient), the rabbit takes a small leap in the [x, y] direction to approach the bottom of the valley (descent). That is exactly what gradient descent does: it takes a small leap in the [weight1, weght2, …, weight49152] space towards the direction of greatest descent, aiming to reach bottom of the loss valley (or the point of smallest error). “Batch” means that the algorithm tries to minimize the sum of errors on all 484 images combined. This summation of losses leads to one objective function variable called “cost”. The leap intensity can be controlled with a hyperparameter called learning rate.

The learning algorithm repeatedly perform this five step sequence for all images in the training set, until we judge the model reached the smallest error possible given the problem, the dataset and the machine learning architecture. The picture below shows how a gradient descent walk in 2D happens. Four situations may occur, depending on the learning rate and the initial position: a) if the leap reaches too far, the rabbit can bounce back and forth out of the valley, increasing its altitude instead of decreasing it. b) Conversely, if the leap is too small, it may get trapped in a local minimum, never finding the global minimum. c) Although it jumped in the direction of greatest descent, it may sometimes end up higher than he previously was; bouncing high and low until it finally reaches as close as possible to the bottom. d) If it starts from the right place and take an ideal leap size, it will smoothly reach the bottom.

The three paths are called: a) divergence, b) local minimum problem, c) oscillatory convergence and d) smooth convergence. We are ultimately trying to train our model via a smooth convergence. However, we often find paths a) b) and c) during our training process.

Since it is impossible to visualize plots in higher dimensions (there are 49153 dimensions on this snake problem), a workaround is to plot the magnitude of the error along each jump. The image below shows how each trajectory displayed above translates into a learning curve in two dimensions.

Results

We ran the above training procedure on 80% of the images (exactly 484) divided into python snakes and rattlesnakes. Making an abstraction, we repeatedly showed an artificial intelligence 484 images of snakes in random order, with their names on, and asked them to learn how to tell de difference between them. Each time we show them all images is called an epoch. We ran the training procedure for 500 epochs and recorded the losses on each one. We also showed the remaining 121 images as a test of its capacity i.e. we tested its inference ability after each epoch and recorded the losses. The first error is called train error and the second dev error (also known as test error). You can see below the learning curves for both datasets.

The train error is also called in the sample error i.e. the error specific to the images it already trained to distinguish. The dev error, however, is the one we are most interested in evaluating, as it assesses if the model can generalize to images other than the ones it has already seen. Therefore, the ideal model is the one with the ability to generalize its knowledge to any instance of the same representation of a python or rattlesnake. The dev error is, however, an approximation to the out of sample error.

In the earlier epochs, the train error swings down than up than down again, continuing to steadily decrease until the training finishes. The dev error, however, oscillates heavily until about 300 epochs. In the picture, it seems as though there is a wide orange patch, which becomes thinner across the epochs. However, this is actually a continuous line heavily oscillating. By the end of these 300 epochs, the dev error becomes constant, showing that the model cannot generalize any further. Therefore, after 500 epochs, the model has a training error of about 0.4 and a dev (generalization) error of about 0.65.

In terms of accuracy, the logistic regression model correctly identified 442 out of 484 images (91% accuracy) in the training set. In terms of generalization, the model correctly identified 79 out of 121 images (65% accuracy) in the dev set. The baseline to compare these performances is random guessing: if the model consisted only of a random number generator between 0 and 1, its performance would stay around 50% accuracy. Therefore, having 91% train accuracy and 65% dev accuracy shows, with high probability, that the model is actually learning distinguishing features between the images in order to decide the correct classes of snakes.

Conclusion

Beginning simple with logistic regression presented some important insights. First, the baseline performance for this problem increased from 50% to 65% dev error. Meaning, complex models will only be relevant if they outperform this metric. Second, we do not need to worry about the train error so far. Because the train accuracy was 91% and could be even higher if we ran the model longer, we are able to focus our efforts into decreasing the dev error and increasing the dev accuracy. This might seem like a trivial observation, however, in some tasks the model cannot properly learn within the dataset, let alone generalize its knowledge to examples other than the ones it has already seen. Finally, we also learned that the model will possibly learn in an oscillatory convergence path, and we should take some measures, whenever possible, to minimize this issue. There are some other performance evaluations we could conduct, such as hyperparameter tuning and error analysis. However, since this model is so simple, we will let these tasks to the next model, shallow neural networks, followed by deep learning architectures.

--

--