Image Classification with Deep Learning: A theoretical introduction to machine learning and deep learning

Published in

Analytics Vidhya

10 min readJan 22, 2020

With regards to identifying images, humans are usually easily capable of recognizing a vast amount of details in objects. For years, we train our brains to interpret what we see and thus subconsciously form our reality.

With advanced technology becoming an irreplaceable element of our lives breakthroughs such as artificial intelligence emerge offering enormous potentials for diverse scopes and use cases.

In particular, convolutional neural networks (CNNs), a class of deep neural networks are applied to analysing images. Making computer vision possible or in other words, enhancing machines to see the world as humans do is not a future vision anymore.

In this article you will get a short theoretical introduction to machine learning and deep learning and get a better understanding of classifying images with deep learning.

1. Difference between AI, Machine Learning and Deep Learning
2. Introduction to Machine Learning Theory
2.1 Key Terminology
2.2 Fundamental Concepts
3. Deep Learning: Neural Networks
3.1 Perceptrons
3.2 Sigmoid Neurons
3.3 Architecture of Neural Networks
4. Deep Learning: Image Classification with Convolutional Neural Networks

1. Difference between AI, Machine Learning and Deep Learning

To understand the concept of deep learning it is crucial to set it into the context of artificial intelligence and machine learning. In computer science, artificial intelligence (AI) is a broad term that describes intelligence displayed by machines. This depiction is characterized by the imitation of natural human intelligence such as learning or problem-solving (Russell and Norvig, 2003). Machine learning is the ability of computers to “learn” directly from data as humans and animals do from real-life experiences. An important aspect is that the used algorithms don’t rely on predetermined equations but adaptively develop their performance (MathWorks Inc., 2016). Moreover, deep learning is a combination of machine learning and the training of so-called neural networks, which will be explained in the subsequent chapters (Burkov, 2019). Beforehand, relevant machine learning terminology and concepts have to be explained.

2. Introduction to Machine Learning Theory

2. 1 Key Terminology

In short, built machine learning systems learn how to combine input factors to generate predictions on previously unseen data. When a system or a model is trained, labels are provided. The label is the variable y that is to be predicted. In view of a case model emails are categorized whether they are spam or not. Here “spam” and “not spam” represent the labels.

With this in mind, features are input variables that describe the data such as the words in the e-mail, e-mail addresses and any kind of other related information. An example is one piece of data as an e-mail. This could be a labeled example, which is used to train the model or an unlabeled example for making predictions on new data.

Lastly, it is important to highlight the differences between regression and classification. A regression model predicts continuous values such as the value or probability of something. However, the classification model predicts discrete values and would interpret whether there is a cat, dog or a hamster in an image (Google Developers, Framing: Key ML Terminology, 2019)

3. 2 Fundamental Concepts

Linear Regression

As already mentioned, regression can forecast the probability of output under a given set of inputs. This happens by guessing the linear relationship between variables. The following table shows the size of seas in square meters and the corresponding amount of fish in each sea.

The data can be examined by plotting it in a diagram. This simple linear relationship can be displayed by fitting a line to the given data.

This relationship is described by the following equation(MIT — Massachusetts Institute of Technology, n. D.):

y = b + mx

In this,

y is the dependent variable which is the value we are trying to predict (the number of fish),
x is the independent variable which is the value of our input feature (the size of the sea),
m is the slope and
b the intercept of the line (MIT — Massachusetts Institute of Technology, n. D.).

Training and Loss

But how to know if the created line can be considered as “good” or “bad”? In this context, the idea of loss is involved. Loss shows how good a model is doing at predicting the outcome. Loss can be defined as the difference between the predicted and the true value in the example. In a perfect forecast, the number of loss would equal zero. The equation for a single observation is called squared loss or L2 loss:

The square of the difference between the label and the prediction = (observation — prediction(x))²

= (y — y’)²

To extend this concept to the whole dataset the so-called mean squared error (MSE) equation is used:

For this, the sum of all squared losses is divided by the number of examples.

Moreover, this is where the aspect of training becomes relevant. Training describes the process of using training data to incrementally improve the models’ ability to predict outcomes. In practice, the machine learning algorithm builds a system by inspecting several labeled examples to find a model that minimizes loss (Google Developers, Descending into ML: Training and Loss, 2019).

Logistic Regression

What if instead of predicting the number of fish in a sea it is of interest to predict whether or not a waterbody is fresh or saltwater? For a categorical problem type, logistical regression comes into play. This classification learning algorithm can be either binary or multiclass. Instead of predicting a value, a probability of an occurrence is predicted. For this, it is necessary to shift from an output that could be between minus infinity and plus infinity to an output that is between 0 and 1. In this case, a value returned by the model for input x is closer to 0 would be assigned negative and a value that is closer to 1 would be assigned positive (Burkov, 2019, S. 32). A function that fits this exact purpose is the sigmoid function:

4. Deep Learning: Neural Networks

With rising complexity and a shift from linear to non-linear problems, the previous concepts quickly come up against their limits. Deep learning, a class of neural network optimizations provides a solution.

4. 1 Perceptrons

To understand the concept behind neural networks it is crucial to begin with a perceptron.

This artificial neuron holds several inputs (xj) and generates one binary output.

The inputs are provided with weights (wj) and the output of 0 or 1 is determined whether the sum of the weights is greater or less than the threshold value.

Combining many perceptrons to a network makes more subtle decisions possible. While the first layer of perceptrons is making three basic decisions based on the weighting, the second layer of perceptrons is making decisions depending on the outcomes of the first layer. The more layers a network has, the more complex choices can be made. Still, there is just a single output.

Due to the shift from a single perceptron to an entire network, the mathematical description also changes with two aspects.

First, ∑j wjxj becomes the product w * x with w and x as vectors with weights and inputs as elements. Moreover, the threshold moves to the other side of the inequality and becomes the so-called perceptrons bias b ≡ - threshold. The bias can be understood as a value for how easy it is for the perceptron to get an output of 1 (Nielsen, 2018, S. 2–4).

4. 2 Sigmoid Neurons

At this point, the network of artificial neurons is merely operating as a logic gate implementing a Boolean function without any learning perspective. In this context, learning would be equal to minimal modifications in the weights or bias to obtain minimal corresponding changes in the output until the designed model behaves exactly how it is supposed to. A particular difficulty is the characteristic of perceptrons which causes the whole network to change radically through the smallest adjustments. This issue is approached by the so-called sigmoid neurons which behave similarly to perceptrons. The advantage of sigmoid neurons is that changes in weights and bias only cause marginal changes in the output.

The input value of a sigmoid neuron is capable of having every value between 0 and 1 which is due to the already familiar characteristics of the sigmoid function (Nielsen, 2018, S. 7–8).

4. 3 Architecture of Neural Networks

The following graphic shows a standard depiction of the architecture of a neural network.

The leftmost layer is called the input layer with its input neurons. The rightmost is the output layer which still contains a single output neuron. A new element is the hidden layer, which is variable in quantity. The hidden layer is simply defined as “not an input or an output”. Its task is to transform inputs into a usable form for the output layer.

The approach of the neural network can be illustrated with an example. If the neural network is supposed to determine whether a handwritten image is a number “9” or not, the intensities of each image pixel play a major role. In a 64 by 64 greyscale image, 4 096 input neurons maintain greyscale intensities between 0 and 1. Finally, the output value of the neural network indicates that the input image is not 9 if the value is below 0.5 or that the input image is 9 if the value is above 0.5 (Nielsen, 2018, S. 10–12).

5. Deep Learning: Convolutional Neural Networks

In the context of image classification, it is not beneficial to rely on classical neural network architectures. This is because they don’t take into account the spatial structure of images. As a result, input pixels that are far apart and close together are treated the same. This is why an architecture is used which takes advantage of the spatial structure: the convolutional neural network (CNN) (Nielsen, 2018, S. 169).

The CNN is used to progressively extract higher-level representations of the input images. It obtains abstract features with every layer. For example, the edges of an object might get recognized at first, followed by more basic shapes and finally the higher-level features such as faces (Albawi, Mohammed & Alzawi, 2017).

The model is not preprocessing the given data anymore to generate features like shapes and textures. Instead, the CNN works with the pixel data of the image and learns to derive these features (Google Developers, ML Practicum: Image Classification, 2019).

While the input layer of a neural network was displayed as a vertical line of neurons, the input of a convolutional neural network is a square of neurons: the input feature map. As before, the input pixels are connected to a hidden layer. This time a connection of a small localized region of the input image is realized.

The CNN extracts tiles of the input feature map and applies filters on them to generate new features which result in the output feature map.

In the convolution process, filters slide over the input feature map to extract a corresponding tile. In this example, there is an input feature map and a convolutional filter with given values. The filter is applied to the input feature map which results in the multiplication of the values of the filters with those of the input feature. Then, the outcomes are summed to a single value of the output feature map. (Nielsen, 2018, S. 170; Google Developers, ML Practicum: Image Classification, 2019).

https://www.youtube.com/watch?v=YRhxdVk_sIs&t=339s

The following example illustrates this approach. The left object is an image representation of a 7, as well as the input feature map that will be run through a convolutional layer with a single filter. The values of this matrix are the individual pixels from the image. The object in the middle is the filter with random numbers. The filter begins with the upper left corner and lands on the first 3 by 3 block of pixels of the input. The dot product of these two elements will be stored for every 3 by 3 block of the input feature map to store it on the corresponding site of the output feature map (deeplizard, 2017). Feel free to watch deeplizard’s “Convolutional Neural Networks (CNNs) explained” video on YouTube: https://www.youtube.com/watch?v=YRhxdVk_sIs&t=339s

While this is still a basic example, CNNs are easily capable of classifying images of objects, animals, humans, plants or screenshots with high probabilities up to 90 % accuracy. Prominent examples of CNN use cases are the photo tagging system at Facebook, virtual assistants like Siri, chatbots and object recognition cameras (Mhalagi, n. D.).

References

Albawi, S., Mohammed, T., Alzawi, S. (2017). Understanding of a Convolutional Neural Network. Diyala/Kirkuk: University of Diyala/University of Kirkuk. DOI: 10.1109/ICEngTechnol.2017.8308186.

Burkov, A. (2019). The Hudred Page Machine Learning Book. Available at: https://file.ai100.com.cn/files/file-code/original/cd136ebe-0e34-4e43-966b-224acff83005/100MLBOOK/The+Hundred-Page+Machine+Learning+Book.pdf (28.09.2019).

deeplizard. (2017, December 9). Convolutional Neural Networks (CNNs) explained. Available at: https://www.youtube.com/watch?v=YRhxdVk_sIs (05.10.2019).

Drakos, G. (2018). How to select the Right Evaluation Metric for Machine Learning Models: Part 1 Regression Metrics. Available at: https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0 (01.10.2019).

Google Developers. (2019). Descending into ML: Training and Loss | Machine Learning Crash Course. Available at: https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss (01.10.2019).

Google Developers. (2019). Framing: Key ML Terminology | Machine Learning Crash Course. Available at: https://developers.google.com/machine-learning/crash-course/framing/ml-terminology (31.09.2019).

Google Developers. (2019). ML Practicum: Image Classification. Available at: https://developers.google.com/machine-learning/practica/image-classification/convolutional-neural-networks (03.10.2019).

MathWorks Inc. (2016). Introducing Machine Learning. Available at: https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/i/88174_92991v00_machine_learning_section1_ebook.pdf (25.09.2019).

Mhalagi, S. (n. D.) The Quest of Higher Accuracy for CNN Models. Available at: https://towardsdatascience.com/the-quest-of-higher-accuracy-for-cnn-models-42df5d731faf (06.10.2019).

MIT — Massachusetts Institute of Technology. (n. D.). Chapter 3 — Linear Regression. Available at: www.mit.edu/~6.s085/notes/lecture3.pdf (01.10.2019).

Nielsen, M. (2018). Neural Networks and Deep Learning. Available at: http://static.latexstudio.net/article/2018/0912/neuralnetworksanddeeplearning.pdf (03.10.2019).

Russell, S. and Norvig, P. (2003). Artificial intelligence (2nd ed.). New Yersey: Prentice Hall.

Image Classification with Deep Learning: A theoretical introduction to machine learning and deep learning

Contents

Written by David Lewenko