Image Classification with Convolutional Neural Networks


Machine learning has been gaining momentum over last decades: self-driving cars, efficient web search, speech and image recognition. The successful results gradually propagate into our daily live. Machine learning is a class of artificial intelligence methods, which allows the computer to operate in a self-learning mode, without being explicitly programmed. It is a very interesting and complex topic, which could drive the future of technology.

Two months ago I wanted to change my life and I enrolled in the programming course from Digital Academy — Czechitas, Prague. In addition to studying basic subjects, my task was to invent and develop my own project. I decided to focus on machine learning. During my course I was lucky to meet a mentor — Jan Matoušek from Data Mind, who helped me to discover a new world of artificial neural networks.


Since I’m a newcomer to this issue, I took a ready-made model from Keras blog. My goals were to understand how the model works, describe it; customize the model and teach it to recognize photos of cars and elephants.

Techniques and tools

I used Python syntax for this project. As a framework I used Keras, which is a high-level neural network API written in Python. But Keras can’t work by itself, it needs a backend for low-level operations. Thus I installed a dedicated software library — Google’s TensorFlow.

As a development environment I used the PyCharm. I used Matplotlib for visualization.

For network training and testing I used a dataset of photos of elephants and cars downloaded from


А bit of theory in the beginning does not hurt :)

Neural network

Is a machine learning algorithm, which is built on the principle of the organization and functioning of biological neural networks. This concept arose in an attempt to simulate the processes occurring in the brain by Warren McCulloch and Walter Pitts in 1943.

Neural networks consist of individual units called neurons. Neurons are located in a series of groups — layers (see figure allow). Neurons in each layer are connected to neurons of the next layer. Data comes from the input layer to the output layer along these compounds. Each individual node performs a simple mathematical calculation. Тhen it transmits its data to all the nodes it is connected to.

The last wave of neural networks came in connection with the increase in computing power and the accumulation of experience. That brought Deep learning, where technological structures of neural networks have become more complex and able to solve a wide range of tasks that could not be effectively solved before. Image classification is a prominent example.

Convolutional neural networks and image classification

Convolutional neural networks (CNN) is a special architecture of artificial neural networks, proposed by Yann LeCun in 1988. CNN uses some features of the visual cortex. One of the most popular uses of this architecture is image classification. For example Facebook uses CNN for automatic tagging algorithms, Amazon — for generating product recommendations and Google — for search through among users’ photos.

Let us consider the use of CNN for image classification in more detail. The main task of image classification is acceptance of the input image and the following definition of its class. This is a skill that people learn from their birth and are able to easily determine that the image in the picture is an elephant. But the computer sees the pictures quite differently:

Instead of the image, the computer sees an array of pixels. For example, if image size is 300 x 300. In this case, the size of the array will be 300x300x3. Where 300 is width, next 300 is height and 3 is RGB channel values. The computer is assigned a value from 0 to 255 to each of these numbers. Тhis value describes the intensity of the pixel at each point.

To solve this problem the computer looks for the characteristics of the base level. In human understanding such characteristics are for example the trunk or large ears. For the computer, these characteristics are boundaries or curvatures. And then through the groups of convolutional layers the computer constructs more abstract concepts.

In more detail: the image is passed through a series of convolutional, nonlinear, pooling layers and fully connected layers, and then generates the output.

The Convolution layer is always the first. Тhe image (matrix with pixel values) is entered into it. Imagine that the reading of the input matrix begins at the top left of image. Next the software selects a smaller matrix there, which is called a filter (or neuron, or core). Then the filter produces convolution, i.e. moves along the input image. The filter’s task is to multiply its values by the original pixel values. All these multiplications are summed up. One number is obtained in the end. Since the filter has read the image only in the upper left corner, it moves further and further right by 1 unit performing a similar operation. After passing the filter across all positions, a matrix is obtained, but smaller then a input matrix.

This operation, from a human perspective, is analogous to identifying boundaries and simple colours on the image. But in order to recognize the properties of a higher level such as the trunk or large ears the whole network is needed.

The network will consist of several convolutional networks mixed with nonlinear and pooling layers. When the image passes through one convolution layer, the output of the first layer becomes the input for the second layer. And this happens with every further convolutional layer.

The nonlinear layer is added after each convolution operation. It has an activation function, which brings nonlinear property. Without this property a network would not be sufficiently intense and will not be able to model the response variable (as a class label).

The pooling layer follows the nonlinear layer. It works with width and height of the image and performs a downsampling operation on them. As a result the image volume is reduced. This means that if some features (as for example boundaries) have already been identified in the previous convolution operation, than a detailed image is no longer needed for further processing, and it is compressed to less detailed pictures.

After completion of series of convolutional, nonlinear and pooling layers, it is necessary to attach a fully connected layer. This layer takes the output information from convolutional networks. Attaching a fully connected layer to the end of the network results in an N dimensional vector, where N is the amount of classes from which the model selects the desired class.

A fragment of the code of this model written in Python will be considered further in the practical part.


In the beginning of this part I would like to describe the process of Supervised machine learning, which was taken as a basis of the model.

Supervised machine learning

It is one of the ways of machine learning where the model is trained by input data and expected output data.

Тo create such model, it is necessary to go through the following phases:

  1. model construction
  2. model training
  3. model testing
  4. model evaluation

Model construction depends on machine learning algorithms. In this projects case, it was neural networks.

Such an algorithm looks like:

  1. begin with its object: model = Sequential()
  2. then consist of layers with their types: model.add(type_of_layer())
  3. after adding a sufficient number of layers the model is compiled. At this moment Keras communicates with TensorFlow for construction of the model. During model compilation it is important to write a loss function and an optimizer algorithm. It looks like: model.comile(loss= ‘name_of_loss_function’, optimizer= ‘name_of_opimazer_alg’ ) The loss function shows the accuracy of each prediction made by the model.

Before model training it is important to scale data for their further use.

After model construction it is time for model training. In this phase, the model is trained using training data and expected output for this data.

It’s look this way:, expected_output).

Progress is visible on the console when the script runs. At the end it will report the final accuracy of the model.

Once the model has been trained it is possible to carry out model testing. During this phase a second set of data is loaded. This data set has never been seen by the model and therefore it’s true accuracy will be verified.

After the model training is complete, and it is understood that the model shows the right result, it can be saved by:“name_of_file.h5”).

Finally, the saved model can be used in the real world. The name of this phase is model evaluation. This means that the model can be used to evaluate new data.

Classification Model (elephants vs cars)

Here I would like to describe the code that was taken as the basis of this project. It is considered that a deep learning model needs a large amount of data. But the model given in this script is excellent for training with a small amount of data. Because of that I took only 200 photos per class for training and 80 photos per class for expected output during training.

Using little data is possible when the image is preprocessing with Keras ImageDataGenerator class. Тhis class can create a number of random transformations, which helps to increase the number of images when it is needed.

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
img = load_img('train/elephants/adventure-1822636_640.jpg') # this is a PIL image
x = img_to_array(img) # this is a Numpy array with shape (300, 300, 3)
x = x.reshape((1,) + x.shape) # this is a Numpy array with shape (1, 300, 300, 3)

# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
save_to_dir='preview', save_prefix='el', save_format='jpeg'):
i += 1
if i > 20:
break # otherwise the generator would loop indefinitely

ImageDataGenerator has the following arguments:

  1. rotation_range — which is used for random rotations, given in degrees in the range from 0 to 180
  2. width_shift_range — which is shown in fraction of total width, used for random horizontal shifts
  3. height_shift_range — which is the same as width_shift_range, but with height
  4. shear_range — shear intensity, used for linear mapping that displaces each point in a fixed direction
  5. zoom_range — use for random zooming
  6. horizontal_flip — unlike other arguments has boolean type, used for randomly flipping inputs horizontally
  7. fill_mode — can be “constant”, “reflect”, “wrap” or “nearest” as in this case; indicates the method of filling the newly formed pixels
  8. These are not all the arguments that could be used, the further ones can be found here.

To specify the input directory load_image is used. Also load_image means that image will load to PIL format.

Image_to_array means that image in PIL format returns a 3D Numpy array, which will be reshaped on further.

Then in the loop with flow(x,y) method, the image transformation takes place. Random transformations are stored in the “preview” folder and look like:

The following code fragment will describe construction of the model. Here the layers begin to be added. This architecture was made on the principle of convolutional neural networks. It consists of 3 groups of layers, where the convolution layers (Conv 2D) alternate with the nonlinear layers (Relu) and the pooling layers (Max Pooling 2D). It then follows 2 tightly bound layers (Dense). Consider their structure in more detail.

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense

model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(300, 300, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))

# the model so far outputs 3D feature maps (height, width, features)

model.add(Flatten()) # this converts our 3D feature maps to 1D feature vectors

Let us look at the first convolution layer Conv 2D. The number 32 shows the amount of output filter in the convolution. Numbers 3, 3 correspond to the kernel size, which determinate the width and height of the 2D convolution window. An important component of the first convolution layer is an input shape, which is the input array of pixels. Further convolution layers are constructed in the same way, but do not include the input shape.

The activation function of this model is Relu. This function setts the zero threshold and looks like: f(x) = max(0,x). If x > 0 — the volume of the array of pixels remains the same, and if x < 0 — it cuts off unnecessary details in the channel.

Max Pooling 2D layer is pooling operation for spatial data. Numbers 2, 2 denote the pool size, which halves the input in both spatial dimension.

After three groups of layers there are two fully connected layers. Flatten performs the input role. Next is Dense — densely connected layer with the value of the output space (64) and Relu activation function. It follows Dropout, which is preventing overfitting. Overfitting is the phenomenon when the constructed model recognizes the examples from the training sample, but works relatively poorly on the examples of the test sample. Dropout takes value between 0 and 1. Тhe last fully connected layer has 1 output and Sigmoid activation function.

Next step is model compiling. It has a binary cross entropy loss function, which will show the sum of all individual losses. The optimizer algorithm is RMSprop, which is good for recurrent neural networks. The accuracy metrics shows the performance of the model.

The following code fragment prepares the model for training:

batch_size = 16

# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1./255)

# this is a generator that will read pictures found in subfolers of 'data/train', and indefinitely generate
# batches of augmented image data
train_generator = train_datagen.flow_from_directory(
'train', # this is the target directory
target_size=(300, 300), # all images will be resized to 300x300
class_mode='binary') # since we use binary_crossentropy loss, we need binary labels

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
target_size=(300, 300),

Batch size the number of training examples in one forward/backward pass (or for 1 epoch, which is expected).

Then the already described Image Data Generator is added for training and tasting samples. But it has a new transformation, which is called rescale. It multiplies the data by the given value.

The flow_from_directory(directory) method is added for training and testing data. First, the path to the folders is specified. Further, the target size follows. It shows width and height to which images will be resized. Next, the batch size is added. Finally binary class mode is set.

When the preparation is complete, the code fragment of the training follows:

steps_per_epoch=400 // batch_size,
validation_steps=160 // batch_size)

model.save_weights('50_epochs.h5') # always save your weights after training or during training

Training is possible with the help of the fit_generator. Here it is important to indicate a number of epochs, which defines for how many times the training will repeat. 1 epoch is 1 forward pass and 1 backward pass over all the training examples.

Also, in this section steps_per_epoch and validation_steps are set. Steps_per_epoch (or number of iterations) shows total number of steps, which is used to declare one epoch finished and begin the next. Typically this number is equal to the number of samples for training (in my case it is 400: 200 photos of cars and 200 photos of elephants) divided by the batch size (16). It means that the number of iterations: 200 / 16 = 25. Validation_steps is total number of steps (batches of samples) to validate before stopping.

When the model is trained it should be saved with save_weights.

Now, when the model is dissembled it can be run. Running takes some time. At the end of the program shows this result here:

It can be seen that after 50 epochs the validation accuracy is 0.9375, it shows the ability of the model to generalize to new data.

After running the code and saving the model it’s time to check its accuracy on the new testing photos. It is possible through Scoring code. After running this code with the new 400 photos of elephants and cars, I got a classification accuracy of 96% (383 photos correct).


As a result of testing the model, I got a very good accuracy: 96% of correct classification samples after 50 epochs. The only drawback was that I had to wait about 40 minutes until 50 epochs come to the end (looking at the fact that I had a very small number of photos for training). On this I wondered: What if I can achieve the same result in fewer epochs?

For this, I decided to build two plots. The first shows the dependence of the evaluation accuracy on the number of epochs. The evaluation accuracy was calculated using additional dataset of 400 pictures. The second plot shows the dependence of accuracy and validation accuracy on the number of epochs during the testing.

On the first plot it can be seen that the high accuracy (96%) is achieved after 10 epoch. In subsequent epochs on the plot the accuracy does not improve (and even decreases in interval 10–25 epochs).

The second graph shows the intersection of accuracy and validation accuracy. Validation accuracy sows the ability of the model to generalize to new data. Validation dataset contains only the data that the model never sees during the training and therefor cannot just memorize. If your training data accuracy (“acc”) keeps improving while your validation data accuracy (“val_acc”) gets worse, you are likely in an overfitting situation, i.e. your model starts to basically just memorize the data.

This means that after the 10th epoch the model can show the same result, but it will not be better. Consequently, this model is be sufficient to train on 10 epochs.


In this work, I figured out what is deep learning. I assembled and trained the CNN model to classify photographs of cars and elephants. I have tested that this model works really well with a small number of photos. I measured how the accuracy depends on the number of epochs in order to detect potential overfitting problem. I determined that 10 epochs are enough for a successful training of the model.

My next step would be to try this model on more data sets and try to apply it to practical tasks. I would also like to experiment with the neural network design in order to see how a higher efficiency can be achieved in various problems.





Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store