MALARIA PARASITIC CELL CLASSIFICATION

4 min readJun 22, 2020

Let us discuss, how a simple CNN structures allow us to achieve a desirable accuracy, even for a complex image classification task. The dataset we will be using is Malaria Cell Images Dataset. The images are categorized in two classes namely: Parasitized and Uninfected, with 13,800 RGB images belonging to each class.

There are many differences between the two classes provided in the dataset. The primary being, the parasitized cell or the cell consisting of the malaria parasite have a patch(purple in color) depicting the presence of the parasite. The parasitized class consists of images as shown below:

The uninfected cell class consists of images similar to:

The images were converted into grayscale and resized to a dimension of 100*100 before splitting the dataset into train and test.

We need to reshape the data like (number of images, 100, 100, 1) .The following snippet of the code shows the dataset before splitting into training and testing datasets.

By utilizing train_test_split, we divide out datasets into training and testing datasets. The training dataset consists of 22,046 Grayscale images and the testing dataset consists of 5,512 Grayscale images. We also need to one-hot encode our targets/labels. The following snippet of the code provides an insight to it:

In the above code, we keep 20% data reserved for testing the model once trained on the training data, i.e. , 80% training data and 20% testing data.

We used Data Augmentation techniques for preventing overfitting and producing more images for training purposes. The following snippet provides an insight into the same:

Here, images have been rotated by 20 degrees, zoomed by 10%, the width and the height have been shifted by 15% and a shear of 10% has been added. The testing dataset was used as the validation dataset and also to evaluate the model’s performance(shown ahead). As a part of experimentation, you can also use the training dataset to produce the validation dataset too, by adding subset=’training’ in train_batches and subset=’validation’ in val_batches after specifying the validation_split=0.1 in ImageDataGenerator().

A Simple LENET-5 like model

The model developed consists of 4 convolution layers, each layer followed by a MaxPooling2D layer. The structure of the developed model is as:

The model developed has 195,038 parameters.

After compiling and fitting/training, the model has a validation accuracy of 93%(approx.) with 94% training accuracy. Following are the results:

The model had an accuracy of 94% when evaluated for performance against the test data.

The following snippet shows the confusion matrix and the classification report for the testing dataset.

Confusion Matrix and Classification report

THE EXPERIMENTATION

As mentioned earlier, the training dataset can be split into training and validation dataset instead of using testing dataset for validation purposes. By doing this process we have three datasets: Training, Validation and Testing.

The following snippet shows the same:

Using the above experimentations with data augmentations, compiling and training the data produces slightly better results.

The following snippet is the confusion matrix and the classification report for the experimented model:

There is a slight increase in the performance of the model with an increased accuracy for the testing data (around 95% compared to the previous 94%).

To further develop a better model, following more experimentations can be performed:

Increase/Decrease the number of layers and/or the number of neurons in each layer.
Try using a deeper network that captures more features. It also helps to reduce overfitting.
Trying a different optimizer while compiling the model. You can select from ‘Adam’, ‘RMSprop’, ‘SGD’, ‘Mini Batch Gradient Descent’. Adam was used during the experimentation.
Try increasing the number of epochs you train the model for. There is no guarantee that the model will perform better, but it gives you a better insight about the performance of the model over the duration of the training process.
Using callbacks to save the best model, schedule learning rate on the basis of ‘validation loss’ or ‘validation accuracy’.

THANK YOU

MALARIA PARASITIC CELL CLASSIFICATION

Written by Akshat