Building a Malaria Classifier with Keras: Background & Implementation

Published in

GradientCrescent

7 min readFeb 11, 2019

Introduction

With an increasing number of controversies surrounding the proliferation of ‘AI’, it’s prudent to demonstrate examples of how machine learning could have far-reaching and positive real-world applications. In support of this, we are going to tackle one of the most serious diseases affecting the African subcontinent today, Malaria.

Malaria is a potentially fatal disease caused by Plasmodium parasites transmitted to people through the bites of infected female Anopheles mosquitoes, known as malaria vectors. It has existed for 30 million years, and has been identified as being a major cause of death in ancient civilizations worldwide. Today, Malaria continues to be a serious disease, with nearly half of the world’s population being at risk, although the WHO has identified the African region as carrying a disproportionately high share of the global malaria burden, with the region being home to 92% of malaria cases and 93% of malaria deaths in 2017.

Malaria parasites can be identified by examining a sample of an infected patient’s blood under an optical microscope. Prior to examination, the sample is spread across a microscope slide, and stained with a dye mixture that enhances the contrast of the plasmodium parasite in the patient’s red blood cells . This technique is accepted as standard by health authorities worldwide, and possesses acceptable accuracy and cost-effectiveness, but is also time and labour-intensive, as well as highly dependent on the experience of the technician.

Plasmodium vivax parasites infesting red blood cells. Credit: Harvard T.H. Chan School of Public Health

To speed up this process, we are going to introduce partial automation for Malaria detection by building a CNN-based Malaria classifier using Keras. The dataset we will be using today is the Kaggle “Malaria Cell Images” dataset, which contain’s over 13 000 RGB images images of both uninfected and parasitized cells.

To prevent excessive training time, we recommend that the user only run this code when linked up to GPU resources. As in our previous tutorial, we assume that the reader is familiar with the elements of deep learning, particularly with basic image classification. The theory behind CNN’s has been widely covered in multiple courses and tutorials, and hence will not be repeated here.

Implementation

To begin, let’s import the Python os and shutil packages for data manipulation, and define the paths to our data along with the path to our working directories. We separate our data into positive and negative samples: “A” refers to the infected/parasitized samples, while “B” refers to the uninfected samples.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


import shutil

import os
print(os.listdir("../input/cell_images/cell_images/"))base_dir='../input/cell_images/cell_images/'
work_dir=  "work/"
os.mkdir(work_dir)base_dir_A ='../input/cell_images/cell_images/Parasitized/'
base_dir_B ='../input/cell_images/cell_images/Uninfected/'

work_dir_A = "work/A/"
os.mkdir(work_dir_A)
work_dir_B = "work/B/"
os.mkdir(work_dir_B)

Next, we split our data– for training, validation, and testing purposes. Under each split folder, we will create two sub-folders for our output categories, termed for positive (infected) and negative (uninfected). We will be copying images from both source directories to these later on — this circumvents some of the READ-ONLY limitations that affect certain environments.

train_dir = os.path.join(work_dir, 'train')
os.mkdir(train_dir)

validation_dir = os.path.join(work_dir, 'validation')
os.mkdir(validation_dir)

test_dir = os.path.join(work_dir, 'test')
os.mkdir(test_dir)

print("New directories for train, validation, and test created")train_pos_dir = os.path.join(train_dir, 'pos')
os.mkdir(train_pos_dir)
train_neg_dir = os.path.join(train_dir, 'neg')
os.mkdir(train_neg_dir)

validation_pos_dir = os.path.join(validation_dir, 'pos')
os.mkdir(validation_pos_dir)
validation_neg_dir = os.path.join(validation_dir, 'neg')
os.mkdir(validation_neg_dir)

test_pos_dir = os.path.join(test_dir, 'pos')
os.mkdir(test_pos_dir)
test_neg_dir = os.path.join(test_dir, 'neg')
os.mkdir(test_neg_dir)

print("Train, Validation, and Test folders made for both A and B datasets")

To simplify things for later viewing and analysis, let’s rename all of our images to correspond to their target class, as positive (A) or negative (B).

i = 0
      
for filename in os.listdir(base_dir_A): 
       dst ="pos" + str(i) + ".jpg"
       src =base_dir_A + filename 
       dst =work_dir_A + dst 
          
       # rename() function will 
       # rename all the files 
       shutil.copy(src, dst) 
       i += 1


       
j = 0
      
for filename in os.listdir(base_dir_B): 
       dst ="neg" + str(j) + ".jpg"
       src =base_dir_B + filename 
       dst =work_dir_B + dst 
          
       # rename() function will 
       # rename all the files 
       shutil.copy(src, dst) 
       j += 1       
        
print("Images for both categories have been copied to working directories, renamed to A & B + num")

Now that all directories have been made, let’s perform a manual train/test split and copy our source images into their respective directories. In our example, each class will have 3000 training images, 1000 validation images, and 500 test images. The training images are used to fit the model using the network’s parameters, while the validation images are used to fine-tune said parameters for a better generalization capability and enhanced accuracy. The role of the three datasets roughly approximate that of practice questions, practice exams, and final exams, respectively.

fnames = ['pos{}.jpg'.format(i) for i in range(3000)]
for fname in fnames:
    src = os.path.join(work_dir_A, fname)
    dst = os.path.join(train_pos_dir, fname)
    shutil.copyfile(src, dst)

fnames = ['pos{}.jpg'.format(i) for i in range(3000, 4000)]
for fname in fnames:
    src = os.path.join(work_dir_A, fname)
    dst = os.path.join(validation_pos_dir, fname)
    shutil.copyfile(src, dst)

fnames = ['pos{}.jpg'.format(i) for i in range(4000, 4500)]
for fname in fnames:
    src = os.path.join(work_dir_A, fname)
    dst = os.path.join(test_pos_dir, fname)
    shutil.copyfile(src, dst)
    


fnames = ['neg{}.jpg'.format(i) for i in range(3000)]
for fname in fnames:
    src = os.path.join(work_dir_B, fname)
    dst = os.path.join(train_neg_dir, fname)
    shutil.copyfile(src, dst)

fnames = ['neg{}.jpg'.format(i) for i in range(3000, 4000)]
for fname in fnames:
    src = os.path.join(work_dir_B, fname)
    dst = os.path.join(validation_neg_dir, fname)
    shutil.copyfile(src, dst)

fnames = ['neg{}.jpg'.format(i) for i in range(4000, 4500)]
for fname in fnames:
    src = os.path.join(work_dir_B, fname)
    dst = os.path.join(test_neg_dir, fname)
    shutil.copyfile(src, dst)
    
print("Train, validation, and test datasets split and ready for use")print('total training pos images:', len(os.listdir(train_pos_dir)))
print('total training neg images:', len(os.listdir(train_neg_dir)))
print('total validation pos images:', len(os.listdir(validation_pos_dir)))
print('total validation neg images:', len(os.listdir(validation_neg_dir)))
print('total test pos images:', len(os.listdir(test_pos_dir)))
print('total test meg images:', len(os.listdir(test_neg_dir)))

Next, we prepare and normalize our data inputs for the network. We can use Keras’s ImageDataGenerator class to automate this step, while simultaneously producing input batches for batch gradient descent training. To summarize, the ImageDataGenerator normalizes pixel intensities to between 0 and 1, converts the JPEG content into floating-point tensor maps of each image, and resizes them to 150 x 150 pixels. As we split our data into two output folder classes, we specify the class_mode as “binary”, and the ImageDataGenerator will automatically learn to associate the contents of each folder with the correct label.

from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
        validation_dir,target_size=(150, 150),
        batch_size=20,
        class_mode='binary')

print("Image preprocessing complete")

With all data preparation completed, we can now build our network. Our CNN architecture consists of several convolutional layers for feature representation and fully-connected layers followed by a sigmoid classifier for binary classification.

Briefly, the role of the convolution layer is to learn discriminative low-level and high-level features by identifying important pixel patterns in image regions. The role of max pooling layers serve to reduce computational burden by only selecting the maximum output value of a convolution for the output activation map The reduced output from the final convolution and max pooling layer is flattened, which is then input to a 512-dimensional fully-connected layer, and finally into a classification layer featuring sigmoid activation to produce a classification result. In this tutorial, we use the RMSprop optimizer but the reader is encouraged to try out other optimizers within the Keras library, such as SGD and ADAM.

from keras import layers
from keras import modelsmodel = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
                        input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()

from keras import optimizers
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-5),
metrics=['acc'])

print("Model created")

Now let’s begin training our model. We will train for 30 epochs to balance accuracy and computational efficiency. Our trained model is then saved into the history variable, from where we can plot the loss and accuracy metrics.

history = model.fit_generator(
        train_generator,
        steps_per_epoch=100,
        epochs=30,
        validation_data=validation_generator,
        validation_steps=200)
model.save('basic_malaria_pos_neg_v1.h5')import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Looks pretty good! Our validation accuracy is over 92%, although validation plots indicate that we are beginning to overfit to our training data. However, not bad for a quick initial model. Finally, let’s use our trained model on our test dataset, and visually inspect the results.

eval_datagen = ImageDataGenerator(rescale=1./255)
eval_generator = eval_datagen.flow_from_directory(
        test_dir,target_size=(150, 150),
        batch_size=20,
        class_mode='binary')
eval_generator.reset()    
pred = model.predict_generator(eval_generator,1000,verbose=1)
print("Predictions finished")
  
import matplotlib.image as mpimgfor index, probability in enumerate(pred):
    image_path = test_dir + "/" +eval_generator.filenames[index]
    img = mpimg.imread(image_path)
    
    plt.imshow(img)
    print(eval_generator.filenames[index])
    if probability > 0.5:
        plt.title("%.2f" % (probability[0]*100) + "% B")
    else:
        plt.title("%.2f" % ((1-probability[0])*100) + "% A")
    plt.show()

Two test images with predicted confidence intervals (left — infected, right —uninfected)

We’ve now given you the blueprints to classifying Malaria. Play around and see if you can improve upon it, and make a viable tool to save lives!

Special thank you to Arushi Goel for her knowledge and input.

References

Chollet, Deep Learning with Python

Building a Malaria Classifier with Keras: Background & Implementation

Written by Adrian Yijie Xu, PhD