Fighting Malaria with Machine Learning | Towards AI

Using Transfer Learning to Detect Malaria Diseases

Satsawat Natakarnkitkul
Jun 18 · 5 min read

Malaria is still one of the most common infectious diseases to-date, and global challenge to tackle with. It is caused by a parasite that is transmitted through the bite of infected female Anopheles mosquitoes. The parasite that causes malaria is a microscopic, single-celled organism called ‘Plasmodium’.

  • In 2017, there were an estimated 219 million cases of malaria in 87 countries, with 435,000 deaths.¹
  • In 2017, the African Region was home to 92% of malaria cases and 93% of deaths.¹
  • Malaria is most commonly found in the tropical and sub-tropical areas of Africa, South America, and Asia.
  • Despite its fatality, it can be cured if detected early. However, the way to diagnose malaria accurately is by taking a drop of blood, smearing it on a slide and then examining it under a microscope to look for malaria parasites inside the red blood cells.
Figure 1: Microscope of blood smear; (1) Healthy red blood cell; (2) Malaria parasites developing within infected red blood cells; (3) Malaria parasites about to burst out (image credit: Will Hamilton)

The healthcare industry starts to turn to use machine learning and train the image classification model to help to reduce the burden of microscopists in resource-constrained regions and improving diagnostic accuracy.

I will use a pre-trained model component in TensorFlow Hub to demonstrate how we can use it for other problems, hence malaria. The article will follow the general machine learning workflow:

  1. Examine and understand the data by visualizing those microscope pictures
  2. Build the data pipeline to the model
  3. Compose the model
  4. Train the model
  5. Model evaluation

The data set and explanation can be found here².

Data Understanding

The slide images of red blood cells are provided and encrypted in azip file. We should confirm how many images are given and what extensions are being provided.

Figure 2: Data in the given Zip file

Once we have removed unnecessary files from the list (other files which are not images). Then let’s observe the sample images from each class.

Figure 3: Infected red blood cells
Figure 4: Uninfected red blood cell

One note from plotting these sample images, we can see that the images are not equal in size and this needs fixing before feeding onto the model.

Human is good at looking and finding the patterns in the image, we can notice the infected red blood cell images have the purple dense color, which indicates the malaria infection.

Data Pipeline

Normally, when we are working with a dataset, we would load those data into the memory (i.e. pandas data frame or numpy array). However, when we are working with other unstructured data or large amounts of data, we cannot fit all images into the memory (maybe we can but it will cost significantly).

In this article, I will demonstrate the use of the flow_from_directory() method from the ImageDataGenerator class to feed the training image to the model.

So now we need to restructure the folders from what we currently have, below image demonstrates the thought process on this.

Figure 5: Preprocessing prior to building the data pipeline

We ultimately only scan for file names and labels associated with the images without loading the actual images onto memory.

Transfer Learning with TFHub

In this step, I will use the feature extractors MobileNet V2, which is available here.

Load the pre-trained feature extractor module and freeze the weights

We load the feature extractor from the TF Hub and get the expected image size. This will be used when we generate the data pipeline to feed the train, evaluation and test dataset from each folder.

Create a data pipeline using folders we have created

Next, we can use flow_from_directory() method in ImageDataGenerator class.

  • In the training data generator, I have applied some data augmentation like zoom, shift, and flip.
  • All data generators will be scaled down to within [0, 1].
  • We then resize the image for all data generators to fit the feature extractor module.
  • The training and validation generator will be shuffled because we want to feed the images randomly.

We can then wrap the feature extractor with the classifier layer on top, specifically for our malaria classification task.

Add top classifier layer to the network

We can start training the network with our data set, in the callbacks, I have included some callback functions to fine-tune and stop the training if there’s no improvement.

Train the network using the generator

Use the history to plot the progress and information during the training’s epochs.

Figure 6: History plot — epoch 15 provides the best val_acc and val_loss

Finally, we can use the model to predict the test data generator.

Make a data frame with the test image name, true label and prediction

This simple transfer learning can achieve good performance across the metrics.

The AUC score: 0.9405, accuracy score: 0.9405, f1 score: 0.9413

What else can we improve?

Based on the existing MobileNet V2 model, we can unfreeze the weight and train several upper layers (not all). This will allow the upper layers, which are specialized in detecting specific patterns to fit into malaria images.

In this article, I have used the pre-trained feature extractor from TFHub to predict malaria from the slide image data. With this approach, it enables reusable machine learning modules and rapid data product development.

For the full code, please visit the following GitHub repository for a full notebook explanation:

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Thanks to Towards AI Team.

Satsawat Natakarnkitkul

Written by

The AI / ML guy, senior data scientist #keeplearning #datascience

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.