COVID-19 X-Ray Data set — Preprocessing of the data set to train a CNN network using Python

ARNOLD SACHITH A HANS
Analytics Vidhya
Published in
4 min readMar 19, 2020

Greetings to everyone!

Before we dive deep into the technical part, I would like to urge everyone to keep yourself hydrated, frequently use sanitizers and follow all the prudent measures to keep yourself protected from COVID-19 virus.

Source : COVID-19 X-Ray data set

Objective : Extract the X-Ray images of both, The lungs of patients affected by COVID-19 virus and Healthy lungs. Create a separate folder to store these images and resize the images accordingly to train the CNN network

This post will take you through the following insights :

  1. Overview of COVID-19 X-Ray data set
  2. Python program to extract the images of lungs of the patients affected by COVID-19 from the data given in .csv file which can be used to train a CNN model
  3. A set of images of Healthy lungs is extracted from Kaggle Chest X-Ray images Dataset.
  4. Resizing the images to train a CNN network

Datasets: The datasets for COVID-19 X-Ray Data set can be accessed through this link. I would like to thank Joseph Paul Cohen. Postdoctoral Fellow, Mila, University of Montreal who has taken this initiative to collect the X-ray images to aid the research scholars in their research activities.

The Data set consists of the metadata.csv file, which consists of 119 rows (excluding the header), and 15 columns namely ‘Patient_id’, ‘offset’, ‘sex’, ‘age’, ‘finding’, ‘survival’, ‘view’, ‘date’, ‘location’, ‘filename’, ‘doi’, ‘url’, ‘license’, ‘clinical notes’ and ‘other notes’.[The column names marked in bold will aid us in extracting the images as per our requirements]

Note: The file name of the images in the datasets matches with the data under the ‘filename’ column, kindly do not change the file name of the images.

Python code to extract the images which has COVID-19 cases recorded in PA view.

You can access the code through this link.

Totally 56 images were extracted to the output folder and these images are COVID-19 cases recorded in Posteroanterior (PA) view. The folder looks something like this…..

Total 56 images were extracted to the destination folder

When you carefully observe the images you can conclude that the images are not of the same size. We know that to train the CNN model all the input images should be of same size. The following code helps in resizing and saving the images at one shot:

Python code for uniform resizing of images

You can access the code through this link.

Now we have 56 X-Ray images of COVID-19 cases recorded in PA view. Take the X-Ray images of ‘Healthy lungs’ from Kaggle Chest X-Ray images Dataset repeat the same procedure of extracting, resizing the images and store the images in a folder.

Finally, we have two folders with X-ray images COVID-19 positive cases and COVID-19 negative cases (Normal X-ray images of a healthy person). Though the images from Kaggle Chest X-Ray images Dataset has some dilemmas such as noise in the dataset while some images are not labelled appropriately but to train a basic CNN model these images should work fine.

You can train a pre-trained models like VGG, ResNet etc; or a CNN network from scratch by modifying the hyper parameters to achieve a better accuracy.

Disclaimer : This article does not focus on the research perspective of the identification of COVID-19 rather throws light on the contributions which Machine learning, Deep Learning, Artificial Intelligence engineers can offer to the society. I must evoke you that I am not a Medical expert, for real-time application to detect the COVID-19 the model built should undergo dogmatic testing and should be verified by a Medical expert before being deployed for real-time scenario.

You can connect with me through LinkedIn, Instagram

Any suggestions kindly keep me informed through your valuable comments.

Happy Coding!

Cheers :)
Arnold Sachith

--

--

ARNOLD SACHITH A HANS
Analytics Vidhya

An Aspiring AI engineer|M.Tech (Artificial Intelligence)|B.E (Mechatronics Engineering)| Writer| Robots Rule| AI for the betterment of the society|