Phi Skills
Published in

Phi Skills

How to use Deep Learning to detect COVID-19 from x-ray scans with 96% accuracy

Disclaimer: I am not a doctor nor a medical researcher. This work is only intended as a source of inspiration for further studies.

The following notebook gets you through my journey creating a database and training a deep convolutional network with it. I got to an amazing 96% accuracy. Don’t be too impressed though, it might very well be that the algorithm won’t generalise well or I made some mistake somewhere else. That said, I hope you will enjoy it. Here is the link if you want to jump on Kaggle to play with the notebook otherwise just keep reading.

Why now?

The more the pandemic crisis progresses, the more it gets important that countries perform tests to help understand and stop the spread of COVID-19. Unfortunately, the capacity for COVID-19 testing is still low in many countries.

How are tests performed?

The standard COVID-19 tests are called PCR (Polymerase chain reaction) tests. This family of tests looks for the existence of antibodies of a given infection. Two main issues with this test are:

  1. a shortage a tests available worldwide
  2. a patient might be carring the virus without having symptoms. In this case the test fails to identify infected patients

Dr. Joseph Paul Cohen, Postdoctoral Fellow at University of Montreal, recently open sourced a database containing chest x-ray pictures of patients suffering from the COVID-19 disease. As soon as I found this out, I decided to put in practice what I have learned during the first two weeks of the Fastai’s DL course and to build a classifier to predict from a chest x-ray scan wether or not a patient has the virus.

The database only contains pictures of patients suffering from COVID-19. In order to build a classifier for xray images we first need to find similar x-ray images of people who are not suffering from the disease. It turns out Kaggle has a database with chest x-ray images of patients suffering of pneumonia and healthy patients. Hence, we are going to use both sources images in our dataset.

The notebook is organized as follows:

  1. Data Preparation
  2. Train Network using Fastai
  3. Optimize Network
  4. Test
  5. What’s Next

But first let’s import necessary libraries

import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import os​

1. Data Preparation

Let’s import Fastai, create useful paths and create covid_df

from fastai import *from fastai.vision import *# useful pathsinput_path = Path('/kaggle/input')covid_xray_path = input_path/'xray-covid'pneumonia_path = input_path/'chest-xray-pneumonia/chest_xray'covid_df = pd.read_csv(covid_xray_path/'metadata.csv')covid_df.head()

We notice straight away that we have a large number of NaN, let’s remove them and see what we are left with.

covid_df.dropna(axis=1,inplace=True)covid_df

That looks better. We are mainly interested in two columns: finding and filename. The former tells us wether or not a patient is suffering from the virus whereas the latter tells us the finename. The other interesting column is view. It turns out the view is the angle used when the scan is taken and the most frequently used is PA. PA view stands for Posterior anterior view.

covid_df.groupby(‘view’).count()

PA makes up the majority of the datapoints. Let’s keep them and remove the rest.

covid_df = covid_df[lambda x: x['view'] == 'PA']covid_df

For simplicity, let’s also rename the elements in column finding to be positive if the patient is suffering from COVID-19 and negative otherwise.

covid_df['finding'] = covid_df['finding'].apply(lambda x:'positive' if x == 'COVID-19' else 'negative')covid_df

Finally, let’s replace the filename column by the entire system path and keep only the two columns we are more interested in

def makeFilename(x = ''):    return input_path/f'xray-covid/images/{x}'covid_df['filename'] = covid_df['filename'].apply(makeFilename)covid_df = covid_df[['finding', 'filename']]covid_df

We now need to create a dataframe of the same format using the pictures from the other database. Once we have that dataframe, we can use the mighty ImageDataBunch methods to create a dataset that we can feed to our convolutional network.

Since we have 92 pictures in our covid_df, I decided to take an equal number of pictures of healthy patients and an equal number of picture of pneumonia patients. In other words, 92 covid_df images, 92 healthy patient images, and 92 pneumonia affected patients.

healthy_df = pd.DataFrame([], columns=['finding', 'filename'])folders = ['train/NORMAL', 'val/NORMAL', 'test/NORMAL']for folder in folders:fnames = get_image_files(pneumonia_path/folder)fnames = map(lambda x: ['negative', x], fnames)df = pd.DataFrame(fnames, columns=['finding', 'filename'])healthy_df = healthy_df.append(df, ignore_index = True)pneumonia_df = pd.DataFrame([], columns=['finding', 'filename'])folders = ['train/PNEUMONIA', 'val/PNEUMONIA', 'test/PNEUMONIA']for folder in folders:fnames = get_image_files(pneumonia_path/folder)fnames = map(lambda x: ['negative', x], fnames)df = pd.DataFrame(fnames, columns=['finding', 'filename'])pneumonia_df = pneumonia_df.append(df, ignore_index = True)pneumonia_df =     pneumonia_df.sample(covid_df.shape[0]).reset_index(drop=True)healthy_df = healthy_df.sample(covid_df.shape[0]).reset_index(drop=True)negative_df = healthy_df.append(pneumonia_df, ignore_index = True)

2. Train Network using Fastai

I am going to run the Convolutional Net using two training sets.

  • The first will have covid_df and healthy_df
  • The second one will have covid_df and pneumonia_df

We will then compare the perfomances and, hopefully, we will get comparable results so that we can have more confidence in our results.

COVID-19 patients and healthy patients

First, we need to merge the dataframes:

df = covid_df.append(negative_df, ignore_index = True)df = df.sample(frac=1).reset_index(drop=True)df.sample(20)

Firstly, create the ImageDataBunch

np.random.seed(42)data = ImageDataBunch.from_df('/', df, fn_col='filename', label_col='finding', ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)data.show_batch(rows=80, figsize=(21,21))
batch taken from our dataset

To my untrained eyes, it looks like the images look consistent. We are going to use a resnet50 and leverage Kaggle free GPU Quota. Let’s start training ten cycles.

learn = cnn_learner(data, models.resnet50, metrics=error_rate)

Let’s first fit 10 cycles and see how it improves

learn.fit_one_cycle(10)

More home works won’t harm the kid.

learn.fit_one_cycle(10)

5.5% error rate which already good enough. We are now ready to run the model on data from covid_df and from pneumonia_df. But let’s save first.

learn.save('stage-1')

Second case: COVID-19 patients and pneumonia patients

The same process as in the previous case is applied here. We just append each variable with 2 ( ex: df becomes df2 ).

df2 = covid_df.append(pneumonia_df, ignore_index = True)
df2 = df2.sample(frac=1).reset_index(drop=True)
np.random.seed(42)
data2 = ImageDataBunch.from_df(
'/',
df2,
fn_col='filename',
label_col='finding',
ds_tfms=get_transforms(), ## data augmentation: flip horizozntally
size=224,
num_workers=4
).normalize(imagenet_stats)
learn2 = cnn_learner(data2, models.resnet50, metrics=error_rate)
learn2.fit_one_cycle(10)
learn2.fit_one_cycle(10)

That’s a nice enough error rate. Let’s save learn2 and start optimization.

learn2.save('learn2-stage-1')

3. Optimize

Results for the first case were already pretty solid in the previous section. Here we are going first to optimize the results for the first case and then to optimize results for the second case. Then, if the two cases accuracy do not differ too much, we will be confident in our result and try to predict a random image online.

First case: COVID-19 patients and healthy patients

Let’s plot the learning rates curve first.

learn.save('stage-1')learn.unfreeze()learn.lr_find()learn.recorder.plot()

The longest downward shape is found in the region around 1e-4 let's use that as our starting point

learn.fit_one_cycle(10, max_lr=slice(8e-5,2e-4))

I obtained the 0% error rate after a updated my notebook on kaggle and used a balanced dataset. This error rate though, is probably due to the fact that I am still collecting data and would require much more images to have an more stable error rate. Previously, it was at 3.6% (hence the title…)

With error rate we might be satisfied with this first results. We are going to save and plot the confusion matrix.

learn.save('stage-2')
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

Since the error rate is 0, the confusion matrix shows we have no errors.

Second case: COVID-19 patients and pneumonia patients

learn2.unfreeze()
learn2.lr_find()
learn2.recorder.plot()

Let’s try setting the learning rate around 1e-4 .

learn2.unfreeze()
learn2.fit_one_cycle(22, max_lr=slice(7e-5,1e-4))

Both cases have an error rate < 3%. Given the scarsity of data, this is a promising first result. Since using both models, covid-19 prediction seems to be consistent, we can be confident enough in its predictions.

Test

Finally, let’s test our model on a random covid-positive image taken from radiopaedia.

img = open_image(input_path/'test-img/df1053d3e8896b53ef140773e10e26_gallery.jpeg')
learn.predict(img)

Our model correctly predicted this image belongs to a positive covid-19 patient. That makes us very happy.

What’s next?

I would like to incorporate scans from other sources and see if accuracy and generalization might increase. Today, while I was about to pusblish this article, I found out that MIT has released a database containing xrays images of covid patients. Next, I am going to incorporate MIT’s database and see where we get.

I would be delighted to hear any suggestion or criticism 😅.

Ciao,
Michele

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store