Modern RecSys

COVID-19 Case Study with CNN

We will cluster COVID-19 X-ray images based on severity with our CNN RecSys flow using transfer learning, Spotify’s Annoy, and PyTorch

Kai Xin Thia
Analytics Vidhya

--

This work is meant as a proof-of-concept of how we can apply the same framework we set up in the previous CNN chapter onto a completely different domain.

We will swap out the training data and employ a more powerful pre-trained model (Resnet152); the rest of the code remains identical to the one we used for DeepFashion images. We aim to identify clusters of X-ray images with similar severity in infection using Approximate Nearest Neighbors.

This work is not intended as medical research nor representative of how we can use CNN to detect COVID-19.

This is part of my Modern Visual RecSys series; feel free to check out the rest of the series at the end of the article.

The COVID-19 Data

From left to right: Healthy, infected, seriously-infected X-ray images of patients. Source: COVID-19 image data collection by Joseph Cohen

Intuition of why CNN will be able to work well on this data set:

As outlined in the previous chapter, the strength of CNN is in the convolutional filters. These filters are very good at detecting shapes, lines, boundaries within the image. From the X-ray images, we see that as the infection worsens, the image blurs with more white areas and the rib cage becomes less visible; these are visual cues that CNN will be able to pick up and learn.

Cleaning the data

  • As there are less than 25 samples of ARDS, Pneumocystis, SARS & Streptococcus in total, I decided to remove those samples and only keep COVID and healthy samples.
  • As there are less than 25 samples of CT scans and only 1 CT scan for healthy patients, I decided to remove CT scans and only keep X-rays.
  • After the cleaning, we have 102 COVID X-rays and 1,584 healthy X-rays.

Model training

We will follow the exact same steps outlined in the previous Convolutional Neural Networks RecSys chapter (you can refer back to that chapter for more details):

  • Convert images to embeddings
  • Conduct Transfer Learning from ResNet152
  • Use Fastai hooks to retrieve image embeddings from step 2
  • Use Approximate Nearest Neighbors to obtain the most similar images based on the embeddings from step 3.

Analysis

Healthy X-ray scan with 36 most similar scans generated by our model. Source: COVID-19 image data collection by Joseph Cohen

For healthy X-ray scans, our model can pick up 36 most similar X-rays that are all healthy. The model can identify and cluster healthy scans.

Infected X-ray scan with 36 most similar scans generated by our model. Source: COVID-19 image data collection by Joseph Cohen

For infected X-ray scans, our model usually picks up a mix of 80% infected X-ray scans and 20% healthy scans. Depending on the degree of infection, the model finds it challenging to differentiate between the lightly infected scans and healthy scans.

Seriously-Infected X-ray scan with 36 most similar scans generated by our model. Source: COVID-19 image data collection by Joseph Cohen

For the seriously infected X-ray scans, our model can pick up 36 most similar X-rays that are all infected. The model can identify and cluster seriously infected scans.

Potential use case of this work

We can use this model to track the change in scan severity over time. If the scan today has fewer healthy neighboring scans and is drifting towards the seriously-infected cluster, this is a sign that the condition of the patient has worsened over time.

The Code

Link to Colab (you just need a free Google Account to run the code on GPU in the cloud)

What have we learned

In this chapter, we explore the use of our previously developed CNN Recsys flow in the healthcare domain. We observed how we can train a powerful model with minimum changes to our code, showcasing the flexibility of our flow.

Explore the rest of Modern Visual RecSys Series

Series labels:

  • Foundational: general knowledge and theories, minimum coding experience needed.
  • Core: more challenging materials with code.
  • Pro: Difficult materials and code, with production-grade tools.

Further Readings

--

--

Kai Xin Thia
Analytics Vidhya

Snr Data Scientist at Refinitiv Labs, M.S. CS Georgia Tech. 9+ years in data, found ❤️ in RecSys, NLP, Computer Vision, Applied R&D. linkedin.com/in/thiakx