Analytics Vidhya
Published in

Analytics Vidhya

Using COVID-19 X-ray images to discover variants of virus mutation

Photo by Fusion Medical Animation on Unsplash

Disclaimer: The methods and techniques explored in this post are meant for educational and sharing purpose. It is not a scientific study nor it will be published in a journal.

TL;DR: In the end, the project result is inconclusive there are too many missing country fields in the dataset and the number sample size is too small. But project manage to split dataset into 3 clusters which might be just luck or it might indicating the virus 3 clusters found in other research papers.

Motivation for this project

This project was an extension of my other post on using X-ray images for classification. The project ended with “not good enough” to be a stop-gap solution to identify an asymptomatic person who carries COVID-19 virus. So to further explore, I thought of is it possible to use the features learned in the model to identify the variant of the virus.

Variant of Viruses

COVID-19 virus like other viruses will have a chance of mutation and change it’s genome every time it infected a host. Labs around the world are working hard on collecting and sequencing the virus from patients in order to map out the family tree on the virus.

So why is it important to identify genetic changes in a virus.

  • Different variants of the virus will have a different contagious level
  • Certain type of virus is more severe than the others
  • A virus might develop resistance to antiviral drug or detection

Quoted from WHO paper “Variant analysis of COVID-19 genomes”

As more patients are infected as time goes by, concerns are that the virus will accumulate more variants and that a virulent strain with stronger toxicity might emerge. Therefore, it is critical to track and characterize them in terms of variants, patient profiles, geographic locations, symptoms, and treatment responses.

With the above understanding of what is project is about, let’s get started.


The following are the steps used to cluster the features learned by our model.

  1. Customise model for feature extraction
  2. Dimension reduction (PCA vs t-SNE)
  3. DBSCAN clustering
  4. Compute score for DBSCAN
  5. t-SNE and DBSCAN Grid Search
  6. Interpret the result

1. Customise model for feature extraction

Extracting features from the model

We need to extract the learned features from our trained model. This can be done by removing the classification layer and accessing the dense layer directly. This may sound complicated but all we need to do is to pop the layers from the model and redirect the output of the model.

# removing final 2 layers exposing dense_26 
# Reconstruct the model to output features
model = Model(inputs=model.input, outputs=model.get_layer(‘dense_26’).output)

2. Dimension reduction (PCA vs t-SNE)

Dimension reduction is a process of reducing the number of features bring down the dimensionality of our dataset. This allows us to better visualise or work on some downstream task. (😂 Try to imagine how a 64 dimension data plots look like.)

Principal Component Analysis (PCA)

PCA cannot handle this

PCA was first to introduce in 1933, it is one of the most commonly use for dimension reduction because it is fast and simple to use. PCA creates low dimension embedding by preserving the overall variance. But it had some drawbacks, PCA is a linear projection meaning it cannot capture non-linear data distribution.

T-distributed Stochastic Neighbor Embedding (t-SNE)

Fast forward to 2008, t-SNE was developed by Laurens van der Maatens and Geoffrey Hinton. Unlike PCA, t-SNE is a non-linear projection and it does this by calculating similarity measures between pairwise distance or local similarities. The author (Laurens van der Mattens) himself said that t-SNE will be useful in areas like climate research, bioinformatics, cancer research, etc where data are highly dimension could use this to reduce the dimensionality and then use it as an input to some other classification model.

The following are the required parameters for using t-SNE and with a short explanation.

n_components: Dimension of the embedded space. Our output dimension.

perplexity: On how to balance attention between local and global aspects of your data. Usually should be less than the number of data points you have when equal to your data points they will produce unexpected behaviour.

Since both PCA and t-SNE are not clustering algorithm, they may look segmented when you plot them but we still need to cluster the data points provided. This is where DBSCAN comes into the picture.

3. DBSCAN clustering

Unlike most popular clustering algorithms like K-Means and Agglomerative clustering which require us to provide n clusters as a parameter to the algorithm. DBSCAN does not require us to guess the number of clusters beforehand when doing clustering. DBSCAN uses distance base measure (Euclidean distance) and a minimum number of points to determine a dense region.

eps: Specify how close each points should be to be considered as part of a cluster. Lower means points need to be closer to consider a cluster

min_sample: Specify the minmum number of points to form a dense region. Higher means require more points to indicate a dense region.

As we can immediately see, finding the best parameters for both dimension reduction and clustering is not an easy task. We will be naive to trial and error our way through this. We need some sort of scoring to help us determine which parameters are the best fit for our data set.

Hold on, let me overthink this.

4. Compute score for DBSCAN

This is the result of my overthinking, my attempt is a simple one. What we wanted is to have each cluster as balanced as possible. We do not want in example two clusters where one of them have only a single data point and the other has all the rest of the points. We do not want a lot of outliers, meaning the algorithm just clusters some points and the rest are outliers.

The following function will provide a score for DBSCAN. It is not perfect but for my project, it seems to work fine.

def compute_dbscan_score(db_labels):
Attempt to find the best DBSCAN parameters by calculating how clusters are balanced
# forming cluster dictionary
cluster_dic = dict(Counter(db_labels))

# Compute the difference in number of nodes of each clusters
# The lower the value, the better cluster is formed
# print(cluster_dic.values())
max_value = np.amax(list(cluster_dic.values()), axis=0)
score = 0
for key in cluster_dic.keys():
if key != -1: # not outliers class
score += max_value - cluster_dic[key]
else: # we want as less outlier as possible
score += cluster_dic[key]
return score

With the scoring function done, the rest is easy. How about some brute force search anyone? 🤣

5. t-SNE and DBSCAN Grid Search

This part is the easiest to understand, we loop through all the possible parameters combination and find the best score provided by our function. The following is just the parameters we are looking through.

total records: 179 recordsmin_cluster = 3
max_cluster = 8
eps_range = np.arange(0.1, 2.0, 0.1)
min_samples = np.arange(5, 20, 1)
perplexity = np.arange(1, 180, 1)

It will be around … 19 x 14 x 179 = 47,614 iterations. Time for some ☕.

Just need to wait for a few moments for our result.
[{'eps': 0.1,
'min_samples': 5,
'n_clusters': 4,
'score': 31,
'perplexity': 179},
{'eps': 0.4,
'min_samples': 19,
'n_clusters': 4,
'score': 47,
'perplexity': 178},
{'eps': 0.4,
'min_samples': 14,
'n_clusters': 3,
'score': 48,
'perplexity': 17}

From the result, we will ignore the first two because the perplexity is too high, recall that perplexity near our total data point will cause t-SNE to behave weirdly. So the third result will be our best parameters.

6. Interpret the result

3 clusters formed from our best parameters

Finally, the plotted result shows 3 clusters. I recall reading news articles about Type A, B and C variant of the COVID-19 virus. This got me really excited but we need to further explore each cluster, to see if they are cases from a specific country or region.

Quoted from above articles

The team used data from virus genomes sampled from across the world between 24 December 2019 and 4 March 2020. The research revealed three distinct “variants” of COVID-19, consisting of clusters of closely related lineages, which they label ‘A’, ‘B’ and ‘C’.

Type ‘A’, the “original human virus genome” — was present in Wuhan, but surprisingly was not the city’s predominant virus type. Mutated versions of ‘A’ were seen in Americans reported to have lived in Wuhan, and a large number of A-type viruses were found in patients from the US and Australia.

Wuhan’s major virus type, ‘B’, was prevalent in patients from across East Asia. However, the variant didn’t travel much beyond the region without further mutations — implying a “founder event” in Wuhan, or “resistance” against this type of COVID-19 outside East Asia, say researchers.

The ‘C’ variant is the major European type, found in early patients from France, Italy, Sweden and England. It is absent from the study’s Chinese mainland sample, but seen in Singapore, Hong Kong and South Korea.

Labels distribution: {Red: 31, Yellow: 71, Green: 70, outliers: 7}
Green cluster (70)
Yellow cluster (71)
Red cluster (31)

From the looks of it, the data show these countries are distributed we cannot conclude if either of the clusters belongs to a variant of the COVID-19 virus. If we have more data and less of the “nan” classes we might be able to do some intelligent guesses. For now, we have to wait for the scientific community to provide more research materials and collect more X-ray for future study.

Thanks for reading until the end. Stay indoors, stay safe and please remember to wash your hands.

Stay indoors even if waifu comes visit.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store