Detecting social indicators contributing to COVID-19 Pandemic using Machine Learning algorithms

Published in

COVID-19 Outbreak

6 min readDec 20, 2020

Early December 2019 will go down as one of the most infamous times in modern history as it was the beginning of the deadliest pandemic in almost a century ever since Spanish Flu in 1918. December 2019 was when a strange disease was detected by doctors in Wuhan, China which will later be recognized as coronavirus [1]. The traces of the first case of COVID-19 have now been traced back to November 2019 [2].

Coronavirus as seen under electronic microscope [Source: Wikipedia]

Coronaviruses are a large family of viruses that cause illnesses ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). This Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus that has not been previously identified in humans [1]. As of the time of writing this blog, a total of 76,167,173 people have tested positive for COVID-19 and the infamous virus has claimed a total of 1,684,482 lives; mortality rate largely depends on the socio-economic background of the country, the quality of medicinal amenities available in the country, and the rate of testing. We performed a study using COVID-19 World Symptom Survey Data [3] to see how big a role does public awareness and socio-economic conditions play in the spread of a pandemic.

COVID-19 patients in a hospital in Italy [Source: Sky News]

DATA DIFFICULTY:

COVID-19 is a novel disease, as a result identifying COVID patients from the symptoms is not easy as no dataset is available for making such predictions. We have seen many datasets that contain the total count of COVID patients in different regions and different countries[6][7]. We found these datasets useful for visualizing the effect of COVID-19 on a particular country in terms of total no of deaths, confirmed cases, suspect cases, current count. We can also use these datasets to check daily reports of new cases and new deaths for every country. Still, Our purpose was not satisfied with those datasets as they do not include any information related to the symptoms. We have seen another dataset from an open-source data repository GitHub[5]. About 212 patients’ data is stored, which have shown signs of coronavirus and other viruses. The text was unstructured; many Machine Learning and Natural Language Processing based techniques can be used for text mining and to refine this data. This was the only dataset that talks about symptoms, but the dataset was very sparse. So we have neglected this data[5].

DATASET:

COVID-19 WORLD SURVEY DATA, open data from UMD and Facebook, is used for identifying potential COVID-19 outbreak hotspots.[3]

COVID-19 OUTBREAK HOTSPOTS:

Click on the image to see how the world is affected by COVID 19.

Country and region-level statistics of COVID-19 cases using weighted data

DATA ANALYSIS:

COVID-19 Symptom Survey was conducted as a collaboration between University of Maryland and Facebook. The data was made available in public domain for the purpose of research and we used the same to make our predictions. People from all over the world, took part in the survey where they were asked several questions pertaining to flu (influenza) and COVID-like symptoms, apart from that the participants answered several questions about their socio-economic background and the level of awareness when it come to dealing with the pandemic in a responsible manner. The survey data provides us with column labels like: country, region, percent_cli (percentage of people reporting COVID like illness), percent_ili (percentage of people reporting influenza like illness), and many more symptoms related to COVID and finally a lot of socio-economic factors and social awareness factors. These labels were then weighted and normalized to adjust for bias and then we have a final set of columns with the bias factors taken into account.

From perspective of a supervised learning task, one of the very first things to notice is the absence of a clear target column. Therefore the target needs to be artificially generated using some form of weighted averaging of the symptom parameters. The weights were assigned based on the severity of COVID-19 symptoms as declared by the World Health Organization [4].

Parameter Weights based on severity of symptoms

Now we have successfully generated a target column which can be used for regression analysis and also for classification analysis after binarization.

Our feature set consists of: {pct_cmnty_sick, pct_ever_tested, pct_tested_recently, pct_worked_outside_home, pct_grocery_outside_home, pct_ate_outside_home, pct_spent_time_with_non_hh, pct_attended_public_event, pct_used_public_transit, pct_direct_contact_with_non_hh, pct__all_time, pct_wear_mask_most_time, pct_wear_mask_half_time, pct_wear_mask_some_time, pct_wear_mask_none_time, pct_no_public}

DATA VISUALIZATION

Now let us have a look at how some of the features from the feature set vary with our final target score.

Top left: Target Score vs. Percentage of people having fever, Top right: Target Score vs. Community Sickness, Bottom left: Target Score vs. Percentage of people working outside home, Bottom right: Target Score vs. Percentage of people not wearing mask

DIMENSIONALITY REDUCTION AND BINARIZATION

We can binarize the target on the basis of mean score of the artificially generated target and then we can visualize the class separation using a t-SNE plot as following:

As we can clearly see, the classes are separable after dimension reduction in 2D space. This a good hint that our models would be able to classify well.

APPLYING MODELS

We have applied several models to this data and we will be talking about few of the best regressors and classifiers which gave us the best results.

k-NN Classifier: The nearest neighbor classifier gave us an accuracy of 95.3% on the binarized data. The optimal value of k was found to be k=11.

Mean Error vs. K to find optimal k value for kNN Classifier

2. Random Forest Classifier: Random Forest Classifier gave us an accuracy of 94.33% while making predictions.

3. SVM Classifier: SVM Classifier gave us an accuracy of 94.75% on the test set.

4. Random Forest Regressor: The Random Forest Regressor generated a RMSE of 0.8581 with optimal parameters as: n_estimators=200, max_depth=60, max_samples=0.9, max_features=11.

CONCLUSION

With this undertaking we have demonstrated the impact of social trends on an ongoing pandemic. We have been able to show that it is possible to predict the extent of the pandemic and its severity from the social awareness of the public. Given below are the models which predicted most accurately how the social factors affect the pandemic:

#MachineLearning2020

CONTRIBUTIONS:

Tapadeep Chakraborty, MTech CSE (AI), IIITD (LinkedIn): Coding and fine-tuning.
Anjali Singh, MTech CSE (DE), IIITD (LinkedIn): Literature Survey, Data Collection and Data Preprocessing.
Prakriti Gupta, MTech CSE (AI), IIITD (LinkedIn): Coding and Data Visualization

Under the guidance of:

1. Course Instructor: Dr. Tanmoy Chakraborty (LinkedIn, IIITD Faculty Profile, Twitter: @Tanmoy_Chak, Facebook)

2. Teaching Fellow: Ishita Bajaj

3. Teaching Assistants: Shiv Kumar Gehlot, Chhavi Jain, Nirav Diwan, Pragya Srivastava, Shikha Singh, Vivek Reddy

REFERENCES:

Harapan Harapan, Naoya Itoh, Amanda Yufika, Wira Winardi, Synat Keam, Haypheng Te, Dewi Megawati, Zinatul Hayati, Abram L. Wagner, Mudatsir Mudatsir, Coronavirus disease 2019 (COVID-19): A literature review, Journal of Infection and Public Health, Volume 13, Issue 5, 2020, Pages 667–673, ISSN 1876–0341.
https://www.livescience.com/first-case-coronavirus-found.html
https://covidmap.umd.edu/
https://www.who.int/health-topics/coronavirus#tab=tab_3
https://github.com/Akibkhanday/Meta-data-of-Coronavirus
https://github.com/CSSEGISandData/COVID-19
https://www.kaggle.com/de5d5fe61fcaa6ad7a66/coronavirus-dataset-update-0206