CLASSIFICATION OF COUNTRIES BASED ON A COVID-19 RISK FACTOR

It has been almost 6 months since the first reported case of corona virus, and we have seen several different approaches by the national governments of the world to tackle the virus. In this article, we shall try to predict if a country is a low, medium, or high-risk zone based on various factors and decide which of the factors are the best choices for prediction.

The factors that determine the risk factor of a country are its population density, how strong the country is economically, median age of the country, presence of the elderly, etc. We will determine which risk sector a country belongs to when only the previously mentioned factors are provided, and the death rate is not known.

In this article, we will be considering the dataset of Asia to train a model, as the continent has very diverse countries when it comes to population, wealth, and the spread of corona virus. Later, we will test the model on another dataset.

Commonly, the death rate of a country is calculated as the annual number of deaths divided by the population of the country. In this analysis, we shall consider only the deaths due to COVID19. That is, we will calculate death rate as the proportion of the number of deaths to the number of cases caused only by COVID19.

If the death rate is below 1, the country is considered low risk. A death rate between 1 and 3 puts the country at medium risk, and between 3 and 7 makes it high risk. Anything above 7 is considered very high risk. The risk is classified this way because the above death rates are added on to annual death rates of the countries. Also, most of the countries have a death rate of less than 5% due to the corona virus.

Let us take a look at the behaviour of risk factor as a function of the other factors. The following graphs were generated using R Studio.

1. GDP per capita vs. median age for Asian countries
2. GDP per capita vs. percentage of population aged 65 and above for Asian countries
3. GDP per capita vs. population density for Asian countries

From the above plots, it can be seen that the risk is low for countries that have a high GDP per capita. OF the three plots, the second figure relatively has more clarity than the other two. Let us consider this for the model. IF we model risk factor as a function of GDP and age of 65 and older, the predictions are as follows.

This matrix shows that the high risk and low risk regions were predicted correctly but 7 of the medium risk regions were predicted as high risk, 6 as low risk, and 1 as very high risk. Only 12 medium risk zones were predicted correctly. We see that very high-risk row is all 0. This is because there is only one country that fall under this category and this data is not sufficient for the model. This can be seen in the previous plots where there is only one very high-risk region (purple point).

The error factor for this model is 0.4. That is only 60% of the model predictions are reliable.

Let us consider another model where the risk factor depends on GDP, age more than 65, and the population density of the country. Additionally, the risk zones are reduced to 3 as there is only one very high-risk country, which will be included in the high-risk zone. The confusion matrix, which illustrates the correctness of the prediction model is as shown below.

The reliability of this model is around 63%. Of all the models that were tried, this was the most reliable model.

The above-mentioned model was tested on a different dataset the same variables. On testing, it was seen that the error is much larger than when the model is used for prediction on the same dataset which was used to train it. The data set used for testing is one which consists of data from all the countries as opposed to just Asia. The confusion matrix obtained is shown below.

We see that the prediction for high-risk and low-risk are comparatively better than that of medium-risk regions. This model provides accurate predictions when tested on this new dataset only for about 54% percentage of the time. Therefore, this model is not ideal for use in practical situations.

To understand why this model performs so badly on the new dataset, let us take a look at the dataset itself. On plotting the GDP vs age 65 and older for the test dataset, the following image is obtained.

4. GDP per capita vs. percentage of population aged 65 and above

From this plot, we see clusters of low-risk and high-risk zones on the right and the top of the chart respectively. However, medium risk zones are not distinguishable from the other two zones which explains why most of the medium risk zones are predicted as high or low-risk zones.

This classification model may be improved by changing the death rate limits for each zone. Another way to improve the model is to use only 2 risk zones instead of 3 or four. Let us see what happens to the same model when there are only 2 risk classifications: low and high. In this case, any death rate above 1% is considered high.

Using the same prediction model as before, we obtain the following confusion matrix for the training dataset (Asia only).

The model is therefore 79% reliable for the new method of classification. When used on the test data set (all countries), we get the following matrix.

In this case, the correctness of the model is 87%, which means that this model is better to use when the dataset is classified into 2 zones. The scatter plot for GDP vs. age 65 and above for this type of classification is as shown below.

5. GDP per capita vs. percentage of population above 65 years of age

We see that a large portion of the plot is high risk because of the limited classification. For this, even though the model works well, such a classification might be of little use practically.

Links to datasets used: