Associative analysis and COVID — 19 symptoms

Praga SV
Analytics Vidhya
Published in
5 min readMay 16, 2020

Introduction

Associative analysis also knows as the market basket analysis is one key technique used to uncover associations between items, initially used by large supermarkets and retailers. It analyses the combinations of items that occur together, and looks for the frequency of these transactions. Thus helping to understand the relationships between the items that people buy. The applications are many, placement of products on aisles, recommending items in ecommerce websites and songs recommended in Spotify.

With the current COVID — 19 out break , many data sets have been made public for the usage of researchers. I came across one such data set published by wolfram [1]. The data set had some details regarding the symptoms which the patients were having, and I decided to dig a bit deeper into this symptoms.

This article will discuss the insights of the data, as well as the approach of how to do it.

Approach

In this particular data set , there are altogether 13179 patient data, but majority of the columns are sparse. Since we are only focusing about the symptoms of COVID-19 , from the entire data set only 1631 patient symptom data is available. One might argue the amount of information is low, but lets be optimistic shall we ?

After loading the data, there is a necessity to clean and to do format transformation. If we take a closer look at the symptom data, the image below shows the format of the symptom data.

Next with the help of regex library , the symptoms needs to be extracted. The code segment below would be helpful for the extraction

The next step towards association analysis is to do a one hot encoding for the extracted data. In this particular dataset we have altogether 95 unique symptoms. The image below describes all 95 unique symptoms.

One might prefer a library to do this encoding, but I preferred to write a code from scratch.

Now we have prepared our data for associate analysis.

Methodology and evaluation

There are many algorithms which can be used for associate analysis. I have applied the ‘Apriori algorithm’ for this particular case. Other algorithms such as Eclat , FP-growth ,ASSOC and OPUS search can also be used.

The Apriori algorithm uses a breath- first search strategy to count the support items. It uses a candidate generation function which exploits the downward closure property of support.

I used the mlxtend library for the apriori algorithm.

The minim support was set at 0.005 , as the dataset was relatively small. Over all 132 possible combinations were found by the pattern.

When it comes to the evaluation , the following metrices are used

· Support

This measure gives an idea of how frequent an itemset is in all the transactions

The value of support helps to identify the rules worthiness, considering for future analysis. for example, one might want to consider only the itemsets which occur at least 50 times out of a total of 100,000 transactions i.e. support = 0.0005. If an itemset happens to have a very low support, we do not have enough information on the relationship between its items and hence no conclusions can be drawn from such a rule

· Confidence

This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents

· Lift

Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of {Y} given {X}

Other matrices such as conviction and leverage can also be used for diagnostic purposes.

Results

The image below shows some of the top supported frequency of items given by the apriori algorithm.

The above image shows the rules developed by the apriori algorithm , together with the evaluation metrices.

as we can see from the image in the far right, the fit for lift is not quite good, but this can be improved with more data obviously and increasing the minimum support value for the apriori algorithm.

Application towards COVID — 19

with this association rule, we are able to identify the symptoms and its development phase. This would be very beneficial in countries where medical resources are very scarce. This could also be used to identify the severity of a patient too.

References

1. https://www.wolfram.com/covid-19-resources/

--

--