Determining Physicians’ Drug Combination Preferences for Regions in Africa Using the Apriori Algorithm:

The mPharma Experience

Benjamin Umeh
mPharma Product & Tech Blog
7 min readJun 29, 2020

--

https://mpharma.com/

Understanding the prevalence of different drug combinations for treating specific ailments can help to provide a clearer picture of how drugs are prescribed and consumed in a healthcare ecosystem. Here in the Analytics Squad at mPharma we are able to use anonymous sales data collected through our proprietary point of sale software Bloom in order to surface prevalent drug combinations. We hope to be able to use these sorts of analyses in the future to be able to better support our mission of providing affordable and accessible healthcare in Africa.

The goal here is to show how the Apriori unsupervised machine learning algorithm can be applied to a drug sales dataset to derive the preferred drug combinations for treating a particular disease in any given geographical area.

Why Care About Drug Combination

Drug combination typically refers to the use of multiple drugs in the treatment of an ailment to increase the likelihood of treating it.

The awareness of the prevalent drug combination in an area can help in a lot of ways both now and in the future. For instance, we need to know the prevalent drug combination first before we can determine whether or not it is best for the patients’ pockets. Especially if adopting less expensive drug combination alternatives would achieve equal or better treatment outcomes.

Physicians Prescription Preferences

While the prevalent drug combination derived from the approach laid out here could serve as a reflection of physicians’ preference in a given region, this research alone could not make that conclusion.

Furthermore, the issue of the prevalence of self-medication in most African countries is noteworthy. About 70% in Ghana, between 24% and 91% in Nigeria, 58.2% in Kenya and 87.5% in Zambia. This tend to limit the extent to which the computed prevalent drug combination would reflect physician preferences.

Nevertheless, with a large majority of the self-medications being chemist or pharmacist “facilitated”, especially in Africa, all hope is not yet lost. Since in “facilitating” the self-medication, the pharmacists make informed suggestions regarding treatment, albeit not as informed as may be desired.

The Apriori Algorithm

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases¹. It was the first algorithm that was proposed for frequent item mining², and it was later improved by R. Agarwal and R. Srikant³.

The algorithm uses frequent items to generate association rules. It is based on the concept that a subset of a frequent item must also be a frequent item.

Long story short, Apriori algorithm is an unsupervised machine learning algorithm which is used to gain insights into the relationship between the different items in a data set.

Extracting The Prevalent Drug Combinations: A Python Implementation of the Apriori Algorithm

Our objective in this Apriori implementation is to find sets of drugs that occur frequently together in the dataset.

The Dataset and Features List

The dataset used in this post is anonymised drug sales data.

In all, there are 6 features/columns in the dataset. The features types cut across categorical variables and numerical variables. Below is the list of the features and what they stand for:

  • transaction_id: Unique identifier for sale transaction. Each sale transaction can consist of more than one item
  • product_id: Unique identifier for each product
  • drug name: The name of the product. These names may include some non-drug products which will be cleaned out during the data preprocessing stage
  • quantity: The quantity of each drug that is purchased in a given transaction
  • ailment_indication: The disease/ailment category that is assigned to the sale
  • country: The country where the sale was made

Exploratory Data Analysis

Each line in the sample of data displayed above represents the sales data for an item sold. Some information we can easily glean from the above analysis are as follows:

  1. There are 253,984 observations in all
  2. About 31.5% of the ailment_indications are missing. We would determine the drugs that have null ailment_indications value when handling missing values
  3. There seem to be about 19 duplicate rows in the dataset.

We will handle these and other issues next

Data Pre-processing

Handling Duplicate Rows in Dataset

The supposed duplicate rows are now represented by the indices of the pandas series dub_full above. Let us call up these rows from the main data and investigate them further.

A cursory look at the supposed duplicate rows above clearly shows that even though they have the same values for many of the features they are not exactly identical or duplicates. So, we do not need to drop any rows from the dataset.

Handling Missing Values

Let us examine the observations with missing values closer to determine the best way to handle them.

To handle the missing values we need to first separate the DataFrame with the missing ailment_indication values from the rest of the dataset without missing values. We then search the rest of the dataset without missing values to find out if any of the drugs associated with the missing ailment_indications in the missing values DataFrame have ailment_indications assigned to them in the dataset without missing ailment_indication values.

Finally, we fill in the missing ailment_indication in the missing value dataset with the ailment_indications values assigned to their associated drugs in the rest of the dataset without missing ailment_indication values.

Define a function that gets the ailment_indications that are missing in the missing value dataset but that exist in the rest of the dataset without missing values.

Next we replace the missing ailment_indication.

Then we rejoin the DataFrame with the now replaced missing ailment_indication to the rest of the dataset.

Checking to see if there are still rows with missing ailment_indication after the replacement.

Cleanup the drug name column

Let us clean up the dataset to make sure that any trailing white spaces in the drug names are removed.

Convert the transaction_id to strings

We need to convert the transaction_id column to the string datatype so the model will not treat it as a numerical value

Keep only transactions with two or more unique drugs

Since our concern is determining the most prevalent drug combinations, transactions containing just a single drug will not be useful to us. So we will need to filter out transactions with just one drug from the data set.

Remove observations with irrelevant ailment_indication type

A preview of the dataset shows that there are some observations with ailment_indications that are not relevant in our analysis, so we need to remove them. Two such irrelevant ailment_indications are “Consumables” and “COSMETICS”.

Thus, we will drop all samples with “Consumables” and “COSMETICS” as their ailment_indication to improve the quality of our result. Inasmuch as some may be medical products, they are not actual drugs.

Before the dataset is transformed further, let’s get the list of the drug names with their corresponding ailment_indications to be used for a later analysis

Data Transformation

To work with the Apriori model we need to transform the dataset by consolidating the items into 1 transaction per row. We will create two functions to achieve this purpose.

Create a function that generates the frequent items set for any given ailment and any given country covered by the data

Now let’s get the top 3 preferred drug combinations (with at least 3 different drugs) for treating malaria in Ghana

We can also adapt this model to any locality so long as the data is granular enough.

From this analysis, the top 3 preferred drug combinations for treating malaria in Ghana are:

  1. (LONART DS x1, GEBEDOL TAB x6, PARACETAMOL 500MG x100)
  2. (LONART DS x1, ZULU 100MG x10, PARACETAMOL 500MG x100)
  3. (LONART SUSPENSION x1, AMOKSIKLAV 625MG x14, PARACETAMOL 500MG x100)

So, there you have it! We have demonstrated how the Apriori Algorithm can be used to determine preferred drug combinations in a geographic area.

Your comments and questions are highly welcomed!

See you next time!

References:

https://www.medscape.com/features/slideshow/dangerous-drug-combinations#page=3

https://www.sciencedirect.com/topics/medicine-and-dentistry/drug-combination

https://www.goodrx.com/blog/10-most-common-drug-combinations/

https://newsnetwork.mayoclinic.org/discussion/nearly-7-in-10-americans-take-prescription-drugs-mayo-clinic-olmsted-medical-center-find/

https://archivepp.com/storage/models/article/TFbjKdtR61XeOCmc0ZmpqNNHDXN1R5pXRuy6XtwG1ltbpAD3V18N2FzqDx04/physicians-drug-prescribing-patterns-at-the-national-health-insurance-scheme-unit-of-a-teaching-ho.pdf

https://www.amhsr.org/articles/a-systematic-review-of-the-literature-to-assess-selfmedication-practices.pdf

http://pubs.sciepub.com/ajphr/3/3/7/index.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741028/

https://www.uspharmacist.com/article/pharmacists-take-center-stage-in-otc-counseling.

https://en.wikipedia.org/wiki/Apriori_algorithm

https://www.softwaretestinghelp.com/apriori-algorithm/

--

--