Data Science Applied to Heroin Consumption
Learning about the theory and math behind an algorithm is extremely important but the real challenge is being able to apply the same on a real world problem! In this article we will build a logistic regression model to the ‘Heroin consumption’ data. Let’s begin!
Setting up the dataset for binary classification
We will look at the problem of classifying an individual as a user of heroin or as a non-user of heroin. Some of the predictors used in determining this are personality factors such as Nscore (measuring neuroticism ), SS score (sensation-seeking), and Impulsive (impulsivity). The original dataset classifies individuals into 7 classes, depending on the time at which they used the drug. The original dataset also addresses many other drugs besides heroin; however, in this project, we look only at heroin. The dataset is taken from the UCI Machine Learning Repository and can be downloaded here:
For each drug, an individual falls under one of 7 classes — CL0, CL1, CL2, CL3, CL4, CL5, and CL6. CL0 means that the individual never used the drug, CL1 means the individual used the drug over a decade ago, CL2 means the individual used the drug within the last decade; and CL3, CL4, CL5, and CL6 mean the individual used the drug in the last year, in the last month, in the last week, and in the last day, respectively.
First, I took the .data file and converted it into a .csv file. I restricted the dataset to the drug heroin by deleting all the columns for the other drugs. I then added a new column “User” that has value 0 if the individual falls under CL0 or CL1 — the individual is not considered a User. Otherwise, the individual falls under CL2, CL3, CL4, CL5, or CL6; in this case, the value is 1 — the individual is considered a User. This gives a binary classification problem for classifying individuals as User or Not User.
I took a random sample of the heroin dataset using excel. Then, I broke up the dataset into two separate csv files: one containing 80% of the data, the other containing 20% of the data. The set containing 80% is our training set, and the set containing 20% is our test set.
Visualizing the data
For this section, I’m going to use the entire dataset for heroin consumption. Let’s look at heroin consumption by gender. Here is a graph for the number of users for each gender.
As you can see, there are about the same number of males as females in the population. For both genders, only a small fraction are heroin users, and there are about twice as many male heroin users as female heroin users. We can test whether gender is statistically significant using the chi-squared test:
Since the p-value is less than the significance level, gender is statistically significant.
Let’s look at the percentages of users by country.
As you can see, the USA and UK have the largest representation in the sample. Most of the heroin users in the sample are from the USA. Applying the chi-squared test, we can check that country is statistically significant:
Since the p-value is less than 0.05, country is statistically significant.
Let’s see how Impulsivity affects the percentage of users:
As the Impulsive score increases, the percentage of users increases.
We can apply the Chi-Squared test to check whether Impulsivity is statistically significant:
Since the p-value is less than the significance level, Impulsivity is statistically significant.
As the Impulsive score increases, the percentage of users increases . A similar effect can be seen with Nscore:
The same thing can be seen with SS:
The opposite effect is seen with Ascore:
The percentage of users is higher for lower Ascores. We can see this more clearly by normalizing the bars:
Similarly, the percentage of users is higher for lower levels of education:
We can see this better by normalizing the bars:
The Chi-squared test was applied to Nscore, SS, Ascore, and Education; all were found to be statistically significant. In contrast, applying the chi-squared test to Escore gives a p-value of 0.156615315, which is greater than 0.05. Therefore, Escore is not statistically significant. Ethnicity also turns out to be statistically insignificant. Interestingly, Cscore and Age turn out to be statistically significant although we will eliminate them in backward elimination when building the model. There does seem to be a relationship between Cscore and whether or not the individual is a user:
The percentage of users is higher for lower Cscores. Similarly, the percentage of users is higher for younger individuals:
Building a Logistic Regression Model
Using gretl, I applied logistic regression to the training set. In logistic regression, our model will give us a probability for each individual — namely, the probability that the individual is a User.
First, I used backward elimination to eliminate some of the independent variables using the threshold of 0.05 for p-value. In this way, I was able to eliminate Escore, Cscore, Age, and Ethnicity. At this point, here is what our logistic regression model in gretl looks like:
Model 32: Logit, using observations 1–1508
Dependent variable: User
Standard errors based on Hessian
Number of cases ‘correctly predicted’ = 1334 (88.5%)
f(beta’x) at mean of independent vars = 0.316
Likelihood ratio test: Chi-square(8) = 199.051 [0.0000]
As you can see, the p-value for Oscore is greater than 0.05; however, removing Oscore from the model decreases the Adjusted R-squared value to 0.169289. So, I decided to keep Oscore as well as Education and SS. I checked for collinearity, and there were no signs of collinearity. The accuracy rate of the model is 88.5%.
Assessing the model
Here is what the CAP curve looks like:
Our training set is ordered by our logistic regression model from highest predicted probability to lowest predicted probability. The horizontal axis measures the percentage of the population of individuals considered. The vertical axis measures the percentage of the population of users in our training set. The red curve, at x% on the horizontal axis, measures the number of users identified out of the first x% of our training set (as ordered by our model) divided by the total number of users in our training set. The blue line is the performance of the random model. As you can see from the CAP curve, our logistic regression model is good. At x=50%, our CAP curve is approximately 90%.
Next, we applied our model to the test set and created the CAP curve for our test set. The test set is ordered by our logistic regression model from highest predicted probability to lowest predicted probability. Here is the CAP curve for our test set:
At x=50%, our CAP curve is approximately 95%. So our model is very good. What this CAP curve tells us is that, by ordering the list of individuals in our test set according to highest probability given by our model, we would have identified 95% of the users by considering the first 50% of the individuals in the list. If we were to consider a new set of individuals, we could potentially identify 95% of the users by considering only 50% of the individuals. The accuracy rate of our model on the test set is 89.9%.
Let’s take a look at the odds-ratios for the predictors:
As you can see, Impulsive, Nscore, SS, and Oscore have a positive correlation with their odds-ratios. On the other hand, Country, Gender, Ascore, and Education have a negative correlation with their odds-ratios. This suggests that increasing Impulsive, Nscore, SS, and Oscore correlates with increasing the probability that the individual is a user. On the other hand, increasing Ascore and Education correlates with decreasing the probability that the individual is a user.
Recall that Cscore and Age were found to be statistically significant although they were not included in the model. I trained a new model by including them to see if it improves the model. Here is the new model in gretl:
Model 1: Logit, using observations 1–1508
Dependent variable: User
Standard errors based on Hessian
Number of cases ‘correctly predicted’ = 1337 (88.7%)
f(beta’x) at mean of independent vars = 0.316
Likelihood ratio test: Chi-square(10) = 199.235 [0.0000]
As you can see, the accuracy rate is 88.7%, not much better than the accuracy rate of our original model.
Here is the CAP curve for our new model applied to the training set:
At x=50%, the CAP curve is approximately 90%. The accuracy rate of the model on the training set is 88.66%. Thus, our new model is not much of an improvement over the original model.
I applied the new model on the test set and got the following CAP curve.
At x=50%, the curve is approximately 95%. The accuracy rate of the model on the test set is 89.9%. Thus, including Cscore and Age in the model does not improve the model by much.
Here are the odds-ratios for the predictors Cscore and Age:
Since the coefficients for Cscore and Age are close to 0, they don’t contribute much to the probability that an individual is a user. It’s strange that Cscore and Age are statistically significant according to the chi-squared test and yet not necessary in the model. What could explain this is that Cscore is negatively correlated with Nscore and Impulsive and that Age is negatively correlated with SS. Here is a heatmap of the correlation matrix:
The dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban, “The Five Factor Model of personality and evaluation of drug consumption risk.,” arXiv [Web Link], 2015