Using Machine Learning to Predict Alcohol and Drug Use with Personality Traits and Socio-Demographic Characteristics (Part 3)

Andrew Sik-On Leung
9 min readJun 1, 2023

--

Part 3 covers my project’s results, discussing how the research questions were answered and concludes with looking at possible future improvements.

To jump back to Part 2, use this link here. Part 1 can be found here.

Results

Examining the personality traits and demographic variables that best predict each drug use outcome

To determine which independent variables were the best predictors of drug use, the coefficients of the multinomial logit regressions and the final random forest feature importance values were retrieved for the three drug outcomes.

Stimulants

Multinomial Logit Coefficients from the final model for stimulants
Feature importance plot for stimulants: plot of mean decrease in impurity by features
Top Three Features for stimulants based on Impurity-based Importance

Depressants

Multinomial Logit Coefficients from the final model for depressants
Feature importance plot for depressants: plot of mean decrease in impurity by features
Top Three Features for depressants based on Impurity-based Importance

Hallucinogens

Multinomial Logit Coefficients from the final model for hallucinogens
Feature importance plot for hallucinogens: plot of mean decrease in impurity by features
Top Three Features for hallucinogens based on Impurity-based Importance

Which machine learning algorithm best predicted drug use for each drug outcome?

In order to determine which algorithm performed the best at predicting drug use, the metric values for the models were tabulated for each drug outcome.

Initial Models

One of the main issues when I first calibrated the models was the overfitting due to the imbalanced datasets. To help correct his issue, I used SMOTE to oversample the minority classes and generate synthetic data before recalibrating the results. We can see in the tables below that using SMOTE to fix the imbalance decreased the metric scores across all models substantially, irrespective of the drug outcome being examined.

Final Models

The final models were determined after:

  • re-balancing the data with SMOTE,
  • choosing the best algorithm,
  • applying a binarized approach,
  • applying cross-validation, and
  • finalizing parameters with GridSearchCV.

For stimulants, the best model was the Neural Network. For depressants and hallucinogens the random forest model was the best performing algorithm.

Conclusion

In the following section we look at the model results using the research questions to provide context. Finally, the last portion discusses the models used and the rationale behind the choices.

Answering Research Question One Discussion

Stimulants

The first research question asked: what are the personality traits and sociodemographic variables to predict the risk of consumption of each class of drug?

To answer this for the Stimulants drug class, I examined the feature importance ranking from the random forest and the coefficients produced by the multinomial logit. For stimulants, the top three features measured by impurity-based importance were: Neuroticism (0.212), ethnicity (0.19) and gender (0.13) had mean information gain above 0.1.

Reviewing the multinomial logit provides some additional insights as the positive coefficient for Neuroticism (0.4627; 0.7381) in both the class labels (2 — infrequent use; 3 — frequent user) indicates that as Neuroticism increases so does the log odds of using stimulants. Ethnicity had a positive coefficient (6.8; 5.11) indicating that Canada, Ireland, and the UK are countries that increase the log odds of using stimulants. Gender was also a positive coefficient (0.7811; 0.9735) indicating that females have a higher odds ratio for using stimulants as compared to males. These results should be taken cautiously and used only as a directional guide, as both the gender and Neuroticism coefficients were not statistically significant at an alpha of p<0.05. It is also unsurprising that the pseudo-R-square is very low (0.060) as this was the most imbalanced drug outcome variable of the three with very few class 0 (non-users) and class 2 (infrequent users). Additional data needs to be gathered to re-calibrate the models.

Depressants

For depressants, Ethnicity (0.43), sensation seeking (0.26), impulsiveness (0.12), Agreeableness (0.13) were the four most significant contributors to the information gain.

Examining the coefficients of the multinomial logit shows that Ethnicity has a positive coefficient (2.89; 3.1167) indicating that those with an ethnicity of other, mixed black/Asian and mixed white Asian had higher log odds of using depressants. In this case, ethnicity was statistically significant (alpha < 0.05). Sensation seeking had a somewhat positive association with depressants (0.475; 0.699) but was only statistically significant in one class (class 3 alpha < 0.05). Impulsiveness was not statistically significant, but did have a negative coefficient, indicating that it would potentially lead to a reduction in the log odds of consuming depressants. For agreeableness, the negative coefficients were almost significant (.07) and indicated that increasing agreeableness leads to a lower log odd of consuming depressants. Once again, the model has poor overall log-likelihood and pseudo-R-squared, which can be attributed to the lack of data and a very imbalanced drug outcome variable that needed significant amounts of oversampling.

Hallucinogens

For hallucinogens, the features with the highest information gain were Country (0.29), age (0.27), sensation seeking and education (0.13).

Country had statistically significant coefficients in the multinomial logit and had a negative coefficient (-0.75; -1.72), indicating that individuals in countries like US, New Zealand, Australia had higher log odds of consuming hallucinogens. Age had negative coefficients (-0.84; -1.35) that were strongly significant, indicating that increasing age decrease the log odds of consuming hallucinogens. Sensation seeking was also statistically significant and had positive coefficients (0.3505, 0.6772), indicating higher levels of sensation seeking are associated with higher log odds of consuming hallucinogens. Finally, education had negative coefficients (-0.087, -0.50) that were statistically significantly, indicating that higher education levels were associated with lower log odds of hallucinogen use. This model for hallucinogens had the greatest number of statistically significant variables and the best pseudo-R squared (0.347), which was expected as the hallucinogen outcome was much more balanced between the three usage classes to begin with, resulting in minor oversampling.

Answering Research Question Two

Stimulants

Research question two asked: can we determine which machine-learning approach/methods is the most effective for predicting consumption? For model comparisons between svm, logistic regression, random forest and neural networks, the overall accuracy, precision, recall, F1-score and AUC (area under the curve) were examined. The overall accuracy, precision, recall and f1 are very similar, indicating that the precision and recall are very balanced/similar¹.

All 4 measures and the AUC point to the neural network classifier as the best classifier for the stimulant drug outcome. However, the neural network has such a high accuracy and F1-score (both 0.94~) this may indicate some overfitting despite tuning the hyperparameters for regularization, oversampling and collapsing dependent variable categories. This is also not surprising due to the limited sample size.

Depressants

For depressants, the random forest model was the best classifier by all five metrics (table 14) and had an accuracy score of 0.84 and an F1-score of 0.809. Once again, the precision and recall scores are very similar indicating that they have been optimized together.

Hallucinogens

For hallucinogens, the best classifier was once again the random forest with an accuracy score of 0.78 and F1-score of 0.720170. Interestingly, the hallucinogen outcome had the lowest scores for all four models by the five metrics. This may be due to this outcome variable having the least amount of class imbalance to begin with and therefore needing much less oversampling to balance the classes. As a result, the hallucinogen data maybe more reflective of the actual population as the sample size for each class was much higher and closer in number to each other.

Future Improvements

There are some limitations that need to be discussed for this project, namely: low sample size, imbalanced classes for the outcome variables, a small set of features and data processing from the original authors.

Low Sample Size

The primary limitation of this study is the small sample size of n=1185 of the drug consumption data set. This is likely the primary culprit for the overfitting that many of the models experience in the initial calibrations. The poor model fits for multinomial logits also point to insufficient data size for creating accurate models. To remedy this, future studies should look for additional data from open data sources such as the substance abuse & mental health data archive by the US government. It should be noted that this study had personality trait data at the individual level which may be hard to come by. Therefore, an additional recommendation is to gather more sociodemographic data with drug use outcomes and calibrate a model base solely on the sociodemographic features as an alternate model if there is no personality trait data available. Alternatively, another future step could be to predict personality traits from sociodemographic and behavioural traits to first derive the personality variables for individuals before predicting their drug use.

Imbalanced Classes

Closely related to low sample size, the drug use data also feature very imbalanced drug use categories with certain class labels being the majority group. With such a low sample size in certain classes, the algorithms would be biased to predicting the majority class. To solve this issue, additional data gather needs to be conducted or another dataset with the same drug outcome variables needs to be retrieved. The additional data could also be gathered under the broader categories created for this analysis: hallucinogens, stimulants and depressants in order to maintain enough sample size for each drug outcome.

Small set of Features

The feature set for this analysis was also quite small, with only 12 features serving as independent variables. It is very possible that the models are highly biased, and many important sociodemographic and behavioural variables are missing. The low log-likelihoods and pseudo-R-squared values, along with the poor accuracy scores before rebalancing and binarizing lend credence to this notion. For future studies, additional individual-level sociodemographic and behavioural variables need to be gathered and included into the study. Aggregate level data could also be added for a mixed-models approach. Additional variables should be researched from the literature. Once there are sufficient features, we can perform a proper feature selection process.

Alternative Data Processing

The creators of this drug consumption dataset made an interesting choice to convert nominal categorical data such as ethnicity, which has no order into continuous values though mapping them to a continuous range of numbers through non-linear Categorical Principal Component. As a result, the resulting continuous variables had normal distributions and were very easy to work with in the modeling process. However, they were difficult to interpret, and it is uncertain if a dummy coding approach would have provided more insights and created better models. Future research should try to use a dummy coding approach or try to obtain data that is originally continuous (i.e. age as the number of years instead of a range) so that there is no need to recode.

Some Final Thoughts

In summary, this project helps to highlight the potential that machine learning algorithms have in assisting in health research and drug use prediction.

Future work should focus on gathering larger amounts of data to address the sample size issues and imbalanced classes found in this study. Additionally, a much larger dataset would help maximize the performance abilities of machine learning algorithms such as neural networks and random forests in producing a more robust and accurate estimate. A larger number of features also need to be retrieved to help provide more predictors to help improve the models and decrease overall model bias. Once the three border outcomes of stimulants, depressants and hallucinogens can be predicted accurately, future studies can then look to predicting very specific drug outcomes, as was found in original dataset.

To jump back to Part 2, use this link here. Part 1 can be found here.

The code for this project can be found here.

Thank you for giving my project a read through! Thoughts? Comments? Please let me know what you think below or email me: andrew.sleung@gmail.com

[1]: Burkov A. The hundred-page machine learning book. Quebec City, QC, Canada: Andriy Burkov; 2019 Apr.

--

--