Using Machine Learning to Predict Alcohol and Drug Use with Personality Traits and Socio-Demographic Characteristics (Part 2)

10 min readJun 1, 2023

Part 2 covers my project’s exploratory analysis, Data-Cleaning/Pre-Processing phase, Feature Selection and Implementing the machine learning models.

To jump back to Part 1, use this link here. Part 3 can be found here.

Exploratory Analysis

Selecting the Dataset

The dataset for the project was obtained from the from the open-source repository — UCI Machine Learning Repository and was selected as it provided an opportunity to examine the association between personality traits and sociodemographic variables with drug and alcohol use. In particular, the availability of personality traits in an individual level survey along with the individual’s drug use status seemed to provide a very interesting opportunity to make use of machine learning algorithms approaches to understand drug use from a classification problems perspective. Additionally, statistical inference approaches could also be used to assess the relationship between personality and sociodemographic with drug use to provide insights that could be used from public health policy perspective.

Independent Variables/Input Feature Set Descriptive Statistics

In this drug consumption dataset, there were five sociodemographic variables (Age, Gender, Education, Country, and Ethnicity), seven measures of personality (NEO-FFI-R/FFM: Neuroticism, Extraversion, Openness, Agreeableness, Conscientiousness; Impulsiveness, Sensation Seeking) and one row identifier (ID). Only ID was used as an input feature as it provides nothing more than a unique row count.

The table below provides descriptive statistics for the 12 + 1 (ID) independent variables. The following fields were also added to provide a test of normality, to see if independent variables could be used in parametric models such as logistic regression to test for significance (p-values) of association with the drug use outcomes. For the measures of normality, the Shapiro-Wilkes test was first used and all but one variable (Neuroticism) passes for normality. This was a bit of a surprise as all the means were 0 and standard deviations were close to 1. However, taking a closer look at the histograms and plots (Appendix A3), it appears on visual inspection that Education, Ethnicity and the seven measures of personality traits are much more normal than the Shapiro-Wilkes test suggests. Further reading indicated that the Shapiro-Wilkes test is a very strong test, however it may perform worse than the D’Agostino K-squared test when there are a lot of repeated values in variable. Looking at the raw data showed many repeated values and the QQ plots also confirm that there are significant amounts of repeated values. Therefore, the D’Agostino statistic and result is a much more reliable judge of normality for this dataset.

Descriptive Statistics for Independent Variables

While skewness and non-normality should be monitored, independent variables do not necessarily violate model assumptions just because of non-normality, especially with non-parametric models like decision trees and k-NN. Some categorical variables like Gender will not fit the assumptions of normality, which is to be expected. These independent variables were determined to be more than feasible for use in parametric models.

Histograms and QQ plots were produced for all the independent variables and a variety of transformations were also attempted.

Outcome Variables Descriptive Statistics

The data set contains eighteen outcome variables of drug use that are all categorical and are labelled with the same seven output classes. The table below outlines all the class labels and the respective definitions:

Drug Outcome variable levels (Frequency of Use)

Further exploration also showed that there were no missing values, with 1885 values present for each variable. The distribution for the 7 classes was also examined by counting the values of each class for each drug outcome. The percentage distribution was also calculated and was as follows:

Percentage Distribution of Values for Outcome Features

From the percentage distribution of each outcome variable, it is apparent that all the outcome variables have highly skewed distributions, indicating that further processing was needed to account for the imbalanced classes in each drug outcome, in order to return consistent results from the machine learning algorithms.

Visualizations

To further explore the dataset, I created a variety of plots to visualize the different variables. The diagrams below show the histograms, QQ plots and my transformation plan for the independent variables. For the full set of visualizations, please go this link.

Visualizations and Transformation Plan for Education and Country Variables

Visualizations and Transformation Plan for Ethnicity, Neuroticism and Extraversion Variables

Visualizations and Transformation Plan for Openness Agreeableness and Conscientiousness Variables

Visualizations and Transformation Plan for Impulsivity and Sensation Seeking Variables

Correlation Analysis

Following the creation of plots and descriptive statistics, I proceeded to look at correlation analysis to see whether there were potential linear relationships between independent variables and outcome drug variables (bivariate relationship). This serves two purposes: 1) to identify independent variables with strong linear relationships as important potential candidates for inclusion into models and 2) to determine if the independent variables are correlated with each other which would result in multicollinearity. In general, the presence of correlated independent variable will lead to less reliable probability estimates from the machine learning algorithms, therefore it would be a good opportunity to remove highly correlated variables before modelling.

Correlation matrices were produced for each of the outcome variables with the full set of independent variables (all twelve). The correlation matrix plot for cocaine is shown below. For the remaining plots please see the GitHub page.

The general takeaway from the correlation matrices was that most of the drug outcomes were not strongly correlated to the independent variables. Among the independent variables, most notably, Impulsivity and sensation seeking were correlated somewhat strongly (0.62) and sensation seeking was moderately correlated with the Openness personality trait. These were expected as these variables were described in the literature as related domains but provided different supplementary information about personality.

Data-Cleaning/Pre-Processing

From the information gathered through the exploratory analysis, there were several major data transformations that needed to be performed on the dataset to help make the modelling process more effective. These operations were:

1) Assessing for null invalid values and determining appropriate imputes

2) Transforming the drug outcome variables from an ordinal categorial variable to an ordinal numerical variable

3) Collapsing the drug outcomes into the three major drug categories: stimulants, depressants, hallucinogens

4) Combining the 7 class labels for the drug outcomes into three classes to produce three new levels of drug use: 0 — non-user, 2 — infrequent use, 3 — high usage

Assessing for null and invalid values

Performing a quick field summary of the variables shows that there are no null values (Fig.) From the exploratory analysis, the min, max and 25th percentile, 50th percentile and 75th percentile values as well as the plotting indicate the spread of values is centred around a mean of 0 with outliers unlikely.

Transforming the drug outcome variables from an ordinal categorial variable to an ordinal numerical variable

All the drug outcome variables had the class labels CL0 to CL6 which were ordinal categories, as CL0 represented “never used” and CL6 represented “Used in the last day”. To convert the string class labels to a numerical ordinal variable, the “CL” prefix was removed with the following method:

Collapsing the drug outcomes into the three major drug categories

To increase sample size of the outcome drug variables and to simplify the drug outcomes in a meaningful way, the drug outcomes were grouped into three new outcome variables, representing broader classes of drugs. These groups were: Stimulants, Depressants, and Hallucinogens. For each drug outcome the max class label from the group of original drugs was selected as the value for new drug outcome. The following shows the grouping functions that were applied and the constituent drugs for each new drug outcome group.

Original Drug outcomes Mapped to Three Classes

Combining the 7 class labels for the drug outcomes into three classes to produce three new levels of drug use

The final preprocessing step was to collapse the ordinal classes to increase sample sizes as well as to re-structure the data for a simpler multi-class problem with 3 classes instead of seven. The original classes were mapped to final 3 classes in the following way:

Original Multi-Class Labels to Three New Labels

The final distribution for the three new outcome variables is in the figure below. From a first glance, the data was heavily imbalanced especially for stimulants and depressants. I noted that oversampling would be needed to rebalance the classes and might have a significant impact on the final models.

Feature Selection

The process for carrying out the feature selection process involved correlation analysis and performing bivariate regressions with the drug outcome and one independent variable at a time.

Correlation Plots (Post Data Processing)

Having cleaned the data, I re-ran the Pearson correlations to re-check the drug outcomes (stim_final, dep_final, hallu_final) against the independent variables to look preliminarily for relationships. In the following three figures, it is apparent that the correlations with each drug outcome category and the independent variables are week to none (close to a value of zero). This is not unexpected, as the distributions of the variables are highly imbalanced. This indicated to me that I should potentially keep all the variables as there was no immediate distinguishing feature (see figures below).

Bivariate Regressions

A second approach that I used to understand which variables would be important predictors of the drug outcomes, was to calibrate and run bivariate regressions with one independent variable at a time with one drug outcome to look at whether the independent variable was significant predictor at a threshold of p<.05. Multinomial Logistic Regressions were used as they could account for the multi-class (three classes) nature of the drug outcome variables. Using statsmodels summary method, the coefficients, probabilities, and log-likelihood values were retrieved. The following figure 12 is an example of using age as a predictor of depressants (dep_final). For the rest of the bivariate model summaries, please see the GitHub.

The general findings led me to include all twelve of the independent variables as each one showed at least a statistically significant relationship with one of the drug outcomes. Retaining all twelve features also provided me with the opportunity to perform multivariate analysis with all those features and account for them in one single model.

Bivariate Regressions Example: Depressants with Age as predictor

Implementation

Initial modelling

For the modelling, process I initially calibrated the following models: SVM, Logistic Regression, k-NN classifier, decision tree, gradient boosted tree, random forest, linear discriminant analysis, neural network. These were then accessed for accuracy, before I reduced the models to focus on the following four:

1) Support Vector Machine (SVM)

2) Logistic Regression (Multinomial Logit)

3) Random Forest

4) Neural Network (MLP Classifier)

Rebalancing and Re-calibrate Models

Once those four models were determined, I rebalanced the data using SMOTE to oversample the training data, as the classes in each of the drug outcomes were highly imbalanced. This imbalance was leading to overfitting in the initial models. Models were then re-run with the rebalanced classes leading to much better estimates.

Applying a Binarized Approach — OneVsRest

To try to further improve the modelling accuracy and also to enable the use of the AUC metric to assess the models, a binarized approach along with the one-vsrestclassifier() was applied to change the multi-class problem to a binary class problem where one class is compared to all of the other classes treated as one class. Once the models were created, I was able to retrieve AUC, F1, Accuracy, Precision and Recall scores to compare the models.

Cross-Validation

To ensure that the results of the model are non-trivial and not due to a random sample of the data, k-fold cross fold validation was carried out for each model where the data split into k number of folds and the model was recalibrated and tested on a different fold of data that was held out as the holdout set each time. The cross_validate() method from scikit learn was used carry out 5 and 10 fold cross-validation to ensure the testing metrics remained relatively the same.

Fine-Tuning Parameters

The final step was to optimize the models by tuning the hyperparameters of the models to produce the best scores possible. An exhaustive search of parameter combinations was carried out with GridSearchCV in scikit learn.

The parameters searched for with SVM: