Using Machine Learning to Predict Alcohol and Drug Use with Personality Traits and Socio-Demographic Characteristics (Part 1)

Andrew Sik-On Leung
8 min readJun 1, 2023

--

In this three part series, we will take a look at a project where two of my interests intersected: Health Outcomes and Machine Learning. Using drug use data retrieved from the UCI Machine Learning Repository, I applied machine learning technqiues to predict drug use.

The overall goal for my Udacity Data Scientist Nanodegree Capstone project was to explore if machine learning models could help better predict drug use from personality and socio-demographic variables.

Part 1 provides a brief literature review on drug use, personality and previous methods used to study the associations between the two, the study methodology, explanation of the methods used and chosen metrics.

Part 2 can be found here. Part 3 can be found here.

Introduction/Problem Statement

After reviewing the literature on personality traits, the association between personality and outcomes, the association between personality and drug/alcohol consumption, as well as the relationship between socio-demographics and drug/alcohol use, there are several conclusions that we can draw.

Firstly, the Five-Factor Model (FFM) of personality is a robust and well validated model for operationalizing and quantifying the personality of an individual¹. As a result, the personality dimensions of Neuroticism (N), Extraversion (E), Openness to Experience (O), Agreeableness (A), and Conscientiousness ©² can be applied as predictive factors to explore a quantitative relationship with drug and alcohol consumption. This makes the FFM dimensions amenable to machine learning methods as they can be incorporated as input factor.

Many studies have shown positive associations between personality traits and various outcomes³. Of particular importance strong associations were found directly between personality traits and drug and alcohol consumption⁴, as well as indirectly through other related health outcomes such as well-being, mental health, and risky health behaviours⁵. This body of literature indicates the importance of including measures of personality when looking at predicting drug and alcohol use in individuals.

Secondly, socio-demographics also play a vital role in determining drug and alcohol use⁶. These characteristics serve as important proxies of the social, economic, geographical and environmental influences that are interacting together to affect an individual⁷. Therefore, any analysis of drug and alcohol use needs to incorporate measures of socio-demographic characteristics along with personality to build a more comprehensive understanding.

Finally, there are also a few research gaps that can be identified from this literature review. Much of the previous work examining personality traits and outcomes uses correlational analysis to provide evidence of statistical association. More rigorous statistical modelling can be applied to help better understand the association and causation between personality traits and drug and alcohol use.

Additionally, the focus of the work has been on finding statistical associations between personality traits and drug and alcohol use, instead of predictive modelling. While associations are extremely important, we can build on them and use predictive modeling to derive novel insights with clinical applications.

The types of statistical modelling applied are also quite limited in most studies of drug use, with models limited to primarily linear and logistic regressions. Applying more machine learning algorithms provides more options for classification and allows for a more thorough comparison between methods to find the best classification method.

Problem Statement

To help address the knowledge gap, I determined that a project using machine learning methods could be implemented to help better predict alcohol and drug use from personality traits and socio-demographic variables.

The next step was to understand what was the available data and what questions I could help potentially answer. In examining the drug use survey dataset that I obtained, I saw that the overarching theme was the relationship between personality traits and drug use. There were also a number of socio-demographic variables available for each indvidual, which meant we could also analyze socio-deomgraphics as a driver of drug and alcohol use.

With these guiding themes in mind, I underwent an interative process of exploratory analysis and data-cleaning (pre-processing) in the next step of the project to help develop and refine the research questions. The final research questions that resulted were as follows:

Research Question One:

What are the personality traits and demographic variables that best predict each drug use outcome?

Research Question Two:

Can we determine which machine-learning approaches/methods is the most effective for predicting consumption?

Creating a Study Methodology

Having settled on the two research questions above, my next objective was ensure an effective project, one that could actually be carried out. To understand the full scope of my study and to ensure a systematic process, I created a study methodology/plan to guide my work. In having this overall view of my study, this also helped me to determine the machine learning methods that I wanted to test and also settle on the appropriate metrics.

Study Plan/Methodology

My study plan consisted of the following:

A flowchart documentation of my study methodology.

Determining the Methods

For this project, the following machine learning methods were used:

  • Support Vector Machines
  • Logistic Regression and Multinomial Logits
  • k-Nearest Neighbours
  • Decision Tree/Random Forest/Gradient Boosting Tree
  • Neural Networks

The proceeding section provides a summary overview of these techniques.

Support Vector Machine

Support Vector Machines are a supervised learning method that can be used for both classification and regression⁸. Knowing the labels for the data, the algorithm tries to find an optimal decision boundary known as a hyperplane in n-dimensional space (n is the number input features used) that can correctly classify the data points to the output labels given. The hyperplane that is selected is the one that separates the positive and negative classes by greatest margin, to allow for greater generalization. The hyperplane is usually a linear equation as the boundary is a straight line. SVMs however can be adapted to multiclass problems and non-linear hyperplanes⁹.

Logistic Regression and Multinomial Logits

Logistic Regression is a classification model which models the probabilities for binary outcomes (two outcomes)⁹. In logistic regression, a linear combination of inputs is squeezed by the standard logistic function or sigmoid function into a codomain of 0 and 1, resulting in an output of log odds. Negative values of the log odds can map to “0” and positive ones can map to “1”.

We can also map back the probabilities by the following equation¹⁰:

A threshold probability is set (i.e., .5), and whenever p > .5 it will result in the positive (“1”). Multinomial Logistic Regression extends this model to produce probabilities for multiple classes (more than 2).

k-Nearest Neighbors

The k-Nearest Neighbors algorithm is a non-parametric supervised learning algorithm that can be used to classify or perform regression. When the algorithm sees a new sample x that does not have a label, it finds k training examples that are closest to x according to distances between n input features. Distance metrics include Manhattan distance or cosine similarity are calculated for all input features between x and data points in the training set. The k data points that have the smallest distance are deemed the closest to the x sample, and the majority label among the k data points is given to x⁹.

Decision Tree/Random Forest/Gradient Boosting Tree

A decision tree is a non-parametric model that builds an acyclic graph⁹. The algorithm chooses a rule to split the data on (branching from nodes). When a value is above a threshold it follows one side of the branch otherwise it goes to the other one. When no more splits can be made, a leaf node is reached, and a decision is reached about which class to assign the data point. To determine whether a split is good, Entropy is calculated where high levels of entropy mean all values of a variable are probable and low entropy is where only one value is possible. A random forest extends this concept by generating multiple trees, where it randomly selects a new subset of features at each split. The outputs are combined at the end (i.e., through majority vote) to get a final classification. In doing so, this avoids correlated trees which would decrease the accuracy of prediction and reduces the variance of the final model to minimize the chance of overfitting. Another extension of the decision tree is the Gradient Boosting Tree, where multiple trees are built, but this time, each tree depends on the last tree, as the residuals for the last tree are calculated and are added back in as new labels. This modified training set is then used to produce the next tree which will have even smaller errors (i.e., smaller residuals)⁹.

Neural Networks

A neural network consists of a series of nested functions called layers⁹. The first layer is the input layer which holds an activation function (i.e., a logistic regression function) chosen by the analyst and can have multiple units. The last layer is the output layer and has one unit to combine all inputs from the second last layer into one output value. To get from the inputs (x) to the first layer, different weights are applied to each input before they are fed into all the units in the first layer. If the activation function is activated the output value is passed to the next set of units in the next layer. This continues until the output layer which produces a final regression value or class prediction. The neural network then assesses the output with the expected output and then uses that information in a process called back propagation to adjust the weights before each layer to produce a better final overall prediction.

Metrics

The next step after determining the methods, was to decide on the appropriate metrics. Since the outcome variables for the drug and alcohol dataset were categorical and multiclass, therefore the overall goal was to maximize the accuracy score of the predictions while balancing both precision and recall. Precision and recall, as well as F1 score and the average precision score were the measures selected to assess the accuracy of the model.

To document and examine the final metric scores and processing time. I prepared the following table:

Evaluation Metrics and Timings Results

To continue to Part 2, use this link here. Part 3 can be found here.

The code for this project can be found here.

[1]: Digman JM. Personality structure: Emergence of the five-factor model. Annual review of psychology. 1990 Feb;41(1):417–40.

[2]: Terracciano A, McCrae RR. Cross-cultural studies of personality traits and their relevance to psychiatry. Epidemiology and Psychiatric Sciences. 2006 Sep;15(3):176–84.

[3]: Costa PT, McCrae RR. Influence of extraversion and neuroticism on subjective well-being: happy and unhappy people. Journal of personality and social psychology. 1980 Apr;38(4):668.

[4]: Terracciano A, Löckenhoff CE, Crum RM, Bienvenu OJ, Costa PT. Five-Factor Model personality profiles of drug users. BMC psychiatry. 2008 Dec;8(1):1–0.

[5]: Trobst KK, Herbst JH, Masters III HL, Costa Jr PT. Personality pathways to unsafe sex: Personality, condom use, and HIV risk behaviors. Journal of Research in personality. 2002 Apr 1;36(2):117–33.

[6]: Sharpe DL, Abdel-Ghany M, Kim HY, Hong GS. Alcohol consumption decisions in Korea. Journal of Family and Economic Issues. 2001 Mar;22:7–24.

[7]: Leung A, Law J, Cooke M, Leatherdale S. Exploring and visualizing the small-area-level socioeconomic factors, alcohol availability and built environment influences of alcohol expenditure for the City of Toronto: a spatial analysis approach. Chronic Diseases and Injuries in Canada. 2019;39(1).

[8]: Shmilovici A. Support vector machines. Data mining and knowledge discovery handbook. 2010:231–47.

[9]: Burkov A. The hundred-page machine learning book. Quebec City, QC, Canada: Andriy Burkov; 2019 Apr.

[10]: Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer-Verlag; 2002 Aug.

--

--