Data Science Final Project: Myers-Briggs Prediction

Team 16: Bianca Antonio, Chang Park, Chethan Valleru, Connor Byron, Gopika Ajaykumar, Vivian Tan

8 min readMay 5, 2018

Personality Types Key for Identifying your Myers-Briggs Personality

Overview

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) — Extroversion (E)
Intuition (N) — Sensing (S)
Thinking (T) — Feeling (F)
Judging (J) — Perceiving (P)

The purpose of this project is to investigate if any patterns can be detected in a person’s specific personality type in connection to their style of writing and to explore the validity of the test in analyzing, predicting, or categorizing behavior.

The dataset contains ~8600 observations (people), where each observation gives a person’s:

Myers-Briggs personality type (as a 4-letter code)
An excerpt containing the last 50 posts on their PersonalityCafe forum (each entry separated by “|||”)

The overall goal for the outcome is to be able to predict a person’s personality type based on some text that they have written, as well as evaluate the validity of the MBTI’s test and its ability to predict the converse — using MBTIs to predict language styles and behavior.

(A more detailed description can be found here: https://www.kaggle.com/datasnaek/mbti-type)

Method Summary

In order to make our classifications, we will likely try using an XGBoost classifier (since XGBoost tends to do well in many Kaggle competitions, even when the data has had little preprocessing). We may also try using other classification methods that we learned about in class, such as Logistic Regression, SVM, or Random Forests (or a neural network with a softmax output layer). Since we are given text data, we will also spend a lot of time on feature extraction using tools like BeautifulSoup and existing
sentiment analysis algorithms.

(We will go in-depth about these methods later in this blog post).

Data

The idea of the project is to use these posts to predict the personality type of a user. As mentioned above, the dataset from Kaggle comes with two columns: the Myers-Briggs type of a user and 50+ user posts stored as strings. Here is an abbreviated example of the first 5 rows:

Class Imbalance

Our dataset was fairly imbalanced for the first two categories (I vs. E and N vs. S).

Data imbalance in the personality types of the dataset

Imbalance Correction

To correct for the data imbalance in each of the subclasses, we tried various tactics such as resampling, stratified cross-validation, F1 scoring, and AUC scoring.

Resampling

For resampling we tried undersampling, oversampling, and the Synthetic Minority Over-sampling Technique (SMOTE). First, we tried undersampling the dataset to have a 50:50 ratio for 0s and 1s of the target Y-variable. This involved scaling the majority class down to the size of the minority class by randomly selecting rows to keep from the majority class. However, after evaluating the new, smaller training set, it was clear that this method would not be effective. This is likely due to the reduction in available data which made our resampled dataset too small. Next, we tried oversampling the dataset. When oversampling before the cross validation split, this resulted in extreme overfitting of most of our models. This seemed to occur because the validation sets usually included massive amounts of duplicate data. To correct for this, we attempted oversample after each validation set was generated. This proved to be difficult and lengthy to implement, but provided better results once implemented. For both undersampling and oversampling we tried other ratios such as 2:1 and 4:1 to make the change in data less drastic. A ratio of 2:1 showed improvement over the other ratios. Lastly, we tried generating samples using SMOTE to correct for the data imbalance. Upon evaluation, this method seemed to improve results by a small margin.

Stratified K-Fold Cross Validation

To further correct for class imbalances, we tuned our hyperparameters using GridSearchCV (from the Scikit-Learn package). This generates stratified K-Folds (we used 5 folds) which preserves the ratio of class values in each fold. Stratified folds handle imbalanced data well since the there is a guaranteed ratio of class values in each fold.

Scoring Methods

For imbalanced classes such as E vs. I and N vs. S, with we tested with both F1 and AUC scoring instead of accuracy. This is because, by scoring with accuracy, our models would just pick the most common class. F1 and AUC scoring focus on the precision and recall which improves predictions for the minority classes.

Feature Engineering

Since the original dataset only came with 2 features, the Type and 50 posts for each person, we decided to create additional features. We ended up adding 16 additional features:

We’ll go into more detail how we made some of these features.

Counting Occurrences

We simply counted the average number of words, punctuation, etc. for the following features:

Words per comment
Variance of words per comment
Question marks per comment
Exclamation marks per comment
Ellipsis per comment
Links per comment
Images per comment

Here’s a sample code snipped of how we did this:

df_train[‘ellipsis_per_comment’] = df_train[‘posts’].apply(lambda x: x.count(‘…’)/50.0)

Sentiment Analysis

Sentiment determines the ‘positivity’ or ‘negativity’ of a piece of writing. We used the TextBlob library to process the comments in our dataset and determine the sentiment for each row. Sentiment is rated from -1.0 to 1.0, where -1.0 is the most negative and 1.0 is most positive. We found later that this feature is one of the most important for distinguishing between types across the 4 MBTI axes.

Part of Speech Tags

Another group of features we added was the average number of occurrences of different parts of speech for each entry of text posts. Parts of speech refers to categories of words that have similar grammatical properties. Below is a table of some common part of speech tags:

We chose a subset of these tags to add as features:

Nouns per comment
Verbs per comment
Adjectives per comment
Prepositions per comment
Interjections per comment
Determiners per comment

We first tried using the spaCy library to tag the comments in our dataset, but it ended up being too slow, so we switched to using the NLTK library instead. We also use Regex to filter out multiple tags (VBD, VBG, VBN, etc.) that we wanted to count in our features.

Data Exploration

After we added our features, we did some data exploration to see how the raw data looks and to see how important our features were for distinguishing types across the 4 MBTI axes. Below is a plot further showing the type imbalances in our data.

We used XGBoost to see how important each of the features we created were:

Below is a collection of graphs depicting our feature distributions. Although some of the feature distributions are very skewed, there is clearly a strong gaussian shape for some of the features such as sentiment and variance_of_word_counts. Unsurprisingly, these two features also happened to have the greatest impact on our models.

Model

Baseline Accuracy

Each axis is assigned a prediction that represents the most common classification for each category (E vs. I, S vs. N, F vs. T, J vs. P).

Support Vector Machine (SVM)

This model allowed us to create hyperplanes that divide our data into different subspaces. Each subspace is separated by the largest margin possible (within an error tolerance). The hyperplanes represent the classification boundaries. Here is a 2-dimensional example of a single hyperplane separating a dataset using SVM.

To implement this model, we used the SVC class from the Scikit-Learn package. We tuned the model hyperparameters using GridSearchCV (also from Scikit) with an AUC scoring function. After tuning a model for each of the four categories (E_i, S_n, F_t, P_j) we obtained the following scores respectively:

The first 3 categories showed little success. Low F1 scores and AUC scores show that our models need improvement for detecting the minority classes. The fourth set of scores shows vast improvement over our previous tests, achieving high scores across the board. We are still investigating the reason for success from this test, but overall our engineered features seem to be excellent predictors for the Judging vs. Perceiving category.

Neural Network

In order to detect more complex, non-linear boundaries between classes, we tried out a simple neural network for classification. The neural network contained two layers with rectified linear unit activation functions and a batch size of 64. We also included two dropout layers that randomly set half the inputs to 0 to prevent overfitting. Finally, the output layer consisted of a sigmoid activation function. The loss function used for the neural network was binary cross entropy, with an rmsprop optimizer. The neural network was trained for 1000 epochs with a batch size of 128. Even with this relatively simple configuration, the neural network performed better than most of the other models we tried. The AUC/ROC scores for the neural networks are listed below:

Logistic Regression

Another model we used was a logistic regression model because it is intended for binary (two-class) classification problems. These are the results from the 10-fold CV and AUC/ROC:

Logistic Regression was the closest model to the baseline accuracy for 10-fold CV, but the AUC/ROC had very bad scores overall.

Random Forest

We thought random forest would be a good model for us to try on this dataset since random forests tend to do well on skewed datasets. We used random forests for each of the 4 axes tuned with Grid Search. Below are the best results for each type: