Personality Prediction using Myers Briggs Type Indicator

Authors: Vidhi Sharma, Dolly Sidar, Udit Bhati

Dolly Sidar
9 min readJan 8, 2022

Motivation

In today’s era, personality is one of the heavily researched and fascinating topics in psychology. Personality recognition of users is widely used in research domains like recommendation systems and human-robot interaction. Traditional recommendation systems come across problems like lack of data about the preferences of the user, free-riders problem, and the data sparsity problem. The identified user personality traits help understand users’ preferences. MBTI’s test-retest reliability hovers around a 0.5 error rate. On retest, people come out with 3–4 type preferences 75–90% of the time. Our methodology can assist with more accuracy than currently existing tests, allowing users to rely on their outcomes. Personality classification based on digital data has proved to be a more efficient alternative to traditional psychological tests.

Source code [GitHub Link], Colab Notebook [Link]

Introduction

With over 3.5 million assessments conducted each year, MBTI is the most widely used personality indicator globally. The Myers Briggs Type Indicator (MBTI) is a personality type system that divides everyone into 16 distinct personalities based on four dimensions, namely: Introversion (I) — Extroversion (E), Intuition (N) — Sensing (S), Thinking (T) — Feeling (F), Judging (J) — Perceiving (P). MBTI predicted personality traits retain essential properties of the traditional personality characteristics. Researchers widely use machine learning and deep learning algorithms to predict personality and psychological traits from digital records.

We’re developing an MBTI personality classifier that uses machine learning models to predict a person’s personality based on the 50 recent social media posts per user as input. We find correlations between a person’s MBTI personality type and writing style. The classifier also demonstrates the validity of the MBTI test. We have used a decent amount of mined personality annotated data from social media. Furthermore, our model would run on more data than that provided in a conventional personality test, which serves as a confirmation system and helps people rely more on the results.

State of the Art

1) Personality Analysis using Social Media [Link]

The authors used Twitter posts for personality analysis based on MBTI. The preprocessing steps include removing hyperlinks/special characters/stopwords, converting emojis to text, lemmatization, and stemming. Sampling methods and natural language processing techniques (NLP) were used for feature selection, such as N-gram, TF-IDF, Word2Vector, and glove word embedding. They trained K Nearest Neighbour (KNN), Naive Bayes, and Logistic Regression models to train data. The results of these two models were found to be slightly more accurate than the SVM model. Logistic regression performed the best using their methodology.

2) Inferred vs. traditional personality assessment: are we predicting the same thing? [Link]

The paper reviews 220 research papers and articles to check if predicted personality estimates retain the characteristics of the original traits. The authors stated that the automatic assessment should predict traits that are consistent with time (future behavior). They found that many predicted personality traits do not retain the characteristics of traditional personality traits. Most of the research papers reviewed used a Big-5 dataset where personality traits are distributed in a five-dimensional space. The authors found that for most studies, the correlation between predicted and reported personality was below 0.5. More work on analyzing personality prediction using psychometric validation instruments is required.

Dataset Description and Analysis

Link: (MBTI) Myers-Briggs Personality Type Dataset |Kaggle

The dataset has 8675 rows and 2 columns, namely- type and posts. The data in column ‘post’ contains 50 recent social media posts for each user. There are 16 unique labels in column ‘type’ with no NULL values, each representing 16 MBTI type indicators.

The pie graph of the number of posts vs. each personality type shows that the dataset is unbalanced. The plot of the distribution of length of all posts summarizes that some posts have less than 2000 words while others have 4500–9000 words.

The MBTI classifier has four main dimensions, namely ‘Introversion-Extraversion’ (IE), ‘Intuition-Sensing’ (NS), ‘Thinking-Feeling’ (TF), ‘Judging-Perceiving’ (JP). Four more columns are added to the dataset. In each column, ‘1’ represents the first part of each dimension (I, N, T, J), whereas ‘0’ represents the second part of each dimension (E, S, F, P), respectively.

TF and JP data are nearly balanced while IE and IS are unbalanced. The heat map shows a positive correlation between IE-JP and NS-JP. Other pairs are negatively correlated (except with themselves).

Preprocessing

For better feature extraction, some preprocessing is performed on the textual data in column ‘posts’.

  1. To lower case: The textual data is converted into lowercase.
  2. Removal of URL/links: The web URLs do not give us any direct text information regarding a person’s personality. They are incompetent in the classification of personality. These links are removed using the regular expression for URLs.
  3. Removal of special characters and numbers: The special characters such as ‘.’, ‘,’ ‘|||’ etc., are primarily outliers and noise. Also, numbers rarely give some helpful information about someone’s personality. Thus, they are removed as well using a regular expression.
  4. Removal of extra space: Extra space gives meaningless information. So, they are removed using regular expressions.
  5. Removal of stopwords: In English, stopwords include words such as ‘for,’ ‘them,’ ‘you,’ etc. These kinds of words are essential to make sense for a language, but they are meaningless for feature extraction and training of models. These words are accessed from the nltk library in python.
  6. Removal of MBTI personality names: MBTI personality names such as ‘INFJ’ ‘ISTP’ used by people in their posts can wrongly influence the results. Consequently, they were also deleted.
  7. Lemmatization: Words having the same meaning should be taken as a single feature. Lemmatizer is used to group words with the same purpose together (gone, going, went to go).
Text before and after preprocessing

Feature Extraction

Our dataset is unbalanced for a few personality types, implying some words appear more often and carry little meaning and information about the data. TF-IDF and count vectorizer is used to convert text into features, providing a more focused text view. First, vectorize the data using countVectorizer and convert the post into the matrix of token counts for the model. Then TF-IDF normalization is used to scale the feature from the count vectorizer into floating-point values. TF-IDF analyzes how much a word is relevant to a corpus in a corpus collection and provides the importance of words in data. After vectorizing, the dataset had 1500 features for each user post.

Word cloud depicting the frequency of words for INFP type, i.e., each word’s size is proportionate to how often it appears in the top posts. The bar graph shows the importance of different words for INFP type using the TF-IDF value. Similarly, we plotted the TF-IDF value graph and word cloud for every personality type.

Methodology

After preprocessing and feature extraction, the resulting dataset is split into training and testing datasets in the ratio of 80:20. Since the data was found to be slightly unbalanced for IE and NS dimensions, we used Stratified K-Fold cross-validation using GridSearchCV to get more accurate results. The models used for MBTI personality classification in our project are Logistic Regression, Naive Bayes and Random Forest Classifier, K-Nearest Neighbor (KNN), SGD Classifier, and Support Vector Machines (SVM). We have used sklearn, NumPy, and pandas libraries to build all these models.

Logistic Regression: The logistic regression model of sklearn applies regularization as default and performs a multivariable logistic regression on the given dataset. It provides greater than 80% accuracy in IE, NS, TF classification and 71% in JP classification.

Naive Bayes: The Naive Bayes classifier assumes that the features of a class are independent of each other. We have used Gaussian naive Bayes, the simplest and effective Naive Bayes Algorithm since it uses only standard deviation and means machine learning model, which gives an accuracy of 75% in NS and TF classification and accuracy greater than 64% in IE and JP classification.

Random Forest: The random forest classifier enhances the performance and controls overfitting of the decision trees classifier by fitting sub-samples to many decision trees and then averaging them to get a random forest. To avoid overfitting data and increase the accuracy, we have performed hyperparameter tuning (on max_depth and min_samples_split). To get the best value for max_depth of the tree, we have analyzed accuracy vs. depth graphs for the random forests. The final models for IE, NS, TF classification have an accuracy greater than 70%. The model for JP classification gives an accuracy of 60%.

KNN Classifier: KNN Algorithm finds the similarity using the distance between the k-neighbors of a data point. We plotted a different K vs. accuracy graph for each category to determine the best K value. The model has an accuracy of above 60% for FT, JP, and above 78% for IE and NS.

SGD Classifier: Stochastic Gradient Descent Selects a few random samples from the dataset for each iteration. We have set the loss parameter as ‘log’, which gives us a probabilistic classifier (Logistic regression). We then applied GridSearchCV with five cross-validations. The SGD model provides an accuracy of above 80% for IE, NS, FT, and 70% for JP.

Support Vector Classification: SVC is a supervised algorithm that uses kernel tricks to transform the data and then find the optimal boundaries based on these transformations. We have set the probability parameter as True, which enables us to estimate the probabilities (Also uses 5-fold CV). The SVC model predicts the accuracy of above 80% for IE, NS, FT, and 72% for JP.

Results

  • Gaussian Naive Bayes gives a low accuracy on the testing dataset (about 60%) for all the features. It shows poor performance in terms of precision and ROC curve.
  • SVC and SGD perform similar to Logistic regression. The precision and recall values of these algorithms are also good.
  • Random forest and KNN have good but relatively lower performance (accuracy around 75%). ROC curve for KNN shows poor performance.
  • We conclude that the Logistic Regression model performs the best for personality classification based on The Myers Briggs Personality Model. The ROC curve for logistic regression also supports the fact that it results in better performance.

Conclusion

Our model accurately predicted MBTI personality based on social media posts using all six supervised machine learning algorithms. The most accurate results were obtained using logistic regression. We can get more precise results by training models on a larger and more accurate dataset. This system can assist in the development of better recommendation systems. Governments can use it to find outliers and understand the personalities of targeted individuals. Companies can also use the MBTI personality test results to gain a better understanding of their employees’ behavior, including their strengths and weaknesses, as well as how they perceive, process, and interpret data.

This project was completed as part of our CSE343: Machine Learning course at IIIT Delhi, taught by Dr. Jainendra Shukla.

References

[1] “6839354.pdf.” Accessed: Jan. 08, 2022. [Online]. Available: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6839354.pdf

[2] B. Antonio, “Data Science Final Project: Myers-Briggs Prediction,” Medium, May 05, 2018. https://medium.com/@bian0628/data-science-final-project-myers-briggs-prediction-ecfa203cef8 (accessed Jan. 08, 2022).

[3] N. H. Z. Abidin et al., “Improving Intelligent Personality Prediction using Myers-Briggs Type Indicator and Random Forest Classifier,” IJACSA, vol. 11, no. 11, 2020, doi: 10.14569/IJACSA.2020.0111125.

[4] P. Novikov, L. Mararitsa, and V. Nozdrachev, “Inferred vs traditional personality assessment: are we predicting the same thing?,” arXiv:2103.09632 [cs], Mar. 2021, Accessed: Jan. 08, 2022. [Online]. Available: http://arxiv.org/abs/2103.09632

[5] S. Patel, M. Nimje, A. Shetty, and S. Kulkarni, “Personality Analysis using Social Media,” International Journal of Engineering Research, vol. 9, no. 3, p. 4, 2021.

Blog Authors

Vidhi Sharma, B.Tech. CSAM, IIITD

Udit Bhati, B.Tech. CSAM, IIITD

Dolly Sidar, B.Tech. CSD, IIITD

--

--