Using ML in Personality Analysis

Serjan Kaur
The Innostation Publication
14 min readJul 4, 2022

You have probably taken a personality test at least once in your life, unless you’ve been living under a rock. These last few months, I had been applying to jobs for the summer. I used to fill out the necessary applications, respond to the generic questions, and attach my resume. Some of them would ask me to complete an online form the very next day with a checkbox where I could mark whether I strongly agreed or disagreed with a predetermined statement.

I was supposed to check the box next to the statement, “I find it difficult to approach individuals,” on the test.

  • Strongly Agree
  • Agree
  • Neither Agree nor Disagree
  • Disagree
  • Strongly Disagree

The truth is that I actually have a hard time approaching people. Therefore, in theory, I should have chosen “Agree” with that statement, but when filling out a job applications' assessment, I marked “Strongly Disagree” without batting an eye! It is challenging to accurately estimate a person’s personality, and candidates can sometimes falsify results — as I did.

To this end, we developed a personality trait estimator that uses unstructured written text as input and outputs personality estimations based on the dimensions of the Myers-Briggs personality type indicator. Essentially attempting to infer your personality from the things you have written.

The growing number of people using social media has resulted in a significant increase in the amount of information available online. The contents that these users post on social media can often provide valuable insights into their personalities (for example, in terms of predicting job satisfaction, specific preferences, and the success of professional and romantic relationships) without the hassle of taking a formal personality test.

Through personality prediction we break down digital input into components and map them to a personality model. A well-known personality model, known as the big five personality traits, has often been adopted in the literature as the de facto standard for personality evaluation due to its simplicity and proven competence. Natural language processing (NLP) technologies have effectively become an established and appealing go-to method for psychological study over the last decade.

But First, Let’s Go Over the Basics

What is NLP?

Natural Language Processing is a part of artificial intelligence which focuses on the overlap between human linguistics and computers. It enables computers to converse with humans in their native tongue and scales other language-related activities. NLP allows computers to read text, hear voice, analyse it, gauge sentiment, and identify which bits are significant, for example.

How it works :

  • Tokenization: It is a process of splitting text objects into smaller units called tokens. Tokens can be numbers, texts, or symbols.
  • Part-of-speech-tagging: POS is categorizing texts into lists of words where they are segregated according to whether the word is a noun, adjective, verb, and so on.
  • Stemming and lemmatization: The process of removing prefixes and suffixes to extract the base form of a word. Lemmatization takes into account the context and converts the word to its meaningful base form, whereas stemming only removes the last few characters, resulting in inaccurate interpretations and spelling mistakes.
  • Stop word removal: As the name suggests it is the eliminating words that appear in a large number of documents in the corpus Articles and pronouns, for example, are commonly characterized as stop words.

What is Personality Analysis?

Personality analysis is the process of analyzing and evaluating an individual’s essential attributes, such as dependability, determination, confidence, tenderness, and so on. It accurately assesses a person’s personality by accumulating relevant information. Personality analysis is used to determine not only psychological disorders in individuals, but also to screen possible job candidates.

Use Cases :

Particular illnesses in mental health are linked to certain personality traits. In forensics, knowing personality features can help narrow the suspect pool. Personality qualities have an impact on one’s fitness for particular occupations in human resources management.It is also used for a variety of other purposes, such as observing personality changes, evaluating hypotheses, determining how beneficial therapy might be, and so on. The analysis is frequently used to determine competence and conduct risk assessments.

Why do we need an alternative?

It is quite easy to deceive them in order to obtain the personality type of our choice, as the questions are straightforward. Individuals who want to take a personality test encounter two major issues: the expense and the length of the test.The majority of personality tests comprise 50–70 questions, which might be tedious for the user.

What is Myer Briggs Test Indicator?

Have you ever wondered what those cryptic-sounding initials meant when someone said they were an INTJ or an ESTP? Well these individuals are referring to their Myers-Briggs Type Indicator personality type (MBTI)

People are classified into one of 16 personality types based on their responses to the inventory’s questions. The MBTI’s purpose is to help people better understand and explore their own personalities, including their likes, dislikes, strengths, weaknesses, potential job choices, and compatibility with others.

Finally, What Are We Doing Here?

In around 50–60 words, the user must describe their interpretation of an abstract visual. This concept is founded on the notion that everyone thinks and expresses themselves differently. At the conclusion of the test, a detailed analysis is provided, as well as the proportion of each trait. This could be extremely advantageous for both psychological and industrial uses.The project is based on the idea that different personality types utilise different terminology to classify a person’s personality attribute.

Step 1 : Importing the Data & Importing Libraries

The Dataset :

Importing datasets is very fundamental when working on a machine learning model. A dataset is a set of data that is usually presented in tabular format. Each column denotes a distinct variable.The dataset we’re using has 4 variables : id, title, author, text and label. The label signifies whether a news is real or fake.

Our Dataset : Labeled data was collected from different people’s postings on a personality test forum along with their personality types, for the aim of this project. A total of 8000 data points were used in the dataset for this investigation. This contained all 16 MBTI personality types.

Here is the dataset I used!

# loading the dataset to a pandas DataFrame
df = pd.read_csv('mbti_1.csv')

The read_csv()function in Pandas imports a CSV file into a DataFrame.

The Libraries:

Numpy : Numerical Python is a Python library that includes multidimensional array and matrix data structures. I used it to create and manipulate data arrays.It basically did all the Math stuff we don’t really want to do

Pandas: is another open source library which we use for data analysis

matplotlib.pyplot: The visualisation toolkit Matplotlib includes a set of functions called Pyplot. A figure’s elements can be changed by using its functions, which include constructing a figure, a plotting area, plot lines, adding plot labels, etc.

re :The functions in this module allow you to determine whether a given text fits a given regular expression(RE).

NLTK : The Natural Language Toolkit (NLTK) is a Python programming environment for creating applications for statistical natural language processing (NLP).Text processing libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning are included.

Step 2: Pre-processing the Data

Why do it?

The dataset contains a lot of slang and terms that could not be used as significant semantic features for defining one’s personality because it was derived from a large social media dataset. Various punctuation marks and a large number of emoticons were also present in the data, which had to be cleaned and eliminated. All redundant punctuation, emoticons, and “stop words” such as “a,” “the,” and others were eliminated from the dataset during data cleaning rounds.

Let me break down the code now :)

  • Convert to lowercase : By reducing all characters to lowercase, it helps to remove unhelpful sections of the data, or noise. When you want to execute text analysis on pieces of data like comments, removing noise comes in handy.
df = pd.read_csv('mbti_1.csv')#convert to lowercasedf['posts'] = [i.lower() for i in df['posts']]df.head()
  • Splitting the dataset : The split() method splits a string into a list.When it detects a separator that has been defined previously, the function reads the string and separates it.
df['Distinct Posts'] = [i.split('|||') for i in df['posts']]
  • Removing rows : The drop() method removes a row or column from a table. The drop() method eliminates the given row by specifying the row axis (axis=’1') and its label.

Cleaning the dataset

Without cleaning the dataset, it is frequently a jumble of words that the machine is unable to comprehend. We’ll go over the procedures involved in cleaning data in a typical machine learning text pipeline.

import redf['Posts'] = df['posts'].apply(lambda x: re.sub(r'https?:[?:A-Za-z0-9//_?.=/-]+', '', x.replace('|','')))df['Posts']=df['Posts'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
  • We’re using the.sub command to replace particular values in our dataset, as well as the Regular expression library (re.) to assist us search for words in the data.
  • The hyperlinks (https?:[?:A-Za-z0–9//_?.=/-]+) are replaced with empty spaces or “|”
# removing special symbolsdf['Posts']=df['Posts'].apply(lambda x: re.sub(r'[0-9,."\'*.?/\()@#!$%&^+]', '', x))# removing repetitive multiple letter words and too long or too short wordsdf['Posts'] = df['Posts'].apply(lambda x: re.sub(r'([a-z])\1{2,}[\s|\w]*','',x))df['Posts'] = df['Posts'].apply(lambda x: re.sub(r'(\b\w{0,3})?\b','',x))df['Posts'] = df['Posts'].apply(lambda x: re.sub(r'(\b\w{30,1000})?\b','',x))
  • Empty spaces or “|” are used in place of special symbols (0–9,.”’*.?/()@#!$ % &+).
  • All repetitive multi-letter terms, as well as those that are too lengthy or short, are removed.
#df['Posts'] = df['Posts'].apply(lambda x: re.sub(r'\s[\s+]','',x))df.drop('posts',inplace=True,axis=1)df.to_csv('mbti_cleaned.csv')

Finally, the drop() method removes first row with the title “posts”

We now have a clean dataset with only the words that are relevant to studying personality traits. The cleaned dataset’s text is categorized by personality types to provide a final dataset with personality classes and raw text entered by these personalities. The final dataset is labeled “mbti_cleaned”.

Step 3 : TF-IDF vectorizer

Textual data must still be converted to numerical data before it can be used. This process is known as feature extraction or, more simply, vectorization, and it is a prerequisite for language-aware analysis. Numerically representing documents improves the ability to perform meaningful analytics. In our case, a particular amount of weight must be applied to the words in order to identify personality based on them.

We use a vectorizer called the TF-IDF vectorizer because it compares how often a term was used by each personality type to the total number of personality types.

What is a TF-IDF vectorizer, exactly?

It’s a text vectorizer that converts text into usable vectors using Term Frequency (TF) and Document Frequency (DF) (DF). The TF is the number of times a given phrase appears in a document, and it reflects how important that term is in that document. The number of documents containing a given term is referred to as the number of documents containing that term (Idf). In a document, the frequency of a term reflects how common it is.

Why TF-IDF vectorizer?

Because it considers the complete document rather than just a single sentence, TF-IDF works best when dealing with informal text data. As a result, the error created by an excessive use of a certain term by a specific personality type is ignored, and the only way a word is given higher priority is if it is used by a specific personality type and not by others.

Finally, All of the raw text is converted to numerical features, to create an entirely numerical dataset. This data was then used to train our machine learning and deep learning models.

How?

Create variables for countvectorizer & tfidfvectorizer respectively and save the function there. After that, the vectorizer functions are fitted(.fit) to the dataset, which implies that all data under the variable is passed through the vectorizer function. Then, using.transform, all of the values will be converted to their corresponding features.

#converting the textual data to #
vectorizer = TfidfVectorizer()
vectorizer.fit(x)x = vectorizer.transform(x)print(x)

Step 4: Naive Bayesian classifier

What is a Naive Bayesian classifier, exactly?

It’s a classification method based on Bayes’ Theorem and the assumption of predictor independence. A Naive Bayes classifier, in simple terms, posits that the existence of one feature in a class is unrelated to the presence of any other feature.

What are we doing here?

To classify the data in our study, we employed a simple Nave Bayesian classifier. In order to fit the bias of the data, the training parameter ‘alpha’ was set at 0.32. Because of the large bias and multiclass classification, the model only achieved 32% accuracy on the validation set.

How does it work?

  1. Create a frequency table from the data set.
  2. Find the probabilities and create a Likelihood table.
  3. Calculate the posterior probability for each class using the Naive Bayesian equation. The outcome of prediction is the class with the highest posterior probability.

Our code :

  • Performing a train test split on dataset
  • Set the variable model to Naive Bayes model (MultinomialNB)
  • Fit the dataset to the model
  • Finally, Let’s predict the test results
model = MultinomialNB(alpha = 0.0013)model.fit(x_train_tfidf,y_train)y_pred = model.predict(cv.transform(x_test))

-> The accuracy was 36%

Step 5: Support Vector Machine

So, What are SVMs?

The basic principle behind SVM is that it constructs a line or hyperplane that divides data into classes. SVM is an algorithm that takes data as input and, if possible, generates a line that separates the classes.The SVM technique finds the points from both classes that are closest to the line. Support vectors are the names given to these points.

The distance between the line and the support vectors is now computed. The margin is the name given to this distance. Our goal is to find the division between the different personality types. The ideal hyperplane is the one for which the margin is the greatest.

What are we doing here?

To increase the fitting towards the data bias, the regularisation value C was increased to 0.16. This strategy was similar to the one used in. On the validation set, the SVM model had a 60% accuracy rate.

The code :

  • In X train tfidf, we have our points, and in y train, we have the classes to which they belong.
  • Now we’ll use the above dataset to train our SVM model.
  • Finally, we predict the datasets class.
  • Tuning Hyper-Parameters
from sklearn.svm import LinearSVCclf = LinearSVC().fit(X_train_tfidf,y_train)y_pred = clf.predict(c_v.transform(x_test))

When you design a classifier, you pass parameters as arguments. The main parameters for SVM are as follows:

C: It manages the trade-off between having a smooth decision border and correctly classifying training points. A large c number indicates that you will receive more correct training points.

Step 6 : Convolutional Neural Network

Finally, Convolutional Neural Networks were utilized, which is a deep learning approach commonly used on images.

So, What is Deep Learning?

Deep learning is a type of machine learning that creates more complicated hierarchical models to replicate how humans learn new knowledge. Deep learning algorithms are known as neural networks and are inspired by the structure of the human brain.

These neural networks are made up of interconnected network switches that are programmed to recognise patterns in the same manner that the human brain and nervous system do.

And, What are CNNs?

A convolutional neural network (CNN/ConvNet) is a type of deep neural network used to evaluate visual imagery in deep learning. The system can take an input image, assign relevance (learnable weights and biases) to various aspects/objects in the image, and distinguish between them.

When compared to other classification methods, the amount of pre-processing required by a ConvNet is significantly less.

What are we doing here?

Word vectors are subjected to 1D convolution in this version of CNN. An embedding layer will transform each word into a 1-dimensional vector. These vectors would serve as the CNN layer’s inputs. The input will be one dimensional, unlike images. In order to improve word separation, the 1D vector CNN technique is utilised to take the input vectors and give learnable weights and biases depending on different features of feature extraction and max pooling.

Due to the large number of classes to anticipate, this accuracy did not increase. To fix this, The deep learning model was combined with a strategy known as multilabel classification.

Multilabel Classification:

To achieve higher accuracy, four binary classifiers were employed rather than making predictions for 16 classes based on 16 personality types. As a result, there were 4 neurons with sigmoid activation functions instead of 1 neuron with soft-max activation in the output layer. Word embeddings were constructed in 1D convolution and provided as input to the neural network. The neural network also included a max pooling layer, a dense layer, and a sigmoid layer for getting the results of the four binary classifiers’ multilabel classification.

It is possible to attain greater accuracy in the multilabel classification with thorough analysis, fine-tuning of the number of hidden layers, activation functions at inner levels as well as the output, and max pooling.

Using the TF-IDF vectorization and meticulously adjusting the hyperparameters resulted in a substantially higher accuracy than other models that employed the ngram model with CNN. Convolutional neural network model is utilised in the testing system’s final version as a result of this training.

That’s it, You see it was pretty simple!

Conclusion & Takeaways

The project uses visuals rather than questions to categorise the user’s personality in as short as 60 words. The benefit of creating such a process for personality classification is that it can quickly and easily identify personality features. The issue of having to answer at least 50–60 questions in order to get a personality test result that may be readily manipulated was fixed. The project’s output was a precise and time-saving method for using artificial intelligence to quickly and accurately take a personality test.

  • If all you are doing is copying and pasting the instructions, comprehending EVERY line of code you write won’t help. Because you’ll forget about them afterwards, take notes.
  • Genuinely understand what you’re performing make the project (like anything else) for the learning opportunity and to hone your talents, not just to complete it.
  • Enjoy breaks! It’s usually time to take a break and work on something else if you’ve been stuck on the same subject for a while. When you return to the debugging process, you’ll view things differently.

Aaaand we’re done!!!!!

S/O to Tejas Pradhan for writing the research paper I followed and helping me when I got stuck!

I really hope you found this article interesting. If that’s the case, you can find more of my fascinating stuff here:

Hey! Thanks for reading till the end:) I am 16y/o AI enthusiast and an innovator at The Knowledge Society (TKS).

My LinkedIn: https://www.linkedin.com/in/serjan-kaur-

My Twitter: https://twitter.com/serjannnnnn

Subscribe to my monthly newsletters: https://serjan.substack.com/

--

--