25 Tweets to Know You: A New Model to Predict Personality with Social Media

Published in

SyncedReview

7 min readMay 30, 2017

(All figures in this review are cited from this paper: Arnoux P H, Xu A, Boyette N, et al. 25 Tweets to Know You: A New Model to Predict Personality with Social Media[J]. ICWSM 2017)

Abstract

In order to provide personalized ads, tech giants such as Google and Facebook are trying to abstract their users’ personality from their posts on social media. Hence, it is essential for social networking applications to predict personality from written text. However, it requires too much input data to be realistically used. In this paper, the authors developed a model that can predict personality with reduced data requirement. The model achieves better performance than state-of-the-art techniques while requiring 8 times less data.

Introduction

More and more social applications are taking users’ personality into consideration in order to provide a more adaptive and personalized service and user experience. Personality is usually measured by the Big-5 dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism (OCEAN) [1, 2, 3].

With the rapid growth of the internet, tremendous amount of data for predicting personality is available on social media. However, a well-known requirement for building a robust machine learning model is the significant amount of input data required. Previous works have shown that, by using on average 200 Facebook posts [2] or even 100,000 words [3], it is realistic to build a personality prediction model. On the contrary, users have on average 22 posts only on Twitter [4]. Hence, these models are not able to deal with a majority of users who have few posts on social media.

This paper presents a personality prediction model trained on a small amount of text, and compares this novel method against previous ones (a first in this field). It also introduces a new method where Word Embedding is used as features for personality modeling, and Gaussian Processes as the learning algorithm. This method outperforms previous works in this field.

Word Embedding with Gaussian Processes

The authors propose a method combining Word Embedding with Gaussian Processes, which first extracts the words from tweets and averages the Word Embedding representation into a single vector, and then takes the vectors as an input for training and testing.

The Word Embedding features

Word Embedding is a technique which represents words as a low-dimensional vector, learning from large amount of unstructured text data. Its performance depends a lot on syntactic and semantic relationships between words by bringing similar words closer together.

In this paper, the authors choose the Twitter 200 dimensional GloVe model [5] for extracting the Word Embedding features.

The Gaussian Processes model

This paper also introduces a new non-linear model: Gaussian Processes (GP) [6]. GP performs quite well in regression because it allows an explicit quantification of noise and a modulation of feature usefulness by conducting a kernel function. Combining GP with Word Embedding has proven very efficient in short text classification [7] and nonlinear modeling from text features [8].

In this paper, the authors use a 200 dimensional vector from the aforementioned Word Embedding features as an input, and train a GP model for each of the Big-5 dimensions.

Experiment Design

The ground-truth data is obtained from over 1.3K participants, and the performance is compared with previous methods.

Ground Truth Collection

Participants’ self-reported personality ratings are collected in the same way as previous work [5]. Participants voluntarily agreed to share their tweets, and answered a personality web survey created by the authors via Twitter adds. Participants’ tweets were analyzed and graded into the Big-5 personality traits.

There are 1323 participants with at least 200 non re-tweet tweets. The age distribution is: under 18 (23%), 18 to 24 (47%), 25 to 34 (14%), 35 to 54 (12%), above 54 (3%). 52% of the participants are female.

The authors then pre-processed the tweets by removing the URLs, hashtags, numbers, punctuation, and setting all text to lowercase.

Methods for Comparison

The proposed method is compared with two state-of-the-art methods:

Linguistic Inquiry and Word Count (LIWC) with Ridge Regression (RR). The method was proposed by Yarkoni [3] using LIWC [10] to extract features and RR as a learning algorithm.
3-Gram with Ridge Regression. The authors used 3-Gram and RR to implement this method.

The proposed Word Embedding with Gaussian Process (GP) method integrates GloVe features with Gaussian Process regression as the learning algorithm. Both RR and GP are regularized to reduce overfitting.

The data is split between Testing and Training using a 10 Fold Cross-Validation, and Training is split again into Training and Validation subgroups using a 75%-25% rule. The performance is evaluated by a Pearson Correlation analysis between the predicted and actual personality scores on the Testing data.

Comparison Settings

The performance of those three methods are compared in the following settings:

Full Setting: the methods are trained and tested over the entire corpus of texts.

Sampling Setting: a down-sampling on the tweets of the testing users is used, and the number of tweets used is varied to simulate users with various numbers of tweets.

Real-life Setting: The models are trained on a large population of users with a large number of tweets and tested on a small set of real life users with small numbers of tweets. The aim is to further investigate the performance of the methods in real life.

Results

Full Setting

Table 1: Model correlation comparison for the Big-5 traits. The reported correlations are significant p<0.01.

In addition to the three methods introduced previously, three other combinations of features and models are also tested: GloVe with RR, LIWC with GP and 3-Gram with GP.

The proposed method (GloVe GP) achieves an average correlation of 0.33 over the Big-5 traits, 33% better than the other methods. What’s more, GloVe feature and GP contributes equally to the performance of the method. Also, GP does not perform well in combination with Bag-Of-Words like features such as 3-Gram.

Sampling Setting

Figure 1: Prediction accuracy of the Big-5 traits according to the number of tweets. Reported correlations are significant p < 0.01.

Real-life Setting

Figure 2: Mean Absolute Error averaged over the Big-5 traits.

This figure shows the comparative mean absolute error averaged over the Big-5 traits for the set of 55 users. The average absolute error of the proposed method is 25% smaller than the state-of-the-art, and 11% smaller than the original method.

All the results show that the proposed method outperforms previous methods in a real life context, even when given a small amount of data.

Conclusions

This paper presented a novel method combining GloVe features with Gaussian Processes, achieving quite good performance in predicting users’ Big-5 personality traits from their social media texts in a real-life context.

Although this method improves personality modeling, there are lots of room for improvement. Such as requiring to be trained on a large number of tweets. Future studies can be conducted to train the model on a smaller number of tweets.

References

1. McCrae, R. R., and John, O. P. 1992. An introduction to the five- factor model and its applications. Journal of personality 60(2):175–215.

2. Schwartz, H. A., et al. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one 8(9):e73791.

3. Yarkoni, T. 2010. Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of research in personality 44(3):363–373.

4. Burger, J. D., et al. 2011. Discriminating gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1301–1309: Association for Computational Linguistics.

5. Pennington, J., Socher, R., and Manning, C. D. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532–4.

6. Rasmussen, C. E. 2006. Gaussian processes for machine learning.

7. Ma, C., et al. 2015. Distributional Representations of Words for Short Text Classification. In Proceedings of NAACL-HLT. 33–38

8. Yoshikawa, Y., Iwata, T., and Sawada, H. 2015. Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model. In AAAI. 3129–3135.

9. Schwartz, H. A., et al. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one 8(9):e73791.

10. Pennebaker, J. W., et al. 2015. The development and psychometric properties of LIWC2015. UT Faculty/Researcher Works.

Author: Yuanchao Li | Localized by Synced Global Team: Junpei Zhong