Inferring and analysing socioeconomic demographics of social media users
This post draws a summary on three recent research papers that propose statistical Natural Language Processing frameworks for inferring socioeconomic attributes from social media (Twitter) user profiles. The attributes we have focused on are a user’s (a) occupational class, (b) income, and (c) socioeconomic status.
Driven by a quite mature social science hypothesis
Studies in sociology have deducted that social status influences facets of language (see Bernstein, 1960 or Labov, 1966). Different socioeconomic backgrounds may result in distinctive topics of discussion or even specific dialects. Taking this notion a step further, we hypothesise that language in social media may also be indicative of a user’s socioeconomic profile. For example, when it comes to posting on Twitter, we expect that middle-aged users with senior managerial roles will be (on average) more formal and less open than younger users, who are less established professionally and accumulate a lower income. And if this is true, then we should be able to capture that relationship and derive a statistical map from a user’s text and online behaviour to a socioeconomic profile.
Using traditional nation-wide statistical schemes for formulating our inference tasks
To standardise the explored inference tasks we have used occupational groups, salary bands and socioeconomic status mappings proposed by the Office for National Statistics (ONS) in the United Kingdom. At the centre of all tasks resides the Standard Occupational Classification (SOC) taxonomy. The SOC taxonomy is a hierarchical structure that starts with 9 major occupation classes, denoted by a single digit (1 to 9), and then breaks down to 25 sub-major groups (denoted by 2 digits), 90 minor groups (denoted by 3 digits), and finally 369 unit groups (denoted by 4 digits). Occupations in upper classes require higher levels of education (i.e. a university degree or a Ph.D.), whereas occupations in the lower classes require to a more elementary skill set. A snapshot of the SOC is provided below.
For the occupational class inference task, we create a labelled 9-class user data set by first mapping (manually) a user profile to a unit 4-digit job group, and then following up the SOC hierarchy to tag it with the corresponding major 1-digit group. So, for each user in our data set we obtain a major occupational class label (from 1 to 9).
For the income inference task, we use ONS’ Annual Survey of Hours and Earnings to map a minor (3-digit) occupational group to the mean yearly income for the year 2013 in British pounds (GBP; £). Below you will find a corresponding snapshot for this mapping.
Finally, for the socioeconomic status inference task, we use a mapping from a unit (4-digit) group to a simplified socioeconomic class: upper, middle or lower. This mapping is encoded in another ONS tool, titled as the National Statistics SocioEconomic Classification (NS-SEC).
Note that the main laborious (manual) part in the process of creating labels for our data is the initial tagging of a user profile with a 4-digit job group. However, in order to create a large-scale data set for model training and experimentation one could also crowdsource this step, reducing human labour per capita to a minimum.
Social media user attributes
A social media user creates a number of trails: textual information in posts, platform behaviour, perceived impact, profile information and so on. In our models, we have tried to incorporate all these features. We have also investigated the contribution of more advanced, but at the same time more approximate, user characteristics; we refer to them as perceived (as we also inferred from the social media user data) psycho-demographics. These include gender, age, political orientation, relationship, religion, education, as well as sentiment and emotions expressed in the textual outputs.
In all our experimental setups, the topics of discussion consistently provided the best statistical traction. To form clusters of keywords (topics) based on Twitter content we compiled and utilised a held-out corpus containing millions of tweets. We first computed a word-by-word similarity matrix using a tweet as our context. The applied similarity metric was the Normalised Pointwise Mutual Information (NPMI) introduced by Bouma (2009). Then, as we were interested in obtaining hard topic clusters, we applied spectral clustering on the word-by-word similarity matrix. In other words, we performed Singular Value Decomposition (SVD) on the graph Laplacian of the NPMI word-by-word matrix (which is a graph after all).
Recently, there has been a growing interest in neural language models, where the words are projected into a lower dimensional dense vector space via a hidden layer (see Mikolov et al., 2013a and Mikolov et al., 2013b). Therefore, we were obliged to use the skip-gram (word2vec) model with negative sampling to learn word embeddings on a held-out Twitter reference corpus (an amazing implementation can be found in the gensim library). This time we replaced the NPMI metric with a cosine similarity between all pairs of word embeddings. Similarly with the previous approach, we then applied spectral clustering on the derived word-by-word neural-cosine similarity matrix.
Following the reviving trend, the neural clusters improved the inference performance further. To give you an idea of the extracted topics, a snapshot with the most relevant ones in predicting a user’s income is given below (the last column holds a parameter, known as the length-scale, that is inversely proportional to a topic’s relevancy in a prediction — the smaller the better). Talking about politics, Non Government Organisations (NGOs) or even using swear words are among the most income-predictive discussion themes.
Nonlinear learning using Gaussian Processes
Gaussian Processes (GPs) provide a powerful, adaptive, nonparametric and nonlinear (some may also add Bayesian, but this is not always true) modelling framework. A Gaussian Process is defined by a mean function (on the input space) and a covariance function (or kernel) on pairs of the input space. These two functions are responsible for modelling target variables any finite number of which should have a multivariate Gaussian distribution (Rasmussen and Williams, 2006). Throughout our experiments, we have seen that GPs were able to capture better the multimodal feature space we have been operating on, and at the same time to provide a significant level of interpretability (e.g. by looking at the length-scale parameters in a covariance function) that other strong (nonlinear) learners do not. On top of this, GPs are very straightforward to try (MATLAB library, Python library) and, most importantly, are modular enough to host creative ideas — for example, valid GP kernels can be added and multiplied to create new kernels that do not lose their GP identity.
Across all tasks we have obtained promising performance figures. The nonlinear GP models were consistently performing better than broadly applied solvers, such as Support Vector Machines (SVM) using the Radial Basis Function kernel and regularised Logistic Regression.
Briefly, in inferring the occupational class of a user (9-way classification), we reached a 52.7% accuracy using the GP-based model on 200 neural topics (see below). Note that (a) the SVM provided a 1% lower performance as well as a model that is hard to interpret, and (b) for this task non textual user attributes did not have a significant predictive power.
For the income inference, I only show the figures for the best performing GP model across the various feature sets as well as for a combination of all features in a linear ensemble (see below). The best Mean Absolute Error (MAE) is obtained when all features are combined (GBP 9,535), but it does not differ much from the one obtained when discussion topics were used alone (GBP 9,621).
Finally, in the 3-way socioeconomic status classification task (see above), we obtained an accuracy of 75.09%. When we converted this task into a binary classification by merging the users with a lower and middle socioeconomic status, the classification accuracy increased to 82.05%. For the record, the GP classifier that was used in the latter approaches (socioeconomic status inference) combined all user features categories by firstly defining a covariance function for each one of them, and then by producing the sum of these covariance functions, i.e. optimising a feature combination inside the GP model.
A qualitative analysis provides interesting insights
Discussion topics have been the strongest predictor of the investigated user demographics. Thus, it is no surprise that users with adjacent occupational classes have similar topic distributions. This is confirmed by measuring and visualising the Jensen-Shannon Divergence (the smaller the more similar) between the topic distributions of all class pairs (see the heat-map on the left). Expected user class clusters emerge; I have circled some of them.
Following up on topics, and focusing on these with the highest predictive relevance in the user occupation classification task, we visualise a topic’s Cumulative Distribution Function (CDF) across the users of the 9 occupational classes. A CDF indicates the fraction of users maintaining at least a certain topic proportion in their tweets. Visually, a topic is more dominant in an occupational class, if the CDF line leans towards the bottom-right corner of the plot. In the figures on the left, you see that the topic of Higher Education is more prevalent in SOC classes 1 and 2, but is also discriminative of classes 3 and 4 from the rest. This is expected because the vast majority of jobs in these classes require a university degree or are actually jobs in higher education.
By examining the topic of Arts, we see that it clearly separates class 5 from all other classes. Class 5 is indeed the class that enlists artistic professions. Hence, this observation additionally provides a good proof-of-concept.
Finally, the topic of Elongated Words (a.k.a. Twitter slang) is more prevalent in the lower occupational classes.
Moving to income modelling, we looked at the relationship between various inferred user demographics and the corresponding perceived income (see the figure below). This was somewhat required in order to validate that our data, especially the inferred demographic attributes, were capturing reasonable trends (even when representing the population of Twitter users). Indeed, our data confirmed that (our world is mostly unfair) income (a) increases with age, (b) is higher for higher levels of education, and (c) is lower for females and African Americans on average.
We then looked at the relationship between the most relevant topics in the inference of user income and income itself (see the figure below). Apart from the better performing nonlinear trend based on a GP model, we also plotted a linear one (obtained via regularised logistic regression) to showcase examples where a linear model is less flexible in capturing this relationship. We generally observe that users talk more about Politics, NGOs, and Corporate themes as their income gets higher. On the other hand, the opposite relationship is present for the use of swear words.
We also performed the same analysis for sentiment and emotions vs. user income (see the figure below). Our analysis unveils that neutral sentiment increases with income, while both positive and negative sentiment decrease, i.e. lower income users are probably more subjective. In addition, the emotions of anger and fear are more present in users with higher income, while sadness, surprise and disgust are more associated with lower income.
The automatic inference of user demographics is useful. Why?
I can infer that some of you have second thoughts:
Is this type of modelling useful? Does it violate user privacy?
The mainstream answer applies here as well: it depends on how a research development is going to be utilised.
The good side of things includes that these methods can (a) provide dynamic, timely and low-cost demographical information complementing the traditional time-consuming and expensive approaches, (b) support large-scale (computational) social science findings, and (c) enhance numerous tasks that focus on particular stratifications of the population, such as health surveillance or social services. Of course, there is also a number of commercial downstream applications that can stem out of this, but this not my main research driver.
Finally, evil applications may arise, but (a) such tendencies will not be stopped by just blocking this line of research, and (b) it is really up to our societies to safeguard user rights in those occasions.
Download the data sets
The (aggregated) data sets used in all the above research efforts have been made publicly available. Below you can find direct links to each one of them. Please refer to the “Data” sections of the corresponding research papers to read their complete description.
- Occupational class task data set (ACL, 2015): 5,191 Twitter users (10,796,836 tweets) and corresponding features
- Income task data set (PLOS ONE, 2015): Same with the previous data set, but with additional user features (e.g. including additional perceived user demographics)
- Socioeconomic status task data set (ECIR, 2016): A different data set of 1,342 Twitter users (2,082,651 tweets) and corresponding features
- D. Preotiuc-Pietro, V. Lampos and N. Aletras. An analysis of the user occupational class through Twitter content. ACL, 2015. [ data ]
- D. Preotiuc-Pietro, S. Volkova, V. Lampos, Y. Bachrach and N. Aletras. Studying User Income through Language, Behaviour and Affect in Social Media. PLOS ONE, 2015. [ data ]
- V. Lampos, N. Aletras, J. K. Geyti, B. Zou and I. J. Cox. Inferring the Socioeconomic Status of Social Media Users Based on Behaviour and Language. ECIR, 2016. [ data ]