Fun with large-scale tweet analysis

12 min readMay 2, 2023

Background

This experiment started with a discussion in a group of friends about political polarization, in particular, whether the US is more polarized than other countries. I decided that rather than indulging in baseless speculations, something like this should be possible to measure using real data. As for the choice of the data source —if we’re interested in political opinions, the obvious option is Twitter.

This article describes the process of collecting and analyzing a rather large dataset of tweets & users. The dataset is also published here (tweets, users), so that people can run their own experiments on it if they want.

Data collection

Surprisingly, there are no datasets which are readily available for the experiment. There is a catalog of publicly available datasets, but all of them have one or more problems from the following list:

Heavy bias (e.g. only English tweets about Covid);
Tiny size (datasets with just 1 million tweets definitely won’t be useful for the analysis in mind);
Ancient (datasets from 2010 might not reflect the reality in 2023);
Not enough information: if we want to estimate political polarization, we need users associated with the tweets, and have many tweets per user.

So it seems the existing datasets don’t cut it, and we need to scrape twitter on our own. The important property we want is to have many tweets per user, so it makes sense to start with sampling the users, and then sampling the tweets for each user.

Sampling users

How to collect a representative sample of users? There are many ways, the most obvious one is to rely on some existing samples. In particular, CommonCrawl can be considered a representative importance sample of all URLs on the internet (with caveats and biases, but better than most alternatives).

So we go with CommonCrawl, take the latest snapshot, extract twitter URLs from it, and collect the set of corresponding users. Unfortunately, this yields only around 80000 distinct users. Can we get more? Luckily, CommonCrawl provides a list of historical snapshots, so we can take a bunch of snapshots from the last 2 years, and take the union of sets of twitter users in them. Of course the snapshots intersect a lot, so taking more yields diminished returns.

After doing this, we end up with a set of around 200k users, and we need to scrape tweets for them.

Scraping tweets

Apriori, I expected this step to be difficult: twitter is annoyingly protective of its data, showing login wall even to normal users, so I expected scraping to be painful.

Reality turned out to be much nicer: apparently, there exists the SNScrape Python library, which handles scraping tweets very well. With it, scraping the tweets is quite straight-forward, we just parallelize it to use multiple IP addresses, and limit to the latest 1000 tweets per user in case there are some users with too many tweets. Scraping some users fails (for various reasons, e.g. the account is deleted), but it is OK.

As a result, we get a dataset of around 150K users and 90M tweets. We publish it here (tweets, users), in case someone else wants to run more experiments with it.

First look at the data

Looking at the number of tweets per user, we see that 50% of them are at the scraping cap (1000 tweets), meaning that our sample is biased towards more active users, which is not surprising:

Distribution of the number of tweet (X axis) per user

Inspecting the data, some users seem to be commercial accounts of organizations, but the majority are real humans (not bots). Around 55% have something specified in the field “location”, which is going to be useful for our country-based slicing. Most tweets are in English, but other popular languages are decently represented.

After inferring user country (more about it later), we can also look at the distributions per country:

Heatmap of the number of users per country

Overall, no obvious anomalies stand out, so we can go to a more detailed analysis and try to draw some conclusions.

Analysis

Country assignments

Before doing any analysis, we should assign countries to users based on the specified location. This seems to be an easy task, but it turned out to be perhaps the most painful and imprecise part of the whole analysis.

Let’s look at some of the popular options of the locations in the dataset:

They look sane, but we already see on of the upcoming issues — not all of them are in English. Moreover, less popular locations are less sane.

Nevertheless, let’s try to work with what we have. There is a library in Python called geograpy which is supposed to extract geo locations from texts, and it kind of works, but apparently only for English. There are many suggestions on the internet on what to do for other languages, e.g. using spaCy, but it is all on a language-by-language basis, and I couldn’t find a solution which just works for any text in any language, which is weird.

So we have to be creative. Let’s first detect the language of the location string. There is a library langdetect, but it performs very poorly on the location strings (like really poorly, e.g. assigning many of the obviously English locations to random languages). Language detection model from fasttext works much better, although still not perfectly. Still, we go forward with it.

Now we can identify the language of the location string, and extract locations from English strings. Can we just translate the location? Libraries like googletrans are out of the question (since we need to translate lots of strings, and this would yield a lot of requests to an external service). Can we use some local models? Among the top models on the Huggingface hub tagged with translation, there are two clusters:

Lots of “Helsinki-NLP” models, each model is for a specific language pair (so it doesn’t fit our use-case);
Variations of Google’s general-purpose T5. I tried it, and it didn’t work well for the translation use-case (even the bigger and “flan” versions).

After trying a bunch of models, I stumbled upon MBart50, which worked OKish for our use-case (although still not optimal). We can use it for translation, and the final flow is the following:

Detect language with Fasttext model;
If it is not English, translate to English with MBart;
Extract countries using Geograpy, take the first one.

Results of this process are not perfect, with each step introducing a bunch of failures (e.g. Geograpy assigning obviously non-US cities to US, because apparently there are cities with the same in the middle of nowhere in the US; an example of the large loss pattern — it assigned many of the people from the Netherlands to the US).

Nevertheless, country assignments look sane on average, and the added noise shouldn’t hurt the following analysis too much.

Toy analysis: sentiment

Before going into something complicated like political polarization, let’s try something simple: tweet sentiment. There are plenty of very good models available for this, we’ll use this one. We score every tweet to get its sentiment, then compute an average sentiment for the user. We’ll use it later in the other analysis, but for now, let’s aggregate it to the country level (averaging with smoothing, to prevent countries with very few users from showing up as outliers):

Sentiment by country (darker values mean more positive sentiment)

Some patterns are expected, some are very surprising (e.g. the most positive countries are Japan and Saudi Arabia (???)).

Political affiliation: naive attempt

Before doing something more principled, let’s try something very naive (and biased!): for each tweet, we ask a language model if it is about politics, and if it is — whether it is “conservative” or “liberal”. We use zero-shot classification with fine-tuned DeBERTa for these purposes, which works reasonably well in all languages.

For the former (understanding if the tweet is about politics) it performs decently well. Of course it is not very principled (perhaps the model just memorized all political topics, and if there is some niche political topic in some small country which is dissimilar to everything else, it’ll miss it), but let’s go with it for now.

For the latter (“conservative” vs “liberal”), it is hilariously bad. E.g. the following statement is classified as “conservative” (and funnily enough, ChatGPT agrees):

Individual freedom and personal responsibility are the foundation of our society.

It is quite different from what Wikipedia thinks of liberalism. Probing more, it seems that according to the model, the “conservative” is roughly matches the positions of the republican party in the US, and “liberal” — democratic party. This is quite bad and US-centric, but let’s go with it for the naive attempt. Let’s compute country average, similarly to the sentiment case:

“Liberal” vs “conservative” by country (darker is more “liberal”)

We see that Canada (and Kenya??) is more aligned with the US democratic party than the US itself, while Turkey, Poland and Russia are aligned with the US republican party.

But what about polarization? We can look at the distribution of user-level averages (0 means fully “conservative”, 1 — fully “liberal”), and US is indeed slightly more spread out to extremes compared to background, although most of the users are in the middle:

Distribution of the user-level “liberal”/”conservative” score (left: USA, right: everything)

We cannot draw many conclusions based on this, in particular, because this “liberal”/”conservative” score is very bad. Moreover, trying to summarize individual political opinions as a single score is very naive to begin with. Can we do better?

Building political clusters

We can use the “is the tweet about politics” score from the previous section, but can we do something better about political affiliation? The idea is the following: let’s take all tweets about politics, compute their embeddings, and cluster them. Each cluster will represent a political opinion, and each user then will have a set of political opinions. We can analyze these sets to quantify polarization.

First, let’s compute the embeddings of all of our tweets. We use sentencetransformers’ MiniLM model for this; its embeddings aren’t very nuanced, but good enough for our purposes.

Once we get all of the embeddings, we cluster them. Apparently clustering 90 million 384-dimensional vectors is a challenging task, so we subsample (uniformly) to 20 million, and use FAISS’ implementation of KMeans (k=20000) to cluster it. We then use the zero-shot “is politics” scores to identify clusters talking about politics (ending up with 927 political clusters in total).

Some examples of random resulting clusters (10 tweets from each):

Some examples of political clusters:

As a side effect, we can also assign new tweets/statements to existing clusters (by embedding them and looking for the nearest centroid), and discover interesting corners of twitter, but this is beyond the scope of this post.

Measuring polarization

Now that we have the clusters, we can assign each tweet to a cluster, and then build a vector for each user: V[user, cluster] = how many tweets does the user have from this cluster. We discard non-political clusters, since we want to measure political polarization.

Next step is to define a metric on this vector space, and this step is surprisingly non-trivial. Metrics like L2 are very bad for our use-case: most of the tweets are from the US, so most of the political clusters and from the US as well. If we compare users from a country where language and political topics are different from the US, they will all have zeros in the dimensions corresponding to the US clusters, making the distances between users in that country much smaller than in the case of the US.

Jaccard similarity solves this issue, but another (smaller) problem remains: it is sensitive to subsampling. Imagine we compare a user with a 10% sample of the same user. For our use-case, we want to say that these are quite similar, but both L2 and Jaccard would say they’re quite far away. Refining the idea of L2->Jaccard transition, we can consider Cosine Similarity: similarly to Jaccard, it doesn’t care about padding redundant zeros to vectors, but it is also less sensitive to set size imbalance. So it is going to be our metric of choice.

Let’s compute t-SNE on a sample of users with this metric, and eyeball the result: interactive link. We can do the same for a subset of users, e.g. USA: interactive link. Color is sentiment; one can notice that for the global plot, users are clustered by countries, as expected. It is also easy to find some failure patterns of country assignment (e.g. Netherlands cluster on the US plot).

Overall, it looks like there is significant spread in the political opinions both worldwide and in the US, so the pessimistic picture of all people participating in the“ultra-right” vs “ultra-left” battle seems to be unfounded.

Before computing the final polarization, let’s look at the data a bit more:

63% of users have at least 10 political tweets (we’ll consider only these users for the purposes of polarization computation);
Same number for US users is 66%. Apparently twitter is very political, as expected;
Pearson correlation between “liberal” score from one of the previous sections and sentiment is 0.17 in the US and 0.1 globally. Apparently “liberal” people are slightly more positive than “conservative” (with all the aforementioned caveats with respect to the meaning of these scores).

Finally, let’s compute polarization. What would it mean that the country is polarized? In the context of the US, people usually mean that there are 2 clusters, and each person is either close to one cluster or the other, but nothing in between/far away from both. Note that the constant “2” can be changed, getting the “Polarization@K” metric.

Let’s formalize this intuition: given the constant K, we compute K centers using KMeans clustering, then for each user we compute the distance to the nearest centroid, and use the average of all these distances as our metric (discarding 5% lowest/highest values to reduce the contribution of outliers). We use cosine similarity as the metric for the reasons discussed above. The metric takes values between 0 (very spread out) to 1 (all users are precisely at the centers of the clusters).

Note that this definition is very data-hungry: to get statistically significant results, we need a lot of users per country (especially for higher K), so we’d need a much bigger dataset to compute polarization for small countries.

OK, let’s finally compute the metric. Below are the heatmaps for K=1, 2, 10 (darker colors = higher polarization):

So it seems that the US is one of the least polarized countries in the world! There is however a caveat: Pearson correlation of the metric and log-number of users in the dataset is strongly negative (-0.38). I cannot see why the metric itself would be biased, but there are many steps in the process which could add some biases (e.g. if the country inference assigns foreign clusters to large countries, it’ll reduce their polarization). As a quick hack, we can adjust Polarization@2 to de-correlate it with the number of users, although it is shaky and not principled at all:

This at least is kind of matching human intuition, but the number-of-users-adjustment is fishy. Nevertheless, even here the US is far from being the most polarized country in the world.

Bonus: more fun with the dataset

More stuff can be done with the dataset, as it is the only (to my knowledge) open tweet dataset with many tweets for each user. I’ll list some stuff without publishing the underlying models, as it is out of the scope of this post:

Cloning twitter users with LLMs. I tried fine-tuning LLaMA-7B with LoRa-based finetuning, and results are quite nice: it is doable quickly on commodity hardware, and it replicates users’ styles very accurately;
Exploring “dark corners” of Twitter: typing random statements and finding closest clusters of tweets is a lot of fun, I found parts of Twitter I’ve never thought existed;
Extracting the social graph. While the data doesn’t have followers, a sparse subgraph can be extracted by looking at mentions in tweets. So far I didn’t think of anything interesting to do with it afterwards.
… certainly there are more things to be done/experiments to run.

Links and references

Datasets: users, tweets
User embedding visualization
Used models: MiniLM, MBart50, DeBERTa, Roberta-sentiment
Used libraries: transformers, faiss-cpu, fasttext, geograpy, geopandas