Data Science Tutorial

What Makes a Politician Popular on Social Media?

K-Means Clustering on Twitter Politicians

Doug Rizio

Published in

Nerd For Tech

16 min readMay 5, 2021

When social media platforms like Facebook and Twitter were first created a decade and a half ago, few could have predicted how influential they would ultimately become in shaping the political landscape of the modern world. [1]

Today, however, just about every politician or public figure recognizes the importance of an online presence. Barack Obama is sometimes known as the first social media President, having leveraged the burgeoning power of Facebook to expand his reach beyond the scope of traditional media, quickly cultivating his image from a relatively anonymous senator to a national icon of hope and change in 2008. [2]

Eight years later, Donald Trump took Twitter by storm, staging an unprecedented campaign that owed much of its momentum to an unrelenting stream of tweets that made his name one of the most mentioned words on the internet, and transformed his lot in life from a real estate magnate and reality TV star to the most powerful man on the planet. [3]

Now, no one can refute the value of digital PR. But how exactly does one use social media to their best advantage, particularly in the realm of politics? Is it all about the followers? Or is it more about making friends with strangers? Perhaps, one must be linked to many social media pages or lists. Maybe it’s as simple as liking lots of posts made by other people, or making as many posts of your own as you can.

What if we could measure exactly what makes a politician popular online? If we can cluster the most popular US politicians into groups to analyze what social media statistics determine their successes in the virtual world, then we might be able to help the next generation of political hopefuls come up with their own strategies for becoming civil servants.

K-MEANS CLUSTERING

If a cluster is defined as a selection of data points grouped together because of certain similarities, then k-means clustering can be defined as a method of finding a fixed number of clusters k in a dataset. Clustering data points can help us to point out trends in data that we might not be able to detect otherwise, and visualizing how they are grouped together can give us a clearer understanding of the set as a whole. By default, K-means clustering relies on Euclidean distance to determine similarity — a measurement based on the actual distances of points in Euclidean space, or a geometric plane. The positions of the clusters on this plane are determined by their centroids, which are specific points that represent the center of each cluster. The key reason to use k-means clustering? While we can arbitrarily decide on any number of centroids or clusters to measure our dataset by, some numbers are more optimal than others. [4]

In order to identify the best number k of clusters, we can use a popular heuristic known as the elbow method, which runs k-means clustering on the dataset for a range of k values, calculates the sum of squared errors (SSE) for each value, and plots the SSE on a line. While we do want a relatively small SSE, SSE also decreases towards 0 as we increase k, and it only reaches 0 when the total number of clusters equals the total number of points. Therefore, the best compromise is somewhere in the middle, where the line begins to bend from one extreme direction to the other. Then, that point is named the elbow of the graph. [5]

SET UP AND DATA SOURCES

For this project, we will be using an existing set of around 1,800 politicians and their Twitter usernames to scrape the social media platform for data on the various numerical values associated with their accounts — such as their total followers, friends, lists, favourites, and statuses. Then, we will run the elbow method on the data, identify k, and visualize k clusters on the graph for analysis.

One page on Kaggle offers exactly what we need in the form of a CSV. [6]

After we download it, we can open up Jupyter Notebooks and import a series of Python libraries that we end up using for our project. Here are the import codes and a brief description of each.

Importing Python Libraries

With that, we can read the .CSV into Pandas, check its information, and display its head. An initial glimpse reveals many duplicate accounts for the same people, empty fields, and columns that we want to filter out — such as the date of the account’s creation, the account ID, the politician’s birthplace, their birthday, and their username on Instagram. One piece of information we do want to hold on to, however, is the age of the person, which is the only numerical value that we can easily plug into our clusters, and might produce interesting insights if we compare it to other data on a graph.

Rename and Drop Columns

Dataframe Information (left) and Dataframe Head (right) after changes

Set Up Tweepy, the Twitter API

Now that we have our nice, clean list of nearly 1,800 users to work with, it’s time to use Tweepy to connect to our Twitter developer account. Note: my keys in this code are censored in order to protect their anonymity. Get your own!

Searching for Users (First Try)

Here, our first search tries to loop through each user screen name in the dataset, find the actual user in Twitter, extract the desired fields from their account, and append the data into a list for later use. The code starts to work for a minute or so, but we quickly run into a problem — one of the users was suspended! We can’t know for sure why, but whatever the case, we have to work around it.

Rather than deleting the particular user from the data and waiting for the problem to repeat itself indefinitely, we can throw an exception to catch the problem, raise an alert, and continue performing its tasks.

Searching for Users (Second Try)

Now the program works without a hitch, and it takes about an hour to completely download the information. Once it’s done downloading, we can create a completely new dataframe with its values.

Create New Dataframe

Merging the Dataframes

Our next step is to merge this new dataframe featuring all of its freshly extracted Twitter statistics with the old dataframe containing the politicians’ personal information. However, because we weren’t able to access those aforementioned suspended accounts, the two sets have a different number of rows, and won’t align properly if we try to merge them as they are. As a result, we have to eliminate any person in one dataframe that doesn’t exist in the other, and then reset both of their indexes so that they match. Only then can we merge the two of the dataframes together into a single set of workable data.

BASIC VISUALIZATIONS

Now, while political party isn’t a numerical value that we plan on clustering in this project, it might be interesting to perform some basic visualizations on it in order to understand the nature of our dataset. Let’s make sure that every politician here is from a major American party.

Modifying Dataset for Political Parties

As it turns out, only about 1350 politicians on the list identified as Republicans, Democrats, Libertarians, members of the Green party, or “independents.” Despite the dataset being advertised as one for US politicians, for some reason there were several people listed who were politicians that identified with the parties of other countries close to the United States, such as Canada or Israel. Now that we’ve isolated only the major political groups in America, though, we should be in the clear.

At this point we can plot our basic demographic data like age, sex, and party.

Although it’s barely detectable, the faint blip at the edge of the dataset would imply that one of our users is almost 120 years old. Are we sure that’s true? Let’s check it out by sorting the dataframe by age.

Upon closer inspection we realize that Tom Allon is either the most youthful 119-year old in the world, or our so-called centenarian’s age is just not right. Because we want to include age in our clusters, and because this person’s apparent age is such a dramatic outlier (and almost definitely inaccurate) we should just remove him from the dataset entirely. Sayonara Tom.

We can also plot the politicians by age and sex. As we can see, most politicians are in their 50s and 60s, the males outnumber the females over 2 to 1, and there are almost no transgender politicians on our list — validating the claim that older men truly do dominate the political world.

Graphing Politicians by Age and Sex (line plot with density)

Graphing Politicians by Age and Sex (bar plot with count)

In fact, there are only 4 transgender politicians out of over 1350.

Here we can show the sexual distribution of the political parties. Republicans and Democrats both dwarf the minor parties, and while Republican men are the largest individual group, women are far more present in the Democratic party, and Democrats outnumber Republicans as a whole.

CLUSTERING THE DATA

Now that we’ve satisfied our curiosity, we can perform our first elbow method on the numerical data to determine the optimal number of k clusters to group the set by. The values 4, 6, 7, 8, 9, and 10 refer to the index numbers of each of the six columns — age, followers, friends, lists, favourites, and statuses.

Defining the Elbow Method

Because the elbow bends right around 3, that number becomes our k. Let’s identify the clusters for each feature of the data, and locate the centroids of each cluster.

Set K Clusters and Identify Centroids

Although it’s a bit confusing at first glance, there are 3 total centroids here, separated by brackets, each with 6 different sets of singular coordinates based on the numerical data we put into the system. Let’s visualize the clusters on their own first, and then visualize the centroids inside the clusters.

Clustering Followers VS Age

Here we can ask the question, does a Twitter politician’s number of followers correlate with their age? Let’s see the results.

While most of the data points are spread out fairly far and wide from 20 to 100, the graph is a bit more concentrated in the 40 to 80 age range, with tiny peaks at 50 and 70. Based on this visualization, age does not stand out as a strong determinant of a politician’s popularity. The vast majority of politicians are also grouped together in a cluster defined by anywhere from 0 to 10 million followers, with the next largest cluster consisting of only half a dozen users with 10 to 30 million. The most remarkable feature of this graph, however, is the third and final cluster, which consists entirely a single point that rises far above the rest, acting as such a dramatic outlier that they have become their own cluster. Who could this mysterious individual be, at nearly 60 years old?

Barack Obama — The Man with the Most Followers

Barack Obama, the 44th President of the United States — and the man with 130 million Twitter followers. This number is more than four times the size of the next most-followed politician, former First Lady and presidential candidate Hillary Clinton, who boasts just over 30 million herself. The difference between Obama’s follower count and the others is so great that it distorts the graph and makes the results difficult to analyze.

Thanks, Obama.

Let’s remove him from the dataset, simply for the sake of eliminating outliers, and perform our elbow-method and k-means clustering for a second time.

Clustering Followers VS Age with Centroids

While the results are not exactly the same, they are very similar, so we can continue clustering this slightly updated set of data with k set to 3 again — and now we can include our centroids.

Now, the graph is expanded, the singular cluster is removed, and the second cluster is split into two, with a new set of users featuring the top number of Twitter followers all numbering at around 30 million: Hillary Clinton, President Joe Biden, and former presidential candidate / rap star Kanye West. In the next largest cluster are those with 10 to 20 million followers: former First Lady Michelle Obama, Vice President Kamala Harris, former President Bill Clinton, congresswoman Alexandria Ocasio-Cortez, and daughter of the former President, Ivanka Trump. Lastly, in the third cluster is everyone else.

Under most circumstances, a person having even 1 million followers would put them far above the rest of the regular population — but when your competition is the most prominent politicians from the most powerful country in the world, the clustering algorithms put zero and a million into the same category: the lowest tier of popularity.

Now, how do total followers correlate with total friends on Twitter?

Clustering Followers VS Friends with Centroids

Because we are still graphing one of the axes based on followers, the vertical spread of clusters in this plot is similar to the last one — only the horizontal alignment of points has shifted to reflect each user’s number of friends.

And, while the most popular politicians have few very friends on social media, many of the people with fewer followers actually have several thousand friends of their own. It looks like the popular politicians are just too cool for everyone else — or maybe those with fewer followers try making more friends in order to gain notoriety. However, one person stands out from the rest with nearly 90,000 friends on the platform. Who is this extraordinary gregarious individual?

Cory Booker — the Man with the Most Friends

Former Mayor of Newark, former US Senator from New Jersey, and former presidential candidate Cory Booker. With almost 5 million followers to call his own, Mr. Booker sits near the top of the third cluster, just behind Speaker of the House Nancy Pelosi (70 million) and former presidential candidate and Senator from Massachusetts Elizabeth Warren (57 million).

Unfortunately, though, many friends does not a president make.

Sorry, Cory.

We can also cluster followers VS favourites (or how many times the user liked another post) which shows somewhat similar results to follower VS friends — most users aren’t likely to engage with other people in general, but those who do are almost always very unpopular by the standards of other politicians.

Clustering Followers VS Favorites with Centroids

The person who stands out the most is the lonely red dot to the right of the red cluster, a relatively popular politician who is also prone to liking lots of posts made by other people. What is the identity of this individual?

Alexandria Ocasio-Cortez, the Woman with Many Favourites (but not the most)

US Representative from New York, Alexandria Ocasio-Cortez, a member of the Democratic Party and the bold 31-year old woman who drew national recognition after unseating her rival, a 10-term incumbent, in the midterm elections of 2018. While AOC is not the single most popular politician on this list, she is in the top 10, and is definitely one of the youngest. She has also favourited 30,000 posts, showing that this millennial is a much more active social media user than the rest of her cluster.

Clustering Followers VS Statuses with Centroids

The visualization for followers VS statuses (or posts) shown above is much like the previous ones — just because a person makes a lot of posts doesn’t mean they are popular. But that doesn’t mean that certain people won’t try.

Raymond Buckley — the Man with the Most Statuses

New Hampshire Democratic Party Chair Raymond Buckley is certainly trying his best to post the most, with over 200,000 statuses to date. While one should never give up on their dreams of becoming a social media celebrity politician, it might be helpful to consider other strategies besides just posting copious quantities of content.

Lastly, the visualization for followers VS lists shown below reveals a positive correlation between the two values: those with the most followers are also the ones who are on the most lists. Twitter lists are curated groups of different accounts that anyone can create and add to, and a person’s presence on many lists suggests that their life or work is interesting or relevant to a variety of subjects.

Clustering Followers VS Lists with Centroids

Intriguingly, there is one special user on the graph whose small number of followers is at odds with their large number of lists:

Al Gore, the Man with Many Lists (but not the most)

Al Gore, the 45th Vice President of the United States, the former presidential candidate in the 2000 election, and a world renown environmentalist. I couldn’t ascertain the exact reason why Al Gore showed up on so many of these lists (nor could I access the lists themselves), but I would imagine that it has something to do with climate change — the subject that he is most famous for. However, his ratio of lists to followers is a total anomaly, and is difficult to explain without more information.

LIMITATIONS

One of the first limitations I observed while performing k-means clustering was how clusters are so easily influenced by outliers — the prime example in this case being Barack Obama, whose follower count was so high that it eclipsed all others and resulted in him having his own cluster. In situations like these, it’s hard to analyze the other clusters, because the differences between them are completely dwarfed by comparison. He had to be completely removed from the graph in order to properly measure the others in detail.

Another limitation here is that I wasn’t able to cluster non-numerical information such as the person’s sex or political party. It would be interesting to learn whether either of those two factors are related to the person’s popularity online, or how they behave on social media.

It would also be interesting if this dataset had included former President Donald Trump, whose activity on Twitter was so unique that it almost certainly would have been another outlier on the graph had he not been banned. Before he was permanently removed from the platform on January 8th 2021, the man boasted nearly 90,000,000 followers and 60,000 tweets. [7]

One last detail worth mentioning is that Twitter (and big tech in general) tends to lean left on the political spectrum, at least when it comes to social policies. [8] And while its CEO Jack Dorsey has claimed that his own personal bias doesn’t impact company policy, the company’s actual Terms of Service reflect mostly liberal sensibilities — the relatively recent wave of user bans in the wake of the Trump presidency have arguably impacted Republicans, conservatives, and the right wing more than the other side of the political spectrum. [9] As a result, many of the users who align with these ideological groups have abandoned Twitter and sought refuge on other platforms with fewer restrictions, such as Parler or Gab. [10] In light of this information, could the greater popularity of Democratic politicians on Twitter be a partial product of this social media exodus? Or are Democratic politicians just more tech-friendly? What kinds of trends would we find if we analyzed those other social media websites? Do the behaviors and resulting statistics of political figures change when they move to other platforms? These are interesting questions that we simply can’t answer without more data.

CONCLUSIONS

One obvious word of advice we can give a person after doing this analysis is that you don’t have to be young to be a popular politician on social media.

And, while certain exceptional people such as Barack Obama, Donald Trump, AOC (and perhaps even Kanye West) were notorious for leveraging strong social media presences to bolster their individual successes (although this is only true of Kanye’s music career and not his run for political office), they weren’t the prolific posters that one might think they would be.

In fact, the paradox of the most popular politicians tending to post the least, and interacting with fewer posts made by other people, also implies that popularity on social media is not simply a product of how much activity someone has online, but rather, what they say in their posts, and what they do in the real world that is worth posting about to begin with. It could also mean that they are simply too busy saying and doing important things in the real world to constantly update their statuses on Twitter.

On the other hand, some of the least popular politicians in the dataset appear intent on garnering attention through vast amounts of activity — a strategy that is, unfortunately, unlikely to work, at least based on this data.

So, what is the moral of this story? Perhaps becoming a popular politician on social media is not about your number of friends or the quantity of your posts, but the value of your ideas and the quality of your character.

(Or, some other more nebulous statistic that I have yet to identify.)