Hashtag Virality: What drives viral content?

Alvaro Aguado, Pratik Parija

--

When we are using a social media platform like Twitter, we have the urge to reply to random users that posted something that seems outrageous. Then, we think that by carefully explaining (t̶h̶e̶ ̶c̶o̶r̶r̶e̶c̶t̶, I mean…) our point of view, this person will likely change its mind. And just like this step by step will we will fix the world...Okay, who am I kidding. We all know that in reality we end up just getting in a fight online, frustrated and thinking with how doomed the world is. We always say this is the last time, but we get triggered again with the next tweet we see. This gave us the idea of exploring hashtag virality in a different way. Is it more more important the topic of the hashtag or the agents that popularize that hashtag what makes it go viral?

What if we could take ‘influencers’ and ‘topics’ and see what attribute helps explain most of the variance in the verbatim. And if their importance change by topic, is this a sign of how popular or the hashtag will be?

Introduction

Virality has been widely studied in many research projects like Wang and Liu 2016, Hashtags and information virality in networked social movement: Examining hashtag co-occurrence patterns, show that using co-occurrence graphs (hashtags that appear in the same tweet) they can find what hashtags are more likely to become viral. In our work we are more interested in the type of language that is used when using what hashtag vs, the other hashtags.

Also Xiong et al (2019) Hashtag activism and message frames among social movement organization: Semantic network analysis and thematic analysis of Twitter during the #MeToo movement, where the authors used the co-occurrence of words to identify the network structure of the messages. Their work also mentions that bottom up processes like Unbranded Hashtags or Events are more likely to be ‘viralized’ in terms of how informative the content is. They define shortness and informative as attributes that go viral for these kinds of events, while Top-down hashtags should be promoted actively by the interested groups.

Our Hypothesis

Twitter is widely known for having popular hashtags or trending topics that tend to be controversial, silly or just related to an specific event. These events could be driven by the topic itself being very engaging (or polarizing) to the audiences or by the reach of an specific author or celebrity that generates verbatim due to the influential position they hold in society.

We believe that more viral hashtags are likely to be driven by influential authors or celebrities rather than the topic itself. If this is true, it will imply that viral content has more to do with the agents involved in the topic, than the topic itself.

In this analysis we will explore a set of hashtag that were at different stages of virality in the latest year in the social media platform “Twitter”. We picked hashtag from what we called 3 different categories: Unbranded Hashtags are topics related to movements that don’t have corporate benefit as goal. In this category we have the well known movements of “Black Live Matters” which gained significant popularity mainly in the US specially from May to June specially driven by the death of George Floyd. The other topic is the “Me Too” movement which protest against sexual abuse and sexual harassments perpetrated by men in position of power. This movement gained significant popularity by the end of 2017, but it is still relevant on social media as ongoing movement to defend women.

On the other hand we have Branded Hashtags which seek benefit for corporations or raise money for a non-profit organization. In this category we included the famous Coca-Cola marketing campaign “Share a Coke”. We expect this to be different in the agents that drive sharing of the campaign. Additionally we have the campaign Ice Bucket challenge, which was a widely known campaign for the awareness of ALS disease. The goal was to raise donations for the ALS Association.

Finally we have another category called events, which are hashtag that were started due to specific circumstances. Here we include the COVID-19 campaign which gained significant popularity due to the world wide affection of the pandemic.

Unbranded: #BlackLivesMatter; #MeToo
Branded: #ShareACoke; #IceBucketChallenge
Events: #Covid

Getting the data

We then collected our data by using the software as service tool from Meltwater Inc. This tool allows us to collect unlimited amounts of queries from Twitter, but also access many other sources such as News sites, Blogs, Reddit, E-commerce reviews and also Instagram, Facebook (but with limitations). Our data was collected using boolean logic to extract the different campaigns. We found some limitations when acquiring the data. The first is that Twitter data is only available for the past year. So our time window for collection is October 1st, 2019 to October 1st 2020.

We used boolean logic to request the data, which returned the results in JSON format (GET method). Here we found our second limitation and that is that we can only obtain the first 20,000 entries for any given time frame. We tried to exploit this by processing data on a daily basis time frame but we got errors for hashtags that have a large number of cases (such as #COVID or #BlackLivesMatter). So we decided to extract Twitter tuples of 40,000 entries per quarterly period which include top 20,000 most shared tweets and bottom 20,000 least shared tweets. Next, we collected all the data from news sites from 2013 until today for each one of the hashtags. These also have limitations in the number of tuples that can be requested at a time, but we can go back more than 10 years to extract the information. Below is a breakout of the sampled data we had with the individual posts by type.

Figure 1. The breakout shows the different sources of data we collected. From #meToo we were allowed to pull higher number of tweets. This is not necessarily a problem since we are going to check each hashtag separately
Figure 2. Histogram with the distribution of lengths of the different campaigns. In red are the campaigns from news, which in some cases are censored and only provided the first 100 characters. In blue the tweets which are from Twitter. Tweets with more than 280 characters belong to Quoted Tweets what include the new tweet and the original tweet.

When converted to tabular form, our data contains 885,293 rows and 40 columns, out of which we are using 20 columns for the data exploration.

Figure 3. Representation of the Virality of the different hashtags during the 2019/2020 period. Y-axis in LOG-10 scale

Additionally we got the total counts by day in terms of tweets for all the different hashtags we worked with. This trended counts are the target variables we use to measure the virality.

Processing the data

We processed the data using R and Python. R was very useful in pre-processing the data and ensuring we could process special characters. Python was used for the reporting and the modelling phase.

Topic Analysis

First we wanted to analyze if within each conversation we find different types of conversations that could spark or trigger more virality than others. A lot of work was spent pre-processing. Not only cleaning the different words and special characters, but also lemmatizing and adding word collocations to our analysis. We wanted to keep emojis, and also references to specific websites in order to ensure we captured content that could be a sign of virality.

Next, we organized the tweets and news in a way that would aggregate tweets by looking not only at terms (like LDA would) but also at the position of those terms within the sentences. This is important because it helps provide distance between terms that could mean the same and even aggregate Retweets, or news that are just feeding from the same same article. That’s why we used word embeddings with word2vec (CBOW). This approach helped arrange the data in a more accurate and flexible way. If we see a lower dimensional manifold of these representations we can observe how the topics distribute.

Figure 4- Distribution of Unigrams for #Shareacoke, #Covid, #BlackLivesMatter, #meToo. We can see that the last 3 have more unigrams in common with each other than with #shareacoke hashtag. We can see that these unigram count distribution follows the famous Zipf’s law even after removing most common words

We can see in Figure 3 at the top-left chart all the hashtags together. Not surprisingly #COVID, #metoo and #BlackLivesMatter are more closely related than the branded campaigns #shareacoke and #icebucketchallenge. When we checked the similarities between COVID, BlackLivesMatter and #metoo, we saw that the main words that they shared in common was mentions of political figures. It is interesting to see that #COVID and #BlackLivesMatter which are hashtags that were very significant this year, had a more broadly distributed topics than #IceBucketChallenge, #Shareacoke or #meToo (which showed lower sphericity results under Barlett KMO test). This is an indication that these topics are more concentrated.

Another thing that called our attention in Figure 3 is the distribution of topics for news and tweets. For #COVID for example, news and tweets were equally distributed across the different topics, with the exception of a series of posts on twitter that were not covered in the news post (circle in the bottom left, these posts were related to the elections). On the other hand, we can see that #BlackLivesMatter had a broad distribution of news were not represented in the twitter posts (no overlap in the left-hand side). These posts are about a broad range of topics that were not as mainstream to twitter like racial justice for diverse groups including latino which was not discussed as much.

This variable is incorporated into the model by counting the virality of the topics within each hashtags. This will help define if topics drives the overall virality. In order to do this, we clustered these word embeddings and calculated the number of posts for each of these clusters. We created these clusters by K-means over the cosine difference across the vector that each text represents. The best average silhouette was the metric used to select an optimal number of clusters. These can be seen below.

Figure 5. Silhouette analysis for Kmeans clustering on each of the hashtags

Agents Analysis

We have seen that many influencers and celebrities are present in many of the hashtags and that their presence is very high in frequency. The hypothesis is that certain authors matter more than others to generate virality. We want to incorporate the impact from these agents to see if they could be driving most of the virality.

In order to do this we extracted author handles using regex. This was challenging since in some cases some agents are mentioned without the ‘@’ handle. We differentiate authors in 3 different categories that are not mutually exclusive: Influencers, Retweeted Authors and Original Authors.

Table 1. Sample of Top 3 Original Authors by hashtag and number of posts
  • Original Authors: Are the original authors of tweets or content in news. They are normally different from the other two because they could be using the intended hashtag significantly higher than the others. In Branded hashtags this are normally the platform owners, or people that are promoting the hashtag. For example ABC News is one of the top authors for the #COVID Hashtag.
Table 2. Sample of Top 3 most Mentioned Influencers by hashtag
  • Mentioned Influencers: Are celebrities or twitter influencer that have an impactful presence in terms of followers or position in society. An example of this when people mention RealDonaldTrump or Joe Biden in their tweets. Let’s take a look at the most common ones in this category.
  • Retweet Authors: Are Twitter users that don’t necessarily generate the original content, but they are have a significant amount of followers and as a result they can direct the messages that they find most relevant.

With this information we want to evaluate the impact of different agents at different points in time and see if their posts correlate with spikes in the conversation. For example we have the posts from Donald Trump and we can overlay their tweets in time with the spikes in mentions for a given hashtag. We only included the top 3 agents for each category in our model.

Figure 6. Example of Twitter counts and sample of a few tweets of RealDonaldTrump

Finally as part of our analysis we also took a look at the relationships these agents have with one another in the graph. We found that the majority of mentions are concentrated in a few top influencers, and that the vast majority of users normally tag the same people over and over. Here we can see a sample of tweets and the user they mentioned. The size of the bubble indicates the number of mentions the user received.

Figure 7. Sample of Twitter Mentions across hashtags. Size of the node indicate the number of mentions. Nodes in blue are in the top 3 mentioned authors within each hashtag, in red are the rest.

Experiment

Schematic of the model used to predict virality. We created an individual model for each hashtag
Figure 8 . Train-test breakout for the model. In orange is the out of sample. In blue is the insample

In order to capture what variables are more important we need to incorporate a method that can replicate the changes in the tweet counts in the time window we selected. In order to do this, we decided to simplify the approach and create a tabular model that will find linear relationships between the tweet counts and the different variables. Our target are the tweet counts for the following date while, the independent variables are the Topic analysis and the Agents. A limitation is that the reaction time between postings and changes in the overall volume is very fast, so we could incorrectly incorporate causality inference from posts that happen a posteriori. That’s why the target variable is shifted one day in order to evaluate the impact the day after.

Word2vec topics were included as percentage of tweets on that given date times the reach of the tweets and news from that date. For Agents we will sum the number of tweets from the top Agents on each category weighted by the total reach of these agents. Finally our model was tested on a time series k-fold validation approach to see how the model performs at different stages. We used lasso paths in order to also understand how the different variables develop in importance as we shrinkage penalty. Note that we performed a model per hashtag, but we aggregated the results for simplicity below.

Results

Dep. Variable: Hashtag counts

R-squared: 0.775

Model: OLS

Adj. R-squared: 0.605

Our model performed well in-sample and out-of-sample across the different hashtags, but it did not perform as well in the first burst of social chatter for some of the hashtags. When we look at the drivers we can observe that agents were far more significant than the topics itself. This doesn’t mean however that topics weren’t important. The importance of some topics were extremely useful in explaining the variance of the model. We could argue that in order to explain virality on twitter is more important the agents, followed by the topic itself. This was derived from the Lasso paths we computed from the model.

Lasso paths for the final model. Lines in the outermost part tend to show more importance than the ones that become zero more quickly with shrinkage

Our model presents limitation in the sense that we are not able to explain ~10%-20% of the variance in the Tweet count changes. This means shows that the different hashtags influence the characteristics and the drivers that are most meaningful. A broader analysis of more hashtags would be necessary to understand the validity of our model in other circumstances.

Another limitation is that we only included hashtags that were successful in terms of virality on social media. It would be useful to sample unsuccessful hashtags in terms of virality within the same categories to understand what are the differences between the two.

Finally, we have to call out the fact that 2020 was certainly a different year compared to other years in the world of social media. For once, a lot of people spent more time using social media. Also, their behaviors might have been impacted as a result of the COVID-19 pandemic and the stress that this generated. Future studies should also consider a larger time frame to ensure we control for these kind of mediating variables.

Ethical considerations

This model is trying to shed light on the drivers that make people post and engage with content in social media. If this model is accurate, it might be uncovering traits of the personality that people are not aware of, and as a result it could be used to manipulate or influence the actions of users.

Whenever we are building algorithms that could manipulate the way people behave we should make sure that the learnings are openly available to users. This way users can be aware of circumstances that could trigger these behaviors.

Also as we worked on this project we identified that content that is extremely controversial tends to be a post that is seeking reaction from the user. As we confirm this personal bias we reward content creators in creating more of those posts.

Conclusions

We embarked this journey with specific questions and hypothesis around virality. Our result suggest that the answer to those hypothesis are a “yes, but” . Yes influencers might be more important than the topic, but the combination of both variables make a more accurate model in explaining the virality.

We learned that Social Media is continuously evolving. The way users interact with the platform is not static nor homogeneous. Different people do different things and use social media for different purposes. Generalizing in this context is hard and it requires a lot of attention to details. We need to be aware of the way we use Social media and also how to make it better for everyone. After all it might be true the old meme in social media:

“Do not feed the trolls!”

Appendix

Responsibilities in the project by author for the Final Report

Alvaro Aguado

  • Data Collection: Collecting the data from the Meltwater platform and incorporating the initial pre-processing. Incorporating UTF-16LE
  • Text pre-processing: cleaning and pre-processing text to be model ready
  • NLP Topic Analysis: creating word2vec embeddings, visualization for TSNE manifold and clustering method.
  • Regression and Shrinkage Path Model: Performing time series k-fold and results

Pratik Parija

  • Introduction/Hypothesis: stablishing initial hypothesis and what are the different tests
  • Agent analysis / Graph Analysis: Developing the Graph analysis of the different agents
  • Ethical Considerations/Conclusions: Developing all the considerations and conclusions for the report
  • Combining the data sets for the final model

--

--

Alvaroaguado
Social Media: Theories, Ethics, and Analytics

Business Insights and Analytics Manager at GSK and Phd candidate in Business Data Science at NJIT University