Analysis of Hashtags in Social Media

What hashtags are most prevalent in social media posts and what makes them important?

Colleen Wang
INST414: Data Science Techniques
6 min readApr 29, 2024

--

Stakeholders/Decisions informed

Web-based networks encompass a realm of communication, relationships, and interconnected online structures. The study of these networks and the patterns they present can hold significant findings for online users and stakeholders that are interested in their analysis, whether climbing the ranks of online stardom or hoping to improve the efficiency of social media algorithms. From exploring data surrounding web-based networks, we can identify major online influences and the characteristics that make them influential. A key question that can be answered by exploring web-based network data is what hashtags are most present in social media posts and what factors make them the most important.

The decisions the answer to this question could inform are relevant to social media influencers and companies who are aiming to optimize influence, viewership, and interactions. These are the specific stakeholders who are pertaining to this question because the analysis of web-based networks such as the most influential nodes and hashtags can help these groups enhance their online presence and cater promotional campaigns more effectively. The decision the answers to this question could inform include a company’s strategies for their algorithm, their marketing strategies, and policies that pertain to users.

Data:

To answer the proposed question, a data set containing information about user interactions across web-based networks is important. The data set should include metric data about the most popular accounts and posts as well as information about the common hashtags present in online content. Data including the number of likes, shares, and interactions would be beneficial for answering the question as well as sentiment data. These fields are relevant to the question because they can reveal patterns about online user interactions that are valuable to stakeholders. An analysis of these data fields and the measures of their importance can inform stakeholders of decisions surrounding algorithms and policies of web-based networks. I collected a subset of this data on Kaggle, a free resource for open data sets. The fields contained in this data set are:

  • Text
  • Sentiment
  • Timestamp
  • User
  • Platform
  • Hashtags
  • Retweets
  • Likes

Data Analysis/Figures:

Top 10 Most Common Hashtags in Data set

In figure 1, the top 10 most common hashtags in the social media posts from the data set and their frequency are shown. I split the hashtag groups into individual nodes and counted the frequency of each node. We can see that the #serenity has 15 instances, #excitement has 13 instances, #gratitude has 13 instances and so on. We will learn further into the analysis why this visualization does not necessarily show the most important hashtags, just the most common ones.

Nodes/Edges/Importance:

Each hashtag in this data set and data analysis represents a node in the graph. Each node is a subjects or theme seen in the social media posts from the dataset. Every post has two hashtags: one for the mood expressed in the post and another for a term that describes the topic. For the purpose of this analysis, each hashtag will be considered a unique node. A relationship or interaction between two nodes is represented by an edge, which is based on the co-occurrence of hashtags in the same social media post. If the two hashtags appear together frequently in postings, there is an edge that connects them.

Nodes’ relevance can be measured using many measures, such as betweenness centrality and degree centrality, which show how influential a node is within the network. A node’s degree of centrality is determined by the number of direct edges it possesses. The degree to which a node serves as a link between other nodes in the network is its betweenness centrality. The significance of the nodes can be established by analyzing the co-occurrence of each hashtag in the analysis. Hashtags with higher degrees are more important because they highlight recurring topics, subjects, and patterns within the dataset.

Top 150 nodes
Top 3 Nodes with Highest Degrees

From Figure 2 and 3 we can see that the nodes with the highest degrees are #contentment with 8 degrees, #excitement with 6 degrees, and #hopeful with 6 degrees. These nodes have the greatest number of connections between them meaning they have the highest degree centrality. These hashtags frequently appear together in social media posts. The co-occurrence of these hashtags can give insight to social media influencers and companies to answer the proposed question.

Answer to Question:

The hashtags that are most present in online social media posts can be seen from Figure 1. We can use the above explanation of the importance of nodes and their respective edges to draw conclusions for the question. Identifying and counting the frequency of each hashtag gives a foundational understanding of the most popular topics and hashtags within the dataset. However, even though the frequent hashtags can tell us the most common topics online, they might not necessarily represent the most influential or important hashtags. This initial analysis serves as a foundation for further exploration into the characteristics that contribute to the importance of nodes.

To explore what makes certain hashtags important, Figure 2 uses NetworkX to create a graph of the top 150 nodes and edges. This visualization allows us to show relationships and connections between hashtags based on their co-occurrence within social media posts. By visualizing hashtag networks and analyzing metrics like degree centrality and betweenness centrality, we can identify the most influential nodes within the network. Understanding the most common hashtags and the characteristics that make them important can help stakeholders such as social media influencers and companies to maximize the efficiency of online engagement, marketing campaigns, algorithms, and policies. This exploratory data analysis reveals insights that can be used by stakeholders to effectively enhance their online presence.

Data Cleaning/Bugs:

To clean this data set, I first standardized the formatting of the hashtags column by converting them into lowercase to avoid case sensitivity issues. I tokenized each hashtag group to access unique nodes that can be individually analyzed. Another thing I did to clean this dataset is drop the empty values from the hashtag column to prevent errors during analysis and maintain an accurate analysis. Some bugs I think others might encounter are empty values in some of the columns. For this analysis I dropped the empty values from the columns I used, however the use of other columns with empty values could present issues. Another bug that others might run into is the configuration of the graph in NetworkX due to the size of the dataset. Creating a graph with every node from the dataset might cause processing issues due to the large number of nodes and edges to be rendered. To fix this I focused on the top 150 nodes which also helps to make sure the visualization is not too cluttered and still understandable.

Limitations/Bias:

The limitations for this dataset include the specificity of the data which does not include all social media platforms. This presents limitations for the groups of users, the themes and topics they post about, and the hashtags they use which could also introduce bias by skewing the data towards certain demographics or topics. To fix this, I focused on hashtag data because the use of hashtags is pretty standard across platforms unlike likes and retweets. The limitations for the analysis include its scope due to it primarily focusing on co-occurrence of hashtags because it overlooks some contextual factors like sentiment that contribute to hashtag importance. Incorporating some columns from the dataset like sentiment and platform can give more contextual insights into the importance of each hashtag. To address biases, the dataset could include demographic information to provide a more comprehensive perspective of trends in web-based networks. However, this also introduces the possibility of bias during analysis based on this information.

Here is a link to my GitHub repository that contains the code I have developed for this assignment: https://github.com/cwangg/INST414-Modules/tree/main/module-2

--

--