Unraveling Influence on Twitter

James van Doorn
INST414: Data Science Techniques
4 min readOct 2, 2023

Non-Obvious Insight

On platforms like Twitter, users follow and interact with others based on the perceived importance of their content and their influence within the community. The non-obvious insight I aim to extract from this analysis is pinpointing the most influential Twitter user. For newcomers to Twitter, it can guide them on who to follow for valuable content. It also serves as a guide for aspiring influencers, offering insights into the practices that can lead them to gain influence on the platform.

Data Source

The dataset originates from SNAP’s “ego-Twitter” dataset. It’s a goldmine of Twitter connections, where nodes represent individual Twitter users, and edges signify the “follow” relationships between them. This dataset provides a snapshot of Twitter’s social graph, capturing the intricate web of connections that define the platform’s social fabric.

Defining Nodes and Edges

In this network, each node represents a unique Twitter user, identified by their user IDs. Edges, on the other hand, denote the act of one user following another. If User A follows User B, there’s an edge from A to B. To construct this graph, I used NetworkX, a Python library tailored for network analysis.

Importance in the Twitterverse

What does it mean for a Twitter user to be “important”? In this context, importance is determined by the number of followers. The more followers a user has, the more influential they tend to be. Think of it as a measure of their reach and impact in the Twitter community.

Identifying Important Nodes

Identifying the most influential Twitter user involves identifying nodes with the highest number of followers. I’ll calculate the degree centrality, which quantifies the number of followers each user has. The top three nodes with the highest degree centrality will be the most important Twitter users.

Software Used to Facilitate

To kick off the analysis, I import the ego-Twitter dataset and create an empty directed graph using NetworkX. Then, I read through the “twitter_combined.txt” file, where each line represents a follower-followed relationship, and add edges accordingly. Next, I calculate the degree centrality for each user, indicating their importance based on the number of followers. The top three users with the highest degree centrality represent the most influential nodes.

Cleaning and Bugs

Regarding data cleaning and bugs, I started by validating the source and authenticity of the ego-Twitter dataset from SNAP. Ensuring the data’s integrity is essential before proceeding with any analysis. Next, I checked the data format to understand its structure. In this dataset, I found that the edges were represented as pairs of Twitter user IDs in a simple text file format. I then looked for missing values within the dataset. Fortunately, in this case, I did not see any missing user IDs or connection information. Duplicate entries, if present, could also skew centrality calculations. I used NetworkX’s built-in functions to detect and remove duplicate edges from the dataset. Lastly, ensuring that the data consistently adheres to the specified format is crucial. Inconsistent data could lead to misinterpretations. I verified that each line in the “twitter_combined.txt” file followed the expected format (two integers representing user IDs).

Most important nodes in the network (based on degree centrality)

Limitations and Bias

  1. I assumed that the ego-Twitter dataset accurately represents the Twitter follower-followed relationships. Any inaccuracies or missing data in the original dataset could lead to biased results.
  2. I was not able to convert the user IDs to Twitter usernames using Twitter’s API, which limits the findings and resulting usability of this post.
  3. While anonymized user IDs were used, it’s important to note that Twitter user IDs may still contain identifiable information. Ensuring the privacy and ethical handling of user data is crucial.
  4. The dataset represents a snapshot of Twitter connections at a specific point in time. Users’ follower counts and influence levels can change over time, so my analysis provides a static view.
  5. The analysis focused solely on degree centrality as a measure of influence. Real-world influence on Twitter is more complex and could involve factors like retweets, likes, and engagement, which were not considered here.
  6. The dataset represents a single ego network, and this analysis is specific to this context. The results may not be directly applicable to other social networks or Twitter as a whole.
  7. I did not perform extensive validation of user IDs due to privacy constraints. This could potentially lead to inaccuracies in my analysis if the dataset contains invalid or outdated user IDs.

In conclusion, while this analysis provides insights into the influential users within the ego-Twitter dataset, it comes with several limitations and considerations. The data’s source, integrity, privacy, and representativeness all play crucial roles in the validity of my findings. Further research and additional data sources would be needed for a more comprehensive understanding of influence on Twitter.

Check out my GitHub repository: jvand0/inst414_work: A repository containing work for INST414 (Data Science Techniques) (github.com)

--

--