A Network Analysis of International Women’s Day in Latin America
By: Denise Neuman and Amritangshu Mukherjee
Introduction
March 8th is International Women’s Day, we acknowledge the fight of women all over the world looking to be given the same rights and opportunities as men and we recognize the what we have achieved by raising the voice and also the long way we have yet to traverse. However, in Latin America, this day has become the date of an annual march called “Marcha 8M” across different cities, giving women a platform to protest against the sexist violence that currently exists in these countries.
During this day (as well as the few previous and following days) many Tweets with hashtags related to the topic are posted. Some of these are published as a way of supporting the movement and some others are used to express an opposition to it. We decided to scrape one-week worth of Tweets a few days after March 8th in order to get data on the movement and be able to observe the network, detect communities within it and analyze the sentiment of the posts. For this, we decided to use Python as a language and Gephi as a visualization tool.
Scraping Data And Creating The Network
To get the data required for our analysis, we used Python’s “tweepy” module to scrape tweets with hashtags like “#8demarzo”, “#marcha8m”, “#nosonformas”, “#diainternacionaldelamujer”, “#sevaacaer”, “#niunamas”. Information obtained through Tweeter’s API with the code presented below, was stored in a Pandas DataFrame, depicted in Figure 1, that included tweet ID, username of the author of the tweet, text/content of tweet, creation date, location, hashtags, etc. The Username and Text have been blurred to protect the privacy of the Twitter Users.
import pandas as pd
import numpy as np
import tweepy
# Initialize dataframe
df = pd.DataFrame(columns=['ID', 'Username', 'Text', 'Retweet Count', 'Favorite Count', 'Created At', 'Location', 'Hashtag'])
# List of hashtags to search for
hashtag_list = ['#marcha8m', '#8demarzo', '#nosonformas', '#niunamas', '#diainternacionaldelamujer', '#sevaacaer']
# Total tweets counter
total_tweets = 0
tweets_wanted=5000
# Iterate through each hashtag and paginate through tweets
for hashtag in hashtag_list:
hashtag_tweets = tweepy.Cursor(api.search_tweets, q=hashtag, tweet_mode='extended').pages()
hashtag_counter = 0
print('Start {}'.format(hashtag))
for page in hashtag_tweets:
for tweet in page:
if tweet.id not in df['ID'] and total_tweets < tweets_wanted:
df.loc[len(df.index)] = [tweet.id, tweet.user.screen_name, tweet.full_text, tweet.retweet_count, tweet.favorite_count, tweet.created_at, tweet.user.location, hashtag]
total_tweets += 1
hashtag_counter += 1
if total_tweets == tweets_wanted:
break
if total_tweets == tweets_wanted:
break
print('Number of tweets for {} is {}'.format(hashtag, hashtag_counter))
print('Total number of tweets: {}'.format(total_tweets))
# Save dataframe to CSV
df.to_csv('tweets.csv', index=False)
In total, we got over 20,000 tweets, but as many posts had more than one hashtag, after filtering duplicates we ended up with 9,193 instances.
For the creation of the network, three elements of each tweet were retrieved:
- Source User: user who posted the tweet or who re-tweeted it.
- Target User: 1) If the tweet was originally posted by the Source User, then the Target User would be the same account. 2) If the instance is a re-tweet, then the Target User would be the Source User of the original tweet. 3) If the instance is a mention of another user in a tweet, then the Target User would be the mentioned account
- Content Type: 1) Tweet (original content of the Source User). 2) RT (re-tweet). 3) Mention (original content of the Source User with a mention of another user)
The Source User was directly obtained from the scrape command in the form of the “Username”. However, for the Target User and Source Type the function “findall” from the “re” module was used. By finding “words” beginning with the @ symbol, we were able to find mentions of accounts within the text and by looking for posts that started with the letters “RT”, we were able to identify re-tweets (and differentiate between mentions in original posts and re-tweeting mentions). Below, an extract of the code and sample table is pictured in Figure 2, again usernames are blurred for privacy.
import re
import pandas as pd
original_df = df.copy()
network_df = pd.DataFrame(columns=['Source', 'Target', 'Content_Type'])
for i in range(len(original_df)):
author = '@'+original_df.iloc[i]['Username'] # Get author
inText_usernameList = re.findall('@[a-zA-Z0-9_:]+', original_df.iloc[i]['Text']) # Get other usernames in the tweet
if original_df.iloc[i]['Text'][:2] == 'RT': # If it is a Retweet
network_df.loc[len(network_df.index)] = [author, inText_usernameList[0].rstrip(':'), 'RT']
if len(inText_usernameList) > 1:
for mention in inText_usernameList[1:]:
network_df.loc[len(network_df.index)] = [author, mention, 'Mention']
else: # If it is not a retweet
if len(inText_usernameList) >= 1:
for mention in inText_usernameList:
network_df.loc[len(network_df.index)] = [author, mention, 'Mention']
else:
inText_username = author # No other username in this tweet
network_df.loc[len(network_df.index)] = [author, inText_username, 'Tweet']
network_df.to_csv('network.csv', index=False)
network_df
Once we had a DataFrame with all the connections, the network was created with the “Networkx” module of Python and using the from_pandas_edgelist function. With this, the network was created and we could move on the next step, understand it.
In order to analyze the structure of the network and how the nodes of the network are connected with one another, we decided to look into nodes that may be driving the movement in the social media channel. We analyzed the sentiment of the tweets, looked at three important centrality metrics, detected communities with cliques and k-cores and visualized the network as a whole. In the following sections we will discuss our findings.
Centrality Metrics Analysis
The module “Networkx” offers built-in functions to compute Degree, Betweenness and Closeness centrality metrics. We used these functions to find important users. For this publication we have changed the usernames of the network in order to preserve privacy.
- Degree: Number of connections that a user has. The top three users with the highest degree centrality metrics are “userA”, “userB” and “userC”.
- Betweenness: How well a node connects different parts of the network. A user with high betweenness is able to transmit new information to diverse parts of the network. The top four users with the highest betweenness centrality metrics are “userA”, “userD”, “userB” and “userC”.
- Closeness: Distance from each point to every other point in the network. A user with high closeness is able to transmit information more efficiently throughout the network. The top four users with the highest closeness centrality metrics are “userA”, “userD”, “userE” and “userF”.
Community Detection
To detect communities, we look for groups of nodes that are connected to each other, there can be two types of communities: cliques and cores.
- Clique: It is a subset of the network where all the users are connected to each other, they are fully connected. In this network, the largest cliques found were subnetworks of five users each (5-clique). As observed in Figure 3 (left side), there are four cliques of length 5. In addition, three of the users within these four cliques belong to all of them, the other four users are part of only two cliques each. For our purposes, we will focus on the three well connected users, and as they are different from the accounts mentioned in the previous section, we will call them “userG”, “userH” and “userI”.
- Core: It is a subset of the network where each user has at least k connections, they tend to be larger than cliques because they do not need to be fully connected. In this case, the largest cores have 11 users each (11-core) and a total of 28 users are part of these cores. Figure 3 (right side) shows the subnetwork of the 28 users belonging to the 11-cores. One thing that caught our attention, was that “userD” was part of many of these cores.
In the following graphs, the usernames of the nodes have been omitted to preserve privacy.
Sentiment Analysis
The next step was to divide positive tweets from negative ones. This was performed by translating the tweets to English (using Google Sheet’s function GOOGLETRANSLATE) and re-uploading the CSV to Python. Afterwards, the SentimentIntensityAnalyzer function of the “vaderSentiment” module was used to get the sentiment of the English-translated tweets. The code below shows how this was performed.
#generate sentiment using Translated column
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
tweets_df['Sentiment'] = tweets_df['Translated'].apply(lambda x: analyzer.polarity_scores(x)['compound'])
#make the sentiment 'Positive' or 'Negative'
tweets_df['Sentiment'] = tweets_df['Sentiment'].apply(lambda x: 'Positive' if x > 0 else 'Negative')
After adding the sentiment to each tweet, the code above Figure 2 was re-run, adding now a fourth column for “Sentiment”. Figure 4 shows the resulting table. Once more, usernames are blurred for privacy.
Overall Network Visualization And Analysis
Network Visualization with Gephi
For this part of the analysis, we used Gephi, a software specialized in network visualization. The data is added to said program using the “Networkx” module. Again, the network is created using the DataFrame in Figure 4 and then it is passed to the write_gexf function which generates the Gephi file. The command is shown below.
import networkx as nx
# create the graph
temp = network_df
G = nx.from_pandas_edgelist(temp, 'Source', 'Target', edge_attr=True, create_using=nx.DiGraph())
# Export the graph to a GEXF file
nx.write_gexf(G, "graphn.gexf")
Once in Gephi, we were able to modify the shape of the network, we found that the “Fruchterman Reingold” mode worked best for our purposes. Also, we asked the program to color the nodes based on the community to which each user belongs and color the edges based on the sentiment of the tweet. The resulting network can be observed in Figure 5.
In the figure above, we can see many communities within the network, the most significant ones being the pink, light blue and light green ones. The largest node observed (at the bottom right) is the user with the highest out-degree, which we believe to be a bot, we will call it “userJ”.
In addition, after analyzing the nodes that appear to be the centroids of the communities, we found that in the middle of the light green cluster, there is “userA”, in the middle of the light blue community there is “userC”, and in the center of the pink community there is “userB”. Interestingly enough, the central users of the two positive communities, meaning red and black nodes do not appear to be significant in the previous analysis of the Centrality Metrics section.
Network Analysis
In order to be able to analyze the network in a cleaner and more significant way, we decided to re-plot the network excluding all nodes with less than 2 connections. For this, the following code was used.
import networkx as nx
# create the graph
temp = network_df
G = nx.from_pandas_edgelist(temp, 'Source', 'Target', edge_attr=True, create_using=nx.DiGraph())
# list to store nodes to be removed
remove_nodes = []
# loop through the nodes in the graph
for node in G.nodes():
# check if the degree of the node is 2
if G.degree[node] <= 2:
remove_nodes.append(node)
# remove the nodes from the graph
G.remove_nodes_from(remove_nodes)
# Export the graph to a GEXF file
nx.write_gexf(G, "graphn.gexf")
The network was shaped with the “Fruchterman Reingold” mode. The resulting network can be observed in Figure 6.
We can distinguish eight different communities (gray, brown, black, purple, pink, green, light blue, turquoise and orange) and we can also see the green (positive) connections and red (negative) connections between the nodes. In addition, the size of the nodes indicates its relative out-degree (larger nodes are tweeting more and mentioning more users in their tweets, smaller nodes do not tweet as much). It is important to note that the colors of these communities (in Figure 6) do not correspond to the color of the communities in Figure 5.
Network Analysis
Now we dive deeper into each of the communities, looking for insights on which nodes are the most important ones based on who is tweeting more and who is being mentioned more in the tweets, etc.
Pink Community: After analyzing some of the nodes with the highest degree in this community, we found several political figures and agencies (such as the President of Mexico, the Tabasco Government and the Guanajuato Education Ministry), therefore we decided to name it “Government Community”. Figure 7 shows a close-up of this cluster. Marked with dark blue is the President of Mexico node. As observed, it has a high in-degree of both sentiments.
Light Blue Community: Interestingly, this positive sentiment community also was found to be formed by political figures, however, we realized they were specific to Mexico City (CDMX). As observed in Figure 8, there are three central nodes which correspond to the city’s Governor (marked in dark blue), the city’s Social Well-being Ministry and the city’s Mental Health and Addiction Prevention Agency. For this reason, we named it “CDMX Government Community”.
Orange Community: “Heated discussions on women issues”. We observed a significant exchange of tweets with predominantly negative sentiment. The visualization showcases a distinct pattern, with a dense outer ring representing highly active users and a central cluster primarily consisting of women. It seems that these conversations revolve around the challenges and suffering faced by women in Mexico, highlighting the importance of addressing these issues. Notably, many tweets mention and tag “User D,” the law enforcement agency of Mexico City, indicating a call for their attention and involvement in addressing these concerns. The high engagement during this event underscores the need for continued dialogue and action to promote gender equality and women’s rights.
The Grey Discord: Unmasking Negative Influences and Suspected Bots. Moving on to the grey cluster, our analysis reveals a group of users characterized by a remarkably high out-degree, which suggests they actively engage in conversations and target a wide range of users. Interestingly, some users within this cluster, such as User J, are suspected to be bots, raising concerns about the authenticity and intent behind their activity. A significant portion of the exchange in this cluster is negative and directed towards women, further exacerbating the challenges faced by the female community. Upon closer examination, it becomes apparent that the core nodes within the grey cluster predominantly consist of men. This observation highlights the need for addressing the potential spread of harmful narratives and disinformation, especially when it comes to sensitive topics such as gender equality and women’s rights. In addition to the previously discussed grey cluster, our analysis uncovers several other grey clusters on the Twitter network that also exhibit a substantial presence of negative sentiment. Unlike the initial grey cluster, these clusters display a more diverse composition, with core nodes that have high out-degrees not exclusively represented by men. This observation demonstrates that negative sentiment and targeted exchanges are not limited to a specific gender, but rather can emerge from various sources and demographics.
Where are the important users with high centrality metrics?
As observed, non of the important nodes mentioned in these communities are “userA” or “userB” or “userC”. We found this intriguing, so we decided to look for these accounts specifically within the network, their nodes can be observed in Figure 11.
Looking only at the network was limiting our ability to understand why these users had high centrality metrics, therefore we decided to look-up their accounts directly to find more useful insights. It was found that “userB” is actually a bot with high degree and betweenness centrality, connecting the green community with the pink “Government Community”. However, “userA” and “userB” were found to be just regular users who posted a tweet that went viral. Figures 12 and 13 show the number of re-tweets each of these posts have.
Conclusion
In summary, the analysis of the Twitter network surrounding the International Women’s Day in Latin America provided valuable insights into the movement’s dynamics and the key players driving the conversation. By leveraging Python, Networkx, and Gephi, the study identified important nodes based on their centrality metrics, detected communities within the network, and analyzed the sentiment of the tweets. The findings showed that political figures had a significant presence in the positive sentiment communities, indicating the importance of the government’s role in promoting gender equality and women’s rights. At the same time, the study highlighted the presence of harmful narratives and disinformation, as well as the need to address negative sentiment and targeted exchanges.
Key insights from the study include:
- Political figures and agencies played a significant role in the positive sentiment communities, indicating the importance of government involvement in promoting gender equality and women’s rights.
- Negative sentiment and targeted exchanges were present in the network, highlighting the need for continued dialogue and action to address the challenges faced by women in Latin America.
- The presence of bots and suspected negative influencers in the network raised concerns about the authenticity and intent behind their activity, emphasizing the need to address disinformation and harmful narratives.
- The study revealed the significant role of key players in driving the conversation, including users with high centrality metrics and viral tweets, underscoring the importance of understanding the drivers of social media movements.
Overall, the study contributes to a better understanding of the social media landscape surrounding the International Women’s Day movement in Latin America and highlights the ongoing fight for gender equality and women’s rights in the region. The findings underscore the importance of continued efforts to address the challenges and promote positive change, both in the online space and in the wider society.