What Twitter Friends Can Tell Us About the CDC — A Social Media Network Analysis

Published in

Nerd For Tech

21 min readMar 10, 2021

Introduction

Social media platforms have become some of the most popular technologies of the decade. Over 72% of Americans use some type of social media as of 2019, with a significant amount of people visiting the platforms every single day, and large numbers of users spending several hours on them daily. [1] Between YouTube, Facebook, Instagram, Reddit, and Twitter, the uses of social media are many and varied — from uploading videos, posting selfies, and sharing memes to staying in contact with friends and family or making connections with online communities.

However, the social media sphere is also becoming an important portal for activism, politics and news. More Americans are relying on these platforms to keep up with current events than ever before, and one of the websites with the most news-focused users is Twitter. With 71% of US users using the site to get their news in 2018, one of Twitter’s primary purposes now is to help the public stay informed. [2] And after twelve months of presidential election developments, political scandals, and pandemic outbreaks, the year 2020 gave the world a lot of information to digest.

Perhaps the most newsworthy subject of the past year has been the spread of the Coronavirus, and the users of Twitter have been on top of that too. Millions of people tweeted about COVID-19 throughout its evolution, numerous health experts (such as Dr. Anthony Fauci) tweeted to the public on how to respond to the disease, and many medical institutions (like the Center for Disease Control and Prevention) tweeted regular reports on the latest viral statistics. Containing a constant stream of tweets that mention the Coronavirus, Twitter holds a treasure trove of data, and the site has even been used to track the spread of COVID-19 by scientific researchers. [3]

Yet despite Twitter’s use as an information hub, it has also become a hotspot for misinformation, particularly about the Coronavirus. Health and medicine are polarizing political topics, COVID-19 conspiracy theories abound, and distrust in institutions such as the CDC or experts like Fauci is at an all-time high. [4]

The CDC in particular is a target of such suspicion. Many public figures, including former presidents and politicians, have undermined its once-credible reputation, distorting its statements and doling out harsh criticisms against it. On the other hand, skepticism towards the agency is not solely the fault of third parties. The CDC itself is partly to blame for its tenuous track record of credibility during the Coronavirus, as it generated noticeable miscommunications and errors in reporting. [5] Some have felt that the CDC has also succumbed to political influences that sabotage its ability to effectively deal with this crisis, and satirical news sites have mocked the institution for being an untrustworthy authority on the topic of COVID-19. [6]

The role of social media in health and medicine is new and hard to navigate. Both information and misinformation spread like wildfire through an untamed digital wilderness. Necessarily so, officials are held accountable for their statements like never before, now that each tweet on Twitter exists in a near-permanent record for everyone in the world to view. Institutions like the CDC have to be more informed and armed with a larger arsenal of resources than ever before in order to prove their credibility to the public.

For the CDC, one of the resources in its larger arsenal includes a wide array of social media pages outside of its primary profile. In addition to the CDC’s main username, “@CDCgov,” the institution is also associated with dozens of secondary Twitter accounts — all of them representing a unique area of interest and dedicated to providing specialized content for specific audiences (often without a mention of Coronavirus). Some of these profiles are meant for particular regions, such as “@CDCRwanda,” whereas others are meant for certain medical concerns, like “@CDCHeart_Stroke.” [7]

While millions of people follow the CDC’s primary profile on Twitter, these peripheral profiles tend to have far fewer followers, often ranging in the mere thousands. And while @CDCGov is only following a few users itself, also known as friends on Twitter, many of these friends include its own peripheral profiles. At first glance, this looks like an insular network with a singular set of interests and a limited number of links to users from other backgrounds.

But what about the other friends on the CDC’s list? Aside from the users directly related to the agency, what additional accounts does the CDC think are important enough to follow? What users do those accounts follow? What kinds of connections exist between the users at the periphery of this social media network?

More importantly, are we able to assess the character of the CDC based on the network of friends that it has created for itself?

Can we detect any potential political leanings, affiliations, or affinities towards certain groups or ideas based on the users that it is either directly or indirectly linked to? Can we see any clearly defined communities emerging from this network? What insights can we gain by understanding the CDC’s web of relationships on social media? Perhaps we can learn whether the CDC truly is a trustworthy and unbiased source of information based on its links to other social media users?

Network Analysis

In order to explore these questions, we have to conduct a network analysis — a method of extracting useful information from the complex relationships between interconnected elements, such as individuals in a social structure, by breaking down and visualizing data. Networks are often represented in the form of graphs, with points or nodes on the graph acting as individual elements, and lines or edges between the nodes acting as the relationships that the different elements have with each other.

One type of relationship structure is known as an egocentric network, which consists of a single central node or ego that connects all other nodes or alters in the network. In this case, the central node or ego will be the CDC’s main Twitter account, the nodes or alters immediately surrounding the ego will be all of the Twitter accounts that the CDC follows, and the nodes surrounding the alters will be a collection of Twitter accounts that the alters follow. This network will also be visualized as an undirected graph, where the edges between nodes depict friendships or mutual follows between accounts — as opposed to a directed graph, where the relationships between different elements only flow in one direction, such as Twitter follows that aren’t reciprocated.

API’s, Libraries, and Programs

Performing a proper social media network analysis requires a certain set of tools:

A Twitter Developer Account
Jupyter Notebook with Python
Tweepy, a Python Library for the Twitter API
Pandas, a Python library for organizing and analyzing data
NetworkX, a Python library for creating and manipulating networks
Gephi, a desktop program for visualizing networks and graphs

Connecting to Twitter

First we must apply for a developer account with Twitter. [8] Once it has been approved by the Twitter staff, we can look at our application settings to see what our keys and access tokens are. Then we can open up Jupyter Notebook and install Tweepy, which we can install using pip.

pip install tweepy

From here we import the library into our notebook and create some new identifiers to represent our keys and tokens (which are X’d out here to preserve the anonymity of my own ID’s).

import tweepyapi_key = “XXX”api_secret = “XXX”bearer_token = “XXX”

After that we authorize tweepy to use our Twitter credentials and create an API object. Because Twitter has certain rate limits in place on the number of tweets or accounts that a developer can retrieve in a given amount of time, we want our API to pause its work whenever the rate limit is reached, and then resume when the next window of time opens. [9]

auth = tweepy.AppAuthHandler(api_key, api_secret)api = tweepy.API(auth, wait_on_rate_limit=True)

When answering an API request that requires iteration through large amounts of information, Twitter feeds back its data in a series of discrete pages, which helps to return more results than can be retrieved in a single response. This method is known as pagination. [10]

As a result, some API requests require entering in a page parameter to loop through the paginated data. Fortunately for us, the Tweepy library features a Cursor object to help us iterate through these pages behind the scenes, eliminating the need for us to provide a page parameter, simplifying our code, and making our lives a lot easier. [11]

We can see that the CDC is following 269 different Twitter accounts, so let’s input that number.

for friend in tweepy.Cursor(api.friends, id=”CDCgov”).items(269):print(“CDCgov:”, friend.screen_name, “|”, friend.name)

The program prints all 269 friends from the CDC’s account. Here are the first 10.

CDCgov: JAMA_current | JAMA
CDCgov: WHCOVIDResponse | White House COVID-19 Response Team
CDCgov: YouTube | YouTube
CDCgov: NOSORH | Natl Organization of State Offices of Rural Health
CDCgov: ruralhealthinfo | RHIhub
CDCgov: RealTimeCOVID19 | COVID-19 Real-Time Learning Network (RTLN)
CDCgov: CDC_Firstline | CDC's Project Firstline
CDCgov: FBI | FBI
CDCgov: CDC_DRH | CDC Division of Reproductive Health
CDCgov: NIDAnews | NIDAnews

Now let’s try 3 friends of friends for each of the first 3 friends of the CDC.

for friend in tweepy.Cursor(api.friends, id=”CDCgov”).items(3):print(“CDCgov:”, friend.screen_name, “|”, friend.name)for friend_of_friend in tweepy.Cursor(api.friends, id=friend.screen_name).items(3):print(“\t\t”, friend_of_friend.screen_name, “|”, friend_of_friend.name)

CDCgov: JAMA_current | JAMA
DrNancyM_CDC | Dr. Nancy Messonnier
Neil_R_Powe | Neil R. Powe
DeniseScholtens | Denise Scholtens
CDCgov: WHCOVIDResponse | White House COVID-19 Response Team
CDCDirector | Rochelle Walensky, MD, MPH
VP | Vice President Kamala Harris
POTUS | President Biden
CDCgov: YouTube | YouTube
uni_mugi_hachi | 仲良し保護猫うにむぎはちむー😸🇯🇵
ahoy_zoe | ⚡️ Zoe Clapp ⚡️
itsjojosiwa | JoJo Siwa!🌈❤️🎀

What if we want to know the exact number of friends that each of the CDC’s friends has?

for friend in tweepy.Cursor(api.friends, id=”CDCgov”).items():print(“Friends of”, friend.screen_name, “|”, friend.name, “:”, friend.friends_count)

JAMA_current | JAMA : 863
WHCOVIDResponse | White House COVID-19 Response Team : 4
YouTube | YouTube : 1201
NOSORH | Natl Organization of State Offices of Rural Health : 375
ruralhealthinfo | RHIhub : 355
RealTimeCOVID19 | COVID-19 Real-Time Learning Network (RTLN) : 30
CDC_Firstline | CDC's Project Firstline : 21
FBI | FBI : 2126
CDC_DRH | CDC Division of Reproductive Health : 131
NIDAnews | NIDAnews : 264

The numbers vary quite a bit. To look at the highest and lowest friend counts of all accounts that are friends with the CDC, we can create a list, append the rest of the friends to it, add a few of the list values to a Pandas dataframe, and sort the dataframe by friend count.

numFriends = []for friend in tweepy.Cursor(api.friends, id=”CDCgov”).items():numFriends.append(friend)friendFrame = pd.DataFrame()friendFrame[‘screen_name’] = [friend.screen_name for friend in numFriends]friendFrame[‘name’] = [friend.name for friend in numFriends]friendFrame[‘friend_count’] = [friend.friends_count for friend in numFriends]friendFrame = friendFrame.sort_values(by=[‘friend_count’])friendFrame

Now we know what kind of range we’re working with: the Twitter accounts for RNsightsSchoolNurses and the White House COVID-19 Response Team have the lowest number of friends at 3 and 4, whereas the US Department of the Interior and the American Cancer Society are friends with 127,768 and 175,736 other users, respectively. With the Pandas sum function we can also get the total number of friends for every friend in the list.

friendFrame.sum(axis=0)friend_count 671333

That is a lot of Twitter accounts. Based on the API’s rate limit, it would take over 180 hours to retrieve the name of every single friend of friend in this network. That amounts to almost 8 straight days of automated requests, which is just not a feasible frame of time to work with. If we want to analyze this network, we have to make some compromises in the interest of time. Since the CDC has 269 of its own friends, let’s set a strict cap of 269 other friends to fetch from each friend of the CDC. If each of those 269 users has 269 friends of their own, then our network may have as many as 70,000 nodes, and could take as long as 24 hours to extract. These are large numbers and timeframes, but they’re just within the limits of workability.

Our next step is to import NetworkX, call the CDC’s friends and friends of friends from Twitter, add all of the friends and relationships as nodes and edges to a graph — and then wait.

import networkx as nxgraph = nx.Graph()for friend in tweepy.Cursor(api.friends, id=”CDCgov”).items(269):graph.add_node(friend.screen_name)graph.add_edge(“CDCgov”, friend.screen_name)print(“Working:”, friend.screen_name)for friend_of_friend in tweepy.Cursor(api.friends, id=friend.screen_name).items(269):graph.add_node(friend_of_friend.screen_name)graph.add_edge(friend.screen_name, friend_of_friend.screen_name)print(“\t\t”, friend_of_friend.screen_name)

After waiting for an entire day and watching the program print out usernames, the graph is created, and we can write it as a file for Gephi.

nx.write_gexf(graph, “CDCgraph.gexf”)

Statistics and Terms

When we first open our file in Gephi, we learn that we have 35,262 nodes and 60,702 edges — an enormous network of Twitter connections. And while the nodes and edges do exist on the graph, they are not organized in any way, with the whole cluster showing up as nothing more than a dense black square. In order to organize this network into a more meaningful form, we have to run some tests to obtain certain statistics about it.

Number of nodes and edges upon importing to Gephi (left), unorganized graph in Gephi (right)

The degree of a node is the number of edges that it has to other nodes in the network, whereas the average degree of the graph is a measurement of the average number of edges per nodes. Because the edges of this graph do not have weights, the average weighted degree and the average degree are both the same, with an average of 3.443 edges per node.

The network diameter is the shortest distance between the two most distant nodes in the network. Because this graph only includes friends of friends, there is a maximum of 4 steps between any pair of nodes — with the CDC at the center, a friend of the CDC on either side, and a friend of that friend at each of the furthest extremes.

Closeness Centrality Distribution (left), Eigenvector Centrality Distribution (right)

The closeness centrality distribution shows the measure of centrality for each node in the network, or how far on average the node is from all other nodes. Nodes with a high closeness have the shortest distances to the other nodes.

The eigenvector centrality distribution shows the measure of the influence or importance that each node has in the network, which is a factor of both the node’s own degree, and the degrees of the nodes that it is connected to. The nodes with the greatest deal of influence in the network are the ones that are connected to the most other nodes that themselves have a great deal of influence.

Modularity shows how divided the network is into different communities or clusters based on the connections between various groups of nodes. Networks with high modularity feature nodes that are densely connected to the nodes inside their own module, but sparsely connected to nodes outside of their module. With a resolution of 1, the modularity of this network is 0.652, with 48 distinct communities. However, only a few of the communities are sizable, with the largest one encompassing 10.99% (about 3,800) of the nodes and dozens of the smallest ones holding only .66-.75% (in the mid 200s). This makes sense because some of the CDC’s 269 friends have up to 269 friends of their own but sparse relationships to the other friends of the CDC — resulting in many relatively isolated modules with a limited number of nodes.

We can also open our data table to inspect the statistics of each node. Interestingly enough, while CDCgov is our focal node, and the one with the highest eigenvector centrality (or the node with the most influence), it does not have the highest degree (or edges to other nodes), and is not even listed in the top 10. And although we only allowed the API to collect a maximum of 269 friends from each account, the highest degree node is CDCDirector with a score of 338. This is because CDCDirector is not only linked to 269 friends of its own, it is also featured as a friend on the lists of many other accounts. As a result, the connections from those other accounts are counted towards its own, and its degree is higher than anyone else’s.

Nodes ordered by Eigenvector Centrality (left) VS nodes ordered by Degree (right)

Now that we’ve run some statistics on our network, we can start organizing it into a clearer composition. By using the OpenOrd layout adjustment, we can scatter the nodes and edges apart from each other into a tightly wound but evenly distributed web of overlapping connections. Then, we can apply a set of colors for the different modules to make the nodes and edges stand out. Because there are so many different communities, it will be difficult to interpret all of them, so we must make a decision to focus on only the most significant ones. 6 bold colors are applied to the 6 largest communities, 6 desaturated colors are applied to the next 6 communities, the rest of the smaller communities are set to a neutral gray tone. Then, we can run ForceAtlas 2 to expand each module so that the individual nodes are more visible.

Network after OpenNord (left) VS network after module colorization (right)

While the network is far from intelligible at this point, we can still see some interesting patterns emerge. Most modules are of a similar size and shape, each one comprised of a single friend of the CDC and up to 269 friends of its own. Many of the modules are also distant and distinct from each other, indicating isolated communities with few connections to the rest of the network. However, as an egocentric network, all nodes and communities radiate outwards from a single focal node in the center, and the modules that are closest to the center start to merge together. Module #3 in red is the largest, densest and most central community containing 10.99% of the network. Module #23 in blue is the next largest at 8.64%, with a significant portion of its nodes blending into Module #3, and several smaller communities standing apart from the rest in the periphery. Green module #16, orange module #21, cyan module #41, and pink module #15 are all smaller and more scattered but still relatively dense and central. Every module after that diminishes in size, density and centrality.

Identifying Groups

Now, in order to understand the identity of each module, we must apply labels to all of the individual nodes. We can also adjust the sizes of each node to enlarge the most influential elements of the network, while shrinking the nodes that are not as important. However, we run into some problems when working with labels…

Closeup on network after enlarging influential nodes

The network is absolutely enormous, the nodes are extremely dense, and the labels are stacked on top of each other. The only way for us to make sense of this is to use the label adjust feature, which pushes each node away from the others to make its label more visible. However, label adjust runs very slowly, and we aren’t interested in examining all 35,000 nodes in the network. In order to speed up the process, we can isolate each community, gray out the rest of the network, eliminate all labels except for the ones we’ve selected, and then run the function only after the graph is simplified.

It’s not perfect, but now we can see the names of the most influential accounts in Module #3, the largest and most central module in the network. We can also inspect the data table of Module #3 for a more comprehensive analysis. In this community of roughly 3,800, the most prominent interconnected nodes seem to be profiles with usernames related directly to the CDC — such as our central node CDCgov, as well as CDC_eHealth, CDCDiabetes, CDCemergency, CDCPCD, CDC_NCBD, CDCChronic, CDCInjury, CDCEnvironment, CDCespanol, and several others. Additionally, the most influential users in this group without a direct relationship to the CDC include those with names pertaining to the FDA (the Food and Drug Administration), the HHS (Department of Health and Human Services), the NIH (National Institutes of Health), and other health-related US government agencies or non-profit organizations. None of this is surprising considering the CDC’s own status as an American health institution. However, there are also hundreds if not thousands of nodes in this module that are not as easily identifiable or even visible. And while it is easy to zoom into the graph to get a closer look, ascertaining the nature of each node in the module would be a tedious and time-consuming task.

Intersecting with Module #3 is the second largest group of Module #23, with just over 3,000 nodes. In order to organize this section we repeat the same steps as last time. While the previous community sat at the center of the graph with a large number of nodes clustered around its core, this module is a bit more scattered. Many of the most prominent nodes in close proximity to the network’s focal point are related to the CDC, just like the nodes in the last cluster. However, instead of addressing specific health issues, these CDC accounts seem to focus more on certain people or regions of the globe — CDCDirecotr, DrKhabbazCDC, DrMartinCDC, CDCGlobal, CDCGlobalJobs, CDCTravel, CDCMalawi, CDCHaiti, CDCSouthAfrica, CDCKenya, CDCRwanda, CDCNamibia, etc. Also featured at the edges of this area are USAID, StateDept, and TravelGov, giving us further indication that this group’s users are concentrated on national concerns. Based on their names alone, the peripheral nodes of Interior, American_Heart, and DeptVetAffairs also share similar interests to the rest of the group, but their distance from the center implies fewer edges linking them directly to the CDC.

The next cluster labeled Module #16 is smaller and less central than the other two at just below 2,500 nodes, but some of its users still overlap with the communities in the middle. However, this module looks less focused on health-related accounts and more focused on accounts relating to the US government, with the most influential nodes featuring usernames such as USAGov, USAGovEspanol, USDOL, DHSgov, Readygov, Digital_Gov, the FBI, the EPA, fema, femaregion2, and FEMA_fenton. Unexpectedly, the only major node in this module not in close proximity to the main group is nycHealthy, with many of its own individual friends featuring some mention of New York City.

The roughly 1,750-node Module #21 is interesting because it is comprised almost entirely of influential users with “NIOSH” in the name and the friends of those users. NIOSH, nioshbreathe, NIOSH_TWH, NIOSHMining, NIOSHConstruct, NIOSHNoise, NIOSHFACE, NIOSH_NPPTL, NIOSHFishing, NIOSH_MVSafety, NIOSHOilandGas, etc. A quick web search reveals that NIOSH stands for the National Institute for Occupational Safety and Health, a part of the CDC. The NIOSHespanol subgroup is related to this one but more of an outlier at the periphery with fewer links to the center of the module — probably due to language or regional differences between individual users — whereas USDA (United States Department of Agriculture) is at the very edge of the graph, sharing many similar friends with the other influential US government nodes of Module #16.

A second group with about 1,750 nodes is Module #41, comprised of even more influential nodes and users related directly to the CDC — DrNancyM_CDC, CDC_TB, CDC_DASH, cdchep, CDCPIN, CDCSTD, DrDeanCDC, etc. Many other nodes in this group are also health-related, such as InjectionSafety, lapublichealth, and the peripheral subgroup under AmericanCancer.

The final module worth mentioning is the around 1,500-node Module #15, whose most influential nodes one again revolve around the topics of health, particularly HIV — with HIVGov, talkHIV, HIVinfo_NIH being some of the most prominent next to DrMerminCDC, nationshealth, ANACnurses, and the peripheral subgroup under HamCoHealth.

After that, the next 6 modules begin to decrease in size, density, centrality, and influence. While some of these modules maintain some proximity to each other, their connections are loose and they become less relevant to the rest of the graph with every drop in their total node percentages. By the time we reach the 13th largest module, it only contains 1.96% of the whole network, and any group after that is not worth mentioning.

The only extra modules that generate some intrigue are the two that are so far away from the rest of the network that they stand out in their own right — one subgroup based on Todobebe, a Spanish-language Twitter account about fertility and infant health, and another subgroup based around HomeDepotFound, the Home Depot Foundation.

Distant modules under Todobebe (left) and HomeDepotFound (right)

Limitations

While the project ended up being unexpectedly large in scale, it was also severely limited by it. First, the rate limits of the API prevented a full retrieval of all friends and friends of friends of CDCgov within a reasonable amount of time, which would have resulted in over 650,000 individual users if time constraints were not a concern. On top of that, Tweepy extracts Twitter friends and followers in the order in which they were added, so the list of actual friends we ultimately obtain is arbitrary. As a result of these factors, this analysis was incomplete from the start. However, even 35,000 users is such an enormous number of nodes that it makes the graph nearly indecipherable. There is simply no way to inspect that many Twitter accounts without relying on other data analysis software. We may be able to point out interesting nodes with recognizable usernames and high levels of centrality or influence, and we may be able to identify some of these users’ interests based on the names of other users in their modules, but without extracting more detailed information from their Twitter accounts there is no way to perform a real assessment on them, or what it means for them to be friends with the CDC.

Conclusion

Unfortunately, the only major insights we can gain from this analysis are predictable ones — that the most central users this network are other branches of the CDC, that the majority of other well-connected users in the CDC’s social media network are related to government agencies and health organizations, and that the few users who aren’t as relevant to these subjects are also less-connected to other users throughout the rest of the network. In hindsight, thinking that we can make a judgement on the character of the CDC based on its social media network sounds like a flawed concept to begin with. Without extensive research into each individual user and the ability to quantify their personal characteristics into something measurable, the results of this project are inconclusive.