GSOC 2018 : Visualizing Media Data With Network Analysis

Published in

Learning Machine Learning

4 min readMay 17, 2018

Google summer of code’s community bonding part has just ended and the coding period started. I thought I should also describe my project with Berkman Klein Center of Internet and Society in Harvard University with more details.

My project is on creating research quality network visualizations for their Media Cloud tool, which is an open source platform for media analysis. Media Cloud tracks thousands of media sources. We can query for different topic to see their coverage by different media sources. We can also research with topic mapper tool which is used to investigate any topic in depth and checking the most influential media sources covering that topic.

Network analysis of the media resources allows us to investigate the influential media sources, check how different political parties share or cover the topic based on their partisanship and understand our news landscape better. For example here is a visualization made for the research paper “Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election”

My mentor Hal Roberts also works on the intersection of quantitative media analysis, online privacy and the networked public sphere. The current network visualizations are built with Gephi, a drag and drop interface for creating network visualization and performing network analysis. The .gexf graph files are loaded to gephi for making the visualizations.

The trouble is that mediacloud has a very large number of topics while they track roughly 10000+ media sources, we can output the network graphs for many topic where the nodes are the media resources and the edges indicate whether the media resources linked to each other in some media content like blog post, article etc or not. Automating the visualization making process with code enables the researchers to quickly understand the coverage of some particular topic by current media.

Graph visualization is a hard problem because of the facts like edges tend to grow very fast with the number of nodes and positioning the nodes for revealing the substructures within a graph is particularly hard. Gephi provides many different physics based force directed layouts like force-atlas, force-atlas-2, fruchterman reingold, yifan hu etc along with other straight forward layouts like rotation, expansion, contraction, randomization etc. The layouts essentially position the particular nodes, after that we can adjust the node sizes, node labels, edge sizes, edge opacity, edge width etc and play with other parameters to get a comprehensible visualization that does not look like a ball of yarn.

I’m working with networkx to automate the process with python. The work is actually divided into two parts in a sense, first I’ve to write code for creating the visualizations and then I’ve to integrate that with the mediacloud website so that researchers can get the visualizations when they download the graph files for media coverage.

Unfortunately enough, networkx currently does not provide many layouts like force atlas 2 off the shelf. So I’ll be using some other libraries. The default settings of the networkx is extremely bad, it’s good enough for analysis, but pretty bad for visualization. To demonstrate the situation, I’ll provide some visualizations from my own experiments.

If we use simply the default settings without adjusting the node colors, node sizes, label colors, label sizes etc, this is what we get after just drawing a media resource graph as it is.

After extracting the top 100 node subgraph by degree centrality and customizing the node colors, edge color and label size, we can see there’s a lot of improvement when it comes to legibility.

However, even to reach that we’ve to do a lot of work by code that we take for granted with tools like gephi. We’ve to specify the label names, node colors, node sizes for each node, we’ve to experiment with different edge colors and edge depth. To reach something that’s publishable I’ll have to focus on issues like implementing algorithms for avoiding node overlap and label overlap issues and also add features for customizing label size proportional to some attribute, rotation etc. So far I’ve ran some good experiments that resulted in the above visualization and worked on identifying the relevant features for producing research worthy visualizations in the community bonding period.

Since the coding period has just began I will be working on the graph visualization algorithms from now on. Code for creating the above visualizations are here.

To be really honest, I think I was correct to start working on coding before the coding period began. Otherwise I’d have to just get started on the problem by now instead of having a solid understanding of where I’ve to go. I’ve started reading up on the different layout algorithms for network visualization and general graph visualization concepts, so I’ll be working on fixing the above mentioned issues from now on.

GSOC 2018 : Visualizing Media Data With Network Analysis

Written by Tahsin Mayeesha