Climate 2017 Topic Networkk

GSOC 2018 : Network Visualization Of MediaCloud Topic Network

Tahsin Mayeesha
Learning Machine Learning

--

Google Summer Of Code’s second coding period is going on right now and I’m still working on my project on network visualization. For more context about the project please refer to the first blog post here :

Successes

The biggest challenge for me up to the first evaluation was to get the network layout to work and work on combining all the scattered code written before for submitting in the first evaluation. Also, the code must work for varying topic networks of different sizes. Again, in this network media sources are nodes and the edge between them indicates references to each other.

To be honest, the problem looks easy but hard to do because the graph sizes and the graph structure can vary a lot. For example , here’s the graph for ebola network, one of the results of my experiment. The node size depends on “inlink_count”. A media link is when one media source links any number of times to another media source. For example, if there was only one Chicago Tribune story linked to and it was linked to by 3 New York times stories and 2 Boston Globe stories, then the story links count would be 2 (because only two unqiue media sources linked to the Chicago Tribune story).

The colormap is based on partisan_retweet, a metric indicating the political preferences of the reader base of the media source on that topic. The colormap is roughly like this : deep blue for left, light blue for center-left, green for center, light pink for center right, dark red for right and gray for the missing values.

We can see there’s many visibly large nodes aka media sources which covered this topic extensively. We can also see nearly all the media sources with high ebola coverage has been center-left and left and the conservative sources haven’t exactly covered this topic. However in a topic like “deep state” we have better political clustering.

Since the network structures are varying a lot along with the visual aspects, like often some media network has stray nodes, coming up with correct visualization with correct layout gets really hard.

The layout configuration we’ve now overall works according to my mentor Hal Roberts, but we’re still working on making the visualization work by scaling(resizing) the graph up and down to a proper scale at this moment. After this issue is done I can work on making the network interactive.

To be honest, making interactive visualizations is likely to take less hassle than coming up with a mathematically good layout since I’ve already studied bokeh and holoviews library stuff.

The code for creating the visualizations is the following script. The dependencies can be found in the github repository : https://github.com/Tahsin-Mayeesha/Mediaviz .

However, I’ve used a wrapper function around the code and another script for creating multiple network visualizations in one go. I’ll reflect more on the challenges faced and how I handled it in the next section.

Challenges

I’ll elaborate on how a network visualization is “drawn” in a graphical library first to make sense of the later part. The main container for all graphical objects in a matplotlib graph is a figure. Other visualization libraries also follow general same philosophy, except with different names. On top of the figure we attach different graphical objects.

For example for drawing a bar graph we can add a bar object on the figure. We also add axis object on the figure with appropriate x and y limits. A visualization can be 3D too but we’re limiting our discussion of the visualization on 2D is important here since we’re not drawing 3D visualization and there’s no z axis for the nodes. The total visualization is like a bunch of sheets stacked on top of each other, each indicating some part of the final visualization.

A layout algorithm like force atlas 2 or fruchterman reingold outputs graph positions (x,y) values for each node. The key idea for a force based layout is that we assign forces between the nodes pairwise which attracts and repels each other and use physics based simulations to reveal the correct graph structure.

For drawing the nodes, we mean that we draw circles or other shapes in the figure in those position values for each node. For drawing the edges we basically draw arrows from one edge to another. For drawing the labels we generally add text annotations on top of the figure.

Since I’m creating the visualizations with networkx’s matplotlib visualization tools mostly, I initially used all the networkx functions like draw_networkx_nodes, draw_networkx_edges, draw_networkx_labels and thought things are turning out well. However, when I tried to manage node overlap and set up a correct scale I found out the node sizes I’m setting are not directly corresponding to the radius of the circles.

The given code in draw_networkx_nodes in networkx actually usage the matplotlib’s scatter plot underneath. Scatterplot is used for drawing multiple geometric objects (patches) on top of a figure of varying size and colors(which is exactly how a network visualization is like with varying size nodes and colors), except the issue is that the sizes are in different units, so I can’t directly extract the radius of the nodes.

To check if there’s a node overlap between two nodes, I’m using the mathematical formula of searching if the center to center distance between the nodes is larger than the radius. If I can’t find the radius, there’s no way to find the node overlap. Sure, I can check border to border distance, but for that I’ll have to try all the distances of the points of two nodes pairwise which is probably going to be an issue. To check more on this issue see here : https://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size

In the end I decided that instead of drawing the nodes with scatter plot I’ll draw the nodes with matplotlib patches.

The patches are in matplotlib’s Artist API and basic building blocks for making visualizations. Patches refer to geometric shapes that we attach on the figure as I mentioned before. We use Circles to draw the nodes, which is all basically preparation for actually setting the scales correctly. All custom util functions like setting node size, node color, filtering subgraph, getting positions for the subgraph, setting labels, rotation layout and this custom function for drawing is in the utils script that I heavily use in the visualization script.

So far those things have gone well. However I had many setbacks too that I’ll now elaborate on. I’ve basically kept the code quite dense and modular for my own comfort. I’ve also taken some design decisions like keeping the visualization and the layout part separated because that way it’s easy to configure them as needed.

Setbacks

My mentor and I missed a weekly meeting before the first evaluation because of time mismatch. After that I misunderstood how to set the scaling algorithm and wrote a wrong algorithm. In the wrong algorithm I thought I have to optimize for the scaling ratio parameter in the force atlas 2 algorithm which stands for the attraction repulsion ratio between the nodes.

Key issue here is that I’ve to ensure that the largest nodes are as close to each other as possible without overlapping with each other. For that I assumed I’d be using binary search to track which scaling_ratio value will work best with both conditions, the largest nodes must be within a certain distance from each other and the largest nodes must not be overlapping either.

However, here the scaling meant basic geometric scale referring to how I should size the graph. In this case the node sizes stay constant and we extract the position layout once with the layout algorithm and after that we make the graph expand and contract with different scale values to find a good balance.

Now I’m working on modifying the scaling code to work as mentioned above. Here’s the implementation for the first approach that failed . Most of the code still works though, like the code for searching if there’s any overlapping code between the largest k nodes or the code for finding the distances will be the same, but the extraction of the correct scale will be different.

Also it turns out that even if I vary the scaling_ratio(ratio between attraction and repulsion forces in the layout algorithm), there’s not much impact on the actual distance between the largest nodes. On top of that, if I set the repulsion force too high aka scaling_ratio is set to a large number, the nodes can get very far from each other quickly. All in all, now I understand the importance of communicating more about the approaches.

The fa2l library I’m using can be quite expensive in terms of run time, so I’m experimenting with different layout libraries and fixing the scale issue at this moment. Now I’ve to try my best to pass the second evaluation .

--

--

Tahsin Mayeesha
Learning Machine Learning

Deep Learning Engineer. New grad, CSE.GSOC 19 participant@Tensorflow. Previous GSOC 18 @ Berkman Klein Center of Internet and Society. Kaggler,fast. ai internat