Graph Networks Visualization with pyvis and keyword extraction

Using free New York Times data to give an example for graph network visualizations with pyvis package in python

Stephan Hausberg
8 min readDec 9, 2022

I really do love visualizations of graph networks and the connection to open data. In this article I want to show the way from getting some exciting data from New York Times API and how to present this data with a graph network visualization.

Usually there are several ways on presenting data and storing it. In some cases it might be useful to store data in a graph network representation and not as relational data or geographical data schemes. In theory, this concept consists of nodes/vortices and edges, where you are defining if edges connect nodes or not — the so-called adjacency. You can also add information about weights of nodes and edges, labels, colors and whatever you can think of. According to this theory there are a lot of famous algorithms and heuristics that are easily explained but are not easy to handle. Explicitly these are travelling salesman problem, matching optimization or chinese postman problem just to name a few — mostly of NP-complexity.

Besides geographical entities and, for example, routing algorithms upon these, there are several other examples. One can think of connections between people and companies, keywords and products in sentiment analysis in an NLP context or connections of local trains to stations. Here, I am taking an approach to visualize New York Times meta data to show how articles connect to keywords in their articles. Espacially I will follow here my train of thoughts during developing these insights. I am using the pyvis python package, see https://pyvis.readthedocs.io/en/latest.

Among this article we are using the following packages: pyvis, network, yake, pynytimes. Starting out with a brief example of how we get pyvis and network up and running.

First steps with pyvis

First of all we should install pyvis and and network, i.e. !pip install pyvis and !pip install network. With the next code block we will create a fully connected graph incorporating five nodes and the corresponding ten edges. This code snippet combines two packages, first there is a Network object g followed by creating a complete graph with 5 nodes from the NetworkX-package. This has the advantage that it directly creates a fully connected graph a you do not have to type in all the nodes and edges. The connection between those two objects will be done with the from_nx function attribute. Applying show with a corresponding .html-file save this first graph and displayes it directly if you are working in a jupyter notebook for example.

Image of five nodes all being interconnected — by author

You can drag and drop the nodes and you will see that there are some sort of physics implemented in this visualization which makes it some sort of funny for the consument to play around with. There are a tons of options to customize this sort of graph. A very interesting example can be found in the documentation I already mentioned above. The next step is to gather data from New York Times API to focus on real data.

Using NYT API to get data

To gather free data from the New York Times Developer API you have to sign up with an email adress and get an API token. Just replace the xxx in the code below. Afterwards you are good to go to receive data. A very good description on how to get into the details can be found in the documentation about the package at https://github.com/michadenheijer/pynytimes. The following key list and topic list was also taken from the documentation, see code below.

In the following we are cycling through all topic lists and download the top stories from NYT and add them to list_raw. You get is a list of dictionaries with information about section, title, abstract, geo_facet and different other information. For my sample I downloaded this on 12/03/2022.

In the latter we create a list of named colors consisting of 34 colors to give every topic category a different color. A closer look at the data reveals that there is a mismatch between the categories which we wanted to extract and what really stands in the field named section. Therefore we create a corrected topic list set up from the section-field. This list will be zipped with the regarding colors afterwards. This is a point where the code is not as clean as it could be. Here, the number of topics highly depends on the data you were retrieving. Somehow one has to handle the length of the lists carefully.

Keyword extraction

At this stage we have gathered data from NYT Api, find articles with topic, titles and abstracts. In sum — a whole lot of data, but rather unstructured with pure text. So, to truely cover what is in these articles we want to extract keywords from the features title and abstract. A suitable article can be found here https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c. In our case I decided to call functions from` yake-project, which is an ngram-based key extraction method. One can play around with the parameters below to find a suitable extraction, I admit that I have simply copied the parameters from the tutorial above to keep it simple. In the first step we set up an extractor object.

In the next code block we cycle through the list of raw data and extract the keywords for every article. On the way we save just the key features for the next step in a dictionary called dict_slim.

First approach — article nodes — keyword edges

As a first approach we set up a network net and take every article as a node. There are two different options here to add nodes to the network, either use add_nodes or cycle through the dictionary and use add_nodes. We use the latter here and specify title, color and group as given in the dictionary. In my case I have cycled through 833 articles, therefore we have 833 nodes.

In addition, we cycle throught the edges twice, i.e. we loop through given articles. As a loop-intern variable we define weight to be the length of the set we get, if there are any given keywords in both articles. If there are any keyword in common and the given integer of the keys suffice the condition of being strictly greater than the first, then this weight will be added to the dictionary.

By incorporating this strict condition we only look once at a connection and also leave all self-loop aside. For instance, if we would have all node connected with every other node, than there would be more than 34K connections. Here, we only find about 25K connections, which is still a lot and gives a hint, where one could be stricter in generating keywords for these articles.

After setting up the dictionary dict_edges_weights we add these edges to the graph only if the weight is strictly greater than zero. Next we save this graph in a corresponding file. By starting this file in the browser it takes a whole lot of time to load it, as well, it ain’t so informative, find a picture below. But it looks directly like computer art.

Creative visualization but not very informative for consument — pic by author

Second approach — keyword node — article edges

As a second approach we are tweaking the structure a little bit to gain more insights about which keywords are in here and how they connect to other keywords. I am starting with a local network where there every node is just every keyword of the first article safed in a sorted list called node_list Afterwards I looped through these lists again and connected every node since the article is the connection. This gives me a fully connected graph having n*(n-1)/2 edges for n being the number of nodes. A picture can be found below.

Fully connected graph — pic by author

In the picture above you easily find what this article might be about — a review on a movie and a reference to Steven Spielberg’s E.T..

First of all, connecting all nodes does not seem to be clarifying the visualization. But we try to move on, adding another article. Now the aim is to find a node / keywords, that was in both articles. We are again tweaking the code a little.

After setting up the network and dictionary elements, we loop through the first two articles to collect all possible nodes and also add the identifiers in which articles these nodes appeared. This helps us to identify the nodes where two or more articles meet.

The second and third code block is almost the same as the above structure.

Two fully connected graphs being interconnected by one node ‘review’ — pic by author

The aboce image shows again two fully connected graphs representing two articles. The difference to the aboce approach is that they are connected by the node / keyword review which can be found in both articles.

This gives rise to go on and play a little with the structure. Is it necessary to show all edges all the time? What kind of coloring is useful? How many articles should we process at once to handle loading time in the end?

After initating the network I shrink the number of articles to 100 at first. The fun part is now to reduce nodes adequatly. Here, the dictionary reduce_dict was set up to just add nodes where more than one edge ends. This leads to graphs where only those nodes are displayed which can be found in two articles. This is — for sure — another parameter to play around with. It might seperate keywords taken more often from keywords which are not so relevant.

After reducing the number of nodes we add them to the network, color them by topic, then connect them via appropriate edges. One important aspect here is how to use the physics-feature. Beforehand we were just using what was given by default. This visualization has the disadvantage that within large graphs they are still somehow in a dense area but repelling each other all the time. This consume compute power of your browser and does not appeal it look smart at all. Here I have found a physics engine that keeps the nodes quite stable and clusters them directly without specifying it. This makes it quite handy.

Keywords are connected by articles — pic by author
Details from the graph above, keywords ‘company’ and ‘work’ seperate a bubble from the rest — pic by author

The pictures above show the results for 100 articles and the physicsengine clustering them somehow in an adequat sense. I have zoomed in below to show that here the keywords company and work are somehow seperating two bubbles. In the bubble on the left there are words like negotiation, pay and benefits or employees which sounds reasonable at first hand for me.

Wrap up and take-away

All in all this short article shows how to create, manipulate and deploy graph visualizations with pyvis alongside an example of New York Times API data under application of Yake Keyword extraction.

Happy to discuss the findings and other approaches to consider, find me for example on LinkedIn to connect, see: https://www.linkedin.com/in/dr-stephan-hausberg-679750118

--

--

Stephan Hausberg

Data is my passion! I'm excited about every aspect of it, from analysis to application.