Explore network relationships between 63,000 COVID-19 research articles with NLP, Dash, and Dash Cytoscape

Visualizing and navigating COVID-19 related academic papers and their relationships with interactive, dynamic, real-time generated network diagrams.

Published in

Plotly

7 min readMay 18, 2020

Visualising COVID-19 related papers with Dash-Cytoscape: check out the app and its source code!

A staggering amount of text data is generated every day. Half a million new tweets are posted every minute; millions of blogs are written every day; and millions of new patents and scientific papers are published every year.

More data points also mean (exponentially) more potential relationships between those points. Much as no person is an island, a tweet, a text message, a patent or an academic paper are all connected to each other — by followers, recipients, subject matter, citations or references.

Comprehensively extracting such information has long been arduous and extremely time-consuming, not to mention costly. Whether it is a researcher reviewing publications in their field or a team of lawyers poring over every email, notebook and Post-It note, such a process often takes months, if not years.

Spending that kind of time and resources is inadvisable, if not impractical for most. Luckily, modern computational analysis tools such as natural language processing (or NLP) can help by exponentially speeding up text analysis and reducing error rates.

But these tools cannot (yet) replace all of the domain expertise, contextual understanding, and integrative thinking that reside in human analysts. So, NLP tools or outputs are best deployed as complementary tools for domain experts.

Visualizations can be crucial to such integrative deployments. A good visualization can help its users to explore, manipulate and understand the dataset, as well as the outputs from NLP analysis.

In this article, we show you Dash Cytoscape, which lets you visualise and explore datasets and relationships within them using Plotly’s Dash. We will be using a demo app that leverages Dash Cytoscape to visualise thousands of academic papers, grouped by topics generated using Latent Dirichlet Allocation (LDA) techniques and connected by citations.

The dataset that we have used is a subset of the CORD-19 dataset: a result of multiple research groups’ collaborative response to the COVID-19 pandemic. While the analysis was carried out for demonstration purposes only, we hope that you will find our visualisations useful for your own work.

Check out the app and its source code on GitHub, then read on to find out how it was made!

LDA analysis

We applied a topic modelling technique called Latent Dirichlet Allocation, or LDA, analysis to our dataset of around 4700 academic papers. LDA analysis is an unsupervised machine learning technique that extracts ‘latent’ topics from the dataset of texts, as well as describing each text as a mixture of these topics.

This technique allows each document to be classified, for example by their ‘primary’ topic, and it can even be spatially represented in two dimensions, such as the scatter plot below using a dimensionality reduction technique (e.g. UMAP or t-SNE).

An example of topic modelling visualisation with t-SNE

This graph already provides some information — specifically, sizes of each topic group and their proximity to each other, although it’s easy to mis-read such plots. One could even add node sizes to indicate which nodes might be more significant than others. Still, it tells us very little about relationships between individual data points (nodes), or indeed, groups of nodes. For that, we need a network diagram, which we can create with Dash Cytoscape.

Dash Cytoscape

Dash Cytoscape is built to help visualize and explore relationships.

What kind of relationships can be represented as network diagrams? The answer is any of them. The diagrams might represent a citation network as we’ve visualized here, but they could just as easily represent a network of computers or a social network, online or in real life.

The below creative example was built with the underlying JavaScript library Cytoscape.js, and relationships between wines and cheeses are represented as a network diagram.

Additionally, Dash Cytoscape includes a number of options to represent node connections (edges) and layouts, as shown in the example below. Here, the same set of nodes are moved around based on a number of pre-set layout options to help the user explore relationships in the dataset.

Some of the thing you can do with Dash-Cytoscape (link)

The concentric layout, for example, makes it easier to visualise the significance of one node in the overall context, whereas the cose layout produces a force-directed layout determined by the physics model.

In the CORD-19 web app, each node (i.e. paper) is placed based on its topic composition as determined by the LDA analysis, with edges representing citation relationships and colored according to the primary topic group of the cited paper; node sizes calculated so that the more oft-cited papers are larger in size.

The basic syntax to add a Dash Cytoscape output is simple; see the example below:

import dash_cytoscape as cyto
...
    cyto.Cytoscape(
        id='core_19_cytoscape',
        layout={'name': 'preset'},
        elements=elm_list,
        stylesheet=def_stylesheet,
    )

The element list is made up of nodes and edges, this being a node list example:

node_list [
    {
        'data': {
            'id': str(i),
            'label': str(i),
            'node_size': int(np.sqrt(1+row['n_cites']) * 10),
        },
        'position': {'x': tsne_to_cyto(row['x']), 'y': tsne_to_cyto(row['y'])},
    } for i, row in in_df.iterrows()]

When rendered, Cytoscape may render a resulting network diagram like so:

An example Dash-Cytoscape network diagram

Using Dash Cytoscape, a complex network graph can be generated in a few lines of code.

There’s still work to be done, however. It is quite a busy diagram!

So, how should we begin to reduce the represented data? This is where the interactivity of Dash begins to shine. A simple, intuitive, responsive set of filters such as those shown on the right can be easily implemented in Dash with inputs and callbacks.

Filtering Cytoscape Maps (check out the app here and its source code on GitHub)

In the animation above, the network diagram has already been filtered by particular journals, where publications from outside of these journals are not shown. Additionally, you can see the pulldown menu being used live to further filter nodes by minimum citations and to show/hide the edges indicating citation connections.

Node (or indeed, edge) selections are also sources of further interactivity, in this case revealing paper details. Or, as you may have noticed in our animation toward the beginning of the article, nodes can be configured to perform other actions, such as to change color, shape, or highlight connected lines.

The syntax for selection-based callback is simple:

@app.callback(Output('node-data', 'children'),
              [Input('core_19_cytoscape', 'selectedNodeData')])
def display_nodedata(datalist):
    ...
    return contents

Interactivity is customisable at a granular level as well. For instance, certain nodes can be filtered to change their property from selectable: True to selectable: False to prevent selection.

That brings us full circle to remind you that with Dash, these changes are made at the back end with Python, which significantly increases a diagram’s power and flexibility. In this example, the network diagrams are in fact being dynamically generated on-the-fly as the user submits various parameters, rather than being pulled from a set of pre-calculated figures.

For applications such as machine learning, the user is able to seamlessly integrate visualisation as a part of the investigative and iterative process, rather than having it be decoupled from the analysis.

If you would like to test the effects of a parameter, you can simply use Dash as the front end to regenerate the model on demand and investigate the results in one step. There’s no need to go back and re-run the model separately from the front end.

You may have also noticed a slider on the right side of the animation above. It is configured to modify the t-SNE parameter ‘perplexity’. In Dash, this is achieved by setting up a slider element and listening for a change to it in a callback function, before passing it on as a parameter to perform updated calculations.

dbc.FormGroup([
        dcc.Slider(id='tsne_perp', ...)]),
...
@app.callback(
    Output('core_19_cytoscape', 'elements'),
    [..., Input('tsne_perp', 'value')]
)
def filter_nodes(..., tsne_perp):
    node_list = update_node_data(..., tsne_perp)
    return node_list

In other words, any calculation that can be performed in Python can be performed with Dash as its interface.

Implementations of more complex functionalities remain the same as in the simple example above. Imagine the function filter_nodes() running a regression analysis, prediction, or text summarisation — the user would be able to dynamically generate a new output on demand.

Together, a Dash app with Dash Cytoscape can represent complex network relationships in an interactive, responsive, and customisable manner for any subject matter where relationships between entities are important.

We are excited to see what you build with these tools and look forward to seeing the amazing creations from our community of incredible, creative Dash users. If you would like to learn more about Dash and its capabilities, check out our weekly live demo!

Explore network relationships between 63,000 COVID-19 research articles with NLP, Dash, and Dash Cytoscape

Visualizing and navigating COVID-19 related academic papers and their relationships with interactive, dynamic, real-time generated network diagrams.

LDA analysis

Dash Cytoscape

Written by JP Hwang