Bibliographic Network Visualization for Academic Literature Reviews

Published in

MBF-data-science

7 min readOct 14, 2018

The lit review is an obligatory part of the research process. You read a lot of prior literature to understand what’s been done, and how your work fits into it. Lit reviews are about competence (I understand what’s going on), exhaustion (I haven’t missed anything), and pattern recognition (these are key features of the field).

I am a visual learner. I like laying things out and seeing their relationships. Cluster things by similarity, draw lines to indicate relationships, maybe use color to capture phenomena at a glance. This is a common visual shorthand in film and TV for “someone is working really hard on a complex problem” and sometimes “someone has gone completely off the deep-end.” Well, mostly the latter. I did a set of these string theory boards for my comp exams, but it was very labor intensive writing everything down on notecards, and on a practical note, I simply don’t have the wall space to keep more than one of these corkboards up at a time.

There must be an easier way, and since I’ve been doing all this work on mapping interdisciplinarity using bibliographic networks, I decided to automate the process of making these networks, and use visualization techniques to help guide my lit review. In short, behold! Clicking through lets you explore the network to your heart’s content.

Citation network for Wagner et al, 2011. “Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature.” *Journal of Informetrics*

Formally, this is a graph of the papers published before today citing Wagner et al (2011). “Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature.” Journal of Informetrics, 5(1), 14–26. https://doi.org/10.1016/j.joi.2010.06.004, and all their references, colored by how many times each node has been cited by another paper in the corpus. Clicking on a small red node, which indicates a paper cited by more than 10 papers in this sample, I find that it is Porter et al (2007), “Measuring researcher interdisciplinarity”. Definitely seems like a paper I should download and check out.

I’m sure that you have your own research questions, so let me walk you through the process of making a network like this on your own.

Step 1) Identify a Corpus
The first and most important step is to figure out what the corpus of papers that you’re going to be analyzing are. In my case, Wagner et al (2011) is a great review article, and I can’t imagine a paper about understanding interdisciplinary research from a bibliographic perspective not citing it. Everything that cites Wagner et al (2011) should give me a sense of where the field is at. Review articles that are more than 5 years old are a good place to start, but it’s reasonable to look at what a great research group is publishing, or a sufficiently narrow topic search.

My corpus contains 154 papers. Very small corpora, say below 10 papers, are not going to be very interesting. Very large corpora become noisy and difficult to visualize, and more seriously, because the layout algorithm scales on O(n)³, I suspect larger visualizations simply won’t run in a reasonable amount of time. How big is too big is a question which I don’t have the technical background to answer, but for now I’d avoid corpora with more than 1000 papers.

Step 2) Assemble a Corpus
Point your browser to Web of Science (you may need to be on campus to create an account) and perform a search to identify your corpus. I searcher for Wagner et al’s DOI, and then clicked “Times Cited” link to get a search. I set the page size to 50, and then downloaded the record collection to a directory on my computer using the dropdown menu at the top of the page. Choose “Save to Other File Formats” and “Full Record and Cited References” in the Record Content. Repeat until you have your entire corpus downloaded to the same directory. Don’t worry if you have duplicates, those are handled automatically later.

The Web of Science download page looks like this.

For now, this tool only works on Web of Science records. The next step requires clean data, and Web of Science is the cleanest data around. Web of Science doesn’t index everything, particularly in the humanities and other book heavy fields. It is possible to manually add non-WoS items after Step 3, but we’re trying to automate away manually adding data.

Step 3) Process Your Data
You’re going to need a working copy of Python 3.6 with the Metaknowledge and Pandas libraries. This is a good place to mention that if you’re doing anything with bibliometrics, you need Metaknowledge. Their team is wonderful. Metaknowledge is Reid McIlroy-Young, John McLevey, and Jillian Anderson. 2015. metaknowledge: open source software for social networks, bibliometrics, and sociology of knowledge research. URL: http://www.networkslab.org/metaknowledge.

I’m going to assume that you’re at where I was two years ago, and know nothing about Python. If you do know something about Python, skip this paragraph. To download Python, go to https://www.python.org/downloads/ and pick the appropriate one for your OS and processor. On windows, open a command prompt (type “command” in the search bar) and run “pip install Pandas” and “pip install metaknowledge”. If things are not working for some reason, read How to Automate the Boring Stuff With Python, and poke around on the internet.

Okay, now that you have Python and the necessary libraries installed, download my script from Google Drive. Put the script in the directory with your WoS records. Run it. Enter a name for your project and hit enter.

You now have an excel file containing a list of nodes, with each node a paper either in your corpus or cited by your corpus, and edges indicating where one paper cites another. You can poke around in this excel file, see what’s been cited a lot (the indegree column), read all the abstracts, etc. It’s possible to manually add nodes and edges in this spreadsheet. Each ID must be unique, edges should go from one ID to another, and Label is what will be displayed next to each node.

I’ve updated the script to also create a co-citation network. The co-citation networks only shows your corpus and papers that have been cited more than once, so it’s less “noisy” and renders faster.

Step 4) Visualize Your Data
Go to www.Kumu.io and create an account. Create a project, and then create a map. I like the stakeholder template. Click the green plus at the bottom of the page and import data from .xlsx. Select the .xlsx file you just made. Save the import, and wait a minute while everything loads. Marvelous! If you click on the three dots on the left, a detailed info pane slides out, where we have the node title, a link to search Google for the paper, and the abstract if it is in the dataset.

The default is not particularly informative, so hit the settings tab on the left side of the window, switch to the advanced editor, and paste the code below. This makes papers that were in your corpus bigger in size, and colors each paper by the number of papers in your corpus that cite it, allowing you to instantly see the fundamental papers in your lit review. Feel free to customize the colors, thresholds, and other elements.

@controls {
bottom-right {
label {
value: “String Theory by Michael Burnam-Fink”;
}
}
}

@settings {
template: stakeholder;
element-text-align: center;
font-size: 14;
element-size: 10;
font-color: #000000;
}

element[“element type”=”Source”] {
size: 60;
}

element[“indegree”>”1"] {
color: #d3fadb;
}

element[“indegree”>”2"] {
color: #f2dd3e;
}

element[“indegree”>”5"] {
color: #f17c3e;
}

element[“indegree”>”10"] {
color: #8e0000;
}

Step 5: Going Deeper

First, I’d like to say that I’ve used a lot of network visualization GUIs, and Kumu.io is head and shoulder above the rest in terms of both usability and aesthetics. Thanks to the Kumu.io team, and thanks to Annie Hale for telling me about them. There’s lots of ways to view these networks, but one of the first is looking at just a few papers, instead of a giant rainbow.

To generate this view, I added tags to the four blue nodes, and then used Kumu’s showcase feature to look at this part of the network in detail.

There are lots of possibilities for working with this data. Papers could be tagged with key concepts, or a concept could be added as a new element using the green plus. You could track your progress through the lit review, tagging papers as Downloaded, Read, and Annotated. Kumu.io also allows you to share your networks with teams. I’m sure I’ve only scratched the surface of what’s possible with this kind of Bibliographic Network Visualization!

I hope you enjoyed this guide, and if you have any thoughts of questions about these techniques, drop me a line.

Bibliographic Network Visualization for Academic Literature Reviews

Written by Michael Burnam-Fink