How to identify thought leaders and visualize their influence

Tyler Burns
Coinmonks
7 min readAug 9, 2018

--

Image provided by igorr at 123rf.com

When you need to learn about a new field very fast, it is important to determine who are the main intellectual authorities. This helps one find the seminal papers in the field, and reviews authored by established people. This is a followup to the last article I wrote, which was looking at trends in the scientific literature over time.

As I did in the previous article, I use the RISmed package from CRAN to pull out information about scientific papers from a given time frame. Rather than looking at publication rate, I look at how authorship is distributed. In my analysis, I stay blind to who is first and who is last author. I simply report all authors involved in a given paper, regardless of their position. I also don’t look at which journal it was published in. The only criteria is that it makes it into PubMed (so it’s not pre-print).

Quantifying Authorship

I use search term “mass cytomtery” because that is the field I am familiar with, so I can confirm and interpret the results accordingly. Let’s have a look number of papers per author between 2010 and 2018 for all respective papers.

The X axis is authors (First, Last) and the Y axis is the total number of papers published during the particular time-frame. You can see that Garry Nolan is on the most papers, as expected given that he pioneered the field. Following not far behind is Sean Bendall, who along with Erin Simonds, published the paper that popularized mass cytometry for immunology. Of note on this list is Bernd Bodenmiller, because he has pioneered Imaging Mass Cytometry, so a share of his papers will be directed there accordingly when you enter the search term “mass cytometry.”

What to do with an infographic like this? If you need to learn about mass cytometry, you identify the intellectual leaders (those at the left of this bar graph) and then find reviews that include them as authors. But the domain “mass cytometry” have many sub-domains, each presumably with their own thought leader. How do we identify these?

Mapping Authorship Per Paper

Let’s dig a little deeper. Each paper is a collection of authors, and with the right tools, we can get a feel for the “reach” of a given author, defined by the number of other authors one publishes with. One highly visual way to get at this is by turning each paper into a high-dimensional object, and do a dimension-reduction visualization as one would do with mass cytometry or single cell RNA sequencing data. This high-dimensional object looks like this:

Each row is a paper and each column is an author. Row 6 and row 10 were papers authored by both Garry Nolan and Sean Bendall, as shown by the “1” in their respective columns.

Now if you do the right dimension reduction, you can plot each paper and see how they group in terms of authorship, effectively creating an intuitive map of all mass cytometry papers published to date.

For the data scientists and others interested, I minimized the sparsity by looking only at the top 100 most prolific authors, given the rest of the data was one or two papers per author. This cleanup step was key to producing good visualizations. I used Logistic PCA (good for sparse binary data) to reduce the dimensions of the dataset to 30, and I ran t-SNE on those 30 dimensions. Of note, these results were similar and revealed the same conclusions, though perhaps a bit cleaner, to just running t-SNE on all 100 dimensions without any dimension reduction.

For everyone else, let’s look at these maps!

In the map above, each dot is a paper. Papers that have shared authors are grouped near to each other. Specifically, each dot is a series of zeros and ones corresponding to which of the set of all authors is on the paper. Now let’s color by some of the authors above and interpret accordingly.

Stopping here, you can see that papers that contain Garry Nolan mainly fall into a particular region of the map, with another small island on the other side. Sean Bendall, the second most prolific mass cytometry author, overlaps strongly with Garry Nolan. This likely comes both from him being a pioneer of the method in the Nolan Lab, and from him starting his own lab at Stanford and remaining collaborative.

Bernd Bodenmiller was also a postdoc in the Nolan Lab when mass cytometry was emerging. You can see three of his papers overlap with the Nolan/Bendall region, but then there is another very distinct island on the map exclusive of Garry Nolan and Sean Bendall. After the Nolan Lab, Bernd Bodenmiller started his own laboratory at the University of Zurich and began pioneering what is called Imaging Mass Cytometry. Given that this activity is in another continent and separate from Garry Nolan and Sean Bendall’s respective direction, it makes sense that a separate island has formed on this map.

Here, I show two authors, Mark Davis and Dana Pe’er who appear to have authorship more decentralized than the first three authors I’ve shown. I interpret this by their expertise. Mark Davis is a central authority on T and B cell immunology, and Dana Pe’er is a central authority on computational systems biology. Thus, multiple groups with technical expertise in mass cytometry collaborate with them.

Mapping Author Relations

One more way to view these per-paper author relationship findings is with a heatmap. Simply take the binary data, and make a Pearson correlation matrix from it. This is called the Phi coefficient, and it maps the relation between columns of a binary dataset. If two authors are mutually on every together, then this value will be 1 for them. If two authors are on no papers together, then this value will be 0 for them. Below is a correlation heatmap (Pheatmap R package) of the top 40 most prolific authors.

You can see that there are modules of authors who publish together (eg. Garry Nolan and Sean Bendall as shown before), and then you have other authors who span multiple modules (eg. Mark Davis, Dana Pe’er as shown before). Adeeb Rahman doesn’t pair up with anyone else on this list despite being very prolific because his co-authors are not part of this “top 40 most prolific authors” list. This places him in the same category as Mark Davis and Dana Pe’er.

Discussion

Taken together, using two visual techniques I reveal two categories of authorship for the mass cytometry field, related in turn to the types of thought leaders that exist. The first is authors who tend to publish with the similar people, and this is expected from any given lab or mutually collaborative labs (Garry Nolan, Sean Bendall, Bernd Bodenmiller). The second is authors who tend to publish with diverse sets of people (Mark Davis, Dana Pe’er, Adeeb Rahman). The former category can be queried further for particular specializations. The latter category suggest that these individuals have a more nuanced (and critical) role in shaping the field that goes beyond the number of papers they’re on.

If you’re looking for the key thought leaders of a given field, you need to look for both categories. It is not enough to sort by how prolific a particular author is. These two categories can be further quantified to, for example, find the most exclusive authorship clique and what they do, or find the most de-centralized author with a co-authorship diversity score.

In later posts, I’ll explore adding other features to this dataset, like MeSH terms and location information in order to determine who is the thought leader of what, and how that changes over time.

--

--

Tyler Burns
Coinmonks

www.tylerjburns.com. I sit at the intersection of biology, data science, and management.