Mapping Research Groups
I’d like to take a moment to write about my ongoing research using bibliometrics to map and understand research groups. One of the advantages of studying scientists is that they produce a lot of very structured material about their work. Every scientific publication has a title, a list of authors, citations to other papers, and metadata. I’m using records stored by the ISI Web of Science, because their system is comprehensive and has very clean data, and because it plays nicely with the Metaknowledge library for Python.
For the privacy of my subjects, I’ve anonymized this group, but I can tell you that they are a multidisciplinary clean energy group at a major American university. I got a list of group members from their public web site, and then scraped Web of Science for their publications. After some preliminary data cleaning, I produced this network.
Serviceable, but not particularly aesthetic. And in this case, aesthetics matter. Visual representation of data provides many ways to deceive. Picturing Personhood by Joe Dumit, has a wonderful chapter about how lots of red in PET scans turns a more or less arbitrary representation of levels of neural activity into alarming illusions of mental illness. In this case, the problems are twofold. First, there’s simply too much information on there, too many nodes and connections. My computer is struggling to render it all. Second, the nodes are scaled by the diameter, and that has some distorting effects on how we see things. Area is a much better representation of importance, as the image below demonstrates.
Using an area scaling lets us show a wider range of values, and matches how the mind assesses the size of circles.
As for clearing away excess information, we could look at just the core of the research group, the people officially listed on the roster, but this may pare us down too far.
I prefer to work with what I call a mutual plot, which includes the core group, anyone who has worked with two or more of the core group, and anyone who has authored more than five papers. You can follow along at this link (UPDATE: More groups have been added, and colors have changed. This is Group C). One-off coauthorship relations are not particularly important to the research questions that I’m interested in.
About those research questions. First, I can tell again that this is a well-integrated network, with most of the network connected in a single large component. This map is colored by discipline, and the primary orange-red shading indicates this group focuses on the physical sciences. A close examination shows that is indeed the case, with primary disciplines in chemistry, applied physics, and material science. The pink nodes are in business, and the blue nodes in biotechnology and microbiology. ANON 29 and ANON 30 are the most productive scholars, with 222 and 354 publications respectively, though several of the professors in the central group are close to or over 100 publications. This group is studded with stars. The nodes are laid out using a force-directed layout, where the links are treated as springs and a physics model runs until the network reaches an equilibrium. Network layout is a hard problem, but Kumu.io’s implementation produces results that are solid (igraph is pretty good, networkx less so). The layout can be seen by mousing over ANON 2 and ANON 29, who share a wall of mutual collaborators between them, and ANON 24 and ANON 27, who’s more similar collaboration networks compose a ring. Overlaps are common, and so the network cannot be taken at a glimpse.
Since one of the focuses of the grant is graduate student training, we should focus on the grad students, indicated with white bullseyes. ANON 2 has authored papers with 5 grad students, as has ANON 20 and ANON 27, but a closer examination reveals they have mostly authored with their own grad students.
ANON 7, ANON 23, ANON 33, ANON 34, ANON 37 have coauthored papers with multiple professors on the grant. They might be students who have been around longer and so have had time to work with professors, or they might be more ambitious than is typical. It’d be interesting to check, if I were doing this properly and not as an illustrative example. Looking at the islands not connected to the main component, three of them consist of grad students. I believe these people conducted research as undergrads, and have not yet published papers in grad school as part of this institution’s network. Several of the students haven’t published at all, and so aren’t on the map. Finally, ANON 42, ANON 57, and ANON 97 are not members of the group, but have published with three or more group professors. They might be worth investigating.
One of the most interesting measures in network analysis is betweenness centrality, which measures the proportion of shortest paths a node sits on. Nodes with high betweenness centrality are considered brokers or bridges, and can leverage their position in the network. I’ve calculated betweenness centrality for just the core network (high publication authors serve as the sole point of contact between their personal networks and the rest of the group, and skew the general stats). ANON 2 has the highest betweenness centrality, indicating his role as a bridge between the rest of the group and ANON 20’s applied physics section.
Another stated goal of this group is interdisciplinary integration. One measure that the field has settled on is the Rao-Stirling diversity index, as calculated by Cassi et al (2014 & 2017), which serves as proxy for knowledge integration. To put it simply (and gloss over some details), every article is assigned to one or more Web of Science categories, and cites many other papers which are also in Web of Science categories. A good measure of diversity requires a distance measure between Web of Science categories, which is provided provided by Rafols et al 2009. Run the math, and you can pick out more interdisciplinary and integrated scholars in a heatmap. I’ve used to same technique to indicate more unusual collaborations.
ANON 1 pops out immediately. While most of their work is in optics, they have ties across the sciences. There’s an arc of lighter colors across the mid-top, in nanotechnology and biomaterials. The heart of the group has relatively low Rao-Sterling indices of 0.2–0.25. They focus mostly on chemistry, and that research focus is reflected in the map. A lot at the grad students shows that they are in this middle range as well. ANON6, ANON 8 and ANON 10 exceed as well. Though this map is one area where we must be careful not to deceive ourselves. Rao-Sterling scores depend on sampled publications, and these values are relative to the group maximum of 0.735 held by ANON 244 and ANON 255, who were coauthors on a single very interdisciplinary paper. More work on more groups is needed to see what typical scores are. One of the classic studies in the field, Porter et al 2007, suggests a Rao-Stirling score above 0.46 is indicative of truly integrative scholarship.
We can also look at gender in science. I found a Python library that guesses gender based on first names and ran my data through it. First, by the numbers this group is mostly male (135 male or mostly male, 41 female or mostly female, 22 ambiguous names, and 58 unknown). The PIs are mostly male, 7 men to 3 women with two unknowns, a mix reflected in the grad students, with 14 men to 3 women. These people should not lean on gender diversity as a strong point. And of course, this is crudest possible measure of gender, based solely on automatic guesses from names. Gender can be hand-coded, and probably should be if it’s a primary focus of a study.
Finally, the cocitations map lets us see what they’re actually working on. This map is simplified to just papers that have been cited five or more times, and I’ve removed markers distinguishing papers the group has published from papers that they cite. Checking one isolated group up top (locations may shift), we find a bunch of papers about bacteria and biofuels. A cluster down in the bottom is about semiconductors and solar cells, another about nanomaterials. There’s a second solar cell group. We can see the several distinct research areas, as well as the papers in common, like one key review article on graphene.
I’ve added two new networks, which reveal more information about scientists in question, their work, and their ideas. A colleague is interesting in the development of community resilience as a concept, and I ran these networks for them. This is a timeline, so you have to click the years on the bottom to bring up data. Scholars are indicated in green, papers included in the sample in shades of blue, and citations in red, with lines indicating a pattern of scholar → paper → citation. Clicking through year by year, we can see that the years 2008–2011 are important in the creation of a “resilience canon”, with the key papers of Cutter et al, 2008, A place-based model for understanding community resilience to natural disasters and Magis, K, 2010, Community Resilience: An Indicator of Social Sustainability. Going forwards, the hundreds of papers in the community tend of cite one of these two keystone papers, or their citations, with broader areas of resilience in the face of disaster, and resilience as local knowledge and community integrity.
The last view is the matchmaker, also using community resilience dataset. Based on a pattern of common citations, I can guess that two scholars have something in common, and that they might want to work with each other. Existing coauthorships are indicated with dashed lines. The complete matchmaker network is spaghetti, which makes sense because these are people working on a common topic, but clicking on one node and using the focus tool (cross-hairs, left side), let you see the overlaps of a single scholar of interest.
It’s taken me a couple of year to get to this point, but my code base is now solid enough that I can do groups quickly. This analysis took me an afternoon. I’m working on a journal paper based on this work, and if you have any ideas about how this can be applied to your project, I’m happy to collaborate.