Seeing the Impact of Interdisciplinary Science Grants

Published in

MBF-data-science

7 min readOct 14, 2018

I’d like to sum up the past two weeks of Data Science in Public posts (on my personal website) to discuss what I did, and what we learned. I’m interested in interdisciplinary science, because so much of science is done by interdisciplinary teams. Very few of the problems that we care about: fighting cancer, clean energy systems, ecological resilience and human well-being, have relevant discoveries that can be made by the idealized model of a lone specialist who is very knowledgeable at one type of thing. The National Science Foundation recognizes, and has been aiding in the transformation of science, in part with the IGERT/NRT grant series. These awards are for about $3,000,000 over five years, with most of the funding going towards graduate student stipends to encourage innovative models in STEM graduate education on high-priority interdisciplinary topics. There have been over 400 of these awards made since 1997, for something like $1.2 billion in total.

I think it’s important to know if this program is working. The 2018 NRT funding solicitation states: “The program is dedicated to effective training of STEM graduate students in high priority interdisciplinary research areas, through the use of a comprehensive traineeship model that is innovative, evidence-based, and aligned with changing workforce and research needs.” In an ideal world, we’d be able to embed trained ethnographers in every one of these grants, to track the development of students into scientists. That’s not feasible, and we can’t go back to the earliest grants without a time machine, but one of the nice things about scientists is that they publish what they discover, those publications are indexed by databases like the Thomson Reuters — ISI Web of Knowledge, and we can use the richly structured information in publications to say something about the grants that produced them.

In one sentence, can we use data science to show that the IGERT/NRT solicitation produced more interdisciplinary scholarship?

You can see the source of the data and how it was transformed at each step in the flowchart below. We go from all the IGERTs, to a sample of five, to a list of 202 scientists, a corpus of over 3400 publications, and then a dataframe of feature vectors amenable to data science techniques. An anonymized version of the final dataset (5-scientometrics101-cites.csv) along with some code is available on my GitHub.

I describe how to evaluate interdisciplinarity in DSiP 2, but I have a better metaphor I’d like to share. Image all of scientific knowledge like the night sky, each star a different fact known to science. New scientific knowledge is like adding a new star to the heavens. Scientists don’t just throw their ideas up in the air: they construct truth by tying each new claim into what is already known. Scientific writing is a lot like finding constellations. Constellations aren’t “real” features of the universe, the individual stars may be separated by thousands of light-years, but they are real features of human experience. The connections that we draw between facts are just as important to understanding as the facts themselves.

Building on that analogy, we’d expect different kinds of scientists, say a chemist and a botanist, to make different kinds of “constellations” when they’re writing their papers. And we’d expect a monodisciplinary paper to look different from an interdisciplinary paper. This concept is operationalized in the Rao-Stirling Diversity Index (SDI), which measures how the citations of a paper are distributed across Web of Science subject categories, and varies between 0 for work entirely within one discipline, and approaches 1 as work becomes increasingly interdisciplinary.

The tedious part is going from a list of grants, to a roster of scientists, to a corpus of publications. Then comes the fun part. Using the Metaknowledge package by Reid McIlroy-Young, John McLevey, and Jillian Anderson, which is an absolutely fantastic tool for bibliographic work, a directory mapping journal abbreviations to Web of Science subject categories provided by Thompson-Reuters customer service, and a similarity matrix between all Web of Science subject categories by Ismael Rafols, Alan Porter, and Loet Leydesdorff, we’re able to process Web of Science records into a table which looks like this.

The first thing to check is how the Sterling Diversity Index changes over time. A boxplot shows that for each of the five groups, the Sterling Diversity Index increased.

A more detailed look at the SDI, with mean and interquartile range plotted, shows the patterns for each group year by year, with the start of each IGERT grant marked by a vertical line. The steep increase of Group B five years after the start of the grant, and the way that SDI stopped declining and actually increased when Group D got their grant are particularly interesting.

SDI by year. Grant award marked with vertical line.

I ran t-tests on the change in each of the groups, and the results are clear. The increase is statistically significant, and averages 0.38 standard deviations. If we imagine that SDI is analogous to IQ, where each standard deviation is 15 points, the increase in scores is equivalent to a gain of 5.7 points. That’s pretty good, given how hard it is to make major structural changes in science.

t-test : p-value : increase in st.dev
Group A: 0.00833 : +0.33257
Group B: 0.00000 : +0.72993
Group C: 0.00879 : +0.15155
Group D: 0.00250 : +0.31565
Group E: 0.00030 : +0.38804

The IGERT/NRT program has meet its primary stated policy objective. Going forward, I think it’d be interesting to look at the program as a whole, as opposed to just a sample. And we need to look in further detail at the groups that met these goals exceptionally well in order to understand what they did, and how those practices can be translated to other research groups.

Classifying Misfiled Papers by Machine Learning

verything above is descriptive statistics and data visualization, which are important, but something that I’ve done before. What’s new to me is machine learning. Machine learning differs from traditional statistics in that while in theory I could have calculated everything above by hand with a lot of tedium, machine learning is about computer programs that optimize the parameters of an algorithm according to a performance measure. There’s not a firm line between machine learning and statistics, and indeed most machine learning curriculum start with linear regression, drawing best fit lines through points.

I had some trouble coming up with a good machine learning task for this dataset. The major types of machine learning operations are regression, predicting a continuous value from a set of inputs; classification, which involves assigning an input to one or more categories; and clustering, finding patterns in data that does not have a prior structure. I toyed with coming up a regression model to predict which scientific papers would have the greatest impact, but discarded it. Important ideas are important because of their content, not because of their citations.

A flaw in my dataset presented the solution. For some reason, 309 of the papers in my sample aren’t associated with any research group. All five of my IGERTs work in different areas, so maybe I can use patterns of citations to match my misfiled papers with the right research group. I ran three common classification algorithms, Logistic Regression, a Support Vector Machine, and a Decision Tree, and found that I could associate 70% of papers in the test split with the correct group. The Support Vector Machine had slightly higher accuracy, but performance was similar across the board. Looking at the confusion matrices for each algorithm, we can see that most of the errors involve sorting papers in Groups B and E with Group C. This makes sense, since Group C is the most prolific group, and all three groups work on nanotechnology. There is overlap in papers about basic topics like the properties of carbon nanotubes.

Confusion Matrices for three classification algorithms

Looking at the workings of the decision tree can be useful, because it mimics a common human way to sort things. Pick an attribute where there is a clear difference between Category A and the rest of the data, split, and repeat until each little clump is all of one kind. To read the decision tree, start at the top, check the condition, and if True go left, and False go right. Geography shows up immediately as a way to distinguish Group D, medical sciences as a way to separate Group E, and material science tends to lead to Group C.

Knowing that, I can assign the missing papers to their most likely groups with 70% accuracy. As validation, I looked at the location field for the misfield papers, and in the overwhelming majority of cases, at least one of the authors is at the university that my classifiers assigned the misfiled paper to.

Thanks to my instructor, Nathan Grossman, who asked questions that lead a significantly more interesting project, my cohort in the Metis Live Online class, for listening to me explain versions of this work four times, and Sebastian Raschka, for his great book on Python Machine Learning.

Seeing the Impact of Interdisciplinary Science Grants

Classifying Misfiled Papers by Machine Learning

Written by Michael Burnam-Fink