New data reveals the hidden impact of open source in science
Understanding software used by scientists by mining the biomedical literature
At the Chan Zuckerberg Initiative (CZI), we believe open source tools are critical to accelerating scientific discovery. In an effort to improve our understanding of the impact of software (and scientific open source in particular) in biomedical science, we’re releasing the CZ Software Mentions Dataset — a dataset entirely composed of software mentions mined from the scientific literature. The dataset, one of the largest available to date, gives researchers access to 67 million software mentions extracted from two corpora: 3.8 million papers from the open access biomedical literature collected by PubMed Central, and 16 million full-text papers made available to CZI by publishers.
Computational tools and open source software have become an essential part of the toolkit of every scientist across a vast range of disciplines. Some of the most important scientific breakthroughs of the last decade, such as the solution for the protein structure prediction problem, were made possible because of the availability of rich and comprehensive data sources and powerful software tools for data representation and analysis, numerical computation, and modeling.
But unlike scholarly papers that typically receive recognition through citations and help their authors access new funding streams and growth opportunities, quantifying the impact of open source software on science has continued to be a challenge. Software is generally not formally cited in scientific publications. At best, the software that scientists use in a study is mentioned in the methods section of a paper, or it may be identified through the dependencies of research code deposited by the authors. As a result, its impact is often hard to demonstrate or quantify.
Over the last half-century, the biomedical science community has built sophisticated ways of measuring research impact through citation-based metrics and citation graphs. But when it comes to impact indicators for other types of outputs — such as datasets, methods, and code — which provide the foundation of much of today’s scientific work, there’s little data that creators and maintainers can use to measure their impact on science. Even more importantly, there aren’t many impact indicators that funders or institutions can turn to in order to evaluate potential investments in research software. The lack of broadly used citation-based indicators for software also means that it’s very hard to measure:
- which software tools are most frequently used across scientific disciplines
- whether emerging new tools are replacing legacy ones
- what the prevalent programming language is in any discipline and its subfields
As a part of our contribution to addressing these issues, today we’re releasing one of the largest datasets available to date of software mentions:
- Preprint: A large dataset of software mentions in the biomedical literature, arXiv: 2209.00693 (CC-BY)
- Data: CZ Software Mentions Dataset, available on Dryad (CC0)
- Code: All the code used for extraction, disambiguation, and linking, as well as instructions on how to reproduce the results and some starter code . Additionally, an archival copy of the code is available on Zenodo (MIT license).
We aim to make the data and code for this project broadly available for reuse in order to enable others to build upon, vet, and extend our results.
Creating this dataset involved a multi-step process led by CZI’s Research Science team:
- Extraction: The first step is to extract text mentions of software from the full-text corpus. We used a state-of-the-art NER (Named Entity Recognition) model obtained by fine-tuning SciBERT on the Softcite dataset.
- Disambiguation: Software can be mentioned through its full name (Statistical Package for Social Sciences) or its acronym (SPSS). There can be synonyms (sklearn and scikit-learn; Image J and ImageJ; GraphPad Prism, and GraphPad, and Prism). Moreover, typos can be introduced by the authors (scikits-learn) or by parsing the XML of the papers. In this step, we mapped different textual mentions to the same software entity using clustering techniques such as DBSCAN (Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise).
- Linking: Unlike scholarly paper references that typically link to their target via a DOI, software mentions extracted from the full text do not come with metadata such as source code URL, description, and/or license information, which ultimately impacts their discoverability. In this step, we used string matches to link software entities identified in the previous step to five canonical software repositories that contain these valuable metadata: PyPI, CRAN, Bioconductor, SciCrunch, and GitHub.
- Curation: We hand-checked the top 10,000 most frequently occurring mentions to remove inaccurate entries — for example, non-computational methods, names of operating systems, names of initiatives related but distinct from software, etc. incorrectly classified as software mentions.
In the diagram below, we use an example (scikit-learn) to illustrate our data model. In addition to connecting the software entity with its canonical name, name variations, and links to software repositories, we also provide a link to the paper for each mention, the narrow context (the sentence from which it was extracted), and the wide context (the name of the section where it appeared). The context is meant to help researchers understand what software has been mentioned and why it was mentioned in the paper. We believe this is a step toward creating a more comprehensive “software citation graph” that will allow us to answer the questions outlined above.
At CZI, we are already using the Software Mentions Dataset to get insights into the software landscape in science. For example, CZI supports the development of computational tools for single-cell biology. We were able to leverage our dataset to understand how the usage of such tools evolves over time:
We spoke with a number of researchers and practitioners in the field about the potential impact and reuse of this dataset.
James Howison of the University of Texas at Austin shared: “I am very excited to see CZI’s detailed, careful, expansive, and transparently published work, particularly because they built on the efforts of 30 students from the University of Texas at Austin and Houston-Tillotson University who manually annotated the SoftCite dataset. We’re hopeful that the CZ Software Mentions Dataset will serve as a resource for those making scientific contributions through software to demonstrate their impact, as well as study how software affects science.”
Daniel Mietchen, research scholar at the Ronin Institute and co-founder of Scholia, said: “Software is an integral part of many research workflows, yet navigating the research ecosystem in a software-centric way is difficult. This new dataset and its methodology can be integrated with other annotations of the literature to develop novel insights: identify software that is essential for a given experimental paradigm, help research funders evaluate how to contribute to the tools their funded research depends on, guide newcomers to a research area, or get an overview of how programming languages and software libraries are used within or across research fields.”
Melissa Harrison, Group Team Leader of Literature Services at EMBL-EBI, said: “At Europe PMC we’re working on ways to surface software citations from the literature in a similar way to our mining efforts for data citations and this work is an excellent way for us to start thinking about this. We look forward to collaborating with CZI to investigate our full corpus using this code, which includes 33,000 full-text preprints.”
Read the preprint to learn more about this work, and please reach out with questions or feedback if you’re reusing this dataset.
This work has been informed by, and partly builds on previous efforts, including SoftwareKG, a knowledge graph that contains information about software mentions from more than 50K scientific articles from the social sciences; SoMeSci, a curated collection of 3.7K software mentions in a collection of 1.3K PubMed Central articles; Softcite, a dataset of manual annotations of 5K academic PDFs in biomedicine and economics; and a release of 318K software mentions based on the CORD-19 dataset.
We thank James L. Howison (University of Texas), Daniel Mietchen (Ronin Institute), and other reviewers for providing helpful comments during the preparation of the preprint. We also thank our team of biocurators: Michaela Torkar, Alison Jee, Celina Liu, Parasvi Patel, and Ronald Wu for their work curating the final dataset. We would like to acknowledge the previous efforts of Frank Krüger from the University of Rostock to construct a knowledge graph of software mentions.
Ana-Maria Istrate, Senior Research Scientist, Chan Zuckerberg Initiative
Ana-Maria develops machine learning solutions to support teams across CZI Science. Her research lies at the intersection of Natural Language Processing, Knowledge Graphs, and machine learning applications to the scientific domain. She has worked on recommendations, ranking, and text mining algorithms for biomedical journal articles. Ana-Maria is passionate about using machine learning to accelerate science. Ana-Maria graduated from Stanford University with a Bachelor’s Degree in Applied Math and a Master’s Degree in Computer Science.
Boris Veytsman, Research Scientist, Chan Zuckerberg Initiative
Boris is a theoretical physicist by training. He worked in many areas across science and technology — from polymers to liquid crystals to air traffic safety to communications to space exploration to biophysics to evolutionary ecology. He currently researches the science of science, open software, and the ways researchers approach their studies.
Donghui Li, Senior Technical Program Manager, Chan Zuckerberg Initiative
Donghui works at the intersection of science and technology. He is passionate about using technology to help scientists make better use of data for research discovery. He has extensive experience in scientific data management and is the co-founder of Phoenix Bioinformatics, a Bay Area nonprofit aimed at sustaining the scientific data infrastructure.
Dario Taraborelli, Science Program Officer, Chan Zuckerberg Initiative
Dario is a social computing researcher and an open knowledge advocate. On CZI’s Open Science team, his goal is to build programs and technology to support open, reproducible, and accessible research. Prior to joining CZI, he served as the Director, Head of Research at the Wikimedia Foundation, the non-profit that operates Wikipedia and its sister projects.
Ivana Williams, Science Program Manager, Chan Zuckerberg Initiative
Ivana Williams is a Science Program Manager on the Single-Cell Biology team at CZI, responsible for computational biology strategy. With an extensive background and experience in mathematics, statistics, data science, and machine learning, she is passionate about building bridges between machine learning, data science, and CZI’s Single-Cell Biology communities. Ivana’s previous research focused on natural language processing and implementing state-of-the-art machine learning and data science solutions to accelerate scientific discovery and unlock insights from scientific publications.