CZI Technology
Published in

CZI Technology

New data reveals the hidden impact of open source in science

Understanding software used by scientists by mining the biomedical literature

  • which software tools are most frequently used across scientific disciplines
  • whether emerging new tools are replacing legacy ones
  • what the prevalent programming language is in any discipline and its subfields
Scanpy, a popular Python framework for analyzing and visualizing single-cell data, is one of many scientific open source projects supported by CZI. We identified over 300 unique mentions of Scanpy in the biomedical literature we mined.
  • Preprint: A large dataset of software mentions in the biomedical literature, arXiv: 2209.00693 (CC-BY)
  • Data: CZ Software Mentions Dataset, available on Dryad (CC0)
  • Code: All the code used for extraction, disambiguation, and linking, as well as instructions on how to reproduce the results and some starter code . Additionally, an archival copy of the code is available on Zenodo (MIT license).
  1. Extraction: The first step is to extract text mentions of software from the full-text corpus. We used a state-of-the-art NER (Named Entity Recognition) model obtained by fine-tuning SciBERT on the Softcite dataset.
  2. Disambiguation: Software can be mentioned through its full name (Statistical Package for Social Sciences) or its acronym (SPSS). There can be synonyms (sklearn and scikit-learn; Image J and ImageJ; GraphPad Prism, and GraphPad, and Prism). Moreover, typos can be introduced by the authors (scikits-learn) or by parsing the XML of the papers. In this step, we mapped different textual mentions to the same software entity using clustering techniques such as DBSCAN (Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise).
  3. Linking: Unlike scholarly paper references that typically link to their target via a DOI, software mentions extracted from the full text do not come with metadata such as source code URL, description, and/or license information, which ultimately impacts their discoverability. In this step, we used string matches to link software entities identified in the previous step to five canonical software repositories that contain these valuable metadata: PyPI, CRAN, Bioconductor, SciCrunch, and GitHub.
  4. Curation: We hand-checked the top 10,000 most frequently occurring mentions to remove inaccurate entries — for example, non-computational methods, names of operating systems, names of initiatives related but distinct from software, etc. incorrectly classified as software mentions.
Penetration of single-cell computational tools in different subfields of clinical medicine (the fraction of papers using these techniques among all papers in the subfield) based on the Software Mentions Dataset.

Read the preprint to learn more about this work, and please reach out with questions or feedback if you’re reusing this dataset.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chan Zuckerberg Initiative Science

Supporting the science and technology that will make it possible to cure, prevent, or manage all diseases by the end of the century.