Estimating biomedical data
Evaluating the impact of a scientific study is a difficult and controversial task. Recognition of the value of a biomedical study is widely measured by traditional bibliographic metrics such as the number of citations of the paper or the impact factor of the journal.
However it would be more appropriate that the critical success criteria for a research study lie in the actual production of biological data, both in terms of quality and also how these datasets can be reused to validate (or reject!) hypotheses and support new research projects.
Although biological data can be deposited in specific repositories such as the GEO database, ImmPort, ENA, etc., most data are primarily disseminated in articles within the texts, figures and tables. How to find and measure the production of biomedical data spread in the scientific publications?
To address this issue, Gabriel Rosenfeld and Dawei Lin developed a novel text-mining strategy that identifies articles producing biological data. They published their method “Estimating the scale of biomedical data generation using text mining” this month on BioRxiv.
Text mining analysis of biomedical research articles
Using the Global Vector for Word Representation (GloVe) algorithm, they identified term usage signatures for 5 types of biomedical data: flow cytometry, immunoassays, genomic microarray, microscopy, and high-throughput sequencing.
They then analyzed the free text of 129,918 PLOS articles published between 2013 and 2016. What they found was that nearly half of them (59,543) generated one or more of the 5 data types tested, producing 81,407 data sets.
This text-mining method was tested on manually annotated articles, and provided a valuable balance of precision and recall. The obvious next — and exciting — step is to apply this approach to evaluate the amount and types of data generated within the entire PubMed repository of articles.
A step beyond data dissemination
Evaluating the exponentially growing amount and diversity of datasets is currently a key parameter for reflecting the quality of a biomedical study. However in today’s era of bioinformatics, in order to fully exploit the data we need to take this a step beyond the publication and dissemination of datasets and tools, towards the critical parameter of improving reproducibility and transparency (data provenance, collection, transformation, computational analysis methods, etc.).
Open-access and community-driven projects such as the online bioinformatics tools platform OMICtools, provide access not only to a large number of repositories to locate valuable datasets, but also to the best software tools for re-analyzing and exploiting the full potential of these datasets.
In a virtual circle of discovery, previously generated datasets could be repurposed for new data production, interactive visualization, machine learning and artificial intelligence enhancement, allowing us to answer new biomedical questions.