Do Scientists Reuse Open Data?
Open science can mean many different things for different people. The way I see it, open-science practices and policies have two fundamental, intertwined goals: to increase science credibility and efficiency. To reach these goals, open-science advocates and practitioners promote the release of science products and means of production — such as manuscripts, data, code, processes, and tools — under open access.
As Titus Brown pointed out in another post of this series, the open-science community — at least in the biomedical sciences — has made extraordinary advances in the last few years. These include the widespread adoption of notebook technologies, the release of preprints, and the diffusion of open-source tools for bioinformatics, to mention just a few.
Similarly, today several groups promote and adopt norms and practices for data sharing and curation in biomedical research. As we know, depositing curated research data in open repositories contributes to making data analyses more reproducible, and, as a result, science more trustworthy. Without data, there is no evidence; without evidence, there is no science; without science, there are no facts. There is no doubt that ensuring reproducibility is itself a sufficient argument for data sharing.
There is one aspect of data-sharing practices that remains widely misunderstood, which is how scientists reuse open data to produce novel knowledge (not to validate others’ analyses). This is the thing: expectations for how much, how, and by whom biomedical data will be re-purposed once released under open access are often misplaced, to say the least.
Over the last four years, I spent most of my working hours interviewing scientists about their data reuse practices. I visited more than ten labs geographically distributed in seven cities, and I asked scientists to show me how, when, and why they reuse research data hosted on open databases and repositories. I talked with both senior and junior faculty members, graduate students and technicians, working in many subspecialties. They included human geneticists, computational biologists, developmental and evolutionary biologists, surgeons, and even clinicians. What a fun job I have, I know.
This is what I found
Let me get this straight: scientists are certainly reusing open data to produce novel knowledge. Practices of data reuse, however, vary between groups, workflows, and types of data. Overall, science data are reused in many ways and at different speeds and rates.
What do scientists reuse open data for?
- Scientists commonly reuse open data for control, comparison, calibration, or (more rarely) to conduct meta-analyses and train or test algorithms.
- Setting aside a few notable exceptions, scientists rarely reuse open datasets to ask novel research questions (i.e., for knowledge discovery). At least in biomedicine, researchers seem to still prefer to use data that they personally collect to conduct novel analyses.
Which kinds of open data are being reused?
- Typically, among all the datasets hosted in an open repository, a few selected datasets become very popular over time. Researchers tend to consult these well-known datasets almost daily. However, the majority of the datasets tend to be reused only occasionally. Data reuse practices seem to mirror citation patterns for scientific publications.
- Data curation is necessary but not sufficient for reuse. Releasing curated, high-quality data does not necessarily enable reuse. Scientists reuse data that they find useful and instrumental to their own research agenda and workflows.
- Data generated by a single lab in a peripheral field are the hardest to reuse: the epistemic costs of learning about the data and the science behind it are often too high — specialized knowledge takes time to be internalized and cannot be easily formalized in metadata and ontologies.
- In contrast, the most widely reused datasets are those intentionally generated for a specific use, with a specific research audience in mind (e.g., TCGA cancer data).
How do scientists evaluate open data for reuse?
- As researchers, we easily trust open data that have been reused before, over and over again, by our colleagues. Newly released open data with no record of reuse will need time to conquer scientists’ hearts. The adoption curve could take months or even years.
- Plus, researchers simply tend to reuse open data collected by people and institutions that they like. Reputation, trust, and pre-existing networks impact reuse as much curation practices do.
What is truly needed to enable reuse of open data?
- Collaborations are the holy grail of reuse. For all the reasons mentioned above, the most successful cases of reuse originated from multi-labs collaborations that involved both data creators and new users. Often these collaborations resulted in co-authorship of one or multiple articles.
What to expect from data reuse practices
Briefly, these are my recommendations for those of you who are engaging in open data/data sharing efforts for the purpose of reuse:
- First, no matter what data you collect, keep in mind that reuse is only one reason for data sharing. Data should be released for transparency as much as for reuse.
- Give up on the idea that all the data you are collecting, curating, and releasing will be widely reused. Some will, some will not, and some will but in unexpected ways.
- If you truly want to maximize reuse, first assess potential for reuse, then start data collection. Open datasets can be reused in many ways, by different sets of users. What can your data be reused for and by whom?
- Hire or consult with data curators who understand the curation needs of your potential users, and (equally important) their science workflows, agendas, and interests.
- Do not try to curate the data “for the entire world.” First, focus on the needs of your immediate users.
- Facilitate the formation of a community of practice around your data. Once you have identified potential users, bring them together by promoting community norms, encouraging collaboration, and adopting ad hoc curation practices. But, remember that communities of practice are not built out of the blue. Potential users should share a pre-existing interest in a kind of data, or in a specific method, sub-discipline, process, etc.
- Once you have identified which datasets might be reused for which goals, you can assign different levels of curation and access, accordingly.
- Encourage collaboration (and co-authorship) between data creators, data curators, and data re-users.
To sum up, open data supports the credibility and efficiency of science by promoting transparent research practices and, potentially, wide reuse of data. However, science drives reuse of open data, not the other way around. Open data can be shared, curated, and reused in many ways. In order to maximize reuse, the “trick of the trade” is to analyze and defy the “science needs” of your potential re-users.
It is not a given what kinds of data people might want or need. You can only estimate the potential for data reuse, and, to do that, you need a thorough analysis of not only where the demand for data is today, but also of where it is heading.
About: Irene Pasquetto is a postdoctoral fellow at the Shorenstein Center on Media, Politics, and Public Policy, at the Harvard Kennedy School.