Data Marketplace for Scientists: A New Hope
Blockchain can serve as a solution to the problem of data hoarding in science
Decentralized ledger technology promises to make data open and immutable. In explaining what “blockchain” is to colleagues outside of the tech industry, we sometimes say it is like putting a ledger up in the sky — everybody can see the content, and nobody can tamper with it unilaterally.
Paradoxically, by putting the data out in the open via this promising new technology, we can also preserve data privacy. This is because we can encrypt the data before we put it up on an open blockchain. By giving the keys only to the people we want, and recording the access history on key usage (on blockchain too, no less), the promise is eventually we will have the best of both worlds — data that are at once open and transparent, and yet they go only where they are supposed to go.
In the past months, numerous projects have come into existence to tackle the issue of health and medical records (such as these and these). The selling point has mostly been from the perspective of a patient: how does one share one’s medical records, to just the right people (doctors treating one for a particular disease), and the right people only (not potential employers, for example)? How do we keep all the data in one common space so we don’t need to worry about moving them around (which inevitably increases the risk of leakage), and yet can let data be accessed by multiple people securely only if we want them to be?
So far, the idea has been promising. But for people who live relatively boring lives such as ourselves, privacy is a relatively minor issue — at least as far as our own medical records are concerned. As scientists though, we have other reasons to be excited.
We approach the problem from the perspective of (socio-)biomedical scientists. Our jobs are to collect and analyze data, to understand how diseases work, in order to figure out ways to treat them. To follow the example of health records, data are precious ‘goods’ to us. We write proposals to compete for grants to allow ourselves to obtain these goods.
In the past couple of years, we have seen many large-scale projects from governments and major research universities tackling mental health problems such as depression in exactly this way. Numerous phone apps have been written to facilitate the collection of such data.
Promising as they are, these projects face several obstacles for which decentralized ledger technology may provide solutions.
To start, building quality smartphone apps often costs more than academics are prepared for, especially because we don’t necessarily have the connections to shop around wisely. There are open-source platforms (e.g., this Android-based framework) that are usable, but non-professionally built apps tend to get uninstalled eventually. People get annoyed when these apps drain the batteries of their phones. Lacking an appealing interface also doesn’t help. People get bored and tired of using over time.
Overall, getting good data is difficult. People may not want to sacrifice too much of their privacy. But more importantly, it is often the case that providing data is work: Are you willing to report on your mood and thoughts in details by the hour? How frequently are you willing to do a blood test?
Traditionally, we pay subjects some small amount of money to compensate for their time. But increasingly, there is a recognition that we need to pay people more substantively for the trouble they take in providing data of quality. And yet, scientific budgets are ever limited.
Which is perhaps why scientists are increasingly emphasizing the need and benefits of data sharing.
Data are non-rival goods.
Once a piece of data is there, your having it doesn’t directly and necessarily diminish the value of my having it. This is unlike a cake, which cannot be eaten twice. For data, it seems that there is every reason to share; it could only make everybody happier.
Unfortunately, in reality, this is only partly true. Companies are hoarding data instead of sharing despite the big gains to the society as a whole. The phenomenon is not limited to the industry but also in science as well. After all, science involves competition too. Having taken the trouble to collect the data, researchers, especially the junior ones, understandably want to be the first one to access them, to generate insights and publish papers before others can. While funding agencies and journals often enforce mandatory sharing, in reality, compliance can sometimes be less than prompt and enthusiastic.
Blockchain can turn wasteful competition between large-scale science projects into synergy.
If you talk to some economists, they may say this is exactly when a firm can step in to help. Imagine a company collecting data on many scientists’ behalf. By pooling together the resources of numerous major projects from different funding agencies and universities, it can build a single best platform for data collection. It can collect a lot more data, to be shared among the different researchers.
Of course, in some instances, it does already happen to some limited extent (such as in the use of the Amazon Mechanical Turk system by behavioral psychologists). But some scientists are understandably skeptical of private companies making a profit off of their academic pursuits. Overall, it just doesn’t happen often enough.
With decentralized ledger technology, we can ensure the transparency that science needs. Researchers should not be beholden to a company. Instead of having many independent projects all claiming to be doing big data, we can be doing really big data, via platforms that are set up with minimal need for centralized governance. For human health data, we can pay the subjects generously to contractually attract and ensure compliance and quality, as if we are treating data provision as prized labor. The different researchers may want their data in slightly different ways, and some may want to share only after N number of papers are published. With smart contracts, even intricate propositions like — “within the first year, people can only access the data for purpose of replication and review; after that, people can publish on topics not including keywords such as XYZ, and then after 2 years, all restrictions are lifted” — can be entertained. (Not saying we encourage restricted sharing, but to our minds, restricted sharing is better than no sharing.)
Our colleagues at Harmony are working on a new high performance blockchain protocol that can scale to serve the types of data marketplaces we envision. With a scalable protocol that preserves decentralization and security (privacy), we can unleash the next wave of health innovation that seems both so promising, yet out of reach with today’s systems.