Toward A Sustainable Data Ecosystem

Published in

DataSeries

16 min readJul 6, 2020

There is a lot of talk these days about data, who owns it, what it’s used for, and how to regulate its uses. But there is too little basic discussion about what data is — fundamentally. We know that it’s all around us, that everyone wants it, it’s extremely valuable, and also that we have very little control over it. Below, I will attempt to explain for you what data is at the most basic level, how it exists, what sort of thing it is, and how we have come to treat it. This will challenge your assumptions about it as well as how it is related to you, and what you can and cannot do to guard what you consider “your” data. Ultimately, we will see that there is great hope, promise, and wealth to be derived from treating personal data differently than we do, regarding it more like a precious natural resource, and creating means for us all to grow enriched through it safer storage, sharing, and use. The model proposed is one that will lead, hopefully, toward an ecosystem where data sustainability benefits us all.

I have argued for some time that the ability of others to claim ownership over data is wrong for a number of reasons. For instance and most notably, in Who Owns You? I argued that the patenting of unmodified genes was contrary to the patent laws (as the Supreme Court subsequently agreed and held). While many took the position that patenting unmodified genes was unethical because it usurped our “rights” to that data, my claim was broader and based not primarily on ethics but on ontology: data cannot, in any rational sense, be possessed or owned at all. Neither can expressions, for that matter — a distinction recognized in the law but poorly understood in the general discourse. The failure of people to comprehend this basic principle and the reasons for it, both ontological and practical, has allowed a lot of the current talk about ethics and personal data to veer off into some tenuous territory. We need to take a close look at what data is and how it exists (its ontology) and reexamine whether ownership is the proper framework at all for its treatment in the law and culture. First, let’s review what property is, because only property can be owned in the first place.

What is Property?

In the law, property is often described as a bundle of rights. In legal parlance, property interests held by people map to various types of things by describing who can do what with what sorts of things under which circumstances. All of this is best understood by way of examples.

Mostly we consider “ownable” things like land and objects. There are lots of things that clearly are NOT ownable, however, and the ways in which we can own land and objects are sometimes restricted. Land and objects (sometimes called moveables) have physical properties unlike other sorts of things sometimes afforded property rights. These things are generally held to the exclusion of others. They have physical boundaries. They are susceptible to possession. A piece of land can be fenced in and trespassers kept out, a hammer or a car can be held or contained in some way such that others cannot possess them. The exclusive nature of land and moveables (to differing degrees, of course) are what made legal rules and cultural norms regarding rights (and duties) that can exist among states, people, and property meaningful and acceptable.

If I possess a thing or a piece of land, then either the market or violence serve as the available means of changing possession. Legal and cultural norms regarding property revolve around facts and acts of possession and providing certainty and peace for the transfer of possession, and accounting for legal “title” to property.

The non-exclusive nature of basically everything that isn’t land or moveables makes possession of those things nearly impossible (with one exception, facts known only to you) and schemes that attempt to provide people with rights to them suspect. The whole of intellectual property law is one such scheme. It attempts to create rights to things that cannot ordinarily be possessed and calls those types of “property rights.” Property rights in land and moveables came to be codified in laws, but no similar rights to data or expressions existed until a couple hundred years ago when states began to try to encourage immigration by creative, entrepreneurial people by offering monopoly rights over those ephemeral things. Since then, “intellectual property” has spread throughout the world, created as a new sort of right over expressions, confusingly and inaccurately dubbed “property” in the process, and without the indicia or features of historical property.

Unlike land and moveables, neither data nor expressions can be possessed (save for secret keeping), and never with the sort of exclusivity that makes laws regarding property extensions of the need to maintain peace and security of possession. Laws that protect tangible property are necessary because they help create peace and prevent violence. No such danger exists when someone copies an expression. A copy of an expression (like a book, movie, song, invention, or statue) does not deprive the author or creator of their original expression. It commits no violence to copy it, it requires no force, nor does it diminish what the author held or created. Civilized norms may require us to attribute the authorship of the original, and have encouraged mechanisms to reward authors and creators, but no peace is threatened by any copy. As Thomas Jefferson famously said:

“He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me.”

Ultimately, we have chosen to create a set of state-sponsored rights, monopolies over the expressions of data, as incentives to creators. But these rights are not truly property rights in any way. They neither grant nor recognize possession at all, they are just mechanisms to punish those who try to express what the “original expresser” expressed for some limited period of time, and only if existing in some “fixed medium.” These artificial monopolies are meant to prevent others from expressing themselves if one can convince a court or regulatory mechanism that another’s expression is sufficiently similar to the one for which you got your monopoly, or to something you have already expressed. Yet even the intellectual property laws do not protect underlying data. They protect only expressions that are original, but data is not granted that status. Data is not even an expression.

What is Data?

Because Intellectual Property (IP) laws protect only original expressions that come to be preserved in some medium (inventions, books, movies, songs, etc.) it does nothing to protect “data.” Data are facts, datasets record things about the world. Data about me are simply facts about me. Many of these facts are quite public and anyone can discover them with little effort. My eye color, whether I wear glasses, my height, the nation I reside in, etc., are all data about me, none of which is mine. I don’t own those facts anymore than I do the ones about which no one yet knows. In their barest form, they are not original expressions and thus not even amenable to copyright or patent. The composition of a particular photo is an expression inasmuch as the choices the photographer makes in framing, exposure, white balance, etc., compose an original expression about facts in the world, but the things that a photo captures are data about the world, and not covered by copyright.

A particular photo of Niagara Falls is an original expression, but Niagara Falls cannot be. A particular portrait of a person is an original expression but the person’s face itself is not. Your birth date is a fact about you, not an expression of something original. The same is true of your life history, the clothes you were seen wearing yesterday, your weight and blood pressure, etc. These are data, and no legal regime grants any monopoly over that data, as opposed, say, to the copyright an author might get for describing you in a magazine profile or biography.

Data is free by material necessity. It cannot be contained once known. It reflects facts that are observed about the world. It cannot be contained save by one means: keeping a secret. There are some nascent legal rules to protect data-sets, and various jurisdictions provide protections against misuses of data, primarily relating to concerns about personal privacy. But ultimately, every time we try to bottle up data we impede values that we also hold dear: the search for truth, the progress of science, and freedom to inquire. Inquiry and expression should not be stifled, and there is no reason to think this is the only way to preserve other values like privacy.

The emerging conflict most widely debated is that between the nature of data, both as an object in itself, and as one treated in our institutions, as something which flows freely and ought to do so for the sake of our increasing knowledge and understanding, and our desires to keep some data contained, primarily that which we deem personal or “ours.” But it is neither a commodity nor an asset under typical conceptions of these entities as it is immaterial, purely informational, representing qualities of the observed universe and not quantities taking up space. Regardless, data is valuable, mostly in its flow among valuers, and there are numerous ways to achieve value through facilitating analysis and exchange of data.

No better example exists for proving the proposition that data should be free than the Covid-19 pandemic, because information, data, and its rapid dissemination and free use in scientific discovery is essential to improving all outcomes, and failures to be transparent about data have already proven to be significant setbacks for that progress. No better example exists as well for the proposition that personal data poses risks to individuals when shared without restriction. Contact tracing through apps for Covid-19 is but one example of how necessary data that is useful for saving lives may also reveal our personal movements and associations in ways that defy our expectations of privacy.

Data is Not “Yours”

No one can own your data, even you. The closest you can come to owning your data is keeping it to yourself. Secrets are “yours” as long as no one else discovers them. Beyond secret-keeping, there is little in the way of legal, ethical, or technical recourse to data being spread once it is known. Moreover, its general usefulness for both science and technology suggests we should continue to encourage its use as long as harms can be avoided. Primarily, the harms that concern most of us center around our expectations of privacy. So how do we encourage people to share data for the sake of science while respecting their expectations of privacy? We must develop mechanisms that protect secrets and yet allow sharing of relevant data.

The problem is that data that is useful for science, especially medical data like genomic and demographic data, can reveal things about individuals and be reverse-engineered into means to identify individuals. Aggregating datasets that are “de-identified” has been the traditional way to try to preserve privacy while sharing (or selling) useful data for science. But studies have shown that re-identification remains possible. HIPAA and other laws have been created to attempt to preserve some rights of privacy and maintain mechanisms for data-dissemination to promote scientific discovery, but whole classes of research and parties are exempted, and there is little understanding of the net benefits to individuals nor duties owed under such laws. Regimes like the GDPR (Europe’s data protection law) enacted to create more privacy for individuals over their personal data, may have chilling effects on science that have been noted and criticized, and that complicate and inhibit commerce internationally.

Data is not property. It cannot be owned, and it SHOULD not be considered ownable. It lacks the exclusive qualities of property, it is more useful and brings tremendous benefits when its flow is easy and transparent. It is not yours, unless you succeed in keeping it secret. What we need is to create better technologies for secret-keeping, and for sharing useful data without revealing secrets, and not more attempts to regulate such an ephemeral realm of intangible objects. Tools that permit us to maintain sovereignty over our data, our identities, and ourselves are what we should be building in order to help ensure the values of privacy and discovery are maintained and respected.

Your Data Future

If data cannot be owned, then what can we do to encourage its use, its dissemination, and to better adhere to our reasonable expectations of privacy? We can treat our personal data as though it is composed of precious secrets and leave the duty of protecting that data up to each of us alone, as sovereign individuals. We do not do that now. Instead, we entrust our secrets to numerous others, platforms and professionals whose good graces alone allow us the illusion of privacy over that data. This has proven to be a grave mistake as prominent leaks and hacks of data from those trusted custodians have shown.

What we should do is keep our secrets in our private vaults and never let anyone have access to those vaults at all. We would prefer that we allow, for instance, scientists to access only as much data as is necessary for their studies without them even knowing the data comes from our vaults, and preferably paying for that access. We want them to be able to query data in our vaults without having access to all the data there, but being able to get true and trustworthy answers to their valid scientific queries. They should only be able to do this with our knowledge, our consent, and even better, by paying us as individual custodians of our data. There should be a permanent audit trail for these queries, and a way to know who accessed our data and when so we can track down mis-uses and persue them legally in case of civil or criminal harms.

Our secret data should never “move” as wholes or parts, but those who do queries against that data which we choose to make searchable should be able to at least verify that the data they are accessing is true and trustworthy. What we need to build is an infrastructure of these private vaults with private keys and payment options with analytical tools and a sense of individual responsibility for the personal data that describes us fully, and over which we cannot exert any form of ownership other than better secret-keeping through a grid or network built to enable it. An ecosystem of data sustainability will treat data as natural resources, safe in their points of origin, valued in their efficient and sparing extraction and flow, used with precision and without waste, and paid for according to their value.

Keeping Better Secrets Through Technology

Since the advent of the computer age, encryption has been used to help protect sensitive data in motion. Using encryption is a centuries-old technique for protecting communications. Messages can be passed that, if properly encrypted and decrypted by the intended receiver, have less of a chance of being intercepted such that secrets are revealed to any but the intended receiver. These sorts of encrypted transfers require, generally, some trust between the parties, and prior agreements as to the keys used to encrypt and decrypt communications. Public key cryptography was created to help solve some problems of symmetric key encryption, and to make it possible to encrypt data in transit without the weaknesses introduced by the need for a “secure channel” for exchange of keys in symmetric key infrastructures. With a public key cryptographic system, once you know the public key of your intended recipient, you can exchange an encrypted file with them without having to risk the transmission of a key to them, and thus risk the possibility that someone intercepts that key and can then decrypt the data.

Public key cryptography still has weaknesses, most notably the “man-in-the-middle” attack, where someone intercepts and alters a public key. Ultimately, any time you are transporting data from one place to another, you are risking its discovery, and where that data is personal and where one wishes it to remain private, approaches that involve federated analytics mean that whole datasets need not move. Rather, with such an infrastructure, protected data can sit safely in place, firewalled and safeguarded, while still affording a means to analyze the data to some third party who gets only a view of the results of a query, and not the whole dataset.

Because data cannot be owned, it must be kept secret, and yet we must be able to do important science with it (and earn according to that value), so technical means for better secret-keeping must be employed. We must build better, more secure personal vaults for our data, and some means for others to probe that data without revealing more than is needed for some sort of query, and with less risk of revealing one’s identity in the process.

One promising mechanism for achieving this personal data vault and still encouraging and facilitating the use of data for research is “Zero-Knowledge Proofs.” This cryptographic process allows parties who do not know each other to be certain that some hidden information is true without revealing the secret itself. A version of it is used in the cryptocurrency Zcash. Properly employed it could help to allow queries on hidden data without leaking so much of that hidden data in the process that it could be used to reidentify individuals. Scalable solutions that could afford us the opportunities to be the guardians of our own personal data and yet participate in a vibrant data-economy that can benefit us all through scientific inquiry, are needed for us to realize data’s true potential.

Reach out and Feel the Data

Consider Yoda describing the force:

“Life creates it, makes it grow. Its energy surrounds us and binds us. Luminous beings are we, not this crude matter.”

Data is a bit like this. It is everywhere and we are constantly finding new ways to extract it from what we observe in the world we inhabit. It has no form of its own, it is present in all of our experience. Collecting it and describing data in new and useful ways is an increasingly valuable endeavor, both scientifically and monetarily. When we are able to interpret data well it informs us about nature, it gives us a better ability to achieve the aims of understanding, prediction, and control of our environment. It is meaningless to claim some sort of “ownership” of it because, once expressed, it becomes mere statements of fact, and no one can monopolize the facts. Data is nonexclusive by nature, and achieves value only through sharing and comparing it against more data.

Data is a bit like water or even air, and although the world has water in abundance, clean, useful water is increasingly valued and those who find better ways to extract it and move it around to those who need it are considered valuable and often devise creative ways to be paid for their efforts. Water is big business and billions are spent every year finding it better, mining it more cleanly, purifying it more sustainably, transporting it cleanly, quickly, and with less waste, and “owning” (possessing) water is transient, fleeting, and more valuable for being so. Water passes through us and re-enters the stream of commerce at some point, through literal streams and a cycle that includes evaporation, oceans, and hurricanes.

Thousands of types of entities participate in the water economy, the mesh of networks that move it around and process it in various ways adding value. The same is true for data. Like water, data achieves it value by its steady (and increasingly rapid) flow through networks in ways that achieve individual ends efficiently. Greater value for all, moreover, can be achieved when infrastructures for natural resources are built to be sustainable. Although data is abundant and extracted through observation of the world we inhabit, it is sometimes valued in ways that make us want to keep some of it secret, and not to distribute it for free. But someone with their own well may wish to share access to their water, profit from it, and not reveal where that well is or provide free access to it for all. The same is true for our personal data. There are fortunes to be made and scientific achievements of benefit to all to be realized through devising the parts of data networks that facilitate finding, sharing, and payment for useful personal data while allowing us to keep our secrets better.

We are in a nascent stage where new forms of data utilities will emerge, providing options for security, and ethical monetization of our data adhering to values that recognize and respect our expectations of privacy. Those who do best at creating better wells, pipes, and spigots to speed up and secure the flow of data, and allow us better to keep our secrets while helping science and profiting from it both communally and personally, will be the founders of a new data age. Like DARPA (the government research agency) whose open-source “TCP/IP” protocols underlie our interconnected world decades later and that powered the first version of the internet (and still do), and like Tim Berners-Lee and CERN who ushered in web 2.0 with HTTP (the open-source protocol that powers the world-wide web), web 3.0’s architects are busily and ingeniously trying to help us maintain our personal privacy while helping to make data flow safely, securely, and profitably, like water does into our homes, washing over our world, encouraging sustainable growth and new life.

Dr. Koepsell is an author, professor, lawyer and entrepreneur. He co-founded EncrypGen, Inc. and is General Counsel and Chief Ethics and Compliance Officer for ConsenSys Health. http://davidkoepsell.com . He currently resides in Mexico City, Mexico with his spouse, two children, and Jack Russell terrier.