A clock melted by a nuclear explosion at the end of World War II. Photo Credit: James 2005

Data Law’s Radioactive Decay

Law’s definition of privacy is at odds with digitization, science, and the modern economy. We’re going to need a new approach.

At the center of nearly all of the world’s data and privacy regulation is one term: “Personally Identifying Information”* (PII for short). PII is the test most laws and regulators uses to classify sensitive data. Once data is identified as PII, it becomes subject to privacy policies, industry regulation, and a wide range of legal protections, many of which make sharing, moving, or publishing that data illegal. Although there are competing definitions of PII, they basically all paraphrase to “something that can be used to identify or locate an individual in a context**, by itself or when paired with other data.”

The problem is that it’s getting a lot easier to identify individuals with decreasing amounts of new data. Data typically starts as PII because it’s generated by individuals in a context — and in modern digital systems, we’re not only creating data about individuals, we’re creating meta data about the context where it was collected. The process of making a data set compliant with regulations is called anonymization and involves stripping it of data that could be used to re-identify individuals. Anonymization is like building any security system — it’s not hard to prevent re-identification if you know what context you want to prevent someone from finding and what data the other party has. It’s almost impossible, though, to completely prevent all re-identification, especially with an unknowable and constantly growing amount of data in the world. As a result, PII, and the legal approaches to data protection and privacy that depend on it, are losing meaning. That raises significant questions about whether law is the right system for classifying and regulating data. No matter the answer, without a new approach, the laws designed to protect people in digital spaces could do just the opposite.

Digitization has meant an exponential increase in the production of data, the evolution of data science, and the capitalization of applying them to each other. The public and private sectors are investing billions in creating, analyzing, and hosting an unprecedented amount of data. The race to monetize all that new data has led an equally unprecedented number of organizations to invest in data science research — essentially, developing granular insights on consumer behavior by identifying them in a context. Some of the world’s largest companies’ core business is selling the ability to identify users in context (advertising segmentation). The definition of PII — and legal approaches to data protection and privacy — puts the law on the wrong side of the modern world.

The problem is, the law is the only system giving users a say in how their data is used at all, right now.

Data protection and privacy law enforcement focuses on the sharing and movement of PII (intentionally, passively, or via security breaches). That requires the law to have both a good definition of PII, as well as “sharing” and “movement” of digital assets. Assuming, for a moment, that it’s possible to anonymize a data set so that it’s not PII when it’s shared—there’s nothing to prevent the receiving party from re-identifying individuals with it. In fact, most data sharing is done with the explicit intention of identifying individual behavior in context. That means, at best, anonymization only meaningfully protects the people who share data, not the people represented in it. And once data is shared, all bets are off.

Even if the laws were perfect — and no one argues they are — implementing them via legal institutions will be expensive, politically complex, and likely ineffective. Most legal institutions simply aren’t designed for a high volume of high complexity, low value disputes — which are exactly what digital systems and data breaches create. Even when causes of action are aggregated by bulk data breaches or class action lawsuits, courts have a limited ability to assess damages or provide redress.

If implemented, data regulation will overwhelm already over-stretched legal and regulatory institutions — particularly those in emerging markets or without dedicated, proportional, independent financial support. At best, that will result in subjective and uneven enforcement, reinforcing existing power disparities. At worst, it will damage the evolution of a wide range of industries and destabilize fragile political relationships. Neither over-regulation nor overwhelmed, unpredictable regulators are good for anyone. As recently noted by re-identification researcher Yves-Alexandre de Montjoye, in the New York Times, “the message is that we need to rethink and reformulate the way that we think about data privacy.”

The first reformulation is that we need to stop thinking about defining privacy as an absolute, and instead think of it as a contextual decision. The second is that we should use the law to give structure and effect to those decisions — but not manage or implement them directly. The third, and most complicated, is the need to reframe the power balance between data sources (usually individuals) and the organizations (usually private companies) that give that data value. In other words, we’ll need to find ways to embed standards into the relationships between private companies and individuals, which can be enforced by public sector institutions.

If it sounds like a tall order, it is — but there’s already a lot of momentum. Elinor Ostrom won a Nobel Prize for Economics for research detailing successful governance models of the commons in 2009. The Internet Society, just yesterday, released a framework for multi-stakeholder governance, focused on the Internet, but applicable to data. Doc Searls’ Project VRM has built a space for the community that’s building user-controlled data ecosystems — an approach gaining traction in India. Bruce Schneier is trying to reverse the tide of big data, “collect it all” hype by arguing that data is a toxic asset, emphasizing the costs and liability of mass collection and storage. In the middle, Jonathan Zittrain, building on Jack Balkin’s work, advocates for attaching fiduciary duty to types of data to motivate companies to be better stewards of user rights. And I’m developing Digital Civic Trusts with Keith Porcaro — a legal vehicle that creates data stewards, with fiduciary duties to users and commercial relationships with companies, based on intellectual property ownership. Each of these ideas are approaches to diversifying the governance and oversight of the decisions that determine equity in digital systems.

Each of these approaches is part of a growing understanding that our current approach to governing digital spaces is unstable and ineffective. Rather than focus on designing governance systems based on legal institutions, we should be focusing on building standards and enforcement frameworks for digital equity. At the very least, we need a new way to classify data — the PII test simply has too much working against it and too much riding on it. And like anything that’s unstable in its core — it is becoming radioactive and dangerous. Data law is nearing the end of its half life, it’s time to build a more stable future.

*- The term appears as a variation of the two operative words, “Personal,” “Personally,” “Identifiable,” and “Identifying.” The variation doesn’t alter the core definition.

**- Identifying an individual in a context doesn’t mean being able to identify a person by name, it means being able to track a single person’s behavior through multiple actions in the same system. Being able to link all the shopping a person does in a day, even if you never know that person’s name, for example, counts as personally identifying.