How we killed our own project because of privacy concerns

We built a data collective for drivers. Our prototype enabled us to test abstract ideas for a product; it also pushed back against our company values.

Reid Williams
Computable Blog
7 min readMar 4, 2020

--

In the summer of 2019, I was struggling with a problem: how does an early startup test, learn from, and reduce the risk of product ideas when an enabling part of its end product is built with a powerful but slow-to-deploy new technology? This was the state of Computable as we were putting together the decentralized LEGO bricks that are Ethereum smart contracts.

There are a variety of ways to approach challenges like this — to understand the pitfalls and the potential of the whole, while the whole is still in pieces. For example: ethnographic qualitative research, or building mock products of varying complexity and resolution, to name a couple.

However, our challenge went deeper because we were building a platform that was two steps removed from end users. We were building a decentralized system that let people create and participate in any number of data cooperatives. In what we envisioned for this future platform, each cooperative would have a unique purpose defined by its members, whether that would be curating internet cat pictures or providing data about how a city’s infrastructure gets used. The platform’s value is in the individual data cooperatives it hosts, and to prove its viability we needed to prototype a cooperative that we thought could really take off.

So we built a working prototype and ran a pilot, and learned volumes about a particular data cooperative that could be hosted on our platform in the near future. But more importantly, we learned what we did and did not want to create as a collective of humans running a company.

A data cooperative of drivers

Our vision for the pilot: create a mobile app that lets users record their location while driving and contribute to a large dataset that tells the story of how transportation in a city works. (We expected rideshare drivers, like those for Lyft and Uber, would be a key group who could gather this data while driving around a city.) From the start, we wanted to create app features and legal agreements that implement the same core beliefs that motivate our smart contract platform: that the users who create data own that data, maintain control over that data, and are fairly compensated for that data.

Screens from our prototype app showing how it gives the user clear control over when data is collected, and the option to easily delete data.

And this is exactly what we did. We created an iPhone app that let users choose to record their location and add it to a dataset that we would license to third parties. We paid users for their data, and included an in-app feature that let them delete their data, should they decide they no longer want it to be part of a licensed data set. Our legal agreement with users backed up their ultimate ownership of the data and their right to delete the data at any time from within the app.

We tried to eliminate as much of the information asymmetry that is standard practice in companies that, as Shoshana Zuboff has put it, practice surveillance capitalism, while still preserving a business model that at the end of the day, does sell data. We wanted to see what informed consent and choice looked like. As a user, yes, you are selling your data, but we’re striving to give you as much transparency, ownership, and control over that data in the process as we can.

Defining fair use of data

Since the origin of our modern conception of “data,” there has been case after case for the positive benefit of datasets to give insight into our world or help solve very real problems. Yet the integration of the internet into every part of our lives has unearthed a dark side of data, due in part to the low cost and vast scale over which collection can happen. These economies of scale are important for the positive impacts of data too. Is it possible to use technology to bias the coin towards positive outcomes?

Our pilot app collected location data, but we intended for it to be used to understand the urban environment. (For example: Where is traffic heavy? Which parts of a city are served by Uber and Lyft drivers? How far away are restaurants from customers their food is delivered to by DoorDash?) Internally, we sketched a rough criteria to discern fair ways for our collectives to use data — that the data, though generated by individuals, was meant to be about the environment, not about the individual.

But almost as soon as we created the criteria we realized how difficult it would be to apply. Certainly using an individual’s location in real time to personally target ads to their phone was not what we had in mind. Traffic data aggregated by city block and time of day passed our criteria because it removed all traces of an individual’s path. But what about data showing the origin and destination of rideshare trips? This kind of data is incredibly useful because it contains detailed information about trips. For example, are people riding all the way from home to work in a rideshare car or are they using a multimodal commute that involves a segment on public transit?

A note on de-identified data

We had thought about de-identifying individual trip data, a process in which an individual’s sequence of locations is only tied together through a cryptic identifier that is not associated with any other identifying information (like the person’s email address or phone number). The problem is that de-identified location data is not necessarily anonymous. Indeed the most useful data any company or project might collect is difficult to anonymize through de-identification because often the data is so high-dimensional that it already uniquely identifies the individual. Such de-identified data is not anonymous, it’s merely unconnected to other individual data attributes like a person’s name, or phone number. Even a random sampling of a handful of location data points from my last 24 hours are likely to include where I live and where I work. This pair (work address, home address) uniquely identifies most of us. It’s not that hard to join pieces of data from several sources to then re-identify an individual complete with their location over days, months, or even years.

The data seller’s dilemma

The fundamental problem is that for any given data set, it’s hard to know what information will be pulled out, and what that information will be used for. This open-ended potential in data is what makes it so useful, but also what makes solving privacy challenges through transparency and user control so difficult.

At Computable, if we license a dataset to a third party, what’s to stop that third party from using the data in a novel way that violates our expectation and our users’ expectation of the data’s purpose? Informed consent must then include not just what data is collected, but how it will be used. This is why California’s new privacy law, the CCPA, stresses that consumers must be informed of both these things. The data that’s collected does not tell the whole story for how it might be used, not merely because privacy policies often obfuscate some data’s use, but because we are continually inventing unforeseen ways to pull useful information out of raw data. Companies that collect data for their own internal use should have no problem describing how they use data, but for a company that sells data (or information derived from data) this is a problem. As of today, we aren’t able to efficiently track how data is used once it leaves our hands.

For a time, we handled this challenge by planning to aggregate individuals’ data in a conservative way that we believed would robustly preserve individual privacy. An individual’s series of timestamped locations was aggregated with others’ to create time- and location-based density maps without any identifier that could connect one data point to another. We also removed data points that we suspected or knew was someone’s home or work address.

Map of San Francisco with data points from our data set.

In a few months time, we created a dataset with nearly 50 million data points from a small set of pilot users, capturing the driving trends of Uber, Lyft, and food delivery drivers, commuters, and other frequent drivers spanning most of the Bay Area. We talked to customers about licensing our dataset. But in the end we killed the project. We had scratched the surface and were already seeing uncomfortable trade offs around user privacy.

Gaining clarity

Following the pilot, we’ve continued to work on data cooperatives, because not all kinds of data belong to this gray area of location data, where how that data is represented and how it’s used determine whether it violates a user’s trust. In the end, we learned a lot not just about the unknown variables of users and product for data collectives, but also about the business, ethical, and legal variables around privacy — in ways that pushed too far against our company values. Ultimately, we gained clarity on what we don’t want to do as a business.

Moving forward we plan to continue building and testing concepts that match our company values and protect privacy for everyone involved — stay tuned.

--

--