Some problems with the “data as labor” argument

There seems to be growing interest in the idea that user data collected and monetized by tech companies is a form of unpaid labor, and that data producers should be compensated for the inputs they provide in tasks like machine learning. I think there are several problems that proponents of this view need to consider.

Jaron Lanier, for example, argues that the commanding heights of the digital economy are held by a small number of firms that exploit network effects and free services to users in order to harvest data from large numbers of users. This creates inefficiencies since these “siren servers” in Lanier’s expression do not use prices to elicit the most valuable user information, nor can users express their preferences for quality services using their dollars. It leads to widening inequality as the labor share of income within these firms is often quite low, and the desire for free services puts the squeeze on print journalism, publishing, and other industries that have historically supported a middle-class economy.

For instance, the authors of the above linked paper “Should We Treat Data as Labor?” note that data produced through the “siren server” model is data generated through consumptive rather than productive activity, which generates biases in data availability and quality. Because most people work in firms, data created in productive work is protected by trade secrecy law. Compared to firms, most individuals are willing to barter their data with tech companies in exchange for a free online experience. This prevents data buyers from rewarding quality, and data producers from specializing.

So there are basically two arguments here: one about efficiency, and another about equity. There are objections to each of these arguments, however, that I haven’t seen data-as-labor advocates directly address. I’ll start with the one concerning equity, which I’ll call the double-counting objection. Basically, if you already conceive of existing data production and usage as a mutually beneficial barter exchange, it seems harder to say that data producers are unfairly exploited because they aren’t also paid. At the very least, the data-as-labor view has to explain what’s so bad about all these free services users currently enjoy.

It’s telling, I think, that examples defenders of the data-as-labor view cite like Amazon’s Mechanical Turk don’t suffer from this objection. MTurk users, as I understand it, just take surveys for money just like college students on university campuses. There’s no additional barter relationship, and no double-counting.

The second objection concerns efficiency, and I sadly don’t have a snappy name for it. But the worry here is that paying people for data might change the data they provide, in particular because the value of some data comes from the fact that it is essentially a by-product rather intentionally produced.

In social science, Goodhart’s Law says that as soon as a measure becomes a target, it ceases to be a valuable measure. In the case of siren servers and users, user data is valuable often because it is not intentionally produced and is instead a by-product of some other activity on the server’s platform. Paying users for certain kinds of data might change user behavior in unexpected ways.

For example, the athlete social network Strava lets you record your runs/swims/bike rides using a GPS device and share the route with your network. In addition to its “freemium” model, Strava Metro also sells de-identified route data to urban planning commissions and other interested users (their publicly available user heatmap also inadvertently revealed some secret military facilities). Suppose, however, that Strava began paying users as suppliers of this valuable data. Depending on how the compensation system worked, for example, bike commuters might change their routes to earn more as “data laborers” from Strava. In the aggregate, this could make the data less valuable to e.g. cities looking to find the best places to build bike infrastructure.

I don’t think these objections are necessarily decisive, but I do think they’re serious and I’d like to see data-as-labor advocates address them. In both cases the problem is unintended consequences — double-counting also means the online experience might change. The problem is also that human labor’s value (as Marx emphasized!) involves instrumentally rational action, while data’s value may be as a by-product of some other activity and therefore not a fit object of instrumentally rational action.