Data Conduct for Human Generated Data: Towards transparency and standards in data sharing through API access

Human generated data (HGD) is data generated from the human experience of artefacts that can create digital (bitstring) data. For example, on the Internet, interacting on social media generates posts; using Google Calendar create events. With the advent of hyper connectivity, physical objects can also create HGD from merely opening a door that has a sensor, or wearing a Fitbit to track sleep and activities, to using an Internet-connected coffee machine. HGD is often generated, collected and stored within software applications and form the asset of the firm that owns the application and the technology. In the way it is stored and accessed, it may or may not constitute personal data or personal identifiable data the way the law defines it, but it would be still be considered human generated data. Human generated data differs from machine generated data in the sense that its creation comes from a human activity, rather than a machine activity. That said, human generated data can be transformed to machine generated data through algorithms e.g. a sentiment analysis algorithm could analyse a person’s tweets and create new data on the person’s sentiment at that time.

Mechanics of Personal Data Exchange on the Internet

The earliest form of HGD was derived from supermarket transactions, surveys and polls. These created the early data brokers that trade with firms’ consumer data obtained from various sources, and individuals were generally not active agents in these transactions. Such transactions have had a market since the advent of ICT, one in which market research companies thrived on, giving insights to firms based on their analysis.

The development of the Internet and the proliferation of e-commerce have resulted in an explosion of HGD supply and with it, public concern about privacy. HGD is now gathered from visits to websites, then used to analyse browsing and shopping behaviours. With cookies, data can be collected across all website visits and individuals can be easily tracked as they leave behind a data trail. With clickstream and identifying information, websites can profile visitors to a high level of accuracy.

Often, firms sell HGD to data brokers or transfer it to another application. Releasing the data in this manner usually mean ensuring that the data cannot be identified to a person e.g. scrubbing personal identity information from the rest of the data, replacing it with a generic ID, a practice more common in parts of the world with strong data protection and privacy laws. However, when it is purchased or consolidated by a data broker, the anonymised data is often “re-identified” so as to draw insights that help brands plan their consumer strategies. Re-identification is the practice of matching de-identified data with other datasets in order to discover the individual to whom the data belongs. It is through this process that firms have been able to target their offerings to the consumer segment most likely to purchase their products.

Physical things with sensors becoming Internet-Connected-Objects (ICO) are starting to generate petabytes of data HGD, exploding the supply of HGD as an information good. This is especially so when HGD held in walled siloes of applications would generate higher payoffs when combined with other datasets. Whether legal or not, HGD now accounts for 36% of data-brokering activities globally (Transparency Market Research 2017). When HGD becomes liberated by the firm through connectivity or re-selling, a secondary market emerges due to its potential benefit as an asset, particularly to advertisers and manufacturers, since it can be used to generate consumer insights.

A key aspect of what has changed since the 80s and 90s HGD sharing is speed. The value of HGD is in the understanding of a person’s context, and that understanding is often time limited. In other words, the context and insight that HGD yield has a half life, and could expire. For example, the value of location HGD of a person being “in Cambridge” would depend if you wish to recommend him a cafe around the corner, in which case it would expire the moment he leaves, or if you wish to know he has been to Cambridge, in which case it won’t expire. HGD often has a unique property of entanglement – the mixing of content information and meta information. It is therefore both data and a signal.

HGD sharing in the past through data brokers used to generate insights at a slower speed, which imply that real time signals could not be easily generated. Today, HGD is shared at break neck speeds, and with trackers and clickstreams, signals are now easily created from HGD and commoditised quickly to find buyers in real time. It’s most prevalent use is within the online advertising domain.

While it is easy to fall into a sense of indignation and campaign to shut down all data sharing practices, research on data-sharing in the economics of privacy have found that disclosing HGD bring enormous benefits to individuals such as immediate monetary compensation (e.g. discounts), intangible benefits (personalisation and customisation of. information content) and price reduction as an effect of more targeted advertising and marketing. Of course, such sharing also brings about costs and negative externalities for example, privacy costs, and subjective and objective privacy harms. Hence it is not wise to have sweeping privacy regulation that result in firms not being able to obtain HGD as this will lead to opportunity costs and inefficiencies, not to mention giving the competitive advantage to firms sitting outside of such regulatory environments.

Keeping the identity of HGD with consent: API Access to Human Generated Data in real-time and on demand

The advancement of Internet technology has resulted in a new channel for transferring human generated data without the need for de-identification or anonymisation, but instead, with the user’s consent. This is done through a set of clearly-defined standards of communication between software components (whatever software language they are written in), called Application Protocol Interfaces (APIs). APIs are now one of the most common ways technology companies share data with one other. An example would be the sharing of Spotify (music streaming) data with Sonos (speaker) resulting in individuals being able to play their own Spotify playlists on Sonos speakers. Sharing HGD in this manner results in contracts between firms for data usage, with the user’s consent. From an economics perspective, it evolves HGD from being a resource within ICT systems to becoming an information/digital good provided by a source firm as an input factor to the destination firm. The mutual sharing of human generated data between applications with the consent of the individual achieves pareto-efficient outcomes since individuals benefit from re-using data that is locked-up within other applications, and all firms benefit from more data at low marginal costs. This liberation of HGD from firms, but now as actual personal data because identity is not stripped off, to become a real-time, on-demand and dynamically-updated information good and a relevant signal is increasingly becoming common place.

While clickstreams and the traditional “anonymised data sharing” practices are still continuing, API access provide a more transparent and legitimate way of sharing data that includes the individual in the data sharing contract. They are still not the first party in the contract, but it is at least an improvement (to have direct first party HGD, give individuals a HAT).

It is important to note that while a thriving demand exists for HGD data, a vast amount of HGD is not shared by firms. Employee data, students exam results, interactions on many smartphone apps are some of the data that have stayed within firms and have not been shared with third parties. However, with advancement in technologies and increasing API access, there is a growing sentiment that HGD liberated from these walled siloes could create greater innovation and opportunities, and may sway the alternative “anonymised” HGD market into greater legitimacy. On the other hand, there is also increasing fear that liberating personal data would generate externalities that are socially inefficient, compromising privacy without any mechanism for internalisation.

API access enable the ability to infer context, interests, preferences, priorities at every moment of a person’s daily activity, if the data can be made available through consent. API access is expanding, and more HGD collected and generated means a higher likelihood of increasing its use.

Firms sharing HGD amongst themselves mitigate the risk of falling afoul of privacy regulation by ensuring that consent is clear and meaningful given. This is not always easily achieved and the practices are diverse ranging from dubious methods of dark design (e.g. making it hard to not give consent) to obviously useful sharing such as sharing a playlist with speakers.

However, there is no way of knowing what happens after the data is shared and where the data that is shared is kept and used.

A data conduct standard for sharing through API access

Standards provide people and firms with a basis for mutual understanding and can enable greater transparency and trust. We propose that a simple standard for API access can be easily achieved by merely stating the number of nodes in which the HGD shared through API access moves through. HGD could be stored at the point of collection e.g. within the application sitting in a device such as a phone or a TV. It can also be stored on device and also at the application’s cloud service, in which case the number of nodes increases to 2. The HGD can also be on device, in the cloud and an internal content delivery system to achieve cross application efficiencies within the same firm in which case the number of nodes can increase to 3 or more. The standard does not purport to provide a recommendation of what the number of nodes should be, merely how many there are. The purpose of creating such a standard is provide full transparency to the source of the API access and the user consenting to the access.

We believe that creating a simple standard would enable a market of certifiers and assurers to emerge, and would build on the trust needed for data sharing. By doing so, it would also speed up the introduction of innovative products, provide greater interoperability between different HGD sources and foster greater voluntary cooperation between compatible firms in the digital economy.

The HAT ecosystem, led by the HAT Community Foundation and HATLAB, is beginning its work on data conduct and standards, and inviting partners to come together to create a more trusted digital economy for data sharing. If you are interested in this, please contact me or Jonathan, our community manager, at Irene.ng@hatcommunity.org or Jonathan.holtby@community.org.