How to handle personal data in an analytics solution

Published in

AT Internet

8 min readFeb 18, 2021

Since the entry into force of the GDPR on 25 May 2018, it has become impossible to avoid addressing the notion of personal data. Every day, “personal data” is more and more a subject of discussion, blending into a mix of information that can be used to directly identify a person, such as an email, a phone number or a social security number. This article will seek to define precisely what personal data is, as addressed in the General Data Protection Regulation (GDPR); we will then explain how it is used in an analytics solution, and lastly we will assess the risk of an infringing use of such data.

What is personal data?

“Personal data” is defined in article 4.1 of the GDPR as follows: “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

According to European regulations, the following are therefore considered as personal data:

Information related to IP addresses, which Internet Service Providers (ISPs) can link to individual persons;
Information linked to cookie or mobile identifiers attached to a user device;
Information related to all types of identifiers, even pseudonymised, such as the username of a user logged into a service, which the publisher can link to a registered individual.
All the online behavioural characteristics of an individual.

What’s the difference between pseudonymisation and anonymisation?

On their site, the CNIL distinguishes between pseudonymisation and anonymisation:

Pseudonymisation “means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information. It consists of replacing the directly identifying data (surname, given name, etc.) in a data set with indirectly identifying data (alias, serial number, etc.). Pseudonymisation thus makes it possible to process individuals’ data without permitting their direct identification. In practice, however, it is often possible to determine their identity by using third-party data: these data therefore must still be considered to be of a personal nature.” (See also article 4.5 of the GDPR).
Anonymisation “is a type of processing that consists in using a set of techniques that irreversibly render impossible the identification in practice of the individual person by any means whatsoever.”

Please note: the definition of “personal data” is not the same everywhere in the world. For instance, North America has the notion of Personally Identifiable Information (PII), which would appear to differ from the notion of personal data as defined in the GDPR

How is personal data used in Analytics?

In any digital analytics tool, data, or events, collected via a tag are linked to a visitor identifier that allows for the creation of the segments or cohorts necessary for marketing analyses.

A distinction should be drawn in general between two types of identifiers linked to an individual that make it possible to cross-reference events or data: the visitor identifier and the user identifier.

What is a visitor identifier?

Visitor identifiers come from trackers. Any read or write operation performed on a user terminal device is considered a “tracker.”

In the UK, the ICO refers to “Cookies and similar technologies.”, in Germany, the DSK refers to “Pixel, Fingerprinting-Methods, IP-Addresses, Cookie-IDs, Advertising-IDs or Unique-User-IDs.”, and in France, the CNIL refers to “Cookies, Fingerprinting, Pixels or other identifiers.”

How is a visitor identifier created?

Cookie or mobile identifiers, tracking pixels, fingerprinting (corresponding to the combination of the IP address and the User Agent), or any other method that collects data on a visitor, goes into a cryptographic hash function to create a pseudonymised value common to all events originating from the same visitor, using the same terminal device. These values make it possible to perform analyses based on segments or cohorts, as described above.

Note: this visitor ID is automatically generated by the analytics solution in order to allow the basic calculation of visits, sessions for example.

What is a user identifier?

The identifier for a user, for example one logged into a platform, is usually generated by a CRM tool. This identifier is specific to the platform publisher. It may be transmitted to the analytics solution at the discretion of this publisher, which generally acts as data controller.

The purpose of this identifier is generally to allow the services provided by the platform to be personalised, in particular by following the practices of this user logged on several devices, where the visitor identifier is unique to each device.

Its publisher must ensure that the processing carried out in connection with this identifier is lawful, in particular by requesting and obtaining consent.

These two identifiers are stored in a database in order to permit the cross-referencing of events and information transmitted via the tag.

That way analyses can be performed in the form of segments or cohort.

What is identifying and non-identifying data?

Since all events are linked at minimum to one visitor identifier, we can say that all audience measurement data is personal data by default.

It is also important to point out that depending on the number and type of properties and the type of information stored in the database, cross-referencing data may make it possible to re-identify an individual with relative ease.

For example:

So-called “contextual” information, such as traffic sources, content visited (pages, videos, products, etc.), visit duration, or session duration are not particularly identifying, unless cross-referenced with other information.
Information relating to geolocation, time frame, age range or gender is more identifying, even without cross-referencing. The combination of gender, postal code of residence and date of birth can permit identification in more than 60% of cases on average, and in 80% of cases for persons over 70 years of age. With two geolocation points, such as a place of residence and a place of work, there is a 50% probability that you will be able to identify a person, and that rises to 90% with 4 points.
Finally, if directly identifying information (such as an email address or phone number) is transmitted through the tag, it goes without saying that a person can be directly re-identified.

Please note: data considered sensitive according to article 9 of the GDPR, such as data concerning racial or ethnic origin, political opinions, religious or philosophical convictions, or even trade union membership, a priori have no place in an analytics solution. If the platform publisher, as the data controller, wishes to use this type of data, it will have to take all the necessary precautions required by the GDPR, and in particular must conduct a Privacy Impact Assessment (PIA).

What about analytics solutions that are supposedly 100% anonymous?

Some analytics vendors claim that their product is 100% anonymous and therefore not subject to GDPR and consent.

Indeed, as described above, a complete anonymisation mechanism would make it possible to overcome certain predispositions of the GDPR. Nevertheless a platform wishing to implement such a solution must in its preliminary review be sure to:

Ensure that the analyses provided are relevant to its business teams;
Ensure that no trackers will be used to provide data to this solution.

As seen above, when a tracker permits the use of a given identifier, even a pseudonymised one, to cross-reference two events, it constitutes personal data according to the GDPR. The necessary measures must therefore be taken to ensure compliance.

Risks and obligations of audience measurement tools in respect of the GDPR

The GDPR requires audience measurement data to be treated by default as personal data, and failure to do so exposes publishers to several types of sanctions:

Corrective measures by a supervisory authority (art. 58): deletion of data flows, deletion of data, prohibition of processing, etc.
Lawsuits by individuals and/or civil society organisations (chapter 8): legal action, representation of individuals, brand shaming etc.
Penalties (art. 83): €20M or 4% of global revenues for the most serious breaches.

To learn more about the risks, please check out the following webinars:

In English: Data Privacy: the Strategy to Get it Right and Survive
In German : DSGVO, Tracking & Privacy-Shield Urteile: Wie können Sie Ihre Analytics rechtskonform betreiben? [GDPR, Tracking and Privacy Shield judgment : how can you manage your Analytics in full compliance?]
In French : Analytics & Données Personnelles nouveaux enjeux, nouveaux risques et solutions [analytics & personal data: new challenges, new risks and solutions]

Are PII-Based Analytics Solutions GDPR Compliant?

As previously indicated, the notion of Personally Identifiable Information (PII) does not appear sufficient to meet the GDPR’s definition, and analytics solutions based on PII may increase the risks related to data capital, brand image or sanctions listed in the previous point.

In paragraph I. Personal Data of its 12 May 2020 publication “Notes on the use of Google Analytics in the non-public domain” the German federal supervisory authorities’ union, DSK, states: “In Google Analytics help, Google asserts that usage data does not constitute ‘personally identifiable information.’ Not only does this point of view contradict the definition of personal data given in article 4.1 of the GDPR, it is also misleading…”

It is also important to put compliance with this definition of personal data into perspective, with respect for the fundamental rights of individuals, and specifically in the audience measurement context, the rights of access (art. 15) and erasure (art. 17). It must be possible to respect and apply these rights based on the pseudonymised data derived from trackers.

In France, the CNIL specifies, in paragraph 52 (art. 5) of its Ruling no. 2020–091 of 17 September 2020 relating to the specific exemption for audience measurement trackers, and adopting guidelines for the application of article 82 of the Law of 6 January 1978 as amended on read and write operations at a user’s terminal device: “the Commission emphasises that audience measurement processing does constitute personal data processing, and is therefore subject to all the relevant provisions of the GDPR.”

On February 10th 2021, the European Council has published a press release that states the position on the new ePrivacy regulation.

Within article 8.1.d of the latest draft, it is also stated that an exemption to prior consent may be possible for “purpose limited audience measurement carried out by the provider of the service requested by the end-user” in respect with the key article 28 of the GDPR, that states that a data processor should provide all the necessary guarantees to the data controller to comply with the GDPR, especially regarding the general transparency and the help to meet compliance, or the fact of complying with data subjects rights.

Therefore, the latest publications made by the CNIL in France (see above), should soon apply across all the European Union, after the ePrivacy regulation is adopted and after it comes into force after the transition period.