Please Trust Us — This is All the Data We’ve Collected About You

People can now access all data that companies have collected about them. As per new regulations, platforms promptly* provide their customers copies of this data upon request. But are you getting all that you’re entitled to? Or are you getting short-changed?

Published in

Prifina

6 min readAug 24, 2020

Both GDPR and CCPA, new regulations that address data rights and data privacy, require companies to provide copies of data they’ve collected about their customers in a “consumer-friendly” format. Yet many of these reported data sets seem incomplete; why might that be the case?

(* ranges from processing the request in a matter of minutes to 30–45 days, the maximum processing time allowed under law.)

We’ve explored data archives from 40+ different data platforms, mainly those with the highest consumer engagement and activity (Amazon, Facebook, Google, 23andme, Ancestry, etc.).You can get copies of your data from these companies with a few clicks and some patience. Oftentimes, you’ll receive millions of lines of code and sometimes 50gb+ in data.

While impressive, there remains a nagging concern; the report you receive may not contain all the information that the business has on you. Are you getting swindled? Let’s explore this nagging feeling in a bit more detail with some practical examples.

What Kinds of Personal Data Do Companies Collect About You?

© Daniel Ali and Paulius Jurcys, Prifina

There are four layers of data that companies collect about you, ranging from the:

input data — data you add (pictures, comments, posts, etc.);
metadata — data that gets created while uploading or creating input data;
generated data — data that is created based on your usage of the service;
derivative data — data that is productized based on activities by the service provider.

As an example, let’s look at the data export functions of Facebook and Google, both of which return your data within hours rather than weeks. For Facebook, the functionality looks as follows (see Your Data on FB Here). I’d encourage you to actually look at your own personal data here, to get a sense of what you will receive.

Now let’s look at Google’s Takeout functionality. You can access this functionality yourself here.

If you browse the information you find on these two pages — which, for the record, are great examples of user-friendly portals — you can see the vast amount of data you will receive.

After downloading this data, you will see that activity related data is incredibly deep, while data regarding your own personal information is very narrow and limited. This is mainly data related to Category 1 Data, the data you’ve inputted, although in some cases that data is stripped (e.g., the photos you export from Google are missing their metadata, possibly because the service removes them when they’re first uploaded).

What Data Should You Actually Receive?

Under new regulations, like GDPR and CCPA, you have the right to obtain all the data we talked about in the previous section; it’s considered your property. You can explore more of what personal data ownership means in this medium post.

According to CCPA, data platforms have to provide:

categories of personal information;
categories of sources from which the personal information is collected;
the business or commercial purpose for collecting or selling personal information;
categories of third parties with whom the business shares personal information and the specific pieces of personal information it has collected about that consumer.

According to GDPR, data platforms have to provide:

the purposes behind processing data;
categories of personal data concerned;
the recipients or categories of recipient to whom personal data has been or will be disclosed, in particular recipients in third party countries or international organizations;
where possible, the period for which personal data will be stored, or, if not possible, the criteria used to determine such a period;
any available information as to their data source, in cases where personal data is not collected directly from the data subject,. These include the existence of automated decision-making, including automated profiling, meaningful information about the logic involved, and the significance and envisaged consequences of such processing for the data subject.

So, What’s Missing?

There are two clear categories that are often missing in data exports:

data sharing and data shared with third parties, and
productized data used in the platforms.

But what if they just don’t have that data? Might that be the reason behind its absence? That would be highly unlikely. Data sharing and productized data are common business practices, and the likelihood of businesses not participating in either of these practices is extremely low.

Data Shared With Third Parties

There may be several reasons why this data is often not provided, even though companies are required to provide it under law. Data sharing is a sensitive and contentious issue for various businesses; showcasing this information may cause more uproar and problems than simply withholding it. I’ve explored cookie data and third party sharing in the post “Hundreds of Companies Are Having a Party with my Cookie Data and I Wasn’t Invited” in Toward Data Science; inquiries into where all the cookie data goes are frequently met with resistance and reluctance.

Publicly disclosing who your data is shared with may get you a ticket to Capitol Hill. This comes up far more often in conversations than you might realize. For many companies, it’s a real concern.

Productized Data Used in Platforms

Businesses have invested heavily in their own proprietary processes to create different segments, profiles, and products based on data they’ve collected about you. Their goal is to better serve you, better target you, and, sometimes, give third parties the option to create more value for you (albeit this is often hit or miss). These productizations of data are often seen as a company’s own proprietary assets or trade secrets, but the extent to which they should be disclosed to the consumer varies quite a lot.

We’ve explored these data productizations previously, such as in this post on Why Spotify Erroneously Thinks I’m Right Wing. Would data that is used to make decisions about a consumer’s service be provided to them or not? The legal framework may be clear but the practical status seems to vary in today’s market.

Why This Matters?

Personal data is becoming the most rich data class. However, it is still often siloed away in different platforms. By making this data layer more accessible and easier to use for consumers and developers, we can open a new data market that includes all stakeholders. To make this market a reality, we need individuals to have as much of their own data as possible.

To get the most value out of personal data, we have to ensure that personal data requests are complete, comprehensive, and provided in the intended manner. This will foster the creation of new types of data applications that can be offered directly to the end users, without developers having to be employed by the NSA to get access to data.

What deficiencies have you found in your personal data exports? I’d love to compare notes between different platforms and discuss why our data seems incomplete.

Connect With Us and Stay in Touch

Prifina allows you, as an individual, to bring your data from different devices and services into one place under your control. Then, you can take that data and power different applications that give you daily value, such as insights or recommendations, without sharing it with anyone.

You can follow us on Twitter, Medium, LinkedIn, and Facebook or listen to our podcast. Join our Facebook group Liberty. Equality. Data. where we share notes about Prifina’s progress. You can also explore our Github channel.