I asked 4 of the biggest websites for my data.
Whether it’s liking an Instagram post or checking your favorite news app for the latest Covid information, we’re continuously and unavoidably generating data. As we browse and binge, our behavior is tracked, stored and interpreted, eventually being sold back to us in the form of scarily accurate targeted advertising. If the product is free, you are the product. And what makes up “you” online? You are your data.
The scary part of this is that you probably already know and accept this. How many times have you joked about Alexa secretly recording you, or talked about a product in a private chat only for it to be advertised to you the next day? We know we are being mined and squeezed for more and more data, and we explicitly agree to this by accepting terms and conditions.
But realistically, how much of my data do these companies have on us? I decided to find out.
The Chosen 4
I don’t know about you, but my email is registered with a lot, and I mean a lot, of online companies. Google password manager suggests that it’s nearly 300, and this doesn’t include things I’ve signed up with via my phone number, with throwaway emails, or even 3rd party integrations (such as registering with an existing Facebook account). My digital footprint is huge, so I wanted to narrow it down to just 4 companies that I believed would have the most data about me. Here’s the list;
A pretty obvious choice as my entire digital life seems to revolve around different Google products — Gmail, Photos, Drive, Android ect ect
Similar to Google but distinct enough for me to think the data it knows about me is worth looking into
I don’t use Facebook for anything more than messenger nowadays, but I have had my account for years and used to engage with content regularly so there will probably be a lot of old data
Reddit is my main and most active form of social media. I don’t think they have as much data on me as Google or Facebook but it’s a wildcard
Most things I do on the internet probably revolve around these 4 websites in some way or another, so let's see just how much and what type of data I have given over to them.
How to ask for your own data
Thanks to a handy thing in the UK called “The Data Protection Act”, all UK citizens are entitled to submit a Subject Access Request (SAR) to any individual/organization that stores personal data about you. The recipient of the SAR has 30 days to comply or face a penalty. I envisioned the process of asking for my personal data from a big company like Google to be very daunting and time-consuming, but as it happened I received my personal data from all 4 websites within 24hrs.
As much as I disagree with certain UK laws and politics, the Data Protection Act is a very empowering tool allowing any individual to gain insight on personal data such as the amount, where it is shared and who may have shared it with them. In an “Us vs Them” view of People vs Corporations, this definitely feels like a solid victory for the people.
To help submit my SAR’s I used a website called Rightly. It’s as easy as selecting the companies you want to contact, supplying some details and then Rightly makes the first step in contacting these companies. And don’t worry, the irony of the fact I signed up to a service that helps people reclaim their personal data isn’t lost on me.
After rightly has opened the conversation, most companies replied with a link to request your data, with Google having an entire service dedicated to this aptly called “Takeout”.
Within 24 hours and with a small amount of link following and email checking, I now had access to all my data.
I really had no idea what file sizes to expect. Given that Facebook and Google have millions, if not billions of users, I thought it was fairly unlikely that each user has several Gigabytes of data, so I thought maybe ~2Gb each was a fair assumption.
This was not a fair assumption.
I had vastly underestimated how big my footprint was. Here are the total data file sizes for each of the 4 companies;
- Google: 8.8Gb
- Youtube: 300Mb
- Facebook: 13.2Gb (!!!)
- Reddit: 2.1Mb
These vastly different sizes seem a bit confusing at first but actually do make a lot of sense. I only use messenger for Facebook, and the messenger data included a lot of media files which caused the file size to be a lot bigger than I expected. Similar to this was the Google export, which also included a lot of media.
Reddit was equally surprising, just in the opposite direction. As I don’t make media posts on Reddit I may have thought it would be smaller than Google or Facebook, but at just over 2Mb, it’s about 4500x less.
Digging into the Data
Now that I had this mountain of data, I started to break it down and note any interesting findings. Let’s have a look at what the data entails for each of these;
As I mentioned before, Google has so many services I use so I can’t cover it all in depth, but the majority of the file size came from Google Photos. All pictures I take with my android phone are automatically backed up to Google Photos (which is not something I’m sure I remember opting into or not) so I wasn’t surprised to see photos going back to 2014. However bear in mind that when you ‘delete’ a photo in Google Photos, it doesn’t actually delete right away. Instead, it lands in your Trash bin which is cleared after 60 days (if not manually). I’m sure a lot of people see this as a helpful feature from Google, but personally, it smacks of Google's unwillingness to let go of your data, holding onto it for 2 months after you have initially tried to delete it.
The Google Pay folder is also a goldmine of data, supplying an HTML page for you to browse all your recorded GPay transactions. And in case that didn’t feel like enough tracking already, most of these transactions are accompanied with a map link so you can see where they took place.
As you may expect there are piles of data for every website you’ve ever visisted, every advert you’ve ever clicked and endless activity logs for the use of any Google product. But let’s get onto the more creepy bits -
Google Location History. This was a really eyebrow-raising one. I knew that Google was monitoring my location via GPS, but I didn't quite know to what extent. Opening a file called LocationHistory.json greeted me with about 100Mb of data dedicated to my location. That’s about 50x the total Reddit data size. Every single data capture records a time, coordinates, and what activity it thinks you were doing at that time, eg/ Car, Train, Walking etc. Slightly more worryingly, I found that the majority of my GPS records are confident that my activity type was “STILL” — this means Google was checking my location when I hadn’t even moved. For some reason the idea of being spied on while not even moving seems worse than sending data when I’ve actually changed location.
The final creepy folder I found was called Voice and Audio, which contained about 800 recordings of me talking to Google assistant (On my phone or via Google Home device). I suppose this is relatively harmless but I have to question why does Google need this data? Is it to train an AI to talk like me? Well, no probably not, but it’s creepy all the same. If you have ever spoken to your Google assistant, that voice clip of you is now Google’s property and they intend to keep it indefinitely. I listened to a few recordings and thankfully I seem to be aware I was talking to Google and they are not just randomly recorded conversations. However, there have been people that report that their personal data records contain whole conversations which they did not give consent to record.
For me, Facebook was the one I was most scared about. Probably because I only use it for messaging but in the past was very active.
Straight away I saw a location history folder. Thankfully I don’t think it’s quite as oppressive as Google's location history polling, but every time you log into Facebook, your location and network data is captured and saved for their use.
Facebook also had all the data I’d come to expect, such as logging every search, like give/received, page visited or video watched.
By far the most worrying part of the Facebook data was the “Messages” folder. In it was a complete historical view of every single conversation I’ve ever had on Facebook. If you’re like me and have been talking to people on messenger since the early 2010s, you’re now looking at an alarming amount of media data — like 4000 photos per conversation alarming. It doesn’t appear like there is an expiration date for these photos, so just remember that when you go to send a photo on messenger, Facebook will own a copy of this photo forever. Reading the message logs was quite nostalgic but I realized that with some old friends I had over 1 million messages all of which Facebook had access to. It’s surprising then that with that amount of information about me, that Facebook still managed to guess my interests to incredibly wrong.
I had to extract Youtube data from the Google data bundle, as they are owned by the Google brand. After looking at the Google data I expected the Youtube stuff to be equally heinous but I was pleasantly surprised — no location data, no long historical data. In honestly it’s relatively barebones, comprising of simple lists of liked videos, subscribed channels, and comments made. Comparing Youtube to Facebook feels like night and day, but remember that Youtube isn’t really standalone; the reason they don’t need to collect as much data as Facebook is because Google does it all for them.
The equivalent of an oasis in this desert of data. At only ~2Mb Reddit keeps by far the least amount of data on me, even though they likely make up the biggest portion of my browsing time. There’s nothing too hair-raising in the data Reddit keeps — the vast majority is similar to Youtube in the way there are just lists and lists of upvoted comments, guided posts, private messages ect.
The most interesting thing I found was in the “comments.csv” file, which is a list of all the comments made on Reddit content. On Reddit you are able to delete comments so they are no longer public facing, and not shown in your profile. However these comments are still kept by Reddit and visible in your overall exported comment data. I guess that’s not as much of a surprise as Google keeping voice recordings, but again it’s a reminder that there is no such thing as deleting on the internet.
I’ll admit it - when I thought of this idea to send SAR’s I was more interested in writing this article than actually concerned about my personal data. But after seeing the extent of my digital footprint it’s the other way around.
But let’s not pretend it’s all doom and gloom. I’m not naive enough to think that there shouldn’t be any of my personal data stored anywhere. In fact, you could go so far as to say that there are situations where the storing and sharing of your personal data is a good thing, the obvious example being more personalized content and advertisements. I’d be annoyed if I kept receiving adverts for hair care products if I were bald, or for lawnmowers if I lived in a highrise flat. Similarly, our Youtube home pages are tailored to our personal preferences allowing us to consume more relevant content and for Youtube to gain more ad revenue — win-win.
What concerns me more than the sheer amount of data about me that these companies have is the types of data. I’m much happier for my preference towards watching kitten videos be recorded than I am for my location to be. Compound that with years of old personal photos and voice clips… and it’s not hard to see why people see big tech companies as a personal security devil.
I think as with most things it’s about finding a balance. As the customer of these products, we are not completely powerless. It’s my responsibility to determine if the data these companies keep on me is acceptable to me or not (assuming that these companies are following the law). I’m not talking about reading the entire Terms and Conditions (I don’t believe anyone ever has), but it’s important to be informed about your own data. I know several people who have quit Facebook citing personal data privacy as the main issue (The irony is that a lot of them still use Instagram and Twitter religiously).
I highly encourage you to submit your own Subject Access Request requests and take a minute to think about your personal data and digital footprint. It has been a surprisingly impactful journey of discovery, and frankly just interesting to be able to snoop on yourself.
Oh, and don’t worry about the Covid vaccine giving you a microchip - you probably carry around the world’s most effective pocket spy daily.