Facebook Knows What You Eat: Discover The Entire Data Facebook Collects About You, Step By Step.

Avi Lumelsky
Feb 3 · 7 min read

I bet most Facebook users are not aware of what they really know about them.
What if I told you that YOU can visualize it in just 5 minutes?

A story of how I have explored https://facebook.com/dyi programmatically.

I’m gonna show you how to do it yourself, and we will explore my (censored) Facebook data together.

A (pretty censored) version of the data I’m about to show you. Open-Source code at the end of the post.

Some spooky commercials I ran into the other day, related to something that I was, most certainly, for 100%, speaking about in-person around my phone (not new to anyone, but boy! they are insolent) got me thinking what Facebook REALLY knows about me.

I am grateful for all the good Facebook brought so far, and I appreciate the company. They have done some amazing things, This post’s intention is to open your eyes to how the advertisement world works with a real-world example.

At the end of the day, like any other free product software — If you don’t pay for it, your data is the key component of their business.

I found some interesting “Facebook-related data” about me in Facebook’s archive and I believe you would like it.

No Idea how they got it, but we can all guess together …
it has something to do with $$$$$.

Whether they bought it or collected it themselves.

From my dashboard — which is about to be reviewed

Giant (and also small) companies must comply with the latest privacy laws if they wish to scale worldwide or enter the European or Californian market.

Facebook, one of the biggest players in the advertisement oligopoly (next to Google), is offering their users to download their data for free at any time.

So I created a new archive. After a few hours, I have received a link to my email.

The “Download Your Information” section on Facebook. It may take a few hours for the results to be ready.

The HTML Files contained an offline version of https://www.facebook.com/your_information.
It is really nice, but after a few minutes of jumping between pages, I understood there must be a better (more efficient) way.

So I looked at the JSON files, which contained a lot of information that the HTML versions were missing.

The HTML version is skinnier. It is obviously making it harder for people, even technical as I am, to really understand the data and really see the whole picture.

You can choose a HTML version, or a very detailed JSON output archive.
I asked for both. The really interesting data is hidden deep inside the JSON version.

Facebook Archive JSON version is more verbose than HTML version

So I downloaded their JSON format and focused on the data that I could not see completely in the HTML5 web interface (And there is lots of it).
It’s like there is more to be shown — but although it is an offline piece of all the data, The web interface encapsulates some of the events.

I wanted to see the entire picture in front of my eyes.
I felt like I could understand it better using graphs, numbers, and lines, rather than through their website/HTML archive.

It would be cool to see it on one screen.
I needed to stream that data and index it somewhere…
No way I’m gonna look after everything, file by file, or page by page.

Indexing The Data

For the geeks among us, Let me elaborate.

I chose to go with a docker stack composed of ELK (Elastic + LogStash + Kibana) components. I implemented a custom LogStash file stream that preprocessed the interesting JSON files from the downloaded ZIP archive to ElasticSearch.

Then I split the records that contained multiple elements (events, clicks, likes, notifications, marketplace item clicks, etc.) so I could query real Facebook objects easily (as if they are indexed in a GraphQL), and create ElasticSearch indexes with my interesting fields.
An ElasticSearch service from the stack would index all my events that ever took place inside Facebook’s servers regarding my data.
Later on, We are going to visualize the data and research it using Kibana.

The Compose file (can be translated to Kubernetes services using Kompose) is available at at end of the post.

I ran docker-compose up and it all started to play.
After some samples went through ElasticSearch, I created index patterns manually using 1 click.

Finally, using Kibana, I started to work with these pieces of data and understood what I was facing.

Finding The Right Place To Start

There is an insane amount of data about me.
I am not talking about the obvious data Facebook collects from their services… A lot of the data came from other places.

I had to focus. There was too much.
Let’s take a look at the Ads And Business dashboard I created, which covers only some of Facebook’s relationships with more than 300 companies.

So, my question is: what they Buy and Sell about me?
After years of using their products (Facebook, Instagram, WhatsApp) for Free, they must know me pretty well.

The Results

Facebook receives much of its information from 3rd party apps and services. I didn’t even need to actually have a Facebook, an Instagram, or a WhatsApp account connected to any of these apps in order to be identified.

I had to make the data visible somehow:

A (pretty censored) version of the data I’m about to show you. Open-Source code at the end of the post.

They know about each and every digital incident that took place in 300+ websites, Which I am really using, or have used, without logging into with Facebook or relating to it in any way.

They know how I pay to each and every service, what apps I’ve installed or changed, what websites I visited, thanks to Facebook Pixel technology and other “great, privacy-respecting” tools.

Another great question came up — who inside Facebook can really see this data?
I bet you also used, even once, one of the following:

  • Spotify
  • Airbnb
  • Zoom
  • Food Delivery Services (Like Wolt)
  • Domain Registrars
  • Air and Travel Agencies
  • Shopping websites
  • Many Landing Pages
  • Social/Music/Video Streaming platforms

Nevertheless, Facebook had my browsing history for a variety of websites. Stop worrying about deleting it occasionally.
Do you believe it? Well, here are some examples from my personal dashboard.

Facebook knows I eat, whenever I order delivery.

Food Delivery services Logs — 3rd party cookies everywhere — Facebook logged my cart events. I wasn’t logged in.

They know about my personal user flows on 300+ websites. from login to cart and checkouts.

Some Off Facebook Activity Events TreeMap, every square is a Website. This let me understand which services gave Facebook most events about me. Bigger is less privacy respecting.

They know which sites I view, even when looking at previews.

They know where, how, and when I Fly and Travel — my booking tickets/accommodation were logged.

Checkout and Payment detailed events — Each block is a unique transaction. Each Color is a different type of event.

They (obviously) know my music taste. From Spotify, not from page likes.

Spotify Sent Facebook many Events.

And “CUSTOM” data events — some nasty data they rather not name — that got me asking- why? What the **** is CUSTOM?

“CUSTOM” — This piece of data is private, and Facebook does not share it. Almost 1/5 of the archive’s events type are CUSTOM, and I cannot know what they mean.

I almost forgot the most precious jewel: “advertisers_who_uploaded_a_contact_list_with_your_information.json”

If that wasn’t enough, I can also find “profile_information.json” among the files.
This file, profile_information, included all the top facts about me, with some 0 timestamps values in a tuple, as if it was stored optimally for data science projects, experiments, and queries.

It's like your due diligence provided by Facebook!
relationships, interests, skills, etc. Think of it as your basic profile info, which many historical and inner data that Facebook does not use or support in their user interfaces. These fields are used by computers, for computers, to customize your ads (Or sell them, to whoever pays the most).

I have much more to cover, But I have started an open-source skeleton for you curious people (who read the article, all the way) so you can download your data and visualize it immediately on your computer.

When you’ll find more interesting files in your archives that I didn’t mention, It is really easy to add new data to the pipeline and I am sure you can discover it. There will be a part 2 to this blog post!

Avi Lumelsky — I’m a Software Engineer / Security Researcher.
I practice mostly in the fields of Privacy, Deep Learning, and Cyber Security.

I am a co-founder of DATU.
DATU lets you control your online data without providing any sensitive information.

We scan your personal information that you choose, nonstop, to find security and privacy flaws.

Within 1 minute you will see your public data that anyone with basic computer knowledge can find about you.

You will know which parts of your identity were compromised, leaked, or used by others.

I will cover the progress in future posts.
I hope you enjoyed this article.
You can also read my previous article about Google Phishing POC.

The code is available at https://github.com/avilum/facebook-archive-analyzer.

DATU is currently working on an offline version of this project that will run on your browser, offline.
As you will have to do to see your personal data,
is to select a local zip archive, that you got from Facebook by email.
It will explore your entire data within your browser automatically!

You are more than welcome to follow me on GitHub, here.
Feel free to comment, leave questions or feedback,
and of course — Sharing is Caring.

Digital Diplomacy

Tech, digital, and innovation, at the intersection with policy, government, and social good.