Facebook and your Personal Data

Daniel Coloma
14 min readMar 22, 2018

There is currently an intense public debate about the latest news about how Personal Information obtained from Facebook has been commercialized and used in order to perform micro-segmentation in Facebook campaigns and, potentially, influencing in elections. This has been possible because of two different things:

  • The mechanisms that Facebook offers to developers to create apps that integrate with Facebook Ecosystem (APIs and Users)
  • The mechanisms that Facebook offers to advertisers to display ads to users.

Most of the people are focusing on the first one: how a developer didn’t honor the Terms and Conditions agreed and how he commercialized info that should be only used for research purposes. However, very few discussions are ongoing about how Facebook lets advertisers create campaigns based on sensitive or personal information.

Facebook has been very vocal about the things they are doing to avoid developers extract personal information from their users or to prevent fake news from being spread, but they are not talking about, what I consider, is the origin of all these aspects: Facebook Advertisement system is designed in a way that incentivizes advertisers to use people personal or sensitive information to create audiences.

I think it’s essential to stress that it has just been the tip of the iceberg of many more situations where personal is being used and abused, and that the only way for Facebook to stop this is changing radically the way in which the let advertisers create custom audiences and target users.

How does Facebook use your personal data?

In order to understand how Facebook uses people’s personal data, it’s key to assess what is Facebook main (and nearly only) revenue source: online advertisement. In fact, one could argue whether Facebook is a social network or an advertisement network: 98% of Facebook revenue is coming from online advertisement and that percentage didn’t change during the last fiscal years:

Facebook revenue per quarter (advertising vs. other revenue). Source: Facebook.

This huge amount of revenue is because the individual contribution every Facebook user is making. Every Facebook user increases Facebook revenue mainly with two different but related actions:

  • Impressions: When a Facebook is shown to her
  • Clicks: When she performs an action over an add (click, likes, app download…).

The revenue for Facebook (and hence the cost for an advertiser) of an impression is significantly smaller than the cost of an action. For instance, in 2016 it was spotted that the cost of a click (Cost-per-click) could be $1.11 for a US user whereas the cost of an impression could be around $0.0085. Indeed, the cost of a single impression is usually so low that Facebook uses the term cost per thousand of impressions CPM (Cost-per-mille impressions).

There are many studies around the CPMs and CPCs values and how they change with the time. A significant one is the one developed by University Carlos III in Madrid, who presented some of the work during a Data Transparency Lab Conference. This University has developed a tool that let users estimate, real-time, how much money Facebook makes because of their online activity, I encourage you to download and test to get a grasp about how you are contributing to Facebook revenue: https://fdvt.org/

Given this, Facebook has two main driving forces to increase its revenue:

  • Increase the number of Facebook Active Users
  • Increase the advertisement revenue per active user

Obviously, the best way to achieve the later is increasing the number of clicks on ads: an extra click is (revenue-wise) more effective, in average, than100 impressions.

It seems to be a general understanding, that in order to get more clicks in ads, the best approach is providing users, the ads that are deemed to be more relevant to them. If this is true, the more targeted ads are shown to users, the more click vs. impression ratio and the more revenue Facebook will make (and hence, the more money advertisers will pay).

In order to let advertisers offer targeted ads, because provides a tool named Facebook Audience Manager. This tool lets advertisers decide which audiences are going to be shown the ads. If the audience is defined in a very accurate way and aligned with the advertisement campaign, it’s extremely likely the click ratio will increase, this is what is called micro-targeting.

Advertisers do have two main approaches to define the target audiences for their ads:

  1. Attribute-based Audiences: They are built by specifying a set of attributes that the users within the audience should meet, for instance, location, age, sex, interests, etc.
  2. Personal Information based Audiences (called Custom Audiences by Facebook): They are built using Personal Identifiable Information (PII) that could be linked to an individual. Examples of PII that can be used are: e-mail addresses, phone number, Facebook ID, etc. This option means that the ad will be only shown to those users which match the specific users in the audience.

It’s key to stress that although their target is the same (creating a potential advertisement audience), they work quite differently. When attributes are used (option 1), the advertiser can specify general aspects of the target audience but there is no way to ensure that the ad is displayed to a specific user. Facebook is selling the advertiser a window (i.e. real state in their website) to show information (ads) to some users that meet some criteria.

The following picture shows an example of potential audience type, for instance, for users that have an ethnic affinity with afroamerican, who have been recently parents and that are interested in homosexuality. As it can be seen, the type of attributes that Facebook opens to advertisers can be a bit controversial. Furthermore, these are just examples and there are additional attributes related with money (such as the amount of annual incomes), believes, health status (e.g. pregnancy), etc.

Example of Segmentation using Attributes (facebook.com)

On the other hand, when Personal Identifiable Information is used to create an audience, the situation is very different. Facebook let advertisers specify the personal data linked to the users in the target audience. For instance, the advertiser could use a list of phone numbers, e-mail addresses, facebook ids… of the users he wants to include in the audience. In other words, this tool lets an advertiser guarantee that the ad is going to be shown to a specific user (if they are part of Facebook Social Network). On top of that, Facebook also offers advertisers the possibility to combine multiple audiences in different ways to create supersets or subsets of existing audiences. The following picture shows a screenshot of the tool that is available to create these audiences in which the different type of PII is shown.

Custom Audience based on Personal Identifiable Information (facebook.com)

How does Facebook obtain your personal information?

Although Facebook users explicitly provide many data points about them to Facebook, the level of segmentation and audience creation offered to advertisers go clearly beyond those data points. For instance, how can Facebook know the level of incomes of a Facebook user? The answer is that Facebook combines that data explicitly provided with other data:

  • Facebook purchases data to 3rd parties. It’s estimated that Facebook, on average, purchases around 600 additional data points for some geographies such as the United States from external companies. Those companies are usually named “Data Brokers” and the most well known ones are Acxiom, Experian and others. Facebook is pretty transparent about which are those companies but they are pretty opaque about which specific data points are purchased and how are used. These companies aggregate tons of public records pulled mostly from government databases or other sources of easily available public record information.
  • Facebook collects data directly from 3rd party web sites. When a user visits a website that includes any piece of code from Facebook (a login option, a like button, a forum…) it means that Facebook is aware of any visit to that site and some additional context such as the referrer page. Recent studies concluded that around 35% of the websites already include Facebook trackers, which means that Facebook is already aware not only about what users do in facebook.com but also about what they do in 35% of the websites they visit outside of facebook.com. This is not limited to Facebook users but also to users who might not have a Facebook account, the Belgian Data Protection Authority has recently fined Facebook because of this activity.

Lastly, all this information, together can be used to infer derived data. E.g. If a user home is located in a ZIP code and the current location of a user is in a ZIP code of another state, it could be inferred that the user is currently traveling. In fact, when creating an audience, Facebook let advertisers create an audience for users that are currently traveling, that travel every week, etc.

How do companies such as Cambridge Analytica use Facebook capabilities?

The usage of those mechanisms to create micro-targeted campaigns in Facebook is well known since years ago and many companies have been pretty vocal about the results they have been achieving related with influencing people, especially in elections. For instance, TMG (The Messina Group) has publicly disclosed how they helped Barack Obama in the elections he won and how they helped Popular Party in the latest general elections in Spain.

They achieve this by creating very specific audiences related with the swing votes and adapting the messages conveyed based on the specific characteristics of those audiences. This is a mechanism that is encouraged by Facebook (regardless of the ethical/fairness considerations) and totally legal if the usage of the personal data meets the local regulations, especially in the aspects related to sensitive categories such as ethnicity, health status, sexuality, etc.

However, although the system could be perfectly legal, it is flawed, mostly because of two factors:

  • Lack of control over the content shown as “ads”. As mentioned before, what Facebook sells to advertisers is a space of real-state in a website in which advertiser can display the ad of a product. That space could be used to show clear, objective, true information or to show fake news or any other type of similar information. The combination of a micro-targeted audience, with the possibility to display to them, very focused content, generated explicitly to influence a particular type of users, poses a significant risk of manipulation.
  • Offering the capability to create audiences based on Personal Identifiable Information has made advertisers do whatever it takes to get as many PII from users as possible. The possibility to influence in a particular and specific user if their PII is retrieved has made companies to explore any potential way (internal or external) to harvest that PII.

What did exactly happened with Cambridge Analytica?

It’s pretty unclear the level of influence and manipulation this company has managed to achieve. However, they have been actively communicating that using Facebook Audiences they were capable to adapt the message to the one that is going to have more influence on that audience. An example that was recently explained by Tom Dobber from the University of Amsterdam is gun rights. Extroverts might respond well to a pro-gun ad that talks about hunting as a family tradition and an adventure. But neurotic people might prefer a message emphasizing that the Second Amendment will protect us.

I am a bit skeptical about the real influence that Cambridge Analytica had in Donald Trump’s election in 2016. I think that influencing consumer decision is way easier than influencing in voters, but in any case, the important thing is not the outcome of their work, but how a flawed system such as Facebook advertisement system is exploited, without taking into account user privacy and essential rights.

Let’s start with a key question, how can Cambridge Analytica identify which swing voters are neurotics, which ones extroverts, etc.? They could have retrieved that data in many ways (e.g. using a Data Broker as Facebook is doing) but the recent scandal is because, they have been used another Facebook application to get that information.

In particular, the application was developed by Global Science Research (GSR) and it was released as a personality evaluation survey of Facebook users. The application performed some questions and retrieved not only the answers from the questions, but also the user facebook information and significant amount of information about all the friends of the user who downloaded the app. Although the application was just used by 270.000 users, it collected information from 50 million Facebook users. This was possible because the application was using an API (called the “friends API”) that let this information to be collected. Facebook removed this API during 2014 because of the increasing privacy concerns it might pose to Facebook users.

Facebook has alleged that the usage of these APIs was licit because:

  • It’s aligned wit the terms Facebook users accepted when they registered for a Facebook account.
  • It honored the privacy settings of the users (although it was really complicated for an average end-user to understand what was going on)

Furthermore, Facebook claims that there has not been any Privacy Breach, and that the only that has happened is that the application author is not meeting the conditions specified: the application was released as a pure academic tool and the terms and condition specified that the collected data was only going to be used for academic purposes. If the information was distributed breaking those terms for commercial purposes to companies such as Cambridge Analytics, the problem is not a privacy problem but a problem of breaking a contract with end-users.

In summary, a company that performs micro-targeted campaigns in Facebook for influencing in elections has bought data from a third party who retrieved it from Facebook (using a privacy sensitive — but already unavailable — API) only for academic purposes (not commercial ones).

The relationship between Cambridge Analytica and GSR is a bit dark, as it seems that GSR created the survey application based on a Cambridge Analytica request (i.e. GSR could have just been an instrument to let Facebook think this data was just used for academic reasons). Furthermore, the fact that the GSR founder, Aleksandr Kogan, has Russian origins (and worked part time for Saint Petersburg University) has also helped to make this story more attractive for media and impactful for end-users.

Only Facebook and Cambridge Analytica know if the data collected by GSR and commercialized to Cambridge Analytica was used and how it was used. So far, Facebook has just confirmed that they asked

If I was Cambridge Analytica, and I had the capability to understand, for every user, how she could be influenced, this is how I would do it:

  • Categorize the 50 million users in multiple “buckets” based on aspects that could make them subject to be influenced.
  • Select those “buckets” that include users more likely to be influenced and that could be more important to me (e.g. swing voters in case of an election)
  • Subdivide those buckets in groups as small as possible (20 people ideally as it’s the minimal size of a PII based Facebook audience). These groups should be selected based on the aspects that could be used to influence them, geographical aspects, etc.
  • Create PII based audiences for those groups. This is trivial, as the data from the 50 million people should include the Facebook ID that can be used as a PII for audience creation.
  • Design and create super-targeted campaigns for those small audiences based on the insights retrieved from the data-points.

And now, what?

Is what has happened something so awful? I think so, but not because the aspects people and the media are stressing. People are focusing on the GSR story, how a company has retrieved the data from a lot of Facebook users without them being aware, how Facebook offered a potentially privacy-sensitive API to developers, how developers broke the Terms of those APIs and how Facebook did little (or nothing) to enforce that those terms were honored.

However, I think that Facebook is right when they stress that this hasn’t been a data breach but a trust breach. They did what they were supposed to be doing, they open that API to developers, Facebook users were informed in their privacy policy (although none might have read it) about this potential usage and hundreds (if not thousands) of apps were using it for years.

This reminds me of Casablanca movie when the policeman says: “I’m shocked! Shocked to find that gambling is going on here” (while a croupier hands the policeman a pile of money).

What are the aspects that I think very few people are paying attention and should be the focus of our analysis?

  • The lack of transparency and traceability about how the data, once they left Facebook data centers are used. An “academic investigator” has amalgamated data points from millions of Facebook users and although there are clear indications that the data could have been commercialized and used back in Facebook platform to create ad audiences, there is still no information from Facebook about their plans to trace back if that was ever the case and inform the affected users and the public opinion.
  • Facebook advertisement platform is designed in a way to incentivize the usage of People’s Personal Information as well as sensitive information about them. Facebook ad model is based on creating audiences as targeted as possible to create campaigns as tailored as possible. Companies have become extremely aggressive when it comes to collecting Personal Data with the hope it can be used to show better and more effective campaigns. As the system is designed to reward (in terms of higher click rates) the creation of campaigns based on sensitive of PII information, companies will try to get that information no matter what.

There are many authors that have been talking about the potential risks of such as system, for instance, some researchers have demonstrated that the campaign manager could be used to re-identify specific user and get more information about one user by knowing a piece of PII, Facebook has reacted to this fixing it before the results of the investigation were known but in my opinion these are just patches, for a system, that is flawed by design.

In particular I think Facebook should change radically their advertisement platform by:

  • Get rid totally of audiences based on Personal Identifiable Information. This is the best way to stop the arms race in which companies have entered to get every bit of data either directly or by purchasing data from data brokers.
  • Remove any data point related to potentially sensitive categories: sexuality, ethnic affinity, health, sexuality… All these aspects are either illegal or at least, arguably illegal, at least in the EU. It’s not acceptable that this information is used by Facebook, especially as it could be used to discriminate people. There is an interesting scientific publication entitled “Facebook Use of Sensitive Data for Advertising in Europe” that talks about how Facebook is breaking European Regulation by doing so. For instance, the following table, extracted from that research paper includes some of the sensitive information used by Facebook to categorize user and generate audiences.
Percentage of FB users (FFB) per EU country that have been assigned each of the 20 very sensitive ad preferences listed in the table. The last row reports the aggregated FFB value for all 20 ad preferences per EU country. The last column reports the aggregated FFB value across all 28 EU countries. (Source: “Facebook Use of Sensitive Data for Advertising in Europe”)
  • Give users back the control on the ads they see. So far, Facebook has provided some limited capabilities (and somehow hidden) to let users understand why they are seeing some ads and which interests are being used to categorize her in a given audience. However, I think that Facebook should follow a different, and more customer-centric approach: Facebook users should explicitly indicate their interests and those should be the ones used by Facebook to create audiences.

This could be a list of big changes, but I think now it’s the right moment to start doing so. Now that Facebook share prices are going down and with so many rumors about big fines from multiple regulators going on, I think it’s about time for Facebook to stop thinking about advertisement as a business that keeps growing continuously in terms of advertisement revenue but about a system that grows in terms customer control and transparency. I think, in the long term, it’s going to be a more sustainable model for both Facebook and their users.