Facebook algorithm and impact on media: French election experiment #1

Macron & Le Pen vs Facebook Algorithm, analysis with https://facebook.tracking.exposed

Summary

During the last stage of the French Presidential Election 2017, we monitored Facebook through four ad-hoc profiles. We were collecting what they saw on their timeline, in order to understand how the Facebook algorithm can influence their perception of reality, by privileging some content and disregarding others.

Despite the fact that our users’ behavior gave Facebook the possibility to inform them impartially, we highlighted that the social network tends to exclude some contents from the Newsfeeds.
We are collecting evidences and methodology to understand if we can speak of algorithmic censorship, as it is one of the concerns we address with https://facebook.tracking.exposed

In particular, this research made evident the falsehood of statements like “if you don’t like Facebook don’t use it”. The society will be heavily influenced also is if you can permit yourself the luxury of staying outside the advertisement platform also knows as social media.

This is not a conclusive research, we are releasing our data-sets to let other data scientist validate our theories, and we are hereby announcing a public call to join a collaborative analysis for the upcoming UK 2017 elections.

Introduction to the platform

“facebook tracking exposed” is the project, #fbtrex the nickname and https://facebook.tracking.exposed is the actual website

Our mission is to help researchers assess how current filtering mechanisms work and how personalization algorithms should be modified in order to minimize the dangerous social effects of which they are indirectly responsible and to maximize the values, both individual and social, that algorithms should incorporate.

The data used in this analysis

Facebook Tracking Exposed is a browser extension able to copy public posts that Facebook shows to the user who install it. Only the Newsfeed is considered. The extension provides us (and to some extend to the public, after anonymization and minimization) with:

1, Facebook user’s ID;
2, The date and time for each time the user refreshes the Newsfeed;
3, The date and time of the impression of every post on user’s Newsfeed;
4, The position of every post in user’s Newsfeed;
5, The HTML section of every Public post in user’s Newsfeed.
See more at https://facebook.tracking.exposed/privacy-statement

The point 5 represents the juicy part that we are going to use for our analysis. At the beginning, it is just a raw piece of HTML, a chain of parsers to extracts metadata (source, publication time, text, type of media) to be used in analysis like this one. As much as the parser improve, also investigation like this might benefit (for example, get the number of likes collected at a certain time from a post)

The methodology

The first step of the algorithm audit is to uniform the behavior of your subjects. We created new profiles, of supposedly French persons, assuming they were citizens, the days before the election.

All of our users where following 13 mainstream french pages. We want use the publications of these media as reference. In this way we are able to see if some posts appears in less appealing location (despite the chronological order, lower in the timeline). Below the points we kept in mind to uniform the test:

  1. Users put “like” (and therefore follow, publications appears in the timeline), these 13 sources here,←, on the left.
  2. Users have no friends
  3. Users do not put any like or comment to the posts in their feed
  4. Users scrolls in a browser for the same amount of time, in order to get more or less the same amount of posts. To do this, we used a dedicated browser for every user, with the extension installed, and a auto-scroll script (3 minutes scrolling, 1 page every 6 seconds, and after an hour, refresh and restart)

Until this, users have the same behavior and the news appears equally.

Our analysis has been drove by this research question:

� With just a small set of differences in the users profile, how much different can be the reality they perceive on Facebook?

After the common initialization, any user, has ~ 30 likes. They can be given to pages, community, random events. For example, these are the likes a user has done at the beginning of the initialization phase of the experiment. Below, for example, the user named Almìr liked these content:

The four users characterization would be detailed in the analysis when we shall be able to analyze the content of the posts (entities extraction and sentiment analysis). In this preliminary release, we did not yet look into.

The Data collected

Every time the user refresh his Facebook page, fbtrex records the new timeline in our database.
Every post the user sees is counted as an impression, and every impression gets his own order. The first is the one appearing on top.
If the post showed is public (without any audience restriction) we get the html in order to extract the metadata on the left:

The original looks like this 19 days later, (when this article is done):

The original post can be found at the permaLink https://www.facebook.com/lepoint.fr/posts/10154413789830703

The CSV file has one entry per public impression. Can be found in our repository, and a sample can be seen here.

Reducing the analysis window

In order to judge if Facebook has displayed a post to a user and hidden for another one, you have to reduce the difference also between their activities and general exposure to Facebook. Considering the lack of coordination between testers, data at the beginning looks dis-homogeneous. Analyzing the data, a good window of time is the afternoon of the 5th of May. The users contributed with a nearly equal amount of post.

Picture 1 — https://public.tableau.com/views/Isolatedpostforanalysis/Sheet1?:embed=y&:display_count=yes

Every color above represents a user. The columns are taller as much as the user got unique posts (measured by the count of the postId). The auto scroll used to make every user refresh once per hour, so only one timeline per hour. If the browser is left operate alone, we would have 20 timelines collected across the afternoon.

The Orange and Blue users have more post. Wonder why? It is for the screen size! The window height has an impact on the auto-scroll tool and then to the posts display and then sent and here analyzed. Below the same data, breakdown by hourly submissions:

Picture 2 — This visualization show the waves of auto-scroll during the hours, 17–20 has been constant for the 4 users https://public.tableau.com/views/Isolatedpostforanalysishourly/isolateddata?:embed=y&:display_count=yes

The content under analysis is the sum of the 13 shared media plus the publications made by the pages the user liked. We proceed in keeping only the post shared by the news media.

Media exposure

We might expect that all the users receive the same amount of posts from the source they are following, or if not, how much it differs? Keeping in consideration 12 of the source they have in common:

Highlights from the picture above:

  1. Every media source it is identify by color. We have four column because four different users get a personalized experience of the source. Almost always, they differs in size, therefore exists some posts who has been display to a user and not to another one.
  2. We don’t know how many unique posts the media have done, we can just see how many the four users saw. Courier International and Mediapart might have produced far less than Le Figarò, or our users didn’t see all of their posts. (This is a limit in our research but not a limit in absolute: you can collect with other channels all the publication made by every media, and check if something was missing from the 4 users of the test)

Information diet: what the users saw?

Picture 4 — https://public.tableau.com/views/usersdifferenciesbeforesplitthem/mediaperuser?:embed=y&:display_count=yes

We saw the uniformity of the posts got by the two users on the left. That was representing a homogeneous set of data, so we went deeper in looking which media has been was favorite to a user compared to the other. The users are: Antoine Blanc (identify as 1000169) and Oliver Dumont (1000167)

Confronting Olivier and Antoine’s feeds

In our analysis we decided not to go deeper into our profile’s political personality, since it would require a semantic analysis of posts, pages and likes. Also if it would seem pointless to analyze the feeds without knowing the specific political affiliation of the two profiles, our goal here is to analyze and to show that there are differences among the two profiles. Moreover a consideration has to be made: apparently something in the Olivier identity probably triggered the algorithm so to push two regional newspapers (Le Parisien, Paris Match). This could be the effect of some more geolocalized pages in the set of “likes”. Aside this, the two political inclinations appears intuitively clear. With the semantic analyses that will follow, the position of our test profiles will be better explained.

Which media feed Oliver more than Antoine?

France24, Courrier International, Libération, Paris Match, Le Parisien. The difference, in the five window hours, are 12 posts seen more from these sources.

Picture 5 — Oliver Information diet — This is a selection of https://public.tableau.com/views/smallusersmediaexposuredifferencies/mediaperuser?:embed=y&:display_count=yes

Which media feed Antoine more than Oliver?

Les Echos, Le Monde, Le Point, L’Obs, Le Figaro. The difference, in the five hour window, is of 11 posts seen more from these sources.

Picture 6 — Antoine Information diet — This is a selection of https://public.tableau.com/views/smallusersmediaexposuredifferencies/mediaperuser?:embed=y&:display_count=yes

Uninfluenced media

In this specific window of analysis, Le Monde Politique and Mediapart got treat in the same way.

Analysis by single post

Picture 7 — a snippet extracted from: https://public.tableau.com/views/smalluserspostperhours/mediaperuser?:embed=y&:display_count=yes

In the visualization linked, a square means the presence of a post in the timeline of a certain hour. We might notice some article are completely excluded for a user. Note, if you select the postId and compose an URL as https://facebook.com/$postId you’ll be forwarded by the actual post. In fbtrex you can see all the impressions in which the post got display with: https://facebook.tracking.exposed/realitymeter/$postId

Now that the visualization was giving some sense, has been included the two initially excluded users. This display even more how posts are selectively discriminated and excluded:

Picture 8 —https://public.tableau.com/views/alluserspostperhours/mediaperuser?:embed=y&:display_count=yes

Publication by media, Hourly frequency, per users

Taking in account the 5th of May, for all the four users (identified here as a,b,c,d) how many post they saw by media source?

Takeaways

  1. Algorithms can exert a subtle kind of censorship. The content you suppose to see are just removed from your attention. If you are an expert on a subject you might just look for the content and find it, if you are not, you should never become one.
  2. Analyze the Facebook algorithm it is a complex task, but it is not necessary. If we prove the unreasonable promotion/exclusion of certain content, we still might push for one of our techno-political goal: algorithm diversity. Give to the user the power to decide their own algorithm, customize, improve, measure, share, re-mix. Algorithms are social policies.
  3. The data collected by fbtrex are intended to empower users with a less immersive experience and show to them their information diet. because you can’t delegate to algorithm how your comprehensions of reality has to be shaped, you have to see which information feeds you. You have to decide willingly to change that. (Sadly we lack resources, UX designer and frontend developer)

Conclusion of piece #1 + Open Data

It is not a euphemism to say that this is just the beginning. The data we have might permit to understand a bit more about how Facebook works and how our society is caged in bubbles. We, the small team of #fbtrex has some other analysis in a queue, but, only a global and diverse community might be able to critic the algorithms from multiple points of views. You can download the CSV used by us in this analysis.

Not just elections deserve to be monitored. The daily informative diet of your community might reveal interesting insights. How a topic gets developed in different bubbles, or just, how the algorithm shapes the perception of our realities.

Known bugs

  1. The parser implemented to extract the source name from the html is giving a different identification to “France24” and “France24 shared a picture”, but in the CSV you can extract the source from the permaLink
  2. One of the users has friends (they belong to a different bubble than France, even to a different continent), and they have been marked in Facebook as “unfollow” to don’t pollute the collection.
  3. Looking at picture 7 and 8 might raise in you the doubt that the not-display posts have been published in the past days, and the reason why they are not showing up in the viz is just bias of our window reduction. But it is not, below my quick and dirty test to see how many users saw the postId 10154723755726936 in the whole data-set, and they are three out of four:
$ grep 10154723755726936 fbtrex-French2017.csv | sed -es/.*feed…//g | sed -es/\”.*// | sort | uniq -c 
 9 100014305273231
 7 100016786692466
 28 100016788883580

Credits

Claudio Agosti, Raffaele Angius.