The Graph API: Key Points in the Facebook and Cambridge Analytica Debacle

That ain’t workin ‘, that’s the way you do it — data for nothin’ through the Graph API for free

Jonathan Albright
Tow Center

--

Due to the number of questions, I’ve decided to outline the key points, privacy implications, and technical features related to Facebook’s Graph API. The Graph API is the underlying issue in the Cambridge Analytica data-sharing voter “micro-targeting” debacle. (This post is an expanded version of the Twitter thread linked below)

https://twitter.com/d1gi/status/976109055642042368

The problematic collection of Facebook users’ personal info — and the ability to obtain unusually rich info about usersfriends — is due to the design and functionality of Facebook’s Graph API. Importantly, the vast majority of problems that have arisen as a result of this integration were meant to be “features, not bugs,” as many have rightly pointed out.

People = Objects

Facebook’s Graph API is a developer, or app-level, interface that followed an earlier “REST” version of Facebook’s API. The introduction of the Graph API was heralded by Facebook as a revolutionary way to understand and access people’s social lives. Or, “We are Building A Web Where the Default Is [Sharing].”

Marketers, businesses, researchers, and law enforcement were provided with industrial-level personal information access and advanced search functionality into Facebook users’ activities, connections, and emotional states far beyond what they simply “posted” and talked about on the platform and apps. As of 2017, Instagram has a similar Graph API functionality:

Facebook’s Graph API was a revolution in large-scale data provision. It converted people and their likes, connections, locations, updates, networks, histories, and extended social networks into — quite literally — “objects.” It made the company’s offerings and the data its users generated more economically viable.

There are even “front end” Graph API interfaces still around, such as StalkScan, which is meant to show users the dangers of background (meta)data availability and teach about possible misuse:

v1.0

Version 1.0 of the Graph API launched on April 21, 2010. It was deprecated in April 2014 and closed completely to legacy apps (ie, existing apps that used the API before April 2014) on April 30, 2015.

That’s five full years — a lot of time for apps and quizzes to extract vast quantities of users’ personal data and the information from across their entire social networks. It was the beginning of “All Your Face Are Belong To Us.

Facebook saw problems with the amount of personal information available in the first implementation of their Graph API. But they didn’t want to cut off marketing channel and business partners from the huge revenue stream. So, on April 30, 2014, the company announced at f8 that v1.0 would be wound down in favor of a much more restrictive v2.0 API.

Zuckerberg and Facebook’s April 2014 news release refer to the company’s change of heart about their industrial PII access feature that exposed users’ sensitive personal info and that of their extended friend and family network as a move towards “putting people first.”

https://newsroom.fb.com/news/2014/04/f8-2014-stability-for-developers-and-more-control-for-people-in-apps/

The exact same day in April, in bullet points underneath, the company announced their biggest tracking and ad targeting initiative to date: the Facebook Audience Network. In simple terms, this extended the company’s data profiling and ad-targeting juggernaut from its own apps and services to the rest of the internet.

Facebook for Business: Say hello to Facebook’s Audience Network — a new way for advertisers to extend their campaigns beyond Facebook and into other mobile apps.

Main v1.0 Problem: Extended Permissions

What made the Facebook Graph API’s v1.0 highly problematic was its extended permissions. Apps could request a huge range of users’ friends info without much friction or communicating the reason(s) for providing consent.

Once authorized with a single prompt, v1.0 app could potentially remain in the background collecting and processing people’s data — and that of their entire friend network — for years. Additionally, v1.0 apps could also request users’ private messages (ie, their Facebook DM inbox) via the “read_mailbox” API request.

Users’ FRIENDS info available: About me, actions, activities, b-day, check-ins, education, events, games, groups, hometown, interests, likes, location, notes, online status, tags, photos, questions, relationships, religion/politics, status, subscriptions, website, work history

Symeonidis, Tsormpatzoudi & Preneel (2017): https://eprint.iacr.org/2015/456.pdf

Also, developers could operate multiple v1.0 apps. Since the v1.0 API returned users’ real Facebook user_IDs, app developers and their partners could instantly recombine mass quantities of personal identifying info collected across dozens of different Facebook apps and quizzes.

Symeonidis, Tsormpatzoudi and Preneel’s excellent paper (2017, see table above) argues that while the company’s v1.0 API allowed third party apps to obtain consent for originating users’ personal information, no consent was obtained for the rest of the “data subjects” — meaning people’s friends on Facebook, which were often the main reason this data was being collected and processed in the first place.

This means that the debate stemming from use of the term “breach,” while not accurate from a systems-level computer security standpoint, is arguably legally correct in regards to the lack of informed consent by the “data subjects.” Meaning before the mass collection, processing, and re-sharing of users friends’ personal information. (see users’ FRIENDS info table above)

My Recommendation

The UK MP Damian Collins, the UK’s ICO office, the United States Senate and Congressional committees, and other important regulatory actors such as the Federal Trade Commission should require Facebook to immediately share with the public the “quizzes” used by Cambridge Analytica — with all the questions, any versions if relevant, and the entire list of v1.0 Graph API “friend permissions” used to effectively steal personal information from up to 50 million people.

There’s no reason to withhold as there are no privacy implications. We should also know if users’ private Facebook messages were collected at any point.

While clearly more problematic than useful overall, Facebook’s Graph API v1.0 tools were amazing for teaching. I used several apps — including Bernie Hogan’s (@blurky) app at the Oxford Internet Institute in my own classes between 2013 and 2015. My students in journalism, communication, public relations, and media analytics could extract their own Facebook networks to understand nature and quality of links, friend clustering, communities, and learn about anthropology concepts such as the Dunbar number firsthand.

Publicity Settings versus Privacy Settings

Facebook’s interface has been built around the false pretense of giving users control over what is shared. But the focus is on “posting,” or outward sharing; what we actively CHOOSE to share. In reality, Facebook users have the exact opposite ability to control what is passively shared ABOUT THEM — meaning the information and metadata others can extract.

In this sense, Facebook’s “privacy settings” are a grand illusion. Control over post-sharing— people we share TO — should really be called “Publicity Settings.” Likewise, control over passive sharing— the information people can take info FROM us — should be called “Privacy Settings.

Politics has become the most important kind of tech reporting.

Recent political news, especially stories about social platforms, foreign manipulation, and misuse of data and public data tools demand a technical baseline. While journalists don’t need deep expertise, they should understand the essential mechanisms — such as the Graph API’s problematic history and its direct relevance to the Cambridge Analytica story.

Reporters should be able to convey in broad terms how users’ personal data was obtained with the Graph API v1.0, and explain to readers why their data was open to mass collection purposes. This will help provoke questions and scenarios about how it was used and where it might have ended up. This technological literacy is a prerequisite to informed reporting on what Cambridge Analytica, Facebook and other political propaganda and media manipulation events mean for society, elections, and the future of technology and democracy.

This is a complex topic. Even for experts it’s easy to get things wrong. But for journalists, this leads to getting called out by defensive execs, PR teams, politicians, and otherwise important stories getting branded as “fake news.” It’s easy to overstate certain pieces of the puzzle. Or lose sight of the bigger picture in side arguments about whether Cambridge Analytica’s tools and methods were effective or not.

None of these are productive at this stage. I suggest we have much bigger issues to confront, like revisiting current business models and rethinking data privacy rules. I hope this outline of key issues (and the questions that might arise as a result of more accurate and technically informed reporting on this topic) is helpful!

--

--

Jonathan Albright
Tow Center

Professor/researcher. Award-nominated data journalist. Media, data, & tech frmly #columbiajournalism #towcenter #berkmanklein #elonuniversity