Analyzing Medium posts to understand impact of Cambridge Analytica Scandal

Has there been a focus shift towards privacy since the scandal?

Published in

Analytics Vidhya

6 min readSep 2, 2018

When the Cambridge Analytica- Facebook scandal emerged, articles related to misuse of user data by technology companies were ubiquitous. The issues raised related to privacy made me want to understand the impact of the scandal on the views people had on Facebook and how their perception of Facebook has changed since the scandal.

I dug into the archives of Medium and collected all the posts that were tagged “Facebook”. You can find the code on GitHub to scrape data for a particular tag on Medium here and details on how to scrape data from Medium in Scraping Medium Posts using Scrapy.

Setting the Stage

This analysis considers 27,317 posts published in English between Jan 1st 2016 to Jul 31st 2018 which had a “Facebook” tag and were not responses. From the overall distribution of posts (see below), you can observe the huge spike in the number of posts between March and April 2018 which coincides with the time the scandal broke.

The next step was to understand whether the increase in the number of posts was on specific days and if so see how they mapped with the timeline of the scandal. Extracting posts tagged “Facebook” published between March 1st 2018 and May 10th 2018 showed that the increase in the number of posts correlates to the timeline of the Facebook-Cambridge Analytica Scandal(see below).The code for this analysis can be found here.

Distribution of Posts Tagged Facebook on Medium (red line is the median number of posts)

The Rise of Privacy Debate

The scandal led to a serious debate on privacy and user data infringement. I wanted to check if the scandal was a wake-up call towards understanding Privacy issues on social media like Facebook, requiring an understanding of the post.

On Medium, one way to do this is to use the Tags associated with posts. Tags are topic identifiers on Medium. They allow users to retrieve all stories of a particular topic.

What data said was in line with my expectation — pre-scandal, four of the top 10 tags associated with Facebook was related to advertising and marketing, one of the primary revenue generating sources for Facebook. The number of posts with the tag “Privacy” between Jan 2016 to 16 March 2018, were only 481 but between March 17th to July 31st 2018 there were 1005 such posts.

Top 10 Tags Associated with Facebook, pre and post the scandal

Distribution of Posts Tagged Privacy or Data on Medium

From the images above, it looks like the scandal was a wake-up call and people have been unaware or negligent about the usage of their data and its potential impact. As we can see from the image above, the number of posts per day related to Privacy is higher than what it was before the scandal.

Privacy Themes

We saw above that the number of posts tagged Privacy or Data has increased after the scandal, but it doesn’t give an idea about the central idea or context of the post. For this purpose, we extracted the content of a post using Scrapy by passing the URL of every post. Medium does not directly provide us with the link of the post, so we concatenated the “https://medium.com/s/story/ “ and the unique slug of a post to get the URL.

To understand the themes of posts which have “Privacy” or “Data” tags, we calculate the TF-IDF matrix and cluster documents using K-Means based on cosine distance. TF-IDF indicates the importance of a word in a document in a collection of documents. Term Frequency (tf) calculates the number of occurrences of a word in a document. Inverse Document Frequency (idf) calculates how frequent is a word in the entire corpus. If a word occurs in every document in the corpus — it is not a rare or a significant word. Using this technique, we identified 12 clusters.

On identification of the clusters, we needed to understand the themes of each of these clusters, so for each cluster, we got the top 10 tags, excluding “Facebook”, “Privacy” and “Data”. Few cluster themes stood out — Cluster 9, which had the majority of the posts was about Social Media, Technology and Data Science and its influence on Politics. Cluster 0, was about the effect of the scandal and how Advertising model of Facebook was affecting privacy. The introduction of GDPR in European Union, was the central theme in Cluster 2.Cluster 7, had posts related to the data breach and the scandal and cluster 3, had posts related to Zuckerberg’s testimony . We also identified a cluster, which had “Eid100” as a tag — this tag was about getting working knowledge about digital technologies. Cluster 6 about Privacy in newer technologies like Augmented Reality, Artificial Intelligence. Cluster 8, was about Privacy issues in Whatsapp and its Encryption. However, across clusters, except for cluster 6, every other cluster had “Cambridge Analytica” in the top 10 tags. This is another evidence to the fact that we as users have been negligent towards Privacy till the scandal.

Understanding User Engagement across Themes

We looked at the different themes that emerged out of posts related to Privacy or Data and we also saw an increase in number of such posts after the scandal. It is now time to look at how user engagement (claps, recommends) has changed across this themes- what are users more interested in reading and how is it different post and pre-scandal.

Cluster 0, which was about how the advertising model of Facebook was affecting privacy has been losing user engagement. The introduction of GDPR(cluster 2) has high recommends and claps post the scandal. Cluster 8, which was about Whatsapp and its Encryption, saw an increase in the number of claps and recommends post-scandal, even though 77% of the posts were before the scandal. Cluster 3 and 7 were predominantly post-scandal. Overall, the interest of the user regarding Privacy spiked after the scandal.

End Notes

With Social Media and Technology becoming ubiquitous data and privacy issues are going to be on a rise. The onus of protecting against the misuse of our data not only lies with the Technology companies but also on the users.As Data Scientists or Analysts the responsibility of preventing data misuse is more on us. Understanding the impact, moral implications and asking the right questions before undertaking an analysis can help prevent data misuse. The code for this entire analysis can be found here. As always, any constructive criticism or ideas on how to improve this analysis further is appreciated.