How does facebook handle the 4+ petabyte of data generated per day? Cambridge Analytica - facebook data scandal.

Ankush Sinha Roy
8 min readSep 15, 2020

Before moving on to Facebook, let’s take a look at a few points at why data indeed is considered as the new gold!

  • Google gets over 3.5 billion searches daily.
    Google remains the highest shareholder of the search engine market, with 87.35% of the global search engine market share as of January 2020. Big Data stats for 2020 show that this translates into 1.2 trillion searches yearly, and more than 40,000 search queries per second.
  • WhatsApp users exchange up to 65 billion messages daily.
    5 million businesses are actively using the WhatsApp Business app to connect with their customers. There are over 1 billion WhatsApp groups worldwide?
  • Internet users generate about 2.5 quintillion bytes of data each day.
    With the estimated amount of data we should have by 2020 (40 zettabytes), we have to ask ourselves what’s our part in creating all that data. So, how much data is generated every day? 2.5 quintillion bytes. Now, this number seems rather high, but if we look at it in zettabytes, i.e., 0.0025 zettabytes this doesn’t seem all that much. When we add to that the fact that in 2020 we should have 40 zettabytes, we’re generating data at a regular pace.
  • By 2020, every person will generate 1.7 megabytes in just a second.
  • In 2019, there are 2.3 billion active Facebook users, and they generate a lot of data.

How big is this data generated by Facebook?

Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data. This enormous amount of content generation is without a doubt connected to the fact that Facebook users spend more time on the site than users spend on any other social network, putting in about an hour a day.

Facebook Big data challenges

Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores used most heavily are:

1. ODS (Operational Data Store) stores 2 billion time series of counters. It is used most commonly in alerts and dashboards and for trouble-shooting system metrics with 1–5 minutes of time lag. There are about 40,000 queries per second.

2. Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.

3. Hive is Facebook’s data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.

ODS, Scuba, and Hive share an important characteristic: none is a traditional relational database. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage or retrieval. Instead, challenges arise from high data insertion rates and massive data quantities.

Big data Analytics

Big data analytics is the often complex process of examining big data to uncover information — such as hidden patterns, correlations, market trends and customer preferences — that can help organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques provide a means to analyze data sets and take away new information — which can help organizations make informed business decisions. Business intelligence (BI) queries answer basic questions about business operations and performance.

Why is it so important in any business?

Big data analytics through specialized systems and software can lead to positive business-related outcomes:

  • New revenue opportunities
  • More effective marketing
  • Better customer service
  • Improved operational efficiency
  • Competitive advantages over rivals

Big data analytics applications allow data analysts, data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI and analytics programs.

How big data analytics works

In some cases, Hadoop clusters and NoSQL systems are used primarily as landing pads and staging areas for data. This is before it gets loaded into a data warehouse or analytical database for analysis — usually in a summarized form that is more conducive to relational structures.

More frequently, however, big data analytics users are adopting the concept of a Hadoop data lake that serves as the primary repository for incoming streams of raw data. In such architectures, data can be analyzed directly in a Hadoop cluster or run through a processing engine like Spark. As in data warehousing, sound data management is a crucial first step in the big data analytics process. Data being stored in the HDFS must be organized, configured and partitioned properly to get good performance out of both extract, transform and load (ETL) integration jobs and analytical queries.

Once the data is ready, it can be analyzed with the software commonly used for advanced analytics processes. That includes tools for:

  • data mining, which sift through data sets in search of patterns and relationships;
  • predictive analytics, which build models to forecast customer behavior and other future developments;
  • machine learning, which taps algorithms to analyze large data sets; and
  • deep learning, a more advanced offshoot of machine learning.

Text mining and statistical analysis software can also play a role in the big data analytics process, as can mainstream business intelligence software and data visualization tools. For both ETL and analytics applications, queries can be written in MapReduce, with programming languages such as R, Python, Scala, and SQL. These are the standard languages for relational databases that are supported via SQL-on-Hadoop technologies.

Now comes the interesting topic!

What exactly is Cambridge Analytica?

Cambridge Analytica Ltd (CA) was a British political consulting firm that combined misappropriation of digital assets, data mining, data brokerage, and data analysis with strategic communication during electoral processes. It was started in 2013 as an offshoot of the SCL Group, and was led by Alexander Nix. After closing operations with legal proceedings including bankruptcy, members of the SCL Group have been continuing operations under the legal entity Emerdata Limited. The company was partly owned by the family of Robert Mercer, an American hedge-fund manager who supports many politically conservative causes and closed operations in 2018 in the course of the Facebook–Cambridge Analytica data scandal, although related firms still exist.

CEO Alexander Nix has said CA was involved in 44 US political races in 2014. In 2015, CA performed data analysis services for Ted Cruz’s presidential campaign. In 2016, CA worked for Donald Trump’s presidential campaign as well as for Leave. EU (one of the organisations campaigning in the United Kingdom’s referendum on European Union membership). CA’s role in those campaigns has been controversial and is the subject of ongoing criminal investigations in both countries. Political scientists question CA’s claims about the effectiveness of its methods of targeting voters.

The infamous Facebook–Cambridge Analytica data scandal.

A recent survey by Harris Poll revealed that users found Facebook to be the hardest social media site to abstain from using. Although Zuckerberg is a programmer, he actually majored in psychology, which may account for his insights into developing a site that is so addictive. The number of people getting involved with the site is constantly increasing, with about 400 users signing up for Facebook each minute. Of course, in that one minute, a lot more happens on Facebook than just 400 people joining. Every 60 seconds, 510,000 comments are posted, 293,000 statuses are updated, 4 million posts are liked, and 136,000 photos are uploaded. But, that’s not enough for Facebook. Looking for ways to increase interactions with posts and advertisements, Facebook released the “reactions” option in 2016. More than 300 billion reactions were used in the year following the release of the feature. Of course, not all interactions, reactions, or even accounts on Facebook are legit. In fact, Facebook deleted 583 million fake accounts in the first three months of 2018. Even with the removal of millions of accounts, the steady increase in the number of Facebook users meant that there were more and more posts contending for users’ attention. Facebook admitted that at any given time, there are more than 1,500 stories competing for a space in a user’s newsfeed. However, Facebook only chooses about 300 stories to appear in a user’s feed. How these stories are selected changed in 2018 when Facebook announced adjustments to its algorithm, prioritizing post from users’ friends and family members. Though many Facebook users’ rejoiced about the change, it was bad news for companies banking on organic reach. Now, they need to rely more on paid advertisements to end up in a user’s newsfeed. Zuckerberg admitted that after the new algorithm overhaul, people were spending less time on the site: down 50 million minutes per day, which is just about 1–2 minutes per user. Not only that, for the first time since its inception, Facebook experienced a decrease in its US and Canada user base in Q4 of 2017. It went from 185 million in Q3 to 184 million in Q4. Although a million users might seem like a drop in the bucket, it’s important to note that each monthly active user from the US and Canada brings in an average of $27 in revenue.

Now, what if this huge amount of data gets leaked?

The Facebook–Cambridge Analytica data breach was a data leak in early 2018 whereby millions of Facebook users’ personal data was harvested without consent by Cambridge Analytica, predominantly to be used for political advertising. It is the largest known leak in Facebook history.

The personal data of up to 87 million Facebook users were acquired via the 270,000 Facebook users who used a Facebook app called “This Is Your Digital Life.” By giving this third-party app permission to acquire their data, back in 2015, this also gave the app access to information on the user’s friends network; this resulted in the data of about 87 million users, the majority of whom had not explicitly given Cambridge Analytica permission to access their data, being collected. The app developer breached Facebook’s terms of service by giving the data to Cambridge Analytica.

Due to the scandal of enabling monetization of Facebook personal data, one assessment was that only 41% of Facebook users trust the company. On 26 March, the US Federal Trade Commission announced it is “conducting an open investigation of Facebook Inc’s privacy practices following the disclosure that 50 million users’ data got into the hands of political consultancy Cambridge Analytica.” In March 2019 Facebook acknowledged it had concerns about “improper data-gathering practices” by CA, months before the previously reported onset-of-alert at December 2015.

According to sources, similar companies like Cambridge Analytica help political parties create influence and manipulate the public, election results in countries like UK, USA, India, Kenya, Mexico and many more.

So, be careful before trusting what you see in the internet.