Cracking the Facebook Data Scientist Interview — Part 2

10 min readMar 17, 2020

This is the 2nd part in my attempt to provide a condensed list of resources to help candidates prep for interviews. If you haven’t already, please refer to part-1 of this series.

The Preparation (Continued)

Quantitative Analysis: this interview is designed to evaluate quantitative reasoning and applied statistics.

Quantitative Reasoning: Testing knowledge of relevant mathematical/probabilistic/statistical concepts and how they relate to Facebook product.
Applied Statistics: problems drawn from real-world data or estimation.

Expect questions on:

Estimation and logical reasoning in the context of a real-world product
Descriptive statistics (mean/expected value, median, mode, percentiles)
Common discrete and continuous distributions
Real-world data — how does it typically look like? Distribution?
Law of Large Numbers, Central Limit Theorem, Linear Regression
Conditional probabilities, including Bayes’ Theorem

Understanding the questions:

Here’s a set of Glassdoor resources to get you started a quick overview of the type of questions asked:

What is the expectation of variance?
We have two options for serving ads within Newsfeed: 1 — out of every 25 stories, one will be an ad, 2 — every story has a 4% chance of being an ad. For each option, what is the expected number of ads shown in 100 news stories? If we go with option 2, what is the chance a user will be shown only a single ad in 100 stories? What about no ads.
Draw a sample distribution of average daily views by users for Instagram. https://www.glassdoor.com/Interview/Draw-a-sample-distribution-of-average-daily-views-by-users-for-Instagram-QTN_3078681.htm
In Mexico, if you take the mean and the median age, which one will be higher and why? https://www.glassdoor.com/Interview/In-Mexico-if-you-take-the-mean-and-the-median-age-which-one-will-be-higher-and-why-QTN_1627826.htm
If you draw 2 cards from a shuffled 52 card deck, what is the probability that you’ll have a pair?
https://www.glassdoor.com/Interview/We-have-two-types-of-reviewers-careful-reviewer-80-of-reviewers-and-lazy-reviewers-20-of-reviewers-Careful-reviewer-QTN_3049407.htm
https://www.glassdoor.com/Interview/There-are-two-mobile-restroom-stalls-at-a-construction-site-where-I-work-There-are-also-three-situations-that-have-an-equ-QTN_1802039.htm
There are 50 cards of 5 different colors. It comprises of 10 Red cards, 10 blue cards, 10 orange cards, 10 green cards and 10 yellow cards. Each color will have the cards numbered between 1 to10. You pick 2 cards at random. What is the probability that they are not of same color and not of same number. Reference
There’s a game where you are given two fair six-sided dice and asked to roll. If the sum of the values on the dice equals seven, then you win $21. However, you must pay $5 to play each time you roll both dice. Do you play this game? And in follow-up: What is the probability of making money from this game? Reference

Prep Guide

Once you have a good grasp of what type of questions are asked, here’s my suggestion on how you can go about your preparation. Use brilliant.org as your bible for all your Math, Stats and Probability. It is one of the best resources you have on the web that not just helps you solve questions, but makes you think about the concepts and forces you to understand by giving real-world problems. However, it’s easy to get lost in the ocean of resources they have, so I would confine them to what is required for your preparation.

Statistics

Start here: https://brilliant.org/courses/statistics/

Solve the following subsections:

Statistics Introduction
Building Blocks — you can skip the Simpson’s Paradox part
Lying with Statistics — you can skip regression paradox and more graphs
Variance and Normal Curves
Experiments

Probability

Start here: https://brilliant.org/courses/probability-fundamentals/

Solve the following subsections:

Introduction — this will surely force you to think, they created this course keeping in mind our fallacies
Starting With Probability — You can skip games and gambling if you’d like
Roll the Dice — You can skip endgame strategy if you’d like

Then move to this course: https://brilliant.org/courses/probability/

Solve everything here except “Advanced Techniques”. You can skip Probability Applications if you think you did well in the prior sections. From the “Advanced Techniques” section, Geometric Probability is basic and must to know.

Distributions

Start here: https://brilliant.org/courses/probability_ii/

Solve the following subsections:

Introduction
Discrete Random Variables
Expected Values
Variance
Discrete Distributions
Continuous random variables — you can skip Join distributions if you’d like
Continuous Distributions — you can skip all except normal

Once you are done with the above, ask yourself the following questions.

Can I clearly associate real world situations to discrete vs continuous variables? Do I know what determines the classification?

Do I know the basic assumptions of Binomial, Normal distributions? Can I associate real world problems to them?

Poisson might be a little challenging to understand only through brilliant, so this is a helpful resource: https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459

Some Khan Academy resources are super helpful:

Do I understand these relationships?

Binomial > Poisson
Binomial > Normal
Binomial > Poisson > Normal

Discuss with a friend, something very simple real world data points. Example: “How will you model the distribution of the number of naan’s people in New York eat in a given time period” — I am not a big naan fan though ;)

Once comfortable with distributions from the above sources, here’s a summary of the PDFs of Discrete Random Variables(DRV). These are not meant to be memorized, but to be understood.

Independence of a DRV = Independence property of 2 events i.e.
P(X=x and Y=y) = P(X=x).P(Y=y)
OR
P(X=x|Y=y) = P(X=x)
OR
P(Y=y|X=x) = P(Y=y)
A cumulative density function (CDF) is represented by
Fx(x) = P(X ≤ x)
Expected Values:
Generically, E|X|= ∑ x. Px(x)
Linearity of expectation: E|X+Y| = E|x| + E|Y|
Variance:
Generically, var(x) = E|X - E|X||² = E|X|² - (E|X|)²
For DRVs, var(x) = ∑(x - E|X|)². Px(x)
where Px(x) is the probability of an event occurring represented by a DRV(x).
If you’re looking to derive the variance of any given dataset, you can extend the formula to
var(x) = 1/n . ∑ (x - E|X|)²
Covariance:
cov(X,Y) = E|(x- E|X|). (y - E|Y|)|
Thus, cov(X,Y) = E|XY|- E|x|. E|Y|
Uniform Distribution:
Px(x) = 1/n, where x = 1,2,3…n
Bernoulli Distribution (events that have only success/failure states):
Px(x) = p, if success
Px(x) = 1-p, if failure
E|X| = p
var(x) = p - p² = p(1-p)
(This is a derivation from E|X|- (E|X|)²
where E|X²| = p. 1² — (1-p).0² — since E|X| = x. Px(x))
Binomial Distribution
Assumptions:
1. Outcome is always 0 or 1
2. Independence of each trial
3. fixed number of trials
4. probability p of success on each trial is constant
Px(k) = (nCk) pᵏ. (1-p)^(n-k)
where nCk is the binomial coefficient read as “n choose k” and represents a combination of k items that can be chosen from n.
E|X| = p
var(x) = p.(1-p)
Geometric Distribution:
Px(k) = (1-p)ᵏ. p, where k is the number of failures until the 1st success.
E|X| = (1-p)/p
A handy formulae you’ll use for the SUM of an infinite series: a/(1-r) where a is the first term of the series, and each term in the series is “r” times the previous term.
Poisson Distribution
Assumptions:
1. an event occurs or does not, there are no partial events
2. avg. no of events per time period is known
3. probability of an event is constant at any point in the time period eg. Px(start-of-day) = Px(end-of-day)
4. each event is independent
5. any number of events can occur during the time period
Therefore, P(X=k) = λᵏ. e^(-λ)/ k!, for k =1,2,3….

SQL Coding

A lot Facebook SQL questions are focused around self joins. Thinking about the problems they solve, it makes a lot of sense. Here are a small family of questions that cover a good portion of SQL. Note — these questions are publicly available on the web and are not questions that were from any specific employer interviews, I am only consolidating them for simplicity.

Question 1

Given the two tables, write a SQL query that creates a cumulative distribution of number of comments per user. Assume bin buckets class intervals of one.

#Basically, the question wants us to do a cumulative sum using a self-joinwith test as (SELECT total_enrolled,count(*) as freqFROM (select a.id,count(*) as total_enrolledfrom camps_sessions_weeks acinner join camps_sessions asps on ac.camp_session_fk = asps.idJOIN camps_activities aON asps.camp_activity_fk = a.idGROUP BY 1) freq_histGROUP BY 1ORDER BY total_enrolled)SELECT a.total_enrolled,sum(b.freq)FROM test aLEFT JOIN test bON b.total_enrolled <= a.total_enrolledGROUP BY 1ORDER BY 1

Question 2

An attendance log for every student in a school district

attendance_events : date | student_id | attendance

A summary table with demographics for each student in the district

all_students : student_id | school_id | grade_level | date_of_birth | hometown

-- Assuming that every student has attended class atleast onceWITH cte as (SELECT COUNT(distinct student_id) as cntFROM all_students),SELECT COUNT(DISTINCT a.student_id)/ cte.cntFROM attendance_events aJOIN all_students bON a.student_id = b.student_idAND a.date = b.date_of_birthJOIN cteON 1=1WHERE attendance IS NOT NULLWITHOUT CTE===============SELECT s.student_id, s.date, COUNT() OVER (PARTITION BY s.student_id, a.date) as count_of_birthdays_attendedFROM all_students sLEFT JOIN all_attendance aWhich grade level had the largest drop in attendance between yesterdayand today?-- Assume 1 student has attendance marked once per day-- Assume attendance for today has been completedWITH cte as (SELECT a.grade_level, b.date, COUNT(student_id) number_of_attended_studentsFROM all_students sJOIN all_attendance aON s.student_id = a.student_idWHERE date IN (current_date,current_date-1)GROUP BY 1,2)SELECT a.grade_level, a.date, a.number_of_attended_students, b.date, b.number_of_attended_students, a.number_of_attended_students - b.number_of_attended_students as differenceFROM cte aJOIN cte bON a.grade_level = b.grade_levelAND a.date != b.daterolling

Question 3

SQL Cases: https://leetcode.com/problems/human-traffic-of-stadium/

Covers the case where the last 2 rows are edge cases that need to be handled.

SELECT t.* FROM stadium t
 LEFT JOIN stadium p1 ON t.id — 1 = p1.id
 LEFT JOIN stadium p2 ON t.id — 2 = p2.id
 LEFT JOIN stadium n1 ON t.id + 1 = n1.id
 LEFT JOIN stadium n2 ON t.id + 2 = n2.id
 WHERE (t.people >= 100 AND p1.people >= 100 AND p2.people >= 100)
 OR (t.people >= 100 AND n1.people >= 100 AND n2.people >= 100)
 OR (t.people >= 100 AND n1.people >= 100 AND p1.people >= 100)
 ORDER BY id;

Question 4

Given an event-level table of interactions between pairs of users (note that there aren’t duplicates in one day for one pair of users), for each possible number of “people interacted with” find the count for that group in a given day (i.e. 10 people interacted with only one person, 20 with 2, etc.).

Facebook Interview Question: Given an event-level table of...

Interview question for Data Scientist.Given an event-level table of interactions between pairs of users (note that…

www.glassdoor.com

Question 5

We have a table called ad_accounts(account_id, date, status). Status can be active/closed/fraud.

A) what percent of active accounts are fraud?

B) How many accounts became fraud today for the first time?

C) What would be the financial impact of letting fraud accounts become active (how would you approach this question)?

Facebook Interview Question: We have a table called ad_acc...

Interview question for Data Scientist, Analytics in Menlo Park, CA.We have a table called ad_accounts(account_id, date…

www.glassdoor.com

Question 6

Facebook Interview Question: Write a SQL query to compute ...

Interview question for Data Scientist, Analytics in Menlo Park, CA.Write a SQL query to compute a frequency table of a…

www.glassdoor.com

Question 7

We have a table called ad_accounts(account_id, date, status). Status can be active/closed/fraud.

A) what percent of active accounts are fraud?

With active as (SELECT COUNT(DISTINCT b.account_id),COUNT(DISTINCT a.account_id),COUNT(DISTINCT b.account_id)/COUNT(DISTINCT a.account_id)::floatFROM sandbox.ad_accounts aLEFT JOIN sandbox.ad_accounts bON a.account_id = b.account_idAND b.status = 'fraud'WHERE a.status = 'active'

B) How many accounts became fraud today for the first time?

SELECT *FROM sandbox.ad_accounts aLEFT JOIN sandbox.ad_accounts bON a.account_id = b.account_idAND a.status = b.statusAND a.created_at > b.created_atWHERE a.status = 'fraud'AND a.created_at = current_dateAND b.status IS NULL;

If you are not a regular SQL person, I would suggest picking up the easy and normal problems from Leetcode and also solving the problems on mode.

Summary

Keep solving SQL, Stats and Probability questions! Keep discussing Product Interpretation problem with yourself/friends! If you get bored, jump around to a different sub-section and come back later. Understand the concepts very thoroughly, it is THE most important thing. It will need practice and patience, but it is not a difficult interview.

My 2 cents

Be yourself at the day of the interview. If you prepare too much, your mind will get blocked by “this is how I am going to structure my product question”, “this is how I will solve questions on self-joins”, “these are my formulas for expected values of each distribution”. Don’t get bogged down with these, they will block your mind. DO NOT study till the last minute. Plan to cool down the day before the interview and just do some basic reading. If you have understood the fundamentals well enough, you will recall how to solve a problem! Good luck!

Cracking the Facebook Data Scientist Interview — Part 2

The Preparation (Continued)

Understanding the questions:

Prep Guide

Statistics

Probability

Distributions

SQL Coding

Question 1

Question 2

Question 3

Question 4

Facebook Interview Question: Given an event-level table of...

Interview question for Data Scientist.Given an event-level table of interactions between pairs of users (note that…

Question 5

Facebook Interview Question: We have a table called ad_acc...

Interview question for Data Scientist, Analytics in Menlo Park, CA.We have a table called ad_accounts(account_id, date…

Question 6

Facebook Interview Question: Write a SQL query to compute ...

Interview question for Data Scientist, Analytics in Menlo Park, CA.Write a SQL query to compute a frequency table of a…

Question 7

Summary

My 2 cents

Written by Deepen Panchal