Cracking the Facebook Data Scientist Interview — Part 2

Deepen Panchal
10 min readMar 17, 2020

This is the 2nd part in my attempt to provide a condensed list of resources to help candidates prep for interviews. If you haven’t already, please refer to part-1 of this series.

The Preparation (Continued)

Quantitative Analysis: this interview is designed to evaluate quantitative reasoning and applied statistics.

  1. Quantitative Reasoning: Testing knowledge of relevant mathematical/probabilistic/statistical concepts and how they relate to Facebook product.
  2. Applied Statistics: problems drawn from real-world data or estimation.

Expect questions on:

  • Estimation and logical reasoning in the context of a real-world product
  • Descriptive statistics (mean/expected value, median, mode, percentiles)
  • Common discrete and continuous distributions
  • Real-world data — how does it typically look like? Distribution?
  • Law of Large Numbers, Central Limit Theorem, Linear Regression
  • Conditional probabilities, including Bayes’ Theorem

Understanding the questions:

Here’s a set of Glassdoor resources to get you started a quick overview of the type of questions asked:

Prep Guide

Once you have a good grasp of what type of questions are asked, here’s my suggestion on how you can go about your preparation. Use brilliant.org as your bible for all your Math, Stats and Probability. It is one of the best resources you have on the web that not just helps you solve questions, but makes you think about the concepts and forces you to understand by giving real-world problems. However, it’s easy to get lost in the ocean of resources they have, so I would confine them to what is required for your preparation.

Statistics

Start here: https://brilliant.org/courses/statistics/

Solve the following subsections:

  • Statistics Introduction
  • Building Blocks — you can skip the Simpson’s Paradox part
  • Lying with Statistics — you can skip regression paradox and more graphs
  • Variance and Normal Curves
  • Experiments

Probability

Start here: https://brilliant.org/courses/probability-fundamentals/

Solve the following subsections:

  • Introduction — this will surely force you to think, they created this course keeping in mind our fallacies
  • Starting With Probability — You can skip games and gambling if you’d like
  • Roll the Dice — You can skip endgame strategy if you’d like

Then move to this course: https://brilliant.org/courses/probability/

Solve everything here except “Advanced Techniques”. You can skip Probability Applications if you think you did well in the prior sections. From the “Advanced Techniques” section, Geometric Probability is basic and must to know.

Distributions

Start here: https://brilliant.org/courses/probability_ii/

Solve the following subsections:

  • Introduction
  • Discrete Random Variables
  • Expected Values
  • Variance
  • Discrete Distributions
  • Continuous random variables — you can skip Join distributions if you’d like
  • Continuous Distributions — you can skip all except normal

Once you are done with the above, ask yourself the following questions.

Can I clearly associate real world situations to discrete vs continuous variables? Do I know what determines the classification?

Do I know the basic assumptions of Binomial, Normal distributions? Can I associate real world problems to them?

Poisson might be a little challenging to understand only through brilliant, so this is a helpful resource: https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459

Some Khan Academy resources are super helpful:

Do I understand these relationships?

  • Binomial > Poisson
  • Binomial > Normal
  • Binomial > Poisson > Normal

Discuss with a friend, something very simple real world data points. Example: “How will you model the distribution of the number of naan’s people in New York eat in a given time period” — I am not a big naan fan though ;)

Once comfortable with distributions from the above sources, here’s a summary of the PDFs of Discrete Random Variables(DRV). These are not meant to be memorized, but to be understood.

  • Independence of a DRV = Independence property of 2 events i.e.
    P(X=x and Y=y) = P(X=x).P(Y=y)
    OR
    P(X=x|Y=y) = P(X=x)
    OR
    P(Y=y|X=x) = P(Y=y)
  • A cumulative density function (CDF) is represented by
    Fx(x) = P(X ≤ x)
  • Expected Values:
    Generically, E|X|= ∑ x. Px(x)
    Linearity of expectation: E|X+Y| = E|x| + E|Y|
  • Variance:
    Generically, var(x) = E|X - E|X||² = E|X|² - (E|X|)²
  • For DRVs, var(x) = ∑(x - E|X|)². Px(x)
    where Px(x) is the probability of an event occurring represented by a DRV(x).
    If you’re looking to derive the variance of any given dataset, you can extend the formula to
    var(x) = 1/n . ∑ (x - E|X|)²
  • Covariance:
    cov(X,Y) = E|(x- E|X|). (y - E|Y|)|
    Thus, cov(X,Y) = E|XY|- E|x|. E|Y|
  • Uniform Distribution:
    Px(x) = 1/n, where x = 1,2,3…n
  • Bernoulli Distribution (events that have only success/failure states):
    Px(x) = p, if success
    Px(x) = 1-p, if failure
    E|X| = p
    var(x) = p - p² = p(1-p)
    (This is a derivation from E|X|- (E|X|)²
    where E|X²| = p. 1² — (1-p).0² — since E|X| = x. Px(x))
  • Binomial Distribution
    Assumptions:
    1. Outcome is always 0 or 1
    2. Independence of each trial
    3. fixed number of trials
    4. probability p of success on each trial is constant
    Px(k) = (nCk) pᵏ. (1-p)^(n-k)
    where nCk is the binomial coefficient read as “n choose k” and represents a combination of k items that can be chosen from n.
    E|X| = p
    var(x) = p.(1-p)
  • Geometric Distribution:
    Px(k) = (1-p)ᵏ. p, where k is the number of failures until the 1st success.
    E|X| = (1-p)/p
    A handy formulae you’ll use for the SUM of an infinite series: a/(1-r) where a is the first term of the series, and each term in the series is “r” times the previous term.
  • Poisson Distribution
    Assumptions:
    1. an event occurs or does not, there are no partial events
    2. avg. no of events per time period is known
    3. probability of an event is constant at any point in the time period eg. Px(start-of-day) = Px(end-of-day)
    4. each event is independent
    5. any number of events can occur during the time period
    Therefore, P(X=k) = λᵏ. e^(-λ)/ k!, for k =1,2,3….

SQL Coding

A lot Facebook SQL questions are focused around self joins. Thinking about the problems they solve, it makes a lot of sense. Here are a small family of questions that cover a good portion of SQL. Note — these questions are publicly available on the web and are not questions that were from any specific employer interviews, I am only consolidating them for simplicity.

Question 1

Given the two tables, write a SQL query that creates a cumulative distribution of number of comments per user. Assume bin buckets class intervals of one.

#Basically, the question wants us to do a cumulative sum using a self-joinwith test as (SELECT total_enrolled,count(*) as freqFROM (select a.id,count(*) as total_enrolledfrom camps_sessions_weeks acinner join camps_sessions asps on ac.camp_session_fk = asps.idJOIN camps_activities aON asps.camp_activity_fk = a.idGROUP BY 1) freq_histGROUP BY 1ORDER BY total_enrolled)SELECT a.total_enrolled,sum(b.freq)FROM test aLEFT JOIN test bON b.total_enrolled <= a.total_enrolledGROUP BY 1ORDER BY 1

Question 2

An attendance log for every student in a school district

attendance_events : date | student_id | attendance

A summary table with demographics for each student in the district

all_students : student_id | school_id | grade_level | date_of_birth | hometown

-- Assuming that every student has attended class atleast onceWITH cte as (SELECT COUNT(distinct student_id) as cntFROM all_students),SELECT COUNT(DISTINCT a.student_id)/ cte.cntFROM attendance_events aJOIN all_students bON a.student_id = b.student_idAND a.date = b.date_of_birthJOIN cteON 1=1WHERE attendance IS NOT NULLWITHOUT CTE===============SELECT s.student_id, s.date, COUNT() OVER (PARTITION BY s.student_id, a.date) as count_of_birthdays_attendedFROM all_students sLEFT JOIN all_attendance aWhich grade level had the largest drop in attendance between yesterdayand today?-- Assume 1 student has attendance marked once per day-- Assume attendance for today has been completedWITH cte as (SELECT a.grade_level, b.date, COUNT(student_id) number_of_attended_studentsFROM all_students sJOIN all_attendance aON s.student_id = a.student_idWHERE date IN (current_date,current_date-1)GROUP BY 1,2)SELECT a.grade_level, a.date, a.number_of_attended_students, b.date, b.number_of_attended_students, a.number_of_attended_students - b.number_of_attended_students as differenceFROM cte aJOIN cte bON a.grade_level = b.grade_levelAND a.date != b.daterolling

Question 3

SQL Cases: https://leetcode.com/problems/human-traffic-of-stadium/

Covers the case where the last 2 rows are edge cases that need to be handled.

SELECT t.* FROM stadium t
LEFT JOIN stadium p1 ON t.id — 1 = p1.id
LEFT JOIN stadium p2 ON t.id — 2 = p2.id
LEFT JOIN stadium n1 ON t.id + 1 = n1.id
LEFT JOIN stadium n2 ON t.id + 2 = n2.id
WHERE (t.people >= 100 AND p1.people >= 100 AND p2.people >= 100)
OR (t.people >= 100 AND n1.people >= 100 AND n2.people >= 100)
OR (t.people >= 100 AND n1.people >= 100 AND p1.people >= 100)
ORDER BY id;

Question 4

Given an event-level table of interactions between pairs of users (note that there aren’t duplicates in one day for one pair of users), for each possible number of “people interacted with” find the count for that group in a given day (i.e. 10 people interacted with only one person, 20 with 2, etc.).

Question 5

We have a table called ad_accounts(account_id, date, status). Status can be active/closed/fraud.

A) what percent of active accounts are fraud?

B) How many accounts became fraud today for the first time?

C) What would be the financial impact of letting fraud accounts become active (how would you approach this question)?

Question 6

Question 7

We have a table called ad_accounts(account_id, date, status). Status can be active/closed/fraud.

A) what percent of active accounts are fraud?

With active as (SELECT COUNT(DISTINCT b.account_id),COUNT(DISTINCT a.account_id),COUNT(DISTINCT b.account_id)/COUNT(DISTINCT a.account_id)::floatFROM sandbox.ad_accounts aLEFT JOIN sandbox.ad_accounts bON a.account_id = b.account_idAND b.status = 'fraud'WHERE a.status = 'active'

B) How many accounts became fraud today for the first time?

SELECT *FROM sandbox.ad_accounts aLEFT JOIN sandbox.ad_accounts bON a.account_id = b.account_idAND a.status = b.statusAND a.created_at > b.created_atWHERE a.status = 'fraud'AND a.created_at = current_dateAND b.status IS NULL;

If you are not a regular SQL person, I would suggest picking up the easy and normal problems from Leetcode and also solving the problems on mode.

Summary

Keep solving SQL, Stats and Probability questions! Keep discussing Product Interpretation problem with yourself/friends! If you get bored, jump around to a different sub-section and come back later. Understand the concepts very thoroughly, it is THE most important thing. It will need practice and patience, but it is not a difficult interview.

My 2 cents

Be yourself at the day of the interview. If you prepare too much, your mind will get blocked by “this is how I am going to structure my product question”, “this is how I will solve questions on self-joins”, “these are my formulas for expected values of each distribution”. Don’t get bogged down with these, they will block your mind. DO NOT study till the last minute. Plan to cool down the day before the interview and just do some basic reading. If you have understood the fundamentals well enough, you will recall how to solve a problem! Good luck!

--

--

Deepen Panchal

Sr. Data Scientist @ FAANG. Enhancing customer experience with AI