30 days of Data Science and Machine Learning Interview Questions
I have started this article to keep track of all my linked-in posts on 30 days of data science and machine learning (DSML) interview questions. My goal is to cover the real interview questions asked by top companies like #google, #amazon, #meta and #microsoft. I have also shared my posts on how these companies define their data science roles. Those posts are helpful read to understand the variation among these companies when it comes to the job titles, responsibilities and desired skills.
Data Science jobs in Google and Amazon
Data Science jobs in Meta and Microsoft
I will be continuously editing this article, adding the questions as I post them. I have taken the feedbacks from my connections and readers to consolidate all the questions and answers into one article.
Question 1: What is bias-variance trade-off ? If you can explain to a mix of technical and non-technical audience.
Answer: This question can come up in many DS interviews. Sometimes we just explain the conceptual definition and fail to bring the examples in the discussion. My goal is not providing in-depth answers, but I will try to cover the main highlights that we should not miss.
- Error = Bias + Variance + irreducible error. The reducible errors are bias and variance, but there is always a tradeoff.
- Bias is defined by the simple assumptions made by the model. For example- assuming the linear relationship when the decision boundary is non-linear. Example- you have an image classifier, you are using a simple model with high bias, meaning the features are simple to miss the categorization of cats vs. jaguars vs. lions etc, because most of the common features are alike and those simple features are not able to distinguish.
- For a high biased model, both training and test errors are high. That means high bias leads to an under fit model. Example of high biased model is a linear regression.
- Variance is when the model is complicated, it tries to learn too well from training data including the noise, and therefore fails to make good predictions on test data. Similar example- if your model is complicated with complex features like detecting the fur-color, which may help you to predict everything in your training data, but what if the model is unable to predict on test data that has a new or unseen fur color. Is fur color a good feature or was it just a noise ?
- For a high variance model, training error is very low, but test error is high. High Variance leads to over fit model. Example of high variance model is a decision tree.
Question 2 : Given a sample table with emails sent vs. received by the users, calculate the response_rate which is given as emails sent/ emails received, and list all the users that fall under the 25th percentile based on the response_rate.
Below is the schema of table ‘gmail_sample_data’:
from_user: string — user id of the email from
to_user: string — user id of the email sent to
day: int — the day when email was sent
I have populated some sample data in my bit.io account. If you are not aware of bit.io , it is very helpful to set up PostGres DB in few seconds without the hassle of installing PostGres and its dependencies. The free tier account comes with 1 DB, 1 CPU and 1 GB Memory, which is enough to practice queries with small sample data.
When I approach SQL question, I first come up with a logical plan to approach the problem before jumping on query. It is important that you share your thought process with interviewer. Here is my logical plan:
1. Get the count of emails sent group by user id
2. Get the count of emails received group by user id
3. To get response rate, we have to make sure we bring both 1 & 2 on same row because SQL operates on row. We can do a join of 1 & 2
4. Calculate the response_rate
5. Use window function for ntile with offset=4 (quartile bucketing)
6. Final result will be all values that fall under the 1st quartile
Once you have the logical plan, writing query becomes easier and efficient. I always prefer CTEs, you can also run the same with sub-queries. Below is the final query:
With sent_emails as (select from_user, count(*) as emails_sent from gmail_sample_data g1 group by 1),received_emails as (select to_user, count(*) as emails_received from gmail_sample_data g2 group by 1),summarized_view as (select from_user as user_id, emails_sent*1.0/ emails_received as response_rate from sent_emails
left join received_emails on sent_emails.from_user = received_emails.to_user),final_view as (select user_id, response_rate, ntile(4) OVER (ORDER BY response_rate) as quartile from summarized_view)select user_id, response_rate from final_view
where quartile =1
Link to my bit.io where you can run the above query: https://bit.io/nbudhath/trial
Credit: I came across similar question in StrataScratch, but this one is different variation. I have used the same dataset to practice my query. StrataScratch is a great resource for practicing SQL. I have worked with Nathanael on sharing few of my SQL interview questions as well.
Question 3: What are the assumptions of linear regression ?
This is a popular question in DS interviews. The interviewer is looking not only for explanations but also for examples. You may come across different versions online. But the ones you must include are following four:
1. Linear relationship: There has to be a linear relationship of the dependent variable (also referred as output variable or target variable) with the given independent variables (also referred as input variable or features). For example: We are predicting the price of house, and let’s say the given input variables are square_footage, num_rooms, location, year_built, and age. There could be more features but for simplicity let’s stick to these for now. The linear relationship assumption says- we should be able to model the regression function as:
Price = B0 + B1*square_footage + B2*num_rooms + B3*location + B4*year_built + B5*age
(*note: there will be more coefficients for categorical variables like location, which will be handled by creating dummy variables)
2. Little to No multicollinearity: This assumption is tied to the first assumption where it says the input variables are independent, meaning there is little to no correlation among the input variables. In an ideal case, only the target variable should be correlated with input features but not among the features themselves. In our example: we will see high correlation between year_built and age, because it is obvious that higher the age, lower the year_built. We should only use one variable; age will be a good choice as it is already numerical.
3. Homoscedasticity: The residuals or errors (true price — predicted price) should have the constant variance. Easy way to remember: your residual plot (residuals vs. predicted values) should show evenly distributed data points along the zero-line (mean error). Heavily skewed or outliers in the input features result in heteroscedasticity. Removing outliers, transformations of variables using log, square root, inverse or box-cox can help. In our example, if some neighborhood has few houses that are very high or very low in price, that may affect to violate this assumption.
4. Normality of residuals: Once we calculate the residuals for all the data points, we can check with histogram or Q-Q plot for normality of residuals. If our model is good, it should be able to capture all trends (highs and lows) in the observed data. We can also interpret this as- the errors in our model should be consistent across the full range of observed data. If we had found the good fit line, residuals should follow normal distribution.
Question 4: What is regression to mean (RTM) ? How does the correlation plays its role on RTM ?
Regression to mean is the statistical phenomenon where the rare or extreme events ( both lower and upper bound) are more likely to be followed by typical events, and over the time extreme events regress towards the mean or average of the distribution.
One good example is the sports industry when some rookie players perform extremely good or bad in his/her first season. However in both cases, there is a higher chance that both of them will regress towards mean in up coming seasons. Success in sports is attributed to Talent + Luck. Suppose the talent is almost comparable for those two players, so luck has made a difference for one to perform well vs. the other. In this case, their performance will slowly converge towards the mean as compared to other players.
Imperfect correlation leads to RTM. When the association of target variable to input feature is weak or moderate, there is some explanatory power between the two but it is not perfect, that means there are other factors, which we are not aware of yet.
When you walk into doctor’s office for simple check up, if your blood pressure in the first measurement is extremely high or low, they will take another measurement to make sure they avoid any systemic errors before concluding anything. This is the most important part, that when we find some extreme samples in our data, before we jump into any conclusions, we should consider few more experiments.
Question 5: You friend purposed a game of dice, where you are given two fair six-sided dice and asked to roll. If the sum of the values on the dice equals seven, then you win $21. You must pay $5 every time you roll the dices. Do you play this game?
This question with solution can be found in many online resources, so I don’t think it will repeat, but similar questions with slight variations can always come.
Solution: When we roll two dices, the total possible events = 6*6 =36
Now what the events of interest ? For this we need to find what are winning combinations = pairs that add to 7, which are (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6
Probability of winning , P(Win) = 6/36 = 1/6
Probability of losing, P(Loss) = 1- 1/6 = 5/6
Earning from win (E_Win) = $21 — $5 (to play) = $16
Earning from loss (E_Loss) = — $5 (negative)
Now we are simply calculating the expected earnings as:
= P(Win) * E_Win + P(Loss) * E_Loss
= 1/6*$16 + 5/6 * (-$5)
= — $1.5 (negative)
On average you will be losing money, so you should not play the game.
Question 6: How do you make an unfair coin a fair one ?
Let’s say we have an unfair coin with probability of head , P(H) = 0.4 , and probability of tail , P(T) = 1–0.4 = 0.6. In fact you can pick any split beside 0.5 for the unfairness to either head or tail.
Now how can we get a fair deal from this unfair coin ? Let’s say we do two tosses. Possible results are: HH, HT, TH, TT.
P(HH) = P(H) * P(H) = 0.4 * 0.4 = 0.16
P(HT) = P(H) * P(T) = 0.4 * 0.6 = 0.24
P(TH) = P(T) * P(H) = 0.6 * 0.4 = 0.24
P(TT) = P(T) * P(T) = 0.6 * 0.6 = 0.36
If you look at above results, P(HT) and P(TH) have same probability of occurrence as 0.24. So instead of doing just one toss for head or tail, we should do two tosses but only consider the events P(HT) and P(TH). In that case the coin can still act as a fair coin.
Question 7: Explain geometric vs. negative binomial distribution ?
Most of the common distributions we come across are- Bernoulli, Binomial, Uniform, Normal, Poisson, and Exponential. However the geometric and negative binomial distributions are not as widely discussed as others, but they are very important to understand.
Both geometric and negative-binomial are related to the binomial distribution having independent trials with two possible outcomes. However the random variable highlights different aspect of the experiment. In Bernoulli, random variable is focused on defining the number of successes, but in geometric and negative binomial, random variable is focused on number of trails we conduct to get specific number of successes (one or many).
Consider an example when we toss a fair coin 3 times. Possible results are {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. Let X be the random variable representing the number of heads. The probability of getting exactly 2 heads, is P(X=2) = 3/8 = 0.375
Now, let’s look at it from geometric distribution, where the focus of random variable is number of trails to get first success (lets say getting head is the success). Let x be the number of trials at which the first head occurs.
Then, P (X=x) = P( 1st success on xth trail)
= P( 1st (x-1) failures and success on xth trail)
= (1 — p)^ (x-1) * p (where p is probability of success and x= 1,2,3…..)
Example: Find the probability of getting a head on your third toss ?
P (X= 3)= (1- 0.5)^(3–1) * 0.5 = 0.5² * 0.5 = 0.125
The negative binomial generalizes the geometric distribution by considering any number of successes instead of one.
P (X=x) = P( kth success on xth trail)
= Combination (x-1, k-1) * (1-p) ^ (x-k) * p ^ k , where x= k, k+1, k+2…
Example: What is the chance of seeing 3rd head on 5 tosses?
Here p= 0.5, x= 5, and k= 3. If we use the above combination formula:
P(X=5) = [4!/(2!2!)] [1-(1/2)]² [1/2]³ = 6/32 = 0.1875
Some interesting facts:
1. When number of success is one (k=1 in above equation), negative binomial becomes the geometric distribution.
2. Negative binomial is called as pascal distribution.
3. In geometric and negative binomial distribution, random variable is number of trails, in the binomial and Bernoulli the random variable is the number of successes. When number of trails= 1, the binomial becomes Bernoulli.
Question 8: We have a ‘user_accounts’ table having account and user information along with the date based on account used by the user. Calculate the monthly growth rate of users for each account.
Below is the schema for table 'user_accounts':date: string in ‘MM/DD/YY’ format
account_id: string
user_id: string
Output should have account_id, year, month, growth_rate ordered in increasing order of months.
Sharing my bit.io link where I added some sample data and the query:
Always remember these 3 steps to tackle the SQL question:
1. Ask Questions and Clear the doubts about data and assumptions
2. Build a Logical plan- communicate the plan back to interviewer
3. Write the query
First of all, it is important to ask questions and clear any doubts or assumptions beforehand. Few clarified questions with interviewer:
1. What is relationship between account and user ?
Response: Account to user is one-to-many relationship. One account can have many users, but one user is associated with only one account at a time.
2. How to handle null values while calculating previous month data ?
Response: If the previous month is missing, simply disregard that month for analysis.
Logical plan:
1. Get the monthly active users (mau) for each account
2. Use window function lag to get last month’s mau for each account: The lag function always have null values for first instance since there is no data for its previous month. The null value in our growth rate calculation throws error, even if we replace it with 0, we will run into divide by zero error. Therefore it is important to clarify edge cases like this beforehand.
3. Use (current — last) / last to get monthly growth rate
Actual Query:
with mau as
(select account_id,
date_part(‘year’, TO_DATE(date,’MM/DD/YY’)) as year,
date_part(‘month’,TO_DATE(date,’MM/DD/YY’)) as month,
count(distinct user_id) as current_month
from user_accounts
group by 1,2,3
order by account_id, year, month
),
monthly_change as(
select *,
LAG(current_month, 1)
OVER (PARTITION BY account_id order by account_id, year, month) as last_month
from mau)
select
account_id, year, month,
Round((current_month — last_month)*100.0 / last_month,2) as growth_rate
from monthly_change
where last_month is not null
Question 9 : Explain the difference between Random Forest (RF) vs. Gradient Boosting (GB)? Which one is better?
Sometime we fail to understand the core concepts. Interviewers will get many clues while you are explaining it whether you just studied for interview question or if you know them well. I will include following key points on discussion.
1. Both RF and GB are based on ensemble learning. RF is based on bagging — bootstrap aggregating technique where random samples with replacement are selected for training each tree, and later results are combined- Avg. for regression task and max votes for classification task. GB uses boosting technique where many weak learners are combined in series to build a strong learner.
2. When it comes to overfitting, RF is better at handling it because of the bootstrap random sampling technique, since each tree never sees the full set of training data. However in GB, all training data is used by each tree, however each iteration the new tree will be fitted on the residuals to learn from past mistakes. So GB is more prone to overfit than RF.
3. Another advantage of RF is the trees are independent since it sees different random samples to train, the parallelism in training can be achieved, but for GB since the trees are sequentially built, we cannot achieve parallelism.
For the second question. I will not take side to any of them, rather both can be equally efficient depending on the problem space. It’s better to evaluate both against your own business metrics and make decision accordingly.
Fun fact: Most of the popular kaggle models are Gradient Boosting- different variations like XGBoost, Light GBM, Catboost, Adaboost. If you are competing in kaggle, go for GBM first :)
Question 10: Given a string s containing just the characters ‘(‘, ‘)’, ‘{‘, ‘}’, ‘[‘ and ‘]’, determine if the input string is valid, meeting following conditions:
1. Open brackets must be closed by the same type of brackets.
2. Open brackets must be closed in the correct order.
Let me clear one important thing when it comes to coding round for data science interviews. One of the most confusing part of data science interview process is the DSA (Data Structures and Algorithms). Some companies do only data analytics part covering data wrangling skills in pandas or SQL, while some companies do DSA round.
Since data science interview process is not yet as streamlined as software engineering, it is little vague when it comes to the coding round. In my experience whenever there is DSA round, it is usually easy to medium level leetcode questions. Unless you are interviewing for ML engineer roles, you should be fine with arrays, strings, stacks and queues, and some questions from linked lists and binary tree. You can skip dynamic programming, graphs and trie for data science rounds. But always make sure with recruiters.
I have attached the screenshot and the link to my Colab notebook for solution to this problem.
Link to Colab: https://colab.research.google.com/drive/1gOlmZf2lwjuNzMq896Q-BlS6l7UPFFWT#scrollTo=hQiXzcXZlsYP
Question 11: If there is a 20 percent probability that you will see falling star in any 30-minute time interval, what is the probability that you see at least one falling star in the period of an hour (60 minutes)?
One trick on answering questions where it is asking us the probability of something happening at least once is to flip the same question to:
1. calculate the probability of something NOT happening
2. probability of at least once happening = 1 — P(Not happening)
The probability of seeing a falling star in 30 mins = 0.2
The probability of NOT seeing a falling star in 30 mins = 1–0.2 = 0.8
Since the events of not seeing a star in first 30 mins is independent of not seeing a star in second 30 mins, the probability of NOT seeing falling star in 60 mins = 0.8 ^2 = 0.64
Now we are almost there, the probability of seeing at least one falling start in 60 mins = 1–0.64 = 0.36
Question 12: What is the loss function in logistic regression called? If you can explain it, and is it possible to use it for multi-class classification?
The loss function in logistic regression is called log loss or binary cross entropy. In logistic regression before the sigmoid function is applied, there is always predicted probability value calculated by the model. That means there are target labels (1 or 0) and predicted probability values. The binary cross entropy is the cost function used to minimize the loss from the predictions made by the model.
Let’s take a simple example of two classes- Green vs. Red, one of them becomes our class of interest. Let’s pick Green. Usually, class of interest is also called positive class or label 1. Consider following three samples:
sample A (right classification)
predicted probability = 0.36
predicted label = 0
actual label = 0sample B (misclassified)
predicted probability = 0.60
predicted label = 1
actual label = 0sample C (right classification)
predicted probability = 0.70
predicted label = 1
actual label = 1
Formula for binary cross entropy is attached as screenshot for readability purpose. In the formula, yi is actual label, and p(yi) is predicted probability of belonging to class yi. Let’s calculate loss for these test cases:
For sample A-
Binary cross entropy = — 0 . log (0.36) — 1 . log(0.64) = 0.44
For sample B-
Binary cross entropy = — 0 . log (0.60) — 1. log (0.40) = 0.91
For sample C-
Binary cross entropy = — 1 . log (0.70) — 0. log (0.30) = 0.35
As we can see, the mis classification example has higher loss, also where our probability for right class is higher, loss is lower.
The probability values are always 0 to 1, the log of which is always negative, so to make the loss positive we have -Ve sign. In this example, I have individually calculated loss for these samples, but as the formula shows we calculate for all N samples and try to minimize the average loss.
To answer the second part of question, we can use multi-class cross entropy, and use SoftMax instead of sigmoid. Or we can use one vs. all technique and use the same method as we discussed above.
Question 13 : Given a trips and a users table, write a SQL query to find the cancellation rate of requests with the unbanned users for each day between “2013–10–01” and “2013–10–03”.
Schema for ‘trips’ table:id: unique id for the table
client_id : this is client identification number
driver_id: this is driver identification number
status: this is the status of request( completed, cancelled_by_client , or cancelled_by_driver)
request_at: requested dateSchema for ‘users’ table:user_id : primary key for this table (it can be either client or driver)
banned: boolean field if user is banned or not (Yes/No)
role: user role ( either client or driver)
I have added sample data and tables in my bit.io database, so you can run and test the query. If you have leetcode account, you can also try there.
Logical Query Plan:
1. cancellation rate = number of cancelled requests / total requests
2. filtering criteria: both clients and drivers should be unbanned, and date range filter for 10/1 to 10/3
3. the users table has to be joined twice since this is a common table- once to get driver’s unbanned status, and again to get client’s unbanned status
4. finally we just apply group by on the joined table for the cancellation rate per day
The exact query is attached as a screenshot from my bit.io editor.
Link to my DB with query: https://bit.io/nirmalbudhathoki/randomDB
Leetcode link: https://leetcode.com/problems/trips-and-users/
You can also find the explanation for results in leetcode.
Question 14: If 50% of candidates who receive a first interview receives their second interview; 90% of your friends who got a second interview felt good about their first interview; 70% of your friends who did not get a second interview also felt good about the first interview. If you feel that you did good in your first interview, what is the probability you will receive a second interview?
This is a typical Bayes theorem question. Sometimes the first part is not given, and interviewer wants to see if you ask that question.
Let’s denote ‘pass’ as being invited to second interview, and ‘fail’ as not hearing back. So we need to find probability of passing when you are feeling good.
P(pass ∣ good) = ?
P(pass) = 0.5
P(fail) = 1- 0.5 = 0.5
P(good ∣ pass) = 0.90
P(good ∣ fail) = 0.70Let’s use the Bayes theorem such as:P(pass | good) =
P(good ∣ pass) * P(pass) / P(good)P(good) =
P(good | pass) * P(pass) + P(good | fail) * P(fail)
= 0.90 * 0.50 + 0.70 * 0.50
= 0.8Now,P(pass | good) = 0.90 * 0.50 / 0.8 = 0.56
So even if 90% of your friends feel good about their first interview and got called in for second round, there is actually only 56% chance of making to second round regardless of you feeling good about it.
Question 15: SQL question asked in DoorDash data scientist interview: Given the below table schema:
order_datetime: date read as string
restaurant_id: int
order_total: double
Write a SQL query that returns top 1% revenue generating restaurants for the month of May 2020. Total revenue is given by cumulative sum of all orders for that month. Make sure to use evenly distributed buckets while getting the top 1%.
This question is testing your ability to use the ntile window function. Since it is asking the top 1%, the idea is to use ntile based on total order, and use 100 buckets, and return the top 1% result.
Always share the logical query first:
1. calculate the total revenue for each restaurant using sum of order_total
2. Apply given filters for the date to get data for only May 2020
3. Use ntile window function with n=100 over total revenue sorted descending to get the top 1%
4. group by restaurant id
5. select the ntile = 1 to get the top 1% from above results
I have added some sample data and the query can be ran in my bit.io account. Screenshot of the query is added as attachment.
Query:WITH orders as (
SELECT restaurant_id, sum(order_total) as total_revenue,
ntile(100) OVER (ORDER BY sum(order_total) desc)
FROM doordash_delivery
WHERE date_part('month',TO_DATE(order_datetime,'MM/DD/YY')) = 05 and
date_part('year', TO_DATE(order_datetime,'MM/DD/YY')) = 2020
GROUP by restaurant_id)
SELECT restaurant_id, total_revenue FROM orders
WHERE ntile =1
ORDER BY total_revenue desc
Link to my bit io account: https://bit.io/nirmalbudhathoki/randomDB
Question 16: This question is popularly known as ants challenge. There are three ants at three ends of a triangle. Assuming they move with same speed, and starting at same time, what is the probability that they do not collide?
This question is fairly simple. It is designed to test your ability of problem solving skill more than knowing probability. Once you understand and break it down, the problem itself is not hard at all, but sometimes we may overthink on it, specially in an interview setting.
Solution: Given the ants start at same time, and same speed, there are only two choices of direction for each ant at each corner of the triangle- it can either go clockwise or anti-clockwise. So we have 3 ants with 2 possible moves- this leads to following combinations:
1. AAA
2. AAC
3. ACA
4. CAA
5. ACC
6. CAC
7. CCA
8. CCC
This is almost like flipping coin three times and getting the combinations. You don’t have to list those choices, but I am just doing for explanation. You can directly calculate using 2³ = 8
Now the question has asked the probability of not colliding, which means all three have to move in same direction, giving us only two possible options for that- which are all moving anti-clockwise (AAA) or all moving clockwise (CCC).
Probability (not Colliding) = 2/8 = 0.25
Question 17: This is one of the DS interview questions asked to cover the understanding of your data wrangling skills. How do you explain the group-by clause to someone who does not know any SQL ?
Questions like these are asked to see your explainability skill. As a data scientist you will be working with a mix of stakeholders- both technical and non-technical. It is your responsibility to be able to explain your findings from data exploration or your insights/ results from model. This is actually a very simple question, yet sometimes I have seen folks struggle to explain it properly. Remember you are not asked to explain it from SQL perspective.
I always suggest to create an example. This is how I would approach this question. Imagine that we are tasked to measure average height of all students in a school based on various attributes. If we are simply asked to calculate the average height of all students, then we simply measure heights of all students and take an average.
What if we are asked to measure the average height by each grade ? Now the concept of group or category is introduced. Here grade is one categorical attribute. Now we group the students by grade, and then measure the average height for each grade. What if we are further asked to measure average height categorizing by their grade and gender ? Now we have two grouping attributes: grade and gender, therefore we have to form sub groups of boys and girls within each grade. Gender is another categorical attribute used. This is the exact concept of group-by clause in SQL, where we calculate aggregates (average height in our example) categorized by various attributes ( grade and gender in our example).
Question 18: During regression, is having more independent variables always helps to have better performing model ? The goodness of fit (R2) test always shows higher score as we keep adding independent variables. How do you interpret that metric, and is that a good metric to measure ?
This is one of most popular questions in data science interviews. It might show up with different variations , but the idea is to understand how R2 works, and why it is not always a good metric to use for evaluating your model.
Answer: No, having more independent variables will not always help us to have better performance, because some of the variables can infuse noise to the model than signal. However the R2 can be misleading since it always remains same or shows improvement as we add more independent variables. It never penalizes for adding redundant or noise variables.
We can use adjusted R2 instead, which is a better metric for goodness of fit than R2. The adjusted R2 will penalize for adding any unnecessary variables. The formula for both R2 and Adjusted R2 is attached in image. As we can see that adjusted R2 has ‘p’ as a factor to control the number of independent variables used.
Question 19: There is one biased coin (having both side tails) in a jar filled with 10 coins. You randomly picked one coin. You are not allowed to check both sides but you can only toss it. You tossed it five times and you got all tails. What is the probability that you pulled the biased coin ?
Initially this question does not sound like Bayes theorem, but once you break it down, this is a Bayes problem.
Let the event of tossing fair coin is F, event of tossing biased coin is B, and event of tossing all 5 tails is T, which in this case has already happened.
Then the question is asking us to find probability of unfair coin when we got 5 tails in a row, or P(B | T) = ?
Using Bayes theorem,
P(B|T) = P( T| B) * P (B) / P(T)
P(T|B) = 1 because we know that biased coin has both sides as tails
P(B) = 1/10 = 0.1 because there is only one biased coin in the jar
P(T) = P( T| B) * P (B) + P( T| F) * P (F)
P(F) = 9/10 = 0.9 because there are 9 fair coins in the jar
If the picked coin was fair, then probability of getting tail = head = 1/2 =0.5
P( T| F) = Probability of getting 5 tails from a fair coin = 0.5⁵ = 0.031
Now filling in all the values:
P(T) = P( T| B) * P (B) + P( T| F) * P (F)
= 1*0.1 + 0.031*0.9
= 0.1279
Then,
P(B|T) = P( T| B) * P (B) / P(T)
= 1 * 0.1 / 0.1279
= 0.78
Question 20 : What are various data imputing techniques you are aware of ? And follow up depending on whether you mention or not- how does the technique MICE works ?
Asking about missing values and how to impute them is one of the most common questions in DS interviews. Most of the times we talk about the techniques like- dropping nulls, imputing with mean or median, imputing with forward or backward fill, using modeling like regression etc., however the least talked technique is MICE (Multivariate Imputation By Chained Equations).
Let’s take a simple example to understand the steps in MICE. Imagine the data with features: age, income, credit score and target variable whether loan is approved or denied. In real world the features will be more, but we are simplifying to understand the concepts.
Below are the steps to perform MICE imputing: Before we begin , we can drop target column- loan in this example.
1. First we impute missing values for all features using one of the other applicable techniques mentioned above (mostly mean or median)
2. Remove the imputed value from one feature- age. Make age as a new target variable. The rows with age becomes training data, and rows missing age becomes our test data. We run regression using other existing features (income and credit score with imputation) to predict age.
3. Do the same for other features. This completes the iteration.
4. Calculate the diff of two dataframes, the goal is to get absolute difference as close as zero for all the imputed values. We can set the acceptable threshold. If the diff is higher than that repeat 2–3.
Question 21: What is self supervised learning ? How does it differ from semi-supervised learning?
When we come across various types of machine learning algorithms, we mostly see — supervised, unsupervised, semi supervised and re-enforcement learning. One of the less talked algo type is self- supervised learning. As the name implies, it is the learning technique where model is supervised on its own from the data itself.
Having labels for supervised learning is desired but it’s not always feabile as it is expensive, resource intensive, and not always realistic. On the other hand, grouping or clustering is not always the type of task we want the model to learn, therefore having complete unsupervised learning is also not always ideal.
This is where the concept of self-supervised learning comes in. When we have big data, we want model to learn from the data and inherently create labels. For example- we as human beings have created many generalized knowledge or common sense from the things we have seen or perceived. When babies see a lots and lots of pictures of cat, and they have not seen cats in real, when they actually see a cat , they can relate back to their learnings and label it as a cat. This is the core idea 💡 that we want ML models to supervise itself to make prediction on unseen data, based on the data it has seen.
To answer the follow up question, semi-supervised learning is when we actually have some ground truth or labels , based on which we are creating more labels for all of our training data, then we can use any supervised models on it. On self- supervised learning we do not have any labels at all.
Question 22 : This was one of the questions asked for FB / Meta ‘s Data Science role. Calculate each user’s average session time. A session is defined as the time difference between a page_load and page_exit. For simplicity, assume a user has only 1 session per day and if there are multiple of the same events on that day, consider only the latest page_load and earliest page_exit. Output the user_id and their average session time in seconds.
Table schema:user_id: int → unique id for the user
timestamp: datetime → timestamp of the event
action: varchar → event type (whether page_load, page_exit, ….. etc.)
Logical Query Plan:
1. Since we are taking one session per user per day, we can define two CTEs- session start and session end
2. For session start- we consider max time for latest page load time
3. For session end- we consider min time for earliest page exit time
4. Join session start and session end tables on user id and date
5. Calculate the AVG on joined results
I have added the data in my bit io account, table is called user_activity. All the SQL questions that I have been sharing can be practiced in the sample data added. Here is the link:
https://bit.io/nirmalbudhathoki/randomDB
Here is the query:
WITH session_start AS (
SELECT
user_id,
DATE(timestamp) AS date,
MAX(timestamp) AS last_page_load
FROM user_activity
WHERE action = 'page_load'
GROUP BY user_id, DATE(timestamp)),
session_end AS (
SELECT
user_id,
DATE(timestamp) AS date,
MIN(timestamp) AS first_page_exit
FROM user_activity
WHERE action = 'page_exit'
GROUP BY user_id, DATE(timestamp))
SELECT
ss.user_id,
ROUND(AVG(EXTRACT (epoch from (se.first_page_exit - ss.last_page_load))),2) AS avg_session_time_sec
FROM session_start ss
JOIN session_end se
ON ss.user_id = se.user_id AND ss.date = se.date
GROUP BY ss.user_id
Question 23: Are you aware of categorical encoding techniques like label encoding and one-hot encoding ? Ensure to explain pros and cons of each method ?
This is a very popular question in data science interviews, and I have seen lots of repetition with slight variations, but the core idea is explaining Label Encoding Vs. One-hot encoding.
Label encoding: Each categorical variable is assigned with a numerical value starting at 0. The unique categories are sorted first. For example- if color is the feature, and we have red , green and blue. The label encoding assigns 0 to blue, 1 to green and 2 to red. Some of the pros for this method is -
1. Simple to understand
2. Dimensionality is not increased one per feature rather than one per category
3. It does not create sparse matrix of features
One of the major drawback of this method is- since numbers are assigned to ordered list, the model might think higher value as higher importance. In our example assigning 2 to red does not mean red is more valuable color.
One-hot encoding: This is a technique of creating dummy variables where each category becomes one feature, and each row only gets one bit high(1) and rest low(0). In above example- blue, green and red becomes three features. For the row where value is blue, the features will look like below:
blue green red as 1 0 0
The advantage of one-hot encoding method is that all features will have same value (either 1 or 0) so there are no confusion of higher number to higher importance. However the drawbacks are:
1. increase variables and sparsity
2. there is something called dummy variable trap, where outcome of one variable can be predicted with help of others, which will create the problem of multicollinearity.
Question 24 : Can you use Bayes theorem to test two hypotheses ?
The answer is yes. Bayes theorem is used to calculate conditional probability given some information, or in other words, one event has already happened, and we are trying to calculate the probability of another event based on that.
Given two events A and B, the Bayes theorem states:
P(A | B) = P( B | A) * P(A) / P(B)
where:
P(A |B) : is probability of A given B. From hypothesis perspective, we can formulate it as probability of hypothesis A, given the evidence B. This is also called posterior since we are trying to calculate it.
P(B |A) : is probability of B given A, also known as likelihood of observing the evidence, given the hypothesis is true.
P(A): is probability of A without any condition, also known as prior, because it is the prior to B happening or without B.
P(B): is probability of B without any condition, also known as marginal.
Using the Bayes theorem, we can calculate posterior for both hypothesis against the same evidence, so we calculate:
P( hypothesis_1 | B) and P(hypothesis_2 | B). We can choose the one with higher of the two.
In fact Naive Bayes uses similar method to categorize the classes based on higher posterior using same features (evidence).
Question 25: What are various types of bias in statistics ?
This is one of the common data science questions that is company agnostic. When they are testing the machine learning breadth, these type of questions are pretty common. The interviewer is looking for your understanding of concepts and how you will explain it. Adding examples add a lot of value in these types of Qs.
There are various types and subtypes, but I would like to add below four:
1. Selection Bias: This occurs when the data selected for sampling is not the true representative of population. For example- I collected the data about covid patients from one hospital because it was easier to access, however the scope of my research is all patients across entire united states.
2. Self- selection bias: This occurs when we let the subjects of interest to make their own selection for the research. For example- if we allow patients an option to opt in/ out for the study of efficacy of covid vaccine, people who don’t believe in vaccine will never chose to survey, and the samples may be incomplete.
3. Observer bias: This occurs when the observers who are recording the results of the experiment are driven by their preferences of desired or known outcomes. For example- medical researchers might be inclined to prove the vaccine is working if they are involved in the process of inventing, as well as testing.
4. Survivorship bias: This occurs when the samples chosen are only survivors, and some of the crucial attributes from non-survivors never made to analysis. For example- if we consider only the companies that survived the recession for the analysis of financial performance in tech industry, then we are infusing the bias based on survivors, and neglecting others.
One popular example of suvivor bias was analyzing the war flights to add armor on the spots that were hit by bullets. Infact the spots without holes were found to be more important to add armor because the flights that were brought down could have been hit on those weak spots; the ones with holes already suggest the flight can survive the impact.
Other types of biases are automated bias, measurement bias, recall bias, implicit bias etc.
The important thing to consider while answering general discussion questions like this is- to keep it simple, but always share examples.
Question 26 : How do you split the train-validation-test samples on time series data ? Can we apply regular Cross Validation technique ?
This is another common data science interview questions that I have seen across different companies. Sometimes this can be a little tricky question since they are testing your understanding of how time series data should be treated while training a model.
When we are dealing with time series data, the random split or our regular cross validation technique is a terrible idea since that will lead into a data leakage problem. Time series prediction or forecasting usually deals with data where we must be cautious that test set or validation set is always the succeeding set, and training data is always the preceding set. If we fail to do so, some data from validation or test set will be leaked to training set, and that will lead to an overfit model.
Unlike regular k-fold cross validation, for time series, the training set will be increasing on a rolling basis. For example: if we have six months of data , and each month is represented as 1, 2, 3, 4, 5, 6 as subsets of data, then the 5-fold cross validation split will look like:
Fold 1 : training [1], validation [2]
Fold 2 : training [1 2], validation [3]
Fold 3 : training [1 2 3], validation [4]
Fold 4 : training [1 2 3 4], validation [5]
Fold 5 : training [1 2 3 4 5], validation [6]
As we can see that each iteration will give us more training samples to train the model, and the indexing has to be done by timestamp. At the end, average is taken just like regular cross validation, or we can even consider taking weighted average based on training data size. If we do weighted average, the assumption is higher the training samples, better will be accuracy so we must assign inverse weightage.
Question 27: What is Central Limit Theorem (CLT) ? How does it help in Machine Learning ?
This is another common question that comes when you are getting testing on Stats and ML breadth.
Here is the book definition: For a population with mean μ and standard deviation σ , if take sufficiently large random samples (n>=30), then the distribution of the sample means will be approximately normally distributed, regardless of the population distribution.
Wait a minute ✋ — The definition sounds little confusing, there are too many ‘distribution’ words in there. However the core meaning is simple: Regardless of how the population is distributed, if we randomly take enough samples, the means of those samples will be closer to normal distribution.
Explaining the definition part of CLT it is not bad. However, interviewer is also looking your understanding of its application on ML. In fact the follow up question weighs more here.
Machine Learning models tries to learn pattern from training data, and generalize it to make prediction on test data. The training data will consists of deterministic (mostly static, if this happens then that will happen) and random (uncertain) information. If everything was deterministic, we do not need ML, we can simply use rule based system.
Let y be the target variable and X (x1, x2, x3, …., xn) be input features. Now, y can be expressed as:
y = f(X) + Σ error
= deterministic function (X) + Σ error
According to CLT, when we have enough training samples, y should follow normal distribution, and since y is the combination of deterministic results from a function (our model) and sum of random errors, the errors should also follow normal distribution. If errors are normally distributed, we can apply linear algorithms to the dataset for better results. If not, may be the data is incomplete or the model is incorrect and we have to explore non-linear algorithms.
PS: I googled CLT for images, and it came up as Cross Laminated Timber. So I created this pic of normally distributed timber logs pile :)
Question 28 : What is power of test in hypothesis testing ? How is it measured and why it is important ?
This was one of the questions asked by Meta in their DS interview. As I mentioned before, Meta’s Data Scientist jobs are more into analytics than core ML. Data Scientists are usually attached to the product team and work on various statistical modeling, therefore questions on hypothesis testing is very common with them.
Power of test is the probability of rejecting Null Hypothesis (H0), when the null is indeed false. So this is actually the measurement of making a right decision, unlike alpha (which is measuring the False Positives or wrong decision against null). Therefore it is referred as power of test- higher the better.
Power of test is calculated as 1- β , where β is the probability that we fail to reject null when we should have rejected it, or in other words it is the probability of we making Type II errors (False Negatives). Therefore doing 1- β, also referred as probability of True Positives, will give us what we need for measuring power of test.
In the context of binary classification, the power of a test is called sensitivity, its true positive rate.
Usually it is recommended to calculate ahead of experiment, which helps us to determine if the sample size we are considering for the experiment is good enough.
Sharing a link for further study with examples. In fact for many stats topics, I always refer to Penn State’s Stats class, which is free :)
Link: https://online.stat.psu.edu/stat415/lesson/25/25.1
Question 29: This is a SQL question with two parts, asked in one of the Amazon interviews. Given a user_activity table:
- What is the number daily active users ?
- List of users who spend at least one hour in a day.
Schema:
user_id — unique ID of the user
timestamp- the timestamp of the activity
action- activity performed by the user
Here is the logical plan that I recommend to discuss with interviewer before you start coding.
Number one rule of coding: NEVER jump into coding right away.
Logical Query Plan:
1. The first question is pretty straight, just count the distinct user_id and group them by day
2. For the second question, we follow below steps:
2a. Create CTE (or subquery) to get the difference of Min and Max timestamp per user per day, which gives the time spent by user
2b. Apply filter of 1 hour to get required list
Here is the link to my bit io, where I have the sample data and query for you to try it: https://bit.io/nirmalbudhathoki/randomDB
Question 30 : What is regularization ? If you are aware of L1 vs. L2 , how do they differ, and which one you prefer to use ?
This is one of very common questions that I have seen across many companies. I can say the likelihood of this question being asked in machine learning section is fairly high.
If I have to give one liner definition: Regularization is nothing but penalizing a model to avoid overfitting. Try to include below points into the discussion:
⭕️ There are two types: L1 ( also called Lasso) and L2 (also called Ridge). As we mentioned, regularization if adding penalty to model, L1 and L2 differs on how we add the penalty to the cost function. As you can see on the attached image, L1 adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function, while L2 adds the “squared magnitude” of the coefficient as the penalty term.
⭕️ Both regularizations have a hyper-parameter λ. Larger the λ the stronger penalty is applied. Likewise, if λ is zero, regularization is deactivated. Using too large values may lead to under-fit model, therefore we use cross validation to pick optimal value for λ.
⭕️ We can also combine and use both, and that is actually called elastic-net regularization.
⭕️ L1 can be more robust to outliers since we are dealing with absolute values, however L2 gives us differentiable function, which makes the function more computationally efficient, leading to a balanced minimization of the weights.
⭕️ L1 can also be used in feature selection as the use of absolute value may lead to drop some of the unnecessary features.
Lastly, thank you everyone for following me on this journey. I hope these questions will help you prepare for your data science interviews. Looking forward to do more of similar series in future 😊