Interview Prep
SQL Interview Question for Data Analysts: Acceptance Rate
Working through SQL interview questions step by step.
Introduction
As a part of my role at a software company, I design SQL generator functions to automate the writing of SQL, so that users don’t actually have to know it. The code I write is all open source and publicly available on Github, and is the driving force behind the SQL Generator 5000 — a free website for generating SQL. Edit: Rasgo has transitioned to AI for SQL generation, so the above link is broken and the resource no longer exists.
As a part of my testing, I like to take real-world examples from the internet, and solve them using Rasgo. This helps me identify areas for improvement as well as validate that everything is working as expected.
I believe that sharing these examples will help folks that are currently learning SQL. I hope these individuals will benefit from seeing the way I approach the problem. I will always provide the raw SQL so that even readers who don’t care about the approach can skip to the end to get the solution.
Acceptance Rate By Date
Today’s interview question comes from stratascratch.com, question ID 10285:
What is the overall friend acceptance rate by date? Your output should have the rate of acceptances by the date the request was sent. Order by the earliest date to latest.Assume that each friend request starts by a user sending (i.e., user_id_sender) a friend request to another user (i.e., user_id_receiver) that’s logged in the table with action = ‘sent’. If the request is accepted, the table logs action = ‘accepted’. If the request is not accepted, no record of action = ‘accepted’ is logged.Table: fb_friend_requests
The Data
A unique row is identified by the pair of USER_ID.
This is confirmed by looking at the data, where we see an “accepted” friend request looks like:
+----------------+------------------+------------+----------+
| user_id_sender | user_id_receiver | date | action |
+----------------+------------------+------------+----------+
| ad4943sdz | 948ksx123d | 2020-01-04 | sent |
| ad4943sdz | 948ksx123d | 2020-01-06 | accepted |
+----------------+------------------+------------+----------+
An unaccepted friend request looks like:
+----------------+------------------+------------+--------+
| user_id_sender | user_id_receiver | date | action |
+----------------+------------------+------------+--------+
| ffdfff4234234 | lpjzjdi4949 | 2020-01-06 | sent |
+----------------+------------------+------------+--------+
Step-by-Step Solution
Separating Sent and Accepted
We clearly need to think of sents
and acceptances
as two different “things,” so I’ll start by breaking those out.
I call this is a filter
transformation, but in SQL terms this is just using a WHERE
clause.
SELECT *
FROM FB_FRIEND_REQUESTS
WHERE (ACTION = 'sent')SELECT *
FROM FB_FRIEND_REQUESTS
WHERE (ACTION = 'accepted')
To Join or Aggregate?
At the end of the day, we know that we will need aggregate by date, then divide SUM(ACCEPT) / SUM(SENT)
in order to create acceptance_percentage.
We also know that we need to join these two concepts together. The critical question is — which do we do first?
The key is remembering that acceptance
happens on a different day. A single invitation actually has 2 different dates: a sent date, and an accepted date. Though we only start with a single date column, there are virtually two different date concepts in the data.
So, if we aggregate by date before we join, the sent
and the acceptance
would be on different days. For example, on a particular day we might have 4 invitations accepted, and not a single invitation sent, which would give us an undefined percentage (4/0)**.
Therefore, we must join the data together so we can attach the sent
date to the acceptance
itself, and then aggregate based on sent date.
Incorrect:**
Correct:
**NOTE: See the final section for a discussion on this topic
Join
Use an OUTER JOIN so you keep dates that had something Sent, but not Accepted.
SELECT t1.USER_ID_SENDER
, t1.USER_ID_RECEIVER
, t1.ACTION, t1.DATE
, t2.USER_ID_SENDER AS ACCEPTED_USER_ID_SENDER
, t2.USER_ID_RECEIVER AS ACCEPTED_USER_ID_RECEIVER
, t2.ACTION AS ACCEPTED_ACTION
, t2.DATE AS ACCEPTED_DATE
FROM {{ sent_subquery }} AS t1
LEFT JOIN {{ accepted_subquery }} AS t2
ON t1.USER_ID_RECEIVER = t2.USER_ID_RECEIVER
AND t1.USER_ID_SENDER = t2.USER_ID_SENDER
Aggregate
Now we can aggregate based on the Sent Date. We are using a little inside-SQL knowledge here, because we are relying on the fact that counting the acceptance_ID
will be NULL because of the outer join, and therefore COUNT
won’t pick it up.
SELECT DATE --this is sent_date
, COUNT(ACCEPTED_USER_ID_RECEIVER) AS ACCEPT_COUNT
, COUNT(USER_ID_RECEIVER) AS SENT_COUNT
FROM {{ join_subquery_results }}
GROUP BY DATE
Our result looks like this:
+------------+--------------+------------+
| DATE | ACCEPT_COUNT | SENT_COUNT |
+------------+--------------+------------+
| 2020-01-04 | 3 | 4 |
| 2020-01-06 | 2 | 3 |
+------------+--------------+------------+
Final Step — Percentage
The final step is easy: we simply divide the 2 columns from the previous result. I call this is the math
transform.
SELECT *
, ACCEPT_COUNT / SENT_COUNT AS accepted_percentage
FROM {{ aggregate_result }}
Putting it all together
Now that we’ve stitched together all the pieces, let’s toggle to Code View to get the SQL and submit our answer.
Success!
Here is the answer in raw SQL code:
WITH CTE_SENT AS
(SELECT *
FROM fb_friend_requests
WHERE (ACTION = 'sent') ),
CTE_ACCEPT AS
(SELECT *
FROM fb_friend_requests
WHERE (ACTION = 'accepted') ),
CTE_JOINED AS
(SELECT t1.USER_ID_SENDER,
t1.USER_ID_RECEIVER,
t1.ACTION,
t1.DATE,
t2.USER_ID_SENDER AS ACCEPTED_USER_ID_SENDER,
t2.USER_ID_RECEIVER AS ACCEPTED_USER_ID_RECEIVER,
t2.ACTION AS ACCEPTED_ACTION,
t2.DATE AS ACCEPTED_DATE
FROM CTE_SENT AS t1
LEFT JOIN CTE_ACCEPT AS t2 ON t1.USER_ID_RECEIVER = t2.USER_ID_RECEIVER
AND t1.USER_ID_SENDER = t2.USER_ID_SENDER),
CTE_AGGREGATED AS
(SELECT DATE, COUNT(ACCEPTED_USER_ID_RECEIVER) * 1.00 AS ACCEPT_COUNT,
COUNT(USER_ID_RECEIVER) * 1.00 AS SENT_COUNT
FROM CTE_JOINED
GROUP BY DATE)
SELECT *,
ACCEPT_COUNT / SENT_COUNT AS accepted_percentage
FROM CTE_AGGREGATED
Note: I used an online SQL formatter to make it more readable, and used another SQL trick to make the counts show up as decimals.
Final Thoughts
I want to make some final comments about the decision to join before aggregating. This is a common decision point, and there is no right or wrong way to do it. The only reason we knew there was a “right” way, was because the interview question explicitly mentioned it. In the real world, it is totally valid for your metric definition to aggregate, and then join-by date.
The downside is that you would have a ratio that could exceed 100% or become undefined, due to the fact that SUM(SENT) / SUM(ACCEPT)
would be aggregating different invitations. Yet, the upside is that this metric would not be a “lagging” metric. In many scenarios, companies will make this tradeoff.
Conversely, the metric in the interview question will be a lagging metric, because we immediately count SUM(SENT)
, which will never change as time goes on, but SUM(ACCEPT)
will start off as nearly 0, and then every day we would have to refresh our query as the number goes up and up over time as people accept the invitations.
Conclusion
I hope you enjoyed this series. I plan on doing a few more real-world interview questions to see if folks find it useful. My goal was to show you that these SQL problems are solvable by breaking the problem down into smaller bite-sized chunks. If you have suggestions, questions, or comments, you can find me hanging out in Locally Optimistic and DataTalks.Club.
Go from SELECT * to interview-worthy project. Get our free 5-page guide.