Data Science Interview Guide Part I: Take Home Assignment

Nanyang
9 min readAug 9, 2020
Photo by fauxels from Pexels

What is a data science interview? How to tackle it well?

I believe that they can be a lot of data science enthusiasts’ concerns.

I would like to share my experience in the NTU LevelUp programme organised by the NTU Career & Attachment Office.This programme pairs up students with mentors from various industries, aiming to give undergraduates like me a flavour of what is the life like to be working in the industry of our interest, providing us an outlook on the future work life and preparing us to acquire the relevant and essential skill sets for the industry.

I am very grateful to be under the mentorship of @Haiyang Sun who is our alumni and is now working as a data scientist, as I am quite interested in the data science field and would like to discover more and also want to know the roadmap to become a data scientist. Together with @Voon hao, another mentee under the same mentor, we have gone through a series of sessions with our mentor, including data science assignment, product and technical interview preparations.

In this article, I would like to share about our learning in the data science take home assignment. I will share some code written by us to give you some idea of how we tackle the problem.

There are two tasks in the assignment.

Task 1: Conversion Rate Prediction

The goal of this challenge is to build a model that predicts conversion rate and, based on the model, come up with ideas to improve revenue.

Dataset

Description of the data set

Approach

In order to build a prediction model, we need to explore the raw data first and then clean them and feed them to the model. The pipeline is as followed:

  1. Exploratory Data Analysis
  2. Model Selection
  3. Model Building
  4. Model evaluation

Exploratory Data Analysis

Correlation

### correlation analysis
corr=conversion_data.corr()
sns.heatmap(corr)

Conversion rate analysis by features

Country

### conversion analysis based on country
country_ana=conversion_data.groupby([‘country’,’converted’])[‘country’].count().unstack()
country_ana.columns=[‘0’,’1']
country_ana[‘%’] = country_ana[‘1’]/country_ana.sum(axis=1)*100
ax = country_ana[‘%’].plot.bar()
ax.set_title(‘Conversion Rate by country’)
ax.set_ylabel(‘Conversion Rate (%)’)

We can see that the conversion rate for China market is not high, which means more efforts shall be spent on other region liker Germany and UK.

Age

ax =sns.distplot(converted[‘age’],hist=False,color=’blue’,kde_kws={‘shade’:True},norm_hist=True,label=’converted’)
ax = sns.distplot(not_converted[‘age’],hist=False,color=’red’,kde_kws={‘shade’:True},norm_hist=True,label=’not converted’)
ax.set_title(‘age distribution ‘)

From the graph, the age distribution for converted cases are relatively younger.

ax = sns.scatterplot(x=age_ana.index,y=age_ana.converted)
ax.set_ylabel(‘# of Conversion’)
ax.set_title(‘Conversion cases by age’)
Presence of outliers

Not hard to find out that there are actually some outliers above age 100, it is highly unlikely to have that data in the real life. Therefore, there is a need to remove them for more accurate analysis.

### remove the anormality
age_ana_cleaned = age_ana.loc[age_ana.index<100]
age_ana_cleaned.reset_index(inplace=True)
## plot scatter graph to bin the age
bins =[15,19,25,29,35,39,45,49,55,59,65]
names = [‘15–19’,’20–25',’26–29',’30–35',’36–39',’40–45',’46–49',’50–55',’56–59',’60–61']
age_ana_cleaned[‘age_bin’]=pd.cut(age_ana_cleaned[‘age’],bins,labels=names).astype(object)
age_ana_cleaned_bin = age_ana_cleaned.groupby([‘age_bin’])[‘converted’].sum()

I cut the ages into bin for easier visualisation.

Source

source_ana=conversion_data.groupby([‘source’])[‘converted’].mean()
source_ana.plot.bar(title=’Percentage of conversion by source’)

Generally, the conversion rate does not vary drastically among the source. It is clear that Direct source has the lowest conversion rate, while Ads source the highest.Then what are the relationship between the age group and the source among the converted cases.

Bivariate analysis for age and source distribution

total_pages_visited

sns.distplot(converted[‘total_pages_visited’],hist=False,kde_kws={‘shade’:True},norm_hist=True,label=’converted’)
sns.distplot(not_converted[‘total_pages_visited’],hist=False,kde_kws={‘shade’:True},norm_hist=True,label=’not converted’)

Among the converted cases, the total_pages_visited shows a normal distribution with mean about 15.

Prepare data for model training

One-hot the categorical features and select the features for model training.

catdata=pd.get_dummies(cleaned_conv[[‘age_bin’,’country’,’source’]])
alldata = catdata.join(cleaned_conv[[‘new_user’,’total_pages_visited’,’converted’]])
x = alldata.drop(columns=’converted’)
y = alldata[‘converted’]

Model Selection

I have selected a several models for the training, select the best one for further improvement.

# for ML:
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
## model selection
base_models = [(“DT_model”, DecisionTreeClassifier(random_state=42)),
(“RF_model”, RandomForestClassifier(random_state=42,n_jobs=-1)),
(“LR_model”, LogisticRegression(random_state=42,n_jobs=-1))]
kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)
for name, model in base_models:# get cross validation score for each model:
cv_results = cross_val_score(model,
x, y,
cv=split,
scoring=”accuracy”,
n_jobs=-1)
# output:
min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f”{name} cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}”)

Output:

DT_model cross validation accuarcy score: 0.9853 +/- 0.0003 (std) min: 0.985, max: 0.9857
RF_model cross validation accuarcy score: 0.9854 +/- 0.0003 (std) min: 0.9851, max: 0.9858
LR_model cross validation accuarcy score: 0.9861 +/- 0.0004 (std) min: 0.9856, max: 0.9866

From the preliminary run, it is found that Logistic Regression shows the best performance out of the selected models.

Fine-tuning parameters

Random Grid Search CV is used to find the optimal parameters for the model.

## choose logistic regression
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
param_grid = {‘C’ : np.logspace(-4, 4, 20),
}
gs = GridSearchCV(clf, param_grid = param_grid, cv = split, verbose=True, n_jobs=-1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 10)
gs.fit(x_train,y_train)

Evaluation

y_pred_acc = gs.predict(x_test)print(‘Accuracy Score : ‘ + str(accuracy_score(y_test,y_pred_acc)))
print(‘F1 Score : ‘ + str(f1_score(y_test,y_pred_acc)))
##Accuracy Score : 0.9858475648323846
##F1 Score : 0.7597315436241611

Feature Importance

lr = LogisticRegression(random_state=42,n_jobs=-1)
lr.fit(x_train,y_train)
plt.figure(figsize=(10,5))
plt.bar(x.columns,lr.coef_[0])
plt.xticks(rotation=45)
plt.title(‘Feature Importance’)

From the coefficients of the features , we can see that the age(15–29) and total_pages_visited have strong influence on the conversion rate prediction.

Learning points from task 1

  • Since there are two target outcomes, it is better to analyse the results in terms of different outcomes, to discover the potential relationship of the features respect to the outcomes.
  • It is essential to know what are the determining features that contribute the most to the outcome prediction, which is to know the feature importance. This helps us to analyse the features better.
  • There are possibly some outliers in the data, for example, the age feature, it is not likely for a 100-year-old to have the cognitive abilities to use internet service. We sometimes have to use our common sense to filter out some wrong information for a sound prediction model.

Task 2 Funnel Analysis

Further analyse conversion at each stage of the funnel, in order to identify areas of improvements.

  • A full picture of funnel conversion rate for both desktop and mobile
  • Some insights on what the product team should focus on in order to improve​ conversion rate as well as anything you might discover that could help improve conversion rate

Dataset

  1. User_table:
  • User_id
  • Date: date when the user firstly landed on the site
  • Device: user device [mobile/desktop]
  • Sex

2. Visited_page_tables: separate tables containing user_id that have visited the page. Pages are home_page(stage0), search_page(stage1), payment_page(stage2), page_confirmation(stage3) in sequential order.

Approach

This is a task that focuses more on data analytics, EDA is key and logical reasoning and deduction is required to tackle this task well.

Data Engineering

There are multiple data sources, it is essential to merge them into one master dataset.

## read the data and merge them
home_page = pd.read_csv(‘./dataset/home_page_table.csv’)
payment_con = pd.read_csv(‘./dataset/payment_confirmation_table.csv’)
payment_page = pd.read_csv(‘./dataset/payment_page_table.csv’)
search_page=pd.read_csv(‘./dataset/search_page_table.csv’)
user = pd.read_csv(‘./dataset/user_table.csv’)
user[‘home_page’]=1
user[‘search_page’]=0
user[‘payment_page’]=0
user[‘payment_confirmation’]=0
user.loc[user[‘user_id’].isin(search_page.user_id),’search_page’]=1
user.loc[user[‘user_id’].isin(payment_page.user_id),’payment_page’]=1
user.loc[user[‘user_id’].isin(payment_con.user_id),’payment_confirmation’]=1
### function to calculate conversion_ratedef conversion_rate(df):
search_page_rate = df[‘search_page’].sum()/df[‘user_page’].sum()
payment_page_rate = df[‘payment_page’].sum()/(df[‘payment_page’].sum()+ df[‘search_page’].sum())
payment_confirmation_rate = df[‘payment_confirmation’].sum()/(df[‘payment_page’].sum()+ df[‘payment_confirmation’].sum())
return search_page_rate,payment_page_rate,payment_confirmation_rate
  • Split the data for both mobile and desktop analysis
mobile = user[user[‘device’]==’Mobile’]
desktop = user[user[‘device’]==’Desktop’]
### conversion rate analysis on both device
mobile_conv = conversion_rate(mobile)
desktop_conv = conversion_rate(desktop)
source = pd.DataFrame({‘mobile’:mobile_conv,’desktop’:desktop_conv},index=[‘search_page’,’payment_page’,’payment_confirmation’])
ax =source.plot.bar()
plt.xticks(rotation=45)

It is clear that the overall conversion rate for mobile is higher than that for desktop, with search_page conversion rate being the same for both cases.

  • Gender analysis on each stage
## 
mobile_group_stage_sex = mobile.groupby([‘stage_reached’,’sex’])[‘user_id’].count().unstack()
desktop_group_stage_sex = desktop.groupby([‘stage_reached’,’sex’])[‘user_id’].count().unstack()figure, ax = plt.subplots(nrows=1, ncols=2)
desktop_group_stage_sex.plot.bar(ax=ax[0],figsize=(10,5))
mobile_group_stage_sex.plot.bar(ax=ax[1])
ax[0].set_title(‘Count of users by stages on desktop’)
ax[1].set_title(‘Count of users by stages on mobile’)

Gender seems to have little impact on the conversion rate. As the count of users of different genders is relatively the same at each stage.

Since this is a funnel analysis, it is natural to use funnel visualisation for the purpose. Here is the code by

.

from plotly import graph_objects as gostages = [“Home”, “Search”, “Payment”, “Payment Confirmation”]for i in [‘device’, ‘sex’, ‘month’]:

fig = go.Figure()
df_plot=df.groupby(i).count()[[‘home’, ‘search’, ‘payment’,’payment_confirmation’]].transpose()
for j in df_plot.columns:
fig.add_trace(go.Funnel(
name = str(j),
y = stages,
x = list(df_plot[j]),
textinfo = “value+percent previous”))
fig.update_layout(template=’seaborn’, title=”Funnel for different {}” .format(i))
fig.show()

Temporal Analysis

user[‘date’]=pd.to_datetime(user[‘date’])
byday_overall = user.groupby([‘date’])[‘user_page’,’search_page’,’payment_page’,’payment_confirmation’].sum()
bydaty_overall_rate = pd.DataFrame(byday_overall.apply(conversion_rate,axis=1))
bydaty_overall_rate.columns=[‘res’]
### split the number in the list to separate columns
bydaty_overall_rate_div = pd.DataFrame([pd.Series(x) for x in bydaty_overall_rate[‘res’]],index=bydaty_overall_rate.index)
bydaty_overall_rate_div.columns =[‘search_page’,’payment_page’,’payment_confirmation’]
bydaty_overall_rate_div.plot.line(figsize=(10,8))
Overall conversion (including both devices)

Overall we can see the rate at each stage keeps fluctuating within a range, but there is a sudden drop at 2015–03–02 for search_page stage conversion from 0.6 to 0.4 within a day, which is abnormal. Another unusual trend from that day onwards is that the conversion rate of payment_confirmation > payment_page.

Hence, there is a need to divide into devices to see which device platform accounts for those abnormalities.

Clearly, conversion rate of the mobile side explains that plummet. On the other hand, desktop accounts for the unusual high payment_confirmation with respect to payment_pages. Then, this is the time we shall ask ourselves, what possibly can cause those to happen, I have come out some reasons for that:

  1. There might be some bugs in the mobile tracking code, rendering the number to be inaccurate.
  2. There can be the issue of site errors and any site changes to the desktop pages

Overall Analysis and Suggestion

  • Mobile platform has higher conversion rate than desktop, suggesting that the it might be more user-friendly compared to desktop design. One advice is to improve the user interface design in the desktop.
  • Investigate what happened on the date of 2015–03–02 to find out the discrepancies in the conversion trend.

Challenges faced in the task

1. There is a time feature in the data. At first, I did not make use of them, thinking that it is of little use in the task and considering the little time left in the assignment. In fact, it is a big mistake. In a data science interview assignment, every feature has its own significance, the decision of whether to ignore them should be backed up by a valid and strong reasoning. Temporal data analysis is a key part of the analysis given a time feature.

Nevertheless, it is important to identify potential abnormality in the data trend, and even propose a possible reasoning to account for the discrepancies, this can really show that you have critical and logical thinking skills to analyse and solve the problems.

2. Should carry out a full comparison, finding all potential bivariate relationship among the given features.

Ending Note

Although I have done a few self-initiated projects, this project really gets me started on the data science journey for the first time in a lot more comprehensive way from data visualisation, analysis to model prediction.

Always start from and link back to the task objective, as it will allow you to be on the right track at all time.

Keep practising it and doing more projects, you will definitely ace it in the end!

Data Science Interview Guide Part II

If you have any advice, comment or ideas, feel free to share it below or you can even reach out to me via Linkedln.

--

--

Nanyang

Data Science enthusiast | Undergraduate NTU | Information Engineering & Media