Data Scientist Coding Exercise
Comparing take-home coding challenge problems from two industries. They vary in scope and difficulty. Suggested solutions are provided.
Are you a data scientist aspirant? Are you currently applying for data scientist positions? Do you have a data scientist interview coming up? Are you worried about the take-home coding exercise?
If you have any of the above questions in mind, then you are in the right place. This article will help answer some of the questions you might have about the data scientist coding exercise.
After going through a couple of data scientist interview processes, I would like to share my experiences about the coding exercise with aspiring data scientists. Hopefully, they’ll learn something from my experiences that could help them to be better prepared for this important phase of the interview process.
The Take-Home Challenge Problem (Coding Exercise)
So, you’ve successfully gone through the initial screening phase of the interview process. It is now time for the most important step in the interview process, namely, the take-home coding challenge. This is generally a data science problem e.g. machine learning model, linear regression, classification problem, time series analysis, etc. Generally, the interview team will provide you with project directions and the dataset. If you are fortunate, they may provide a small dataset that is clean and stored in a comma-separated value (CSV) file format. That way you don’t have to worry about mining the data and transforming it into a form suitable for analysis. For the couple of interviews I’ve had, I worked with 2 types of datasets, one had 160 observations (rows) while the other had 50,000 observations. The take-home coding exercise differs from companies to companies, as described below.
Sample 1: Coding Exercise for the Data Scientist Position (Take Home)
This coding exercise should be performed in python (which is the programming language used by the team). You are free to use the internet and any other libraries. Please save your work in a Jupyter notebook and email it to us for review.
Data file: cruise_ship_info.csv (this file will be emailed to you)
Objective: Build a regressor that recommends the “crew” size for potential ship buyers. Please do the following steps (hint: use numpy, scipy, pandas, sklearn and matplotlib)
1. Read the file and display columns.
2. Calculate basic statistics of the data (count, mean, std, etc) and examine data and state your observations.
3. Select columns that will be probably important to predict “crew” size.
4. If you removed columns explain why you removed those.
5. Use one-hot encoding for categorical features.
6. Create training and testing sets (use 60% of the data for the training and reminder for testing).
7. Build a machine learning model to predict the ‘crew’ size.
8. Calculate the Pearson correlation coefficient for the training set and testing data sets.
9. Describe hyper-parameters in your model and how you would change them to improve the performance of the model.
10. What is regularization? What is the regularization parameter in your model?
Plot regularization parameter value vs Pearson correlation for the test and training sets, and see whether your model has a bias problem or variance problem.
Comments and Remarks: This is an example of a very straightforward problem. The dataset is clean and small (160 rows and 9 columns), and the instructions are very clear. So all what is needed is to follow the instructions and generate your code. Notice also that the instruction clearly specifies that python be used as the programming language for model building. The time allowed for completing this coding assignment was 3 days. Only the final Jupyter notebook has to be submitted, no formal project report is required.
Sample 2: Data Science Take Home Challenge
In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:
- First, the borrower receives the funds. This event is called origination.
- The borrower then makes regular repayments, until one of the following happens:
(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.
(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid.
In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:
- The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
- For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.
Objective: We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.
Comments and Remarks: The dataset here is complex (has 50,000 rows and 2 columns; and lots of missing values), and the problem is not very straightforward. You have to examine the dataset critically and then decide what model to use. This problem was to be solved in a week. It also specifies that a formal project report and an R script or Jupyter notebook file be submitted.
Suggested Solution to Take Home Challenge Exercises
For datasets, and suggested solutions, please see the following links:
Note: The solutions presented above are recommended solutions only. Keep in mind that the solution to a data science or machine learning project is not unique. I challenge you to solve these problems yourself before reviewing the sample solutions.
The take-home coding exercise provides an excellent opportunity for you to showcase your ability to work on a data science project. You need to demonstrate exceptional abilities here. For example, if you are asked to build a multi-regression model, make sure you can demonstrate a full understanding of the following advanced concepts:
(i) Feature standardization
(ii) Hyperparameter tuning
(iv) Techniques of dimensionality reduction such as PCA (principal component analysis) and Lasso regression
(v) Generalization error
(vi) Uncertainty quantification
(vii) Demonstrate the ability to use advanced data science techniques such as scikit-learn’s pipeline tool for model building
(viii) Be able to interpret your model in terms of real-life applications
If there are certain aspects of the problem that you don’t understand, feel free to reach out to the data science interview team if you have questions. They may provide some hints or clues.
In summary, we’ve discussed two sample take-home coding exercise from two different industries. The coding exercise varies in scope and complexity, depending on the company you are applying to. The take-home coding exercise provides an excellent opportunity for you to showcase your ability to work on a data science project. You need to use this opportunity to demonstrate exceptional abilities in your understanding of data science and machine learning concepts. If there are certain aspects of the problem that you don’t understand, feel free to reach out to the data science interview team if you have questions. They may provide some hints or clues.
Additional Data Science/Machine Learning Resources
For questions and inquiries, please email me: firstname.lastname@example.org