Source: Pexels

Data Scientist Coding Exercise

Comparing take-home coding challenge problems from two industries. They vary in scope and difficulty. Suggested solutions are provided.

Benjamin Obi Tayo Ph.D.
Mar 19 · 6 min read

Are you a data scientist aspirant? Are you currently applying for data scientist positions? Do you have a data scientist interview coming up? Are you worried about the take-home coding exercise?

If you have any of the above questions in mind, then you are in the right place. This article will help answer some of the questions you might have about the data scientist coding exercise.

After going through a couple of data scientist interview processes, I would like to share my experiences about the coding exercise with aspiring data scientists. Hopefully, they’ll learn something from my experiences that could help them to be better prepared for this important phase of the interview process.

The Take-Home Challenge Problem (Coding Exercise)

Sample 1: Coding Exercise for the Data Scientist Position (Take Home)

This coding exercise should be performed in python (which is the programming language used by the team). You are free to use the internet and any other libraries. Please save your work in a Jupyter notebook and email it to us for review.

Data file: cruise_ship_info.csv (this file will be emailed to you)

Objective: Build a regressor that recommends the “crew” size for potential ship buyers. Please do the following steps (hint: use numpy, scipy, pandas, sklearn and matplotlib)

1. Read the file and display columns.

2. Calculate basic statistics of the data (count, mean, std, etc) and examine data and state your observations.

3. Select columns that will be probably important to predict “crew” size.

4. If you removed columns explain why you removed those.

5. Use one-hot encoding for categorical features.

6. Create training and testing sets (use 60% of the data for the training and reminder for testing).

7. Build a machine learning model to predict the ‘crew’ size.

8. Calculate the Pearson correlation coefficient for the training set and testing data sets.

9. Describe hyper-parameters in your model and how you would change them to improve the performance of the model.

10. What is regularization? What is the regularization parameter in your model?

Plot regularization parameter value vs Pearson correlation for the test and training sets, and see whether your model has a bias problem or variance problem.

Comments and Remarks: This is an example of a very straightforward problem. The dataset is clean and small (160 rows and 9 columns), and the instructions are very clear. So all what is needed is to follow the instructions and generate your code. Notice also that the instruction clearly specifies that python be used as the programming language for model building. The time allowed for completing this coding assignment was 3 days. Only the final Jupyter notebook has to be submitted, no formal project report is required.

Sample 2: Data Science Take Home Challenge

In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:

  • First, the borrower receives the funds. This event is called origination.
  • The borrower then makes regular repayments, until one of the following happens:

(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.

(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid.

In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:

  • The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
  • For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.

Objective: We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.

Comments and Remarks: The dataset here is complex (has 50,000 rows and 2 columns; and lots of missing values), and the problem is not very straightforward. You have to examine the dataset critically and then decide what model to use. This problem was to be solved in a week. It also specifies that a formal project report and an R script or Jupyter notebook file be submitted.

Suggested Solution to Take Home Challenge Exercises

Sample 1 recommended solution

Sample 2 recommended solution

Note: The solutions presented above are recommended solutions only. Keep in mind that the solution to a data science or machine learning project is not unique. I challenge you to solve these problems yourself before reviewing the sample solutions.

Final Remarks

(i) Feature standardization

(ii) Hyperparameter tuning

(iii) Cross-validation

(iv) Techniques of dimensionality reduction such as PCA (principal component analysis) and Lasso regression

(v) Generalization error

(vi) Uncertainty quantification

(vii) Demonstrate the ability to use advanced data science techniques such as scikit-learn’s pipeline tool for model building

(viii) Be able to interpret your model in terms of real-life applications

If there are certain aspects of the problem that you don’t understand, feel free to reach out to the data science interview team if you have questions. They may provide some hints or clues.


Additional Data Science/Machine Learning Resources

Essential Maths Skills for Machine Learning

3 Best Data Science MOOC Specializations

5 Best Degrees for Getting into Data Science

5 reasons why you should begin your data science journey in 2020

Theoretical Foundations of Data Science — Should I Care or Simply Focus on Hands-on Skills?

Machine Learning Project Planning

How to Organize Your Data Science Project

Productivity Tools for Large-scale Data Science Projects

A Data Science Portfolio is More Valuable than a Resume

Feature Selection and Dimensionality Reduction Using Covariance Matrix Plot

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

For questions and inquiries, please email me:

Towards AI

Towards AI, is the world’s fastest-growing AI community for…

Benjamin Obi Tayo Ph.D.

Written by

Physicist, Data Science Educator, Writer. Interests: Data Science, Machine Learning, AI, Python & R, Predictive Analytics, Materials Sciences, Biophysics

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

Image Filtering

More from Towards AI

Mar 29 · 8 min read


More from Towards AI

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade