Exceptional Resources for Data Science Interview Preparation. Part 2: Classic Machine Learning

Artem Ryblov
10 min readMar 29, 2024

--

Hi! My name is Artem. I work as a Data Scientist at MegaFon (a platform for secure data monetization, OneFactor). We build credit scoring, lead generation and anti-fraud models using telecom data, and we also do geoanalytics.

In the previous article, I shared materials for preparing for one of the most daunting (for many) stages — Live Coding.

Let’s remember which sections make up the interview process for the Data Scientist position:

A typical interview process through the eyes of DALL-E 3

In this article, we will look at materials that can be used to prepare for the section on classic machine learning.

Remarks

Most of the resources in this article are free, but there are a few paid ones. I recommend buying them only if you clearly understand that you do not want or cannot spend your personal time searching for information on your own.

I have highlighted my favorite materials ⭐.

Table of contents

  1. Classic Machine Learning
  2. Resources
    - Books
    - Courses
    - Sites
    - Cheatsheets
    - Other
  3. Let’s sum it up
  4. What’s next?

A section on classical machine learning is found in every Data Science interview (in one form or another), because it allows to test basic knowledge of ML, without which there is no point in conducting a section on specialized ML/DL (RecSys, NLP, Time Series, CV, RL, ASR/TTS, etc.).

In this section, expect questions on the following topics:

  • Machine Learning Types
    Supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, etc.
  • Models
    Linear, tree-based, metric, ensembles, boosting, neural networks.
  • Data
    Types of data, types of variables, data presentation, data quality, etc.
  • Training and Quality Assessment of Models
    Quality criteria: business metrics, online metrics, offline metrics, loss functions. Validation: leave-p-out cross-validation, k-fold cross-validation, holdout method, etc. Overfitting vs. underfitting, bias-variance trade-off, hyperparameters search, parameters vs. hyperparameters, selection of the best features, regularization, class imbalance, etc.

Resources

Once we have figured out what topics need to be studied, it’s time to move on to the resources that will help us with this.

Books

An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani + The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman

The first book (An Introduction to Statistical Learning) is suitable for beginners, while the second (The Elements of Statistical Learning) will be easier to read for advanced readers.

Machine Learning Simplified: A gentle introduction to supervised learning by Andrew Wolf

The underlying goal of “Machine Learning Simplified” is to develop strong intuition for ML inside you. We would use simple intuitive examples to explain complex concepts, algorithms or methods, as well as democratize all mathematics behind machine learning.

After reading this book, you would understand everything that comes into the scope of Supervised ML, and would be able to not only understand nitty-gritty details of mathematics behind the scene, but also explain to anyone how things work on a high level.

The Kaggle Book
Inside the book there is a detailed description of the competition analysis, code examples, end-to-end pipelines, as well as all the ideas, suggestions, best practices, tips and recommendations that Luca Massaron and Konrad Banahevich collected over the years of competitions on Kaggle (more than 22 years).

Interpreting Machine Learning Models With SHAP: A Guide With Python Examples And Theory On Shapley Values

This book takes readers on a comprehensive journey from foundational concepts to practical applications of SHAP. It conveys clear explanations, step-by-step instructions, and real-world case studies designed for beginners and experienced practitioners to gain the knowledge and tools needed to leverage Shapley Values for model interpretability/explainability effectively.

Interpretable Machine Learning. A Guide for Making Black Box Models Explainable by Christoph Molnar

This book is about how to make machine learning models interpretable.

After exploring the concepts of interpretability, you will learn about simple, interpretable modelssuch as decision trees, decision rules and linear regression. The focus of the book is on model-agnostic methods for interpreting black box models such as feature importance and accumulated local effects, and explaining individual predictions with Shapley values and LIME. In addition, the book presents methods specific to deep neural networks.

All interpretation methods are explained in depth and discussed critically. How do they work under the hood? What are their strengths and weaknesses? How can their outputs be interpreted? This book will enable you to select and correctly apply the interpretation method that is most suitable for your machine learning project. Reading the book is recommended for machine learning practitioners, data scientists, statisticians, and anyone else interested in making machine learning models interpretable.

Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson

The process of developing predictive models includes many steps. Most materials (books, blog posts, etc.) focus on algorithms, while other important aspects of the modeling process are ignored.
This book describes methods for finding the best subset of features to improve model quality. Various data sets are used to illustrate the methods, as well as programs (in R) to reproduce the results.

Clean Machine Learning Code by Moussa Taifi paid

The book contains recipes for writing clean code for training and inference of ML models:

  • Basics of Clean Machine Learning Code
  • Name Optimization
  • Function Optimization
  • Style
  • Clean Machine Learning Classes
  • Software Architecture in Machine Learning
  • Machine Learning Through Testing

Courses

Open Machine Learning Course by Yury Kashnitsky

mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). Having both a Ph.D. degree in applied math and a Kaggle Competitions Master tier, Yury aimed at designing an ML course with a perfect balance between theory and practice. Thus, the course meets you with math formulae in lectures, and a lot of practice in a form of assignments and Kaggle Inclass competitions. Currently, the course is in a self-paced mode. Here we guide you through the self-paced mlcourse.ai.

Stanford CS229: Machine Learning by Andrew Ng

Iconic course on the basics of machine learning.
Topics covered in the course:

  • Supervised learning (generative/discriminative models, parametric/non-parametric models, neural networks, support vector machines);
  • Unsupervised learning (clustering, dimensionality reduction, kernel methods);
  • Learning theory (bias/variance trade-off, practical advice);
  • Reinforcement learning and adaptive control.

Kaggle Learn

Kaggle provides access to its short but interesting courses for free. Here are just a few of them:

Google Machine Learning Courses

Basic courses (basic concepts of machine learning):

  • Introduction to Machine Learning
  • Machine Learning Crash Course
  • Problem Statement
  • Preparing data and feature space
  • Testing and Debugging

Advanced courses (tools and methods for solving various problems):

  • Decision Forests
  • Recommender systems
  • Clustering
  • Generative Adversarial Networks
  • Image classification
  • Fairness in the Perspective API

Guides (step-by-step instructions for solving problems using best practices):

  • ML Rules
  • Guide “People + AI”
  • Text classification
  • Good data analysis
  • Deep Learning Setup Tutorial

Sites

Machine Learning for Everyone. In simple words. With real-world examples. Yes, again

A great introduction for those who want to finally understand machine learning — in simple language, without formula-theorems, but with examples of real problems and their solutions.
Suitable for those who have just begun to understand machine learning.

Kaggle Competitions

There are several types of competitions on Kaggle:

  • Getting Started
    Getting Started competitions are the easiest, most approachable competitions on Kaggle. These are semi-permanent competitions that are meant to be used by new users just getting their foot in the door in the field of machine learning. They offer no prizes or points. Because of their long-running nature, Getting Started competitions are perhaps the most heavily tutorialized problems in machine learning — just what a newcomer needs to get started!
    - Digit Recognizer
    - Titanic: Machine Learning from Disaster — Predict survival on the Titanic
    - Housing Prices: Advanced Regression Techniques
  • Playground Competitions
    Playground competitions are a “for fun” type of Kaggle competition that is one step above Getting Started in difficulty. These are competitions which often provide relatively simple machine learning tasks, and are similarly targeted at newcomers or Kagglers interested in practicing a new type of problem in a lower-stakes setting. Prizes range from kudos to small cash prizes. Some examples of Playground competitions are:
    - Dogs versus Cats — Create an algorithm to distinguish dogs from cats
    - Leaf Classification — Can you see the random forest for the leaves?
    - New York City Taxi Trip Duration — Share code and data to improve ride time predictions
  • Research
    Research competitions are another common type of competition on Kaggle. Research competitions feature problems which are more experimental than featured competition problems. For example, some past research competitions have included:
    - Google Landmark Retrieval Challenge — Given an image, can you find all the same landmarks in a dataset?
    - Right Whale Recognition — Identify endangered right whales in aerial photographs
    - Large Scale Hierarchical Text Classification — Classify Wikipedia documents into one of ~300,000 categories
  • Featured
    Featured competitions are the types of competitions that Kaggle is probably best known for. These are full-scale machine learning challenges which pose difficult, generally commercially-purposed prediction problems. For example, past featured competitions have included:
    - Allstate Claim Prediction Challenge — Use customers’ shopping history to predict which insurance policy they purchase
    - Jigsaw Toxic Comment Classification Challenge — Predict the existence and type of toxic comments on Wikipedia
    - Zillow Prize — Build a machine learning algorithm that can challenge Zestimates, the Zillow real estate price estimation algorithm

Machine Learning Mastery

Jason Brownlee’s site contains:

  • Guides (step-by-step guides divided into several levels):
    - Foundations
    How to start learning machine learning, probability, statistical methods, linear algebra, optimization, mathematical analysis.
    - Beginner
    Python, understanding ML algorithms, introduction to sklearn, time series prediction, data preparation.
    - Intermediate
    Boosting, class imbalance, deep learning, ensembles.
    - Advanced
    LSTM, NLP, CV, GANs, attention and transformers
  • Tutorials (blog posts on the site on various topics)
  • E-books paid (extended materials from the site, combined into books by topic)

I admit, I haven’t studied these resources in full, but in my experience, if in the search results there is a choice between this site and medium, analytics vidhya, etc., it is better to go here.

The Illustrated Machine Learning

Metrics for distance-based algorithms

The idea of the site is to make the complex world of machine learning more accessible through clear, concise illustrations. The goal is to provide students, professionals, and anyone preparing for a technical interview with a visual aid to better understand the core concepts of machine learning. Whether you’re new to the field or a seasoned professional looking to brush up on your knowledge, these illustrations will be a valuable resource on your journey to understanding machine learning.

MLU-EXPLAIN

Visual explanations of basic machine learning concepts.
Machine Learning University (MLU) is Amazon’s educational initiative designed to explore the theory and practice of machine learning.
MLU-Explain exists [with this goal] to teach important machine learning concepts through visual essays in a fun, informative, and accessible way.

Cheatsheets

Supervised Learning

  • Loss functions, gradient descent, likelihood
  • Linear models, Support Vector Machine (SVM), generative learning approach
  • Trees and ensembles, K-Nearest Neighbors (KNN), learning theory

Unsupervised Learning

  • EM algorithm (Expectation Maximization), k-means, hierarchical clustering
  • Metrics for assessing the quality of clustering
  • Principal Component Analysis (PCA), Independent Component Analysis (ICA)

Tips and Tricks

  • Confusion matrix, accuracy, precision, recall, F1-score, ROC
  • R2, Mallows’s Cp​, Akaike information criterion (AIC), Bayesian information criterion (BIC)
  • Cross-validation, regularization, bias/variance trade-off, error analysis

Other

StatQuest with Josh Starmer

A channel where Joshua Starmer explains various algorithms and concepts in ML in simple language, supported by visualizations and examples. These videos are suitable for a first introduction to the material and for repetition.
You can use the channel search function to find the video you need.

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning by Sebastian Raschka

The proper use of methods for assessing model quality and selecting an algorithm is vital in academic machine learning research, as well as in many real-world business problems. This article examines the various methods that can be used for each of these sub-problems and discusses the main advantages and disadvantages of each method with reference to theoretical and empirical research. In addition, recommendations are made to encourage best but feasible practices in machine learning research and application.

How to avoid machine learning pitfalls: a guide for academic researchers by Michael A. Lones

This article describes common pitfalls encountered when using machine learning and what you can do to avoid them. It covers five stages of the machine learning model development process: what to do before building a model, how to reliably build models, how to robustly evaluate models, how to fairly compare models, and how to report results.

A new perspective on Shapley values: Part I: Intro to Shapley and SHAP + Part II: The Naïve Shapley method

These two blog posts will help you learn SHAP/Shapley values for model interpretation.

Let’s Sum It Up

If you haven’t read it yet, I recommend reading the “Learning How to Learn” and “Let’s Sum It Up” blocks from the first article, since everything said there is also applicable for preparing for the machine learning section.

What’s next?

In the next article we will analyze materials for preparing for the section on specialized machine/deep learning.

You can find the latest resources for this series of articles in the Data Science Resources repository, which will be maintained and updated. You can also subscribe to my Telegram channel Data Science Weekly, where I share interesting and useful materials every week.

If you know of any cool resources that I didn’t include in this list, please write about them in the comments.

--

--