Machine Learning Cheat Sheet for Data Scientist Interview : Linear Regression Frequently Asked Questions Part 1

Kira Shi
4 min readFeb 3, 2022

--

Here is a summary of frequent ML questions being asked during a data scientist interview. I have kept the cheatsheet as concise as possible. All of the content are key knowledge that must be remembered. Hope it can serve as a preparation guide before your DS interview!

Loss function

Definition: measures the penalty between its predicted outcome and actual outcome.

Common types of loss function

Linear regression: square loss (sum of squared errors)

sum of squared errors’s formula

Classification(e.g. logistic regression, decision tree): Log Loss/Cross-entropy loss

cross-entropy loss’s formula

Soft margin SVM: hinge loss

hinge loss’s formula

Gradient descent

  • Optimize the objective function(minimize the loss function).
  • Find the local optimum, can only find the global optimum when the objective function is convex.

Steps:

1. Start with an initial guess.

2. Reduce the objective function a little bit each time using the negative direction of gradient (partial derivative).

3. Stop until converge to a local optimum.

Learning rate: controls how big a step we take in gradient descent.

  • If learning rate is too small: converge slowly
  • If learning rate is too big: overshoot the optimum and fail to converge
  • Choosing adaptive learning rate: take smaller steps when approaching a local minimum
  • Simultaneously update all the parameters

Note: Linear regression’s cost function is always a convex function — always has a single minimum.

Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent

Gradient descent: use all the data points in one iteration and calculate its gradient to update the parameter.

  • especially suitable for convex function

Stochastic gradient descent: use one random data point in one iteration

  • converge faster
  • reduce the probability of getting stuck in one local optimum

Mini-batch GD: use a mini-batch size of data points in one iteration

  • more efficient than SGD since we can use vectorized implementation

Feature scaling for gradient descent

  • Feature scaling accelerates of convergence of gradient descent.
  • Geometric explanation: direction of gradient descent is perpendicular to contour line. When we don’t implement feature scaling before gradient descent, the shape of contour line is oval. It does not necessary point to the optimal, therefore leading to slower convergence.

Normal equation

Closed form solution for linear regression(matrix form)

Gradient Descent VS Normal Equation

  • Gradient descent works well when sample size n is very large.
  • Normal equation needs to compute (XᵀX)⁻¹, time complexity is O(n³)。 There is a problem when matrix in non-invertible. If non-invertible → does not have a unique solution, use gradient descent instead.
  • Causes of non-invertibility: collinearity. Number of features is larger than number of data points(use regularization instead)

Logistic Regression

A generalized linear model. Target variable is a binary variable.

f(x) is the estimated probability of y=1 given x

Sigmoid function

The function crosses 0.5 at the origin and then flattens out, between 0 and 1

Coefficient interpretation

The expected change in log odds of having the outcome when a certain variable increase one unit, while other variable remains the same.

Loss function

If we use traditional square loss, it is a non-convex function because of adding the sigmoid function. Thus we may not find the global optimum.

Loss function of a binary logistic regression (Convex function)

  • It can be derived using Maximum Likelihood Estimation(MLE).
  • There is no closed form solution. We need to use gradient descent to find the optimum.
  • Advanced optimization in gradient descent: BFGS algorithm, often faster than traditional gradient descent.

Multi-class classification problem

Use One vs. All classification, build multiple binary classification models.

--

--