Machine Learning Cheat Sheet for Data Scientist Interview : Linear Regression Frequently Asked Questions Part 1
Here is a summary of frequent ML questions being asked during a data scientist interview. I have kept the cheatsheet as concise as possible. All of the content are key knowledge that must be remembered. Hope it can serve as a preparation guide before your DS interview!
Loss function
Definition: measures the penalty between its predicted outcome and actual outcome.
Common types of loss function
Linear regression: square loss (sum of squared errors)
Classification(e.g. logistic regression, decision tree): Log Loss/Cross-entropy loss
Soft margin SVM: hinge loss
Gradient descent
- Optimize the objective function(minimize the loss function).
- Find the local optimum, can only find the global optimum when the objective function is convex.
Steps:
1. Start with an initial guess.
2. Reduce the objective function a little bit each time using the negative direction of gradient (partial derivative).
3. Stop until converge to a local optimum.
Learning rate: controls how big a step we take in gradient descent.
- If learning rate is too small: converge slowly
- If learning rate is too big: overshoot the optimum and fail to converge
- Choosing adaptive learning rate: take smaller steps when approaching a local minimum
- Simultaneously update all the parameters
Note: Linear regression’s cost function is always a convex function — always has a single minimum.
Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent
Gradient descent: use all the data points in one iteration and calculate its gradient to update the parameter.
- especially suitable for convex function
Stochastic gradient descent: use one random data point in one iteration
- converge faster
- reduce the probability of getting stuck in one local optimum
Mini-batch GD: use a mini-batch size of data points in one iteration
- more efficient than SGD since we can use vectorized implementation
Feature scaling for gradient descent
- Feature scaling accelerates of convergence of gradient descent.
- Geometric explanation: direction of gradient descent is perpendicular to contour line. When we don’t implement feature scaling before gradient descent, the shape of contour line is oval. It does not necessary point to the optimal, therefore leading to slower convergence.
Normal equation
Closed form solution for linear regression(matrix form)
Gradient Descent VS Normal Equation
- Gradient descent works well when sample size n is very large.
- Normal equation needs to compute (XᵀX)⁻¹, time complexity is O(n³)。 There is a problem when matrix in non-invertible. If non-invertible → does not have a unique solution, use gradient descent instead.
- Causes of non-invertibility: collinearity. Number of features is larger than number of data points(use regularization instead)
Logistic Regression
A generalized linear model. Target variable is a binary variable.
Sigmoid function
Coefficient interpretation
The expected change in log odds of having the outcome when a certain variable increase one unit, while other variable remains the same.
Loss function
If we use traditional square loss, it is a non-convex function because of adding the sigmoid function. Thus we may not find the global optimum.
Loss function of a binary logistic regression (Convex function)
- It can be derived using Maximum Likelihood Estimation(MLE).
- There is no closed form solution. We need to use gradient descent to find the optimum.
- Advanced optimization in gradient descent: BFGS algorithm, often faster than traditional gradient descent.
Multi-class classification problem
Use One vs. All classification, build multiple binary classification models.