Preparing for interview on Machine Learning? Here, is a complete guide to interview questions on Linear Regression.

Writuparna Banerjee
Analytics Vidhya
Published in
12 min readAug 8, 2020

Hey! Are you ready for the interview? Still not confident!!!! Don’t worry, go through the commonly asked interview questions on Linear Regression. To let you know, it is a common practice to test data science aspirants on commonly used machine learning algorithms. These most commonly used conventional algorithms being linear regression, logistic regression, decision trees, random forest etc. Data scientists are expected to possess an in-depth knowledge of these algorithms. It is the basis of many different ML Algorithms, so if you make a mistake in giving these answers during an interview, it might be the end of the interview.

Keeping in mind about such young data science aspirants like you, I have covered all the important concepts asked in interview to strengthen your knowledge in at least one of the conventional algorithms.

So, let’s get started with Linear Regression!

Interviewer: What is linear regression?

Your answer: Linear regression is a method of finding the best straight line fitting to the given data, i.e. finding the best linear relationship between the independent and dependent variables.
In technical terms, linear regression is a machine learning algorithm that finds the best linear-fit relationship on any given data, between independent and dependent variables. It is mostly done by the Sum of Squared Residuals Method.

Interviewer: What are the assumptions made in linear regression model?

Your answer: The important assumptions in linear regression analysis are:

  1. There should be a linear and additive relationship between dependent (response) variable and independent (predictor) variable(s). A linear relationship suggests that a change in response Y due to one unit change in X is constant, regardless of the value of X. An additive relationship suggests that the effect of X on Y is independent of other variables.
  2. There should be no correlation between the residual (error) terms.
  3. The independent variables should not be correlated.
  4. The error terms must have constant variance. This phenomenon is known as homoskedasticity.
  5. The error terms must be normally distributed.

Interviewer: What if these assumptions get violated ?

Your answer: To understand the outcomes of violating such assumptions we have to dive into the assumptions.

Linear and Additive: If we fit a linear model to a non-linear and non-additive data set, the regression algorithm would fail to capture the trend mathematically, thus resulting in an inefficient model. Also, this will result in erroneous predictions on an unseen data set.

Autocorrelation: Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).The presence of correlation in error terms drastically reduces model’s accuracy. This usually occurs in time series models where the next instant is dependent on previous instant. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. If this happens, it causes confidence intervals and prediction intervals to be narrower.

Confidence interval is a range of values so defined that there is a specified probability that the value of a parameter lies within it.

Prediction interval is the range that likely contains the value of the dependent variable for a single new observation when specific values of the independent variables are given. Narrower prediction interval means that the predicted value of a future observation with the same settings would lie in a narrower range.

Multicollinearity: This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.

Moreover, with presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters.

Heteroskedasticity: The presence of non-constant variance in the error terms results in heteroskedasticity. Generally, non-constant variance arises in presence of outliers. It looks like that these values get too much weight, thereby disproportionately influences the model’s performance. When this phenomenon occurs, the confidence interval tends to be unrealistically wide or narrow.

Normal Distribution of error terms: If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Presence of non — normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

Interviewer: How to find the best fit line in a linear regression model?

To find the best fit line for our model we have to make the distance with respect to all the points minimum. We have to find that line which is closest to all the points. In statistics, this vertical distance is called residual.

Residual is equal to the difference between the observed value and the predicted value. For data points above the line, the residual is positive, and for data points below the line, the residual is negative. So, if were to find out the sum of all the residuals, then due to the negative errors there will be subtractions in the distance and the the value of resultant distance would be less than the actual. So, to eliminate the negative sign we have to square each residual and find out it’s sum.

This distance is known as Sum of Squared Residuals(SSE) and the method is known as Least Squares Method as we need to find that value of m and b of the linear regression line for which SSE is minimum.

Interviewer: Why do we square the error instead of using modulus?

It’s true that one could choose to use the absolute error instead of the squared error. In fact, the absolute error is often closer to what we want when making predictions from our model. But, we want to penalize those predicted values which is contributing the maximum error. Moreover looking a little deeper, the squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization. To optimize the squared error, we can just set its derivative equal to 0 and solve. To optimize the absolute error often requires more complex techniques. Actually we find the Root Mean Squared Error so that the unit of RMSE and the dependent variable are equal.

Interviewer: What are techniques adopted to find the slope and the intercept of the linear regression line which best fits the model?

Your answer: There are mainly two methods:

  1. Ordinary Least Squares(Statistics domain)
  2. Gradient Descent(Calculus family)

Interviewer: Explain Ordinary Least Squares Regression in brief.

Your answer: Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable. The method estimates the relationship by minimizing the sum of the squares of the difference between the observed and predicted values of the dependent variable configured as a straight line. OLS regression is used in bivariate model, that is, a model in which there is only one independent variable ( X ) predicting a dependent variable ( Y ). However, the logic of OLS regression can also be used in multivariate model in which there are two or more independent variables.

Interviewer: What are the limitations of OLS?

Your answer: OLS is computationally too expensive. It performs well with small data. For larger data Gradient Descent is preferred.

Interviewer: Can you briefly explain gradient descent?

Gradient descent is an optimization algorithm that’s used when training a machine learning model. It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.

We can think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops learning. In mathematical terms, a gradient is a partial derivative with respect to its inputs.

Let’s imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps along the way as possible. He might start climbing the hill by taking really big steps in the steepest direction, which he can do as long as he is not close to the top. As he comes closer to the top his steps will get smaller and smaller to avoid overshooting it. This process can be described mathematically using the gradient.

Imagine the image below illustrates our hill from a top-down view and the red arrows are the steps of our climber. Think of a gradient in this context as a vector that contains the direction of the steepest step the blindfolded man can take and also how long that step should be.

Note that the gradient ranging from X0 to X1 is much longer than the one reaching from X3 to X4. This is because the steepness/slope of the hill, which determines the length of the vector, is less. This perfectly represents the example of the hill because the hill is getting less steep the higher it’s climbed. Therefore a reduced gradient goes along with a reduced slope and a reduced step size for the hill climber.

Instead of climbing up a hill, think of gradient descent as hiking down to the bottom of a valley. This is a better analogy because it is a minimization algorithm that minimizes a given function.

Let’s imagine we have a machine learning problem and want to train our algorithm with gradient descent to minimize our cost-function J(w, b) and reach its local minimum by tweaking its parameters (w and b). We can assume the horizontal axes represent the parameters (w and b), while the cost function J(w, b) is represented on the vertical axes.

We know we want to find the values of w and b that correspond to the minimum of the cost function (marked with the red arrow in the diagram above). To start finding the right values we initialize w and b with some random numbers. Gradient descent then starts at that point (somewhere around the top of our illustration), and it takes one step after another in the steepest downside direction (i.e., from the top to the bottom of the illustration) until it reaches the point where the cost function is as small as possible.

Interviewer: Explain the significance of learning rate.

Your answer: Learning rate determines how big are the steps that gradient descent takes into the direction of the local minimum, which figures out how fast or slow we will move towards the optimal weights.

For gradient descent to reach the local minimum we must set the learning rate to an appropriate value, which is neither too low nor too high. This is important because if the steps it takes are too big, it may not reach the local minimum because it bounces back and forth between the convex function of gradient descent. If we set the learning rate to a very small value, gradient descent will eventually reach the local minimum but that may take a while.

Interviewer: How to evaluate regression models?

Your answer: There are five metrics used to evaluate regression models:

  1. Mean Absolute Error(MAE)
  2. Mean Squared Error(MSE)
  3. Root Mean Squared Error(RMSE)
  4. R-Squared(Coefficient of Determination)
  5. Adjusted R-Squared

Interviewer: Which evaluation technique should you prefer to use for data having a lot of outliers in it?

Your answer: Mean Absolute Error(MAE) is preferable to use for data having too many outliers in it because MAE is robust to outliers whereas MSE and RMSE are very susceptible to outliers and starts penalizing the outliers by squaring the residuals.

Interviewer: What’s the intuition behind R-Squared?

Your answer: We use linear regression to predict y given some value of x. But suppose that we had to predict a y value without a corresponding x value.

Without using regression on the x variable, our most reasonable estimate would be to simply predict the average of the y values.

However, this line will not fit the data very well(as we can see in the figure above). One way to measure the fit of the line is to calculate the sum of the squared residuals — this gives us an overall sense of how much prediction error a given model has.

Now, if we predict the same data with regression we will see that the least-squares regression line will seem to fit the data pretty well (as shown in the figure below).

We will find that using least-squares regression, the sum of the squared residuals has been considerably reduced.

So using least-squares regression eliminated a considerable amount of prediction error. R-squared tells us what percent of the prediction error in the y variable is eliminated when we use least-squares regression on the x variable.

As a result, R² is also called the coefficient of determination. Many formal definitions say that R² tells us what percent of the variability in the y variable is accounted for by the regression on the x variable. The value of R² varies from 0 to 1.

Interviewer: Can R² be negative?

Your answer: Yes, R² can be negative. The formula of R² is given by:

where,

Var mean = Variance by the mean line

Var R = Variance by the regression line

As variance is equal to the sum of squared error. So, R² can be written as:

If the sum of squared error of the mean line(SSM) is greater than the regression line(SSR), R squared will be negative.

Interviewer: What are the flaws in R-squared?

Your answer: There are two major flaws:

Problem 1: R² increases with every predictor added to a model. As R² always increases and never decreases, it can appear to be a better fit with the more terms we add to the model. This can be completely misleading.

Problem 2: Similarly, if our model has too many terms and too many high-order polynomials we can run into the problem of over-fitting the data. When we over-fit data, a misleadingly high R² value can lead to misleading predictions.

Interviewer: What is adjusted R²?

Your answer: Adjusted R-squared is used to determine how reliable the correlation is between the independent variables and the dependent variable. On addition of highly correlated variables the adjusted R-squared will increase whereas for variables with no correlation with dependent variable the adjusted R-squared will decrease.

The formula is:

where:

  • n is the number of points in our data sample.
  • k is the number of independent regressors, i.e. the number of input columns.

Adjusted R² will always be less than or equal to R².

Conclusion: These are the few basic questions that can be asked from Linear Regression. Please note, that I have explained few questions in detail for clearing your concepts and for your better understanding. While answering to the interviewer be specific. Answer only what you have been asked for. Be confident and answer smartly. Good luck!

If you find this helpful, don’t forget to hit 👏 icon . It will help other young aspirants like you to see the story. Thank you!😊

References:

  1. https://builtin.com/data-science/gradient-descent
  2. https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/assessing-fit-least-squares-regression/a/r-squared-intuition

--

--

Writuparna Banerjee
Analytics Vidhya

Data Science and Machine Learning enthusiast | Front-end Web Developer | Technical blogger