What is Statistical Learning? — ISLR Series

Taraqur Rahman
The Biased Outliers
7 min readDec 30, 2020

We are reading Introduction to Statistical Learning (ISLR) bookclub-style format here at Biased Outliers. The purpose of this reading is to build our foundation of statistics so that we can move forward as a community to understand more complex machine learning topics. Every week we meet online on Sunday evenings to discuss what we read that week and tackle both the conceptual and applied exercises. Another way to reinforce what we learned is by blogging what we learned in each chapter.

Under the Hood

Using data, we want to understand how a variable affects the other. For example, if there are clouds in the sky, then it will rain. The observation (or predictor) of cloudy skies will affect our prediction if it is going to rain. If there are not any clouds, then most likely it will not rain. This sounds intuitive but actually it is not. It became intuitive to use because we have experienced this scenario multiple times. In other words, we collected these observations many times in our lifetime. Now when we experience the observation of cloudy skies, we think most likely it will rain. The opposite, there being no cloudy skies, will result in no rain. We collected enough data in our lifetime to bet on that it will not rain if it is not cloudy.

In statistics, the observations (cloudy skies) have many names: input, predictor, feature, and independent variables. The outcome (it is going to rain) also has a few names: response, dependent, or target variables. Given data, we want to either predict what will happen next or understand how the observations will affect the response. Mathematically it looks like this:

The standard model. The response, Y, is equal to a function applied to the predictors, X, plus some error term. Our job is to predict f(X).

In the equation, the Y represents the response variable, X represents the predictor variables and the ε represents an irreducible error. This equation says that given some predictors (X), we can apply a function to those predictors and add ε to give us a desired outcome (Y).

Reducible and Irreducible Error

The irreducible error, ε, is (wait for it….) an error that we cannot reduce. This irreducible error represents things we cannot account for such as additional predictors that we don’t have in our data collection. Realistically we cannot create a model that can predict accurately 100% of the time. Just because it is cloudy, doesn’t mean it will always rain.

However we are in control of the f(X). Our job is to use statistical methods to estimate f(X). The fact that we are estimating means that there will be some sort of error. The error caused by the estimating f(X) is called reducible error, which is (yup, you got it) the error we can reduce. We want to choose a function with small reducible error because that means that the function we chose is really close to the actual function. If the error is high, then that means our estimation was wrong.

Our job here is to apply statistical learning methods on data in order to estimate a function f-hat such that Y=f-hat(X) for any observation (X,Y). Y =f(x)

The expectation value of the actual model and the predicted model. The reducible error is the error we CAN reduce. On the other hand, we do not have control of the irreducible error.

The ^ on top of the variables indicate the predicted value. To reduce error, we need to figure out how to calculate the error. And that depends on the problem at hand. If the problem is a regression (predicting a numerical value), then the most common metric is mean-squared error (MSE). If the problem is classification (predicting what group an observation belongs to), then the common metric is error rate.

The red dots are the actual data. The black curve is the predicted model. The black lines connecting the data to model is the error caused by estimation. The mean squared error (MSE) will be the sum of all the black lines divided by the number of data points. src: ISLR

MSE informs us how much error is associated with a regression model. The lower the MSE (low error), the better the model fits the data.

Equation for mean squared error (MSE). Subtracts the difference between actual value and predicted value, squares them, then sums up all the squared values and divides it by the number of values.

That means to get the MSE, we have to subtract the actual y value from the predicted y value, square them, and then average them ( sum them up and then divide by the number of observations). We subtract first because that tells us numerically how off the model was from the actual value. Then we square it to get a positive value and then we find the mean to see how off we were on average. If we were off by a lot, then the MSE will be big. If we were really close to the actual value, then the MSE will be small.

In a classification setting, when we are predicting the group an observation belongs to, (if an observation is a lion or airplane), the most common metric used is the error rate. The error rate informs us how many times the model predicted the wrong class of the observation.

Error Rate function. Averages the number of incorrect class predictions.

In the classification problem, we are not predicting a number anymore. We are predicting a class, ex: if the image is a dog. If our prediction spits out a cat, then y-hat (our prediction:cat) is NOT equal to y (actual class:dog) therefore the I in the equation will output a 1. Every incorrect prediction, the error rate will count and then at the end will divide by the number of predictions to give a percentage of how many times the model was off.

Bias-Variance Tradeoff

Another thing to keep in mind is the bias-variance tradeoff. In a nutshell, bias is the accuracy of our model and variance is the generalization of our model. The ideal estimated function will be one with low bias (it predicts well with low error) AND low variance (it provides similar results given new data).

Prediction vs. Inference

One thing to note when applying statistical methods is what is the end goal. If the end goal is prediction (predicting a value) then we can use a non-parametric method. This means that we do not make any assumptions about the data and let the data tell us what the function would look like. This flexibility provides us with a lot more functions to choose from. Another way of thinking about this is throwing all the predictors in a black box so that it outputs a response. We do not know what is happening in the blackbox but we do not care, as long as we get a valid outcome.

On the other hand, if the goal is to make an inference, understand the relationship between the predictors and the response (how Y changes as a function of the X), then we have to use a different approach than prediction problems. We cannot throw everything in a blackbox to get a response because in an inference problem, we care what is happening inside the black box. In this scenario, we might have to use a parametric method. In a parametric method, we assume a form (or shape) of the function and try to fit that data into that function. When we assume a form, we create these parameters that the data has to fit in (hence parametric methods). This method is inflexible because we are limiting the function to the form we are assuming. However the benefit of an inflexible model is that it is easier to interpret, which is what we want when we are solving an inference problem. We can answer questions such as “which predictors are associated with the response?”, and “what is the relationship between response Y and the predictors X?”

One person might ask, is there a model that outperforms every other model? No. There is no golden model. There is no free lunch, meaning no one method dominates all others over all possible data sets. We pick a function based on the data and every dataset is different. When we fit the model on apple images, we will not be able to predict if a fruit is an orange.

And that is basically what is under the hood of machine learning. Using statistical methods, we want to estimate a function that best fits the data so that we can either predict or infer what is happening.

Collaborators: Michael Mellinger

If you are interested in joining our textbook club, feel free to drop a message. We can send you an Slack invite.

--

--