Part I. Build and Evaluate a Predictive Model with scikit-learn: A Walkthrough for Beginners

Ani Karenovna K
9 min readMar 28, 2019

--

For those new to data science, phrases like predictive modeling and model evaluation may seem daunting. Of course depending on the model, these terms may mean less or more complex things. But in this post, I will attempt to disillusion the perceived complexity of these terms with a guided walkthrough of building a Simple Linear Regression (SLR) and providing an appetizer for evaluating the model with scikit-learn. My goal is to leave you both with a conceptual and practical understanding of what predictive modeling is and how to do it. The variable names used here may seem confusing so I highly advise referring to the images in this post as they are designed to help in visualizing the processes described and what all the variables are.

So What is A Prediction Model?

A prediction model is a function which takes a feature or a set of features X as input and outputs some approximation of feature y which we wish to predict (y is often referred to as the target feature). This output is an approximation because no model is perfect and therefore our function can only predict y within some error of what the true y is. Let’s refer to our predicted approximations of y as y_pred and the true values of y as y_true.

Evaluating a Prediction Model: the What, the Why and the How

In order to evaluate how well a prediction model is performing we need a way of measuring how far or close its predictions of y are from true y. Needless to say, this means that we must have both y_pred and y_true. Why do we care to evaluate our model? Well, because we want to understand whether we can rely on it to make accurate predictions about some future y using new data that it has not yet seen.

For example, in a SLR setting, you can evaluate how good the predictions are by looking at the mean squared error (MSE) . This is a metric which takes all the errors our model produced, squares them and then takes the mean of those squared errors. In other words, MSE measures how far our y_pred deviates from y_true on average. There are a variety of metrics we can use and MSE is just one of them.

Another metric, for example, is the coefficient of determination —the R_squared value. R_squared value measures the amount of variability in the data that is explained by our model as compared to a null model which would simply be always predicting the mean of y. The range for an R_squared value is between 0 and 1. An R_squared value of 0.35, for example, would mean that 35% of the variability in our data is explained by our model. The closer the R_squared value is to 1, the better the model.

Improving the performance of a SLR model comes with the objective to minimize MSE or to have an R_squared value as close to 1 as possible.

To build and evaluate a model, we need to have historical data with which to ‘inform’ our model so it can later make ‘educated’ predictions. By historical data I mean data that includes X and y from the past which we currently have. We feed a subset of this historical data to the model, let it train on it and learn from it. This subset is referred to as training data. We then test our model to see how well it’s performing by computing the MSE or the R_squared value on training data. But the point of modeling here is not to get the optimal MSE or a great R_squared value for our training data — we already know what we want our model to predict because our historical and therefore, training data comes with the true y! Instead…

…we want our model to work really well on ‘future’ data which it has never seen before. This means that we wish to build a model that can be generalizable to new, unseen data.

Generalizability is an important concept for the predictive power of a model and is achieved by balancing out the bias and variance tradeoff which I will discuss in my next post. For now, let’s just focus on building the model and getting the necessary scores we need to evaluate it.

Diving Right In…

1. Define X and y

Let’s say we have data on the rate of cricket chirps (X) and the average outside temperature in Fahrenheit (y). This is our historical data. Keeping in mind the generalizability concept, our goal is to build a model which will predict the outside temperature in the future assuming we will have new data on the rate of cricket chirps — that is, assuming we will have only some new X, but not y at some point in the future.

We define our historical X and y in pandas as follows:

X = data[[‘chirps per second’]]y = data[‘temperature (F)’]

The double brackets around ‘chirps per second’ are there because we will later be ingesting our X and y into scikit-learn. Scikit-learn can take any number of predictor features X and it expects a 2-D array for X which is why we need the double brackets.

2. Train-Test Split the Data

Next we need to partition our historical data into a train set and a test set. This is because we want to train our model, make it smarter, and then we want to test it later to see how well it’s performing.

Let’s define some terminology:

Train Set — a subset of our entire historical dataset which is used to train and build the model.

Test Set — a subset of the entire historical dataset which is used after we have trained our model for the purpose of making predictions and evaluating model performance.

Scikit-learn’s test_train_split method will randomly split our data into train and test subsets.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.26)

It’s important to set the random_state parameter to a fixed value. This will ensure that you will get the same train and test sets every time you do a train-test split on the same dataset.

The test_size parameter here is set to 0.26. This means that I want my test data to be about 26% of the entire dataset, while the remaining portion will become the train set.

The picture below shows what the train and test split accomplishes using a random subset of the actual data I use for this example.

Now we’re ready to train and build our model.

3. Instantiate the Model

First step towards actually building our SLR model is to instantiate it. This creates a blank model object which is much like an empty template ready to be filled with our training data.

from sklearn.linear_model import LinearRegression  slr_model = LinearRegression()

4. Fit the Model on Train Set

Next, we feed our train data into the slr_model template and let the model ‘fit itself’ to the training data. This is the step where the model is ‘learning’. What this means is that the template slr_model learns the intricacies of our data and it adjusts itself to it so that when the model sees data like this in the future, it will recognize the most salient patterns in the data and will make its predictions accordingly in a more ‘informed’ way.

Fitting the model requires inclusion of both X and y from our training set.

Thanks to scikit learn, this process is a simple one-liner:

slr_model.fit(X_train, y_train)

The picture above illustrates the high-level process of what happens when you fit a model. The equation represents the regression line which fits our data and it has unique slope and intercept values. Now let’s look up the slope and the y-intercept for our model and then view a scatterplot with the actual regression line.

slope = slr_model.coef_
y_intercept = slr_model.intercept_
--------------------------------------------------------------------slope = 2.90165464
y_intercept = 31.359529272367354

Great! We have a working model! Let’s see how well of a fit it is to our data.

5. Make Predictions on Train Set

In simple terms, the prediction step is equivalent to plugging in X values to our equation from above and calculating the corresponding y values.

The following code produces predictions y_pred_train on the train set

y_pred_train = slr_model.predict(X_train)

6. Evaluate the Fit on Train Set

Remember that we have our true y from the train set — y_true_train. We can compare it to y_pred_train by making use of the R_squared metric and the MSE metric.

from sklearn.metrics import r2_score, mean_squared_errorr2_train = r2_score(y_true_train, y_pred_train)
mse_train = mean_squared_error(y_true_train, y_pred_train)
--------------------------------------------------------------------[Output]:r2_train = 0.6599654904754202
mse_train = 12.744526804119651

The training R_squared value is 0.66. This means that the model we just built explains about 66% of the variability in our data.

The training MSE is 12.74 which means our predictions on average deviate from the true values of y by about 3.57 degrees of Farenheit. The MSE is the mean squared error, so in order to get the actual error with the same units as our y (deg. Farenheit) we need to take the square root of the MSE.

It’s critical to understand the purpose of the train set and the test set. We train and build the model using the train set. We can evaluate the model on the train set to see how well the fit is. But we use the test set to gain insight on how generalizable the model is to future data. We only make predictions on X_test and never ingest y_true_test to the model.

We always want to make sure our model is agnostic to y_test_true because y_test_true represents unseen data.

7. Make Predictions on Test Set

Just like we saw before with predicting on train data, we can do the same on test data:

y_pred_test = slr_model.predict(X_test)

8. Evaluate Performance on Test Set

And our test scores are…

r2_test = r2_score(y_test, y_pred)
mse_test = mean_squared_error(y_test, y_pred)
--------------------------------------------------------------------[Output]:r2_test = 0.7728120770684193
mse_test = 16.095126545492537

The testing scores can be interpreted in the same way as we did the training scores above.

We now have both our training scores and the testing scores which means we can compare the two. By comparing the performance of the model on train set to that of the test set, we can infer a lot about how good our model really is.

This is where I’ll stop as this leads to the broader discussion in my next post of how to interpret the relationship between train and test scores, how to do cross validation, and how to deal with the bias and variance tradeoff to achieve the optimal model. Here is a link to the article:

Let’s Review

To recap everything we’ve done so far, we built a SLR model, made predictions with it and obtained training and testing scores. Thankfully, with scikit-learn’s help, the steps outlined here are not specific to SLR only and can be applied to other models. Here are the general steps of building an evaluating a predictive model:

  1. Define X and y
  2. Train-test-split your data
  3. Instantiate the model
  4. Fit the model using X_train and y_train
  5. Make predictions on train set using X_train (obtain training score)
  6. Evaluate the model fit on train set
  7. Make predictions on test set using X_train
  8. Evaluate the model on test set (obtain testing score)

Data Source . The data used in the example above was augmented via simulation.

--

--

Ani Karenovna K

Data Scientist | Graduate Student in Applied Maths — Demystifying the Math in ML Algorithms https://www.linkedin.com/in/ani-k-karenovna/