Problem with “In-Sample” accuracy

Siddhraj Maramwar
Analytics Vidhya
Published in
4 min readJul 15, 2020
Photo by Stephen Dawson on Unsplash

Whenever we built a machine learning model it is obvious we calculate the accuracy of the built model (though not all).

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. We can solve this using train_test_split.

There are many metrics for summarizing model quality, but in this article we will look for Mean Absolute Error (also called MAE).

The prediction error can be seen as below :

error=actual−predicted

For calculating the metric (MAE) first we need to built the model.Selecting Data for Modeling

Selecting data for Modeling

We will get the dataset for house prices in Bengaluru City. click here to get data.

import pandas as pd
file_path = "C:/Users/XYZ/XYZ/Bengaluru_House_Data.csv"
home_data = pd.read_csv(file_path)
home_data = home_data.dropna(axis=0)
home_data.shape
OUTPUT
(7340, 9)

Perform all the required data cleaning and pre-processing (dropna, label encoding).

After that select the prediction target.

Selecting The Prediction Target

This can be done with the help of dot.notation.

y = home_data.price

Choose the features

Columns that are inputted into our model (and later used to make predictions) are called “features.” In our case, these columns are used to determine the home price.

blr_features = ['area_type','location','size','total_sqft', 'bath', 'balcony']X = home_data[blr_features]

Building our Model

You will use the scikit-learn (sklearn)library to create your models.

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
blr_model = DecisionTreeRegressor(random_state=1)

# Fit model (training the model)
blr_model.fit(X, y)

Prediction

Ususally we make prediction to the prices of houses that we have never seen before. But we’ll make predictions for the first few rows of the training data to see how the predict function works.

print("Making predictions for the first 5 houses:")
print(X.head())
print("The predictions are")
print(blr_model.predict(X.head()))
OUTPUTMaking predictions for the first 5 houses:
area_type location size total_sqft bath balcony
0 3 208 3 1056 2.0 1.0
1 2 147 8 2600 5.0 3.0
3 3 384 5 1521 3.0 1.0
5 3 622 3 1170 2.0 1.0
11 2 622 8 2785 5.0 3.0
The predictions are
[ 39.07 120. 94.9 55. 295. ]

Model Validation

You will have to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy.

We will measure our built model accuracy using Mean Absolute Error (also called MAE).

from sklearn.metrics import mean_absolute_errorpredicted_home_prices = blr_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
OUTPUT
1.6330122417833453

Conflict

The measure we just computed can be called an “in-sample” score. We used a single “sample” of houses for both building the model and evaluating it. Here’s why this is bad.

“in-sample” : when the model has seen the data before (while training) and predicting the target variable for same train data(part of training data).

“out-sample” : when model sees the data for the first time. (other than data which is used for training / new data).

In the sample of data we used to build the model, assume that all homes with red color doors were very expensive. The model’s job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with red doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn’t hold when the model sees new data, the model would be very inaccurate when used in practice.

Any how it is obvious and practical to measure accuracy on the new data which wasn’t used to build model.

The best way to do this is to exclude some data while building the model, and then use the remaining data(new data) for measuring accuracy.

Solving the Conflict

train_test_split function from scikit-learn library can be used to break up the data into two pieces. One into train_data and test_data.

Here is the code to solve the Conflict:

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)# Define modelblr_model = DecisionTreeRegressor()# Fit model
blr_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = blr_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
OUTPUT
25.692170517947087

This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes.

There are many ways to improve this model, such as experimenting to find better features or different model types.

--

--

Siddhraj Maramwar
Analytics Vidhya

Student of Computer Science | Data Science | Machine Learning | Always up for good challenge.