Problem with “In-Sample” accuracy
Whenever we built a machine learning model it is obvious we calculate the accuracy of the built model (though not all).
Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. We can solve this using train_test_split.
There are many metrics for summarizing model quality, but in this article we will look for Mean Absolute Error (also called MAE).
The prediction error can be seen as below :
error=actual−predicted
For calculating the metric (MAE) first we need to built the model.Selecting Data for Modeling
Selecting data for Modeling
We will get the dataset for house prices in Bengaluru City. click here to get data.
import pandas as pd
file_path = "C:/Users/XYZ/XYZ/Bengaluru_House_Data.csv"
home_data = pd.read_csv(file_path)
home_data = home_data.dropna(axis=0)
home_data.shapeOUTPUT
(7340, 9)
Perform all the required data cleaning and pre-processing (dropna, label encoding).
After that select the prediction target.
Selecting The Prediction Target
This can be done with the help of dot.notation.
y = home_data.price
Choose the features
Columns that are inputted into our model (and later used to make predictions) are called “features.” In our case, these columns are used to determine the home price.
blr_features = ['area_type','location','size','total_sqft', 'bath', 'balcony']X = home_data[blr_features]
Building our Model
You will use the scikit-learn (sklearn)library to create your models.
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
blr_model = DecisionTreeRegressor(random_state=1)
# Fit model (training the model)
blr_model.fit(X, y)
Prediction
Ususally we make prediction to the prices of houses that we have never seen before. But we’ll make predictions for the first few rows of the training data to see how the predict function works.
print("Making predictions for the first 5 houses:")
print(X.head())
print("The predictions are")
print(blr_model.predict(X.head()))OUTPUTMaking predictions for the first 5 houses:
area_type location size total_sqft bath balcony
0 3 208 3 1056 2.0 1.0
1 2 147 8 2600 5.0 3.0
3 3 384 5 1521 3.0 1.0
5 3 622 3 1170 2.0 1.0
11 2 622 8 2785 5.0 3.0
The predictions are
[ 39.07 120. 94.9 55. 295. ]
Model Validation
You will have to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy.
We will measure our built model accuracy using Mean Absolute Error (also called MAE).
from sklearn.metrics import mean_absolute_errorpredicted_home_prices = blr_model.predict(X)
mean_absolute_error(y, predicted_home_prices)OUTPUT
1.6330122417833453
Conflict
The measure we just computed can be called an “in-sample” score. We used a single “sample” of houses for both building the model and evaluating it. Here’s why this is bad.
“in-sample” : when the model has seen the data before (while training) and predicting the target variable for same train data(part of training data).
“out-sample” : when model sees the data for the first time. (other than data which is used for training / new data).
In the sample of data we used to build the model, assume that all homes with red color doors were very expensive. The model’s job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with red doors.
Since this pattern was derived from the training data, the model will appear accurate in the training data.
But if this pattern doesn’t hold when the model sees new data, the model would be very inaccurate when used in practice.
Any how it is obvious and practical to measure accuracy on the new data which wasn’t used to build model.
The best way to do this is to exclude some data while building the model, and then use the remaining data(new data) for measuring accuracy.
Solving the Conflict
train_test_split
function from scikit-learn library can be used to break up the data into two pieces. One into train_data and test_data.
Here is the code to solve the Conflict:
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we# run this script.train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)# Define modelblr_model = DecisionTreeRegressor()# Fit model
blr_model.fit(train_X, train_y)
# get predicted prices on validation dataval_predictions = blr_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))OUTPUT
25.692170517947087
This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes.
There are many ways to improve this model, such as experimenting to find better features or different model types.
Get access to the repository below.
FEEL FREE TO MAKE CHANGES.