STEP BY STEP GUIDE

Training your First Machine Learning Model with Python’s sklearn

This article will guide you through all the steps required for Machine Learning Model Training, from data preprocessing to model evaluation!

Nisarg Kapkar

Published in

Analytics Vidhya

11 min readOct 23, 2020

Image with text Machine Learning — Photo by Markus Winkler from Pexels

Machine Learning is teaching a computer to make predictions (on new unseen data) using the data it has seen in the past. Machine Learning involves building a model based on training data, to make predictions on other unseen data.
Some applications of machine learning: Recommendation system (for example, recommending new movies to a user based on movies he has seen and liked), Stock Market (predicting trends of stocks), Virtual Personal Assistants, etc.

Machine learning is generally split into three categories: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
This tutorial will focus on training a machine learning model using Supervised Learning.
In supervised learning, we train the computer on data containing both input (features) and output (target), and the goal is to learn a function that maps input to an output.
You can read more about machine learning approaches here.

For this tutorial, we will use Python’s sklearn library (sklearn is a Machine Learning library and it contains implementation of various Machine Learning Algorithms) and Kaggle’s Housing Price Prediction Dataset.

Considering that you know the basics of Machine Learning and Supervised Learning, let’s start with our Machine Learning model!

Let’s first create a new notebook on Kaggle.
Login to your Kaggle account and go to the Notebook Section of Housing Price Competition.
Click on the ‘New Notebook’ option. It will redirect you to the Notebook settings page.
Keep everything as default and click on Create.

You should now have a new notebook similar to the one shown below.
(Delete the given default code block)

I have made a Notebook containing all the relevant steps and code.
You can access the Notebook on Kaggle.

Step 1- Import necessary libraries/functions

Let’s import all the required libraries and functions we would need.

#Import necessary libraries/functionsimport pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

Explanation and link to documentation for each function are provided in subsequent steps.

Step 2- Load the data

You should have an arrow symbol on the top right corner, clicking on it will open a new panel. The panel shows details about input data.

Click on ‘home-data-for-ml-course’ and then click on ‘train.csv’. Clicking on ‘train.csv’ will open a new panel. The panel shows the path and the data stored in ‘train.csv’
Store the ‘train.csv’ data in ‘X dataset’
Similarly, store the ‘test.csv’ data in ‘X_test dataset’

#Load the data#read train.csv and store it in 'X'
X=pd.read_csv('../input/home-data-for-ml-course/train.csv',index_col='Id')#read test.csv and store it in 'X_test'
X_test=pd.read_csv('../input/home-data-for-ml-course/test.csv',index_col='Id')

As the name suggests, read_csv is used to read a comma-separated values (CSV) file. You can read more about function here.

Step 3- Examine the data

Before proceeding to further steps, let’s examine our data.

#Examine the dataprint(X.shape)
print(X.columns)
print(X.head(n=5))

X.shape returns the dimensions (number of rows, number of columns) of DataFrame
X.columns returns the column labels of DataFrame
X.head() returns the first n rows of DataFrame

Similarly, you can examine data in the ‘X_test’ dataset.

Below is an image showing the first 5 rows of the ‘X’ Dataset (X.head(n=5))

Image showing first 5 rows of ‘X’ database

Some observations:

‘X’ dataset contains a column named ‘SalePrice’, but this column is not present in the ‘X_test’ dataset. This is because ‘SalePrice’ is our target variable (more on target variable in step 4) and we will predict the values of ‘SalePrice’ for ‘X_test’.
There are two types of columns, columns containing numbers (numerical columns) and columns containing non-numerical values (categorical columns).
A categorical column only takes a fixed number of values.
Some cells have value as ‘NaN’, these are cells with missing values. (More on missing data in step 6)

Step 4- Separate the Target variable

Some key definitions:

Features: Features are basically the independent columns/variables used by models to make predictions.
Target: Target is the output obtained by making predictions.

In our database, SalePrice is Target and all other remaining columns are the Features.

Generally, real-world datasets have a lot of missing values. (More on missing data in Step 6)
There is a chance that our target value is itself missing for some rows in the dataset. For cases like this, we drop the rows with missing target values from the dataset.

#drop rows with missing target values from 'X'X.dropna(axis=0,subset=['SalePrice'],inplace=True)

dropna() is a function available in Pandas. It is used to drop rows/columns with NaN (or Null) values.
You can read more about the function and its parameters here.

After dropping, we will separate our target from other features.
We will store our target in ‘y’ and then drop the target column from the dataset.

#Store target in 'y' and drop the target column from 'X'y=X.SalePrice
X.drop(['SalePrice'],axis=1,inplace=True)
print(y)

drop() is a function available in Pandas. It is used to drop rows and columns.
You can read more about the function and its parameters here.

Step 5- Extract categorical and numerical columns

We saw that there are two types of columns present in the dataset: Numerical columns and Categorical columns.

Let’s first see the cardinality (number of unique values in a column) of our categorical columns.

#print categorical column labels with cardinalityfor i in X.columns:
    if X[i].dtype=="object":
        print(i,X[i].nunique(),sep=' ')

The above code prints categorical column labels with the corresponding cardinality. (nunique() is used to count distinct values)

We will divide columns labels into 3 parts: categorical_columns, columns_to_drop, numerical_columns

#Divide columns in 3 parts: categorical_columns, numerical_columns and columns_to_dropcategorical_columns=[]
numerical_columns=[]
columns_to_drop=[]for i in X.columns:
    if X[i].nunique()<15 and X[i].dtype=="object":
        categorical_columns.append(i)
    elif X[i].nunique()>=15 and X[i].dtype=="object":
        columns_to_drop.append(i)
    elif X[i].dtype in ["int64","float64"]:
        numerical_columns.append(i)
        
print(categorical_columns)
print(columns_to_drop)
print(numerical_columns)

categorical_columns is a list containing all column labels with non-numerical value and cardinality less than 15.
numerical_columns is a list containing all column labels with numerical values.
columns_to_drop is a list containing all column labels with non-numerical value and cardinality greater than/equal to 15.

#drop 'columns_to_drop' from 'X' and 'X_test'X=X.drop(columns_to_drop,axis=1)
X_test=X_test.drop(columns_to_drop,axis=1)

The above code will drop all the categorical columns with cardinality greater than/equal to 15 from both ‘X’ and ‘X_test’ dataset.

The explanation for why we are selecting categorical columns with cardinality less than 15 and dropping other categorical columns in Step 7

Step 6- Impute missing data

Real-world datasets might contain many missing values (NaN, Null). There are many reasons why data could be missing.
For example, A house without a ‘Garage’ won’t have a ‘GarageQual’ (garage quality) data (since there is no Garage in the house)

Scikit-learn (sklearn) will throw an error if we try training a model on data containing missing values. So, we need to impute (fill in/replace) the missing values before training the model.

Before imputing, let’s check how many cells contain a missing value!

#optional
#print column labels with number of missing cells in that corresponding column#for X dataset
missing_columns=X.isnull().sum()
print("X dataset")
print(missing_columns[missing_columns>0])print()#for X_test
missing_columns_test=X_test.isnull().sum()
print("For X_test set")
print(missing_columns_test[missing_columns_test>0])

The code prints column labels with the number of missing cells in that corresponding column.

Image showing missing column label with count (before imputing)

We will first impute numerical columns

#impute numerical_columns
numerical_imputer=SimpleImputer()#for X
for i in numerical_columns:
    current_column=np.array(X[i]).reshape(-1,1)
    updated_column=numerical_imputer.fit_transform(current_column)
    X[i]=updated_column#for X_test
for i in numerical_columns:
    current_column=np.array(X_test[i]).reshape(-1,1)
    updated_column=numerical_imputer.fit_transform(current_column)
    X_test[i]=updated_column

For imputing, we have used the inbuilt Imputer provided in sklearn (SimpleImputer())

This imputer will replace all the missing values in a particular column with the mean of that column values.
More details on SimpleImputer here.

We iterate over all numerical columns (for both ‘X’ and ‘X_test’ dataset) and store them in np.array (current_column)
Our imputing method (numerical_imputer.fit_transform) expects a 2D array, so we use reshape(-1,1) to convert our 1D array into 2D array.
We store the final column (with imputed values) in ‘updated_column’ and finally, we replace that corresponding column in the dataset with the obtained ‘updated_column’

Similarly, we will impute categorical columns

#impute categorical_columns
categorical_imputer=SimpleImputer(strategy="most_frequent")#for X
for i in categorical_columns:
    current_column=np.array(X[i]).reshape(-1,1)
    updated_column=categorical_imputer.fit_transform(current_column)
    X[i]=updated_column#for X_test
for i in categorical_columns:
    current_column=np.array(X_test[i]).reshape(-1,1)
    updated_column=categorical_imputer.fit_transform(current_column)
    X_test[i]=updated_column

Here, we replace the missing value in a particular column with the most_frequent value of that column.

Now, let’s recheck the number of missing cells!

#optional
#print column labels with number of missing cells in that corresponding column#for X dataset
missing_columns=X.isnull().sum()
print("X dataset")
print(missing_columns[missing_columns>0])print()#for X_test
missing_columns_test=X_test.isnull().sum()
print("For X_test set")
print(missing_columns_test[missing_columns_test>0])#after imputation, there would be no columns with missing data

You should see output similar to one shown in the image below.

Image showing missing column label with count (after imputing)

Step 7- Encode Categorical columns

A categorical column takes a fixed number of values. For example, a column ‘pet’ would contain some fixed numbers of values only like Cat, Dog, Bird, etc.

We need to preprocess (encode them) these types of columns before using them in a Machine Learning model, or else it will throw an error.

We will use the inbuilt OneHotEncoder provided in sklearn.

OneHotEncoder creates a new boolean column for every unique value in the column to be encoded.

Consider a column named ‘pets’ which contain 5 unique values [‘Cat’,’Dog’,’Bird’,’Fish’,’Rabbit’].
After applying OneHotEncoding we will get 5 new columns

Image showing OneHot encoding example table

One disadvantage of one-hot encoding is that it can increase the size of the dataset.
For the above example, we have 35–7 = 28 new cells (35 new boolean cells — 7 original cells we will drop)
Now consider a dataset of 1000 rows with 50 categorical columns, and the cardinality of each categorical column is 30.
So, we each categorical column, we will have 1000*30–1000 = 29000 new cells.
For all 50 categorical columns, 29000*50 new cells

This is the reason why we are dropping categorical columns with cardinality greater than/equal to in Step 5

#Encode categorical columns#STEPS:
#get one-hot encoded columns for X and X_test (using fit_transform/transform)
#give names to one-hot encoded columns (using get_feature_names)
#drop categorical columns from X and X_test (using drop)
#oh encoding removes index, add back index (using .index)
#add one-hot encoded columns to X and X_test (using pd.concat)ohencoder=OneHotEncoder(handle_unknown='ignore',sparse=False)#for X
ohe_X=pd.DataFrame(ohencoder.fit_transform(X[categorical_columns]))
ohe_X.columns=ohencoder.get_feature_names(categorical_columns)
X.drop(categorical_columns,axis=1,inplace=True)
ohe_X.index=X.index
X=pd.concat([X,ohe_X],axis=1)#for X_test
ohe_X_test=pd.DataFrame(ohencoder.transform(X_test[categorical_columns]))
ohe_X_test.columns=ohencoder.get_feature_names(categorical_columns)
X_test.drop(categorical_columns,axis=1,inplace=True)
ohe_X_test.index=X_test.index
X_test=pd.concat([X_test,ohe_X_test],axis=1)

OneHotEncoder is an inbuilt function available in sklearn.

Sometimes that validation set might some categorical features that are not present in training data, by default, OneHotEncoder will raise an error when such features are encountered.
To avoid error, we set the handle_unknown parameter to ‘ignore’. Now, if an unknown feature is encountered, the one-hot encoded column for that feature will be all zeros.

More details on OneHotEncoder here.

‘oh_X’ and ‘oh_X_test’ contain our one-hot encoded columns for the ‘X’ and ‘X_test’ dataset, respectively.

By default, one-hot encoded columns are given names like 1, 2, 3… To get column names based on feature values, we use the get_feature_names function.
OneHot encoding removes the index, so we add back the indexes.

Next, we drop the categorical columns from the X and X_test dataset.
Then we concatenate (using inbuilt Concat function from Pandas) the ‘X’ and ‘X_test’ dataset with their one-hot encoded columns. (‘oh_X’ and ‘oh_X_test’)

NOTE:
We have used fit_transform for ‘oh_X’ and transform for ‘oh_X_test’. You can read about the difference between fit_transform and transform here.

Step 8- Split Dataset into train and validation set

Three types of datasets are usually used in different stages of model training: Training set, Validation set, and Test set.

You can read more about these types of datasets here.

We already have our test set (X_test).

We will split the ‘X’ dataset into 2 parts: Training set and Validation Set.

#Split Dataset in train and validation setX_train,X_valid,y_train,y_valid=train_test_split(X,y,train_size=0.8,test_size=0.2)

(X_train, y_train) are the training set features and target, respectively.(X_valid, y_valid) are the validation set features and target, respectively.

We have used a 0.8 proportion of the dataset for the training set and the remaining 0.2 for the validation set.

More details on the train_test_split function here.

Step 9- Define the model

We will use Random Forests for our Machine Learning model.

Random Forests constructs multiple decision trees and merges their outputs to provide more accurate predictions.

I recommend reading and understanding decision trees and random forests properly before proceeding further.

We will use the inbuilt RandomForestRegressor function available in sklearn.

#Define the modelmodel=RandomForestRegressor()

Step 10- Fit the model

Fitting the model basically means training the model on training data.

We will use the .fit method provided in sklearn to fit our model on training data.

#Fit the modelmodel.fit(X_train,y_train)

Step 11- Predict and evaluate on the validation set

Let’s test our trained model on the validation set.

#Predict and Evaluate on validation setpreds_valid=model.predict(X_valid)
score_valid=mean_absolute_error(y_valid,preds_valid)
print("MAE: ",score_valid)

.predict() method (available in sklearn) will make predictions on validation data. These predictions are stored in ‘preds_valid’.

We have used mean_absolute_error (from sklearn) to estimate how far our predicted values (preds_valid) are from actual values (y_valid).

We define the error as the difference between actual and predicted value. (error = actual-predicted)

MAE (mean_absolute_error) takes the absolute value of error for every row and then takes the average of these absolute errors.

In the above code, ‘score_valid’ stores our MAE score.

Step 12- Generate test prediction

#Generate Test Predictionpreds_test=model.predict(X_test)
submission=pd.DataFrame({'Id':X_test.index,'SalePrice':preds_test})
submission.to_csv('submission.csv',index=False)

Our predictions on the test set (X_test) are stored in the ‘preds_test’.

We then make a DataFrame (with Id and SalePrice (our predicted SalePrice) column) and convert it to .csv.
This .csv will be used for Kaggle Contest Evaluation.

To submit your predictions in the Kaggle Contest:

Click on Save Version (blue button, top right corner) and select the ‘Save & Run All’ option and then click on Save.
After saving, click on the number next to the Save Button.
A new window will open up, and you should have the Version History panel on the right side. Click on the ellipsis (…) button next to Version 1.
Click on the ‘Submit to Competition’ option.

You can view your submission and see your Score (MAE for test set) and position on the leaderboard.

BONUS SECTION

Congratulations on training your first Machine Learning Model!
Want to decrease your MAE and improve your rank on the leaderboard? Here are a few things you can try!

Test using different models (such as XGBoost, LightGBM, etc.) and see which model provides better accuracy.
Feature Engineering (extracting additional features from existing data) and Feature Selection (selecting the best subset of features to use for model training)
Try using various imputation methods for missing data. SimpleImputer can impute missing data with mean, median, mode, and constant. You can also try imputing using KNNImputer.
Hyperparameter tuning (refer to the documentation of each model to learn more about their hyperparameters). Sklearn learn has an inbuilt function (GridSearchCV) which you can use to search for optimal hyperparameter. (you can also use a simple loop to search for optimal hyperparameters)
Ensemble learning (Ensembling is a technique of combining multiple machine learning algorithms to get better prediction). You can read more about sklearn inbuilt modules for ensemble methods here.