Data Preprocessing , Training and Evaluating Machine Learning model in Python
This Blog shows you how you can import data, do some data preprocessing, train model and evaluate the result using python. I will be training a Linear regression model.
Topics:
Importing data from CSV, Scaling data, Encoding categorical values, Splitting data, Training Model, Evaluating the performance
Used Libraries:
- Numpy
- Pandas
- skLearn
So, Let’s begin folks !
Dataset:
Here, we will train a Machine Leaning model to predict profit from R&D spend, Administration spend, Marketing spend and the state.
Importing Dataset and libraries
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd# Importing the dataset
dataset = pd.read_csv(‘50_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
We are importing dataset from CSV file using pd.read_csv(‘50_Startups.csv’).
dataset.iloc[:, :-1].values represents the independent variables, there are 5 columns, by “ :-1 “ we are indicating we want all the column except for the last column which is profit, and profit is our target colum
dataset.iloc[:, 4].values represent the dependent or target column. which is index 4 or profit column
Encoding Categorical Values
Machine Learning can not process non numerical data. before training the model we need to convert the non-numerical values to a numeric representation. here the coulmn state is non numeric, we need to convert them into a numerical representation.
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
now the categorical values are represented in numerical format. here we are using One hot encoding to represent categorical data to numerical representation.
Splitting Test and validation Set
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
It is better practice in ML to at least divide data into 2 splits, test and training dataset. we use training datset to train the model and test dataset to evaluate the performance of the trained model.
Feature Scaling
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
It is better to scale the values into a certain range, it will speed up training process as the model has to handle lesser range of values.
There are 2 popular ways of scaling your data
- Normalization: ranges the values from 0 to 1
- Standardization : ranges the values into a certain range so that it has mean= 0 and slandered deviation = 1
Training The model
# Training Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Here, using LinearRegression from sklearn.linear library you can train the model with training data
Predicting using Trained Model
# Predicting the Test set results
y_pred = regressor.predict(X_test)
Using the trained model we can predict profit / y / dependent variable from X_test data.
Evaluating Model’s Performance
TO evaluate model’s performance we use different matrices which can be imported from sklearn import metrics.
#Importing metrics library
from sklearn import metrics
#Printing MAE
print(metrics.mean_absolute_error(y_test, y_pred))
#Print MSE
print(metrics.mean_squared_error(y_test, y_pred))
#Print RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
For Evaluating regression model we can use MAE, RMSE, MSE, R-Squared, Adjusted R-Squared metrics.
For classification we use different metrics, i wrote another blog on Evaluating Classification you can find it in the link below