Data Preprocessing , Training and Evaluating Machine Learning model in Python

Sep 30, 2020

This Blog shows you how you can import data, do some data preprocessing, train model and evaluate the result using python. I will be training a Linear regression model.


Importing data from CSV, Scaling data, Encoding categorical values, Splitting data, Training Model, Evaluating the performance

Used Libraries:

  1. Numpy
  2. Pandas
  3. skLearn

So, Let’s begin folks !


The data sets represents spending of some companies in different sectors , location and their profit.

Here, we will train a Machine Leaning model to predict profit from R&D spend, Administration spend, Marketing spend and the state.

Importing Dataset and libraries

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv(‘50_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

We are importing dataset from CSV file using pd.read_csv(‘50_Startups.csv’).
dataset.iloc[:, :-1].values represents the independent variables, there are 5 columns, by “ :-1 “ we are indicating we want all the column except for the last column which is profit, and profit is our target colum

dataset.iloc[:, 4].values represent the dependent or target column. which is index 4 or profit column

Encoding Categorical Values

Machine Learning can not process non numerical data. before training the model we need to convert the non-numerical values to a numeric representation. here the coulmn state is non numeric, we need to convert them into a numerical representation.

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

now the categorical values are represented in numerical format. here we are using One hot encoding to represent categorical data to numerical representation.

Splitting Test and validation Set

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

It is better practice in ML to at least divide data into 2 splits, test and training dataset. we use training datset to train the model and test dataset to evaluate the performance of the trained model.

Feature Scaling

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

It is better to scale the values into a certain range, it will speed up training process as the model has to handle lesser range of values.
There are 2 popular ways of scaling your data

  1. Normalization: ranges the values from 0 to 1
  2. Standardization : ranges the values into a certain range so that it has mean= 0 and slandered deviation = 1

Training The model

# Training Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(), y_train)

Here, using LinearRegression from sklearn.linear library you can train the model with training data

Predicting using Trained Model

# Predicting the Test set results
y_pred = regressor.predict(X_test)

Using the trained model we can predict profit / y / dependent variable from X_test data.

Evaluating Model’s Performance

TO evaluate model’s performance we use different matrices which can be imported from sklearn import metrics.

#Importing metrics library
from sklearn import metrics
#Printing MAE
print(metrics.mean_absolute_error(y_test, y_pred))
#Print MSE
print(metrics.mean_squared_error(y_test, y_pred))
#Print RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

For Evaluating regression model we can use MAE, RMSE, MSE, R-Squared, Adjusted R-Squared metrics.

