Data Preprocessing , Training and Evaluating Machine Learning model in Python

Sayed Ahmed
School of ML
Published in
3 min readSep 30, 2020

This Blog shows you how you can import data, do some data preprocessing, train model and evaluate the result using python. I will be training a Linear regression model.

Topics:

Importing data from CSV, Scaling data, Encoding categorical values, Splitting data, Training Model, Evaluating the performance

Used Libraries:

  1. Numpy
  2. Pandas
  3. skLearn

So, Let’s begin folks !

Dataset:

The data sets represents spending of some companies in different sectors , location and their profit.

Here, we will train a Machine Leaning model to predict profit from R&D spend, Administration spend, Marketing spend and the state.

Importing Dataset and libraries

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv(‘50_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

We are importing dataset from CSV file using pd.read_csv(‘50_Startups.csv’).
dataset.iloc[:, :-1].values represents the independent variables, there are 5 columns, by “ :-1 “ we are indicating we want all the column except for the last column which is profit, and profit is our target colum

dataset.iloc[:, 4].values represent the dependent or target column. which is index 4 or profit column

Encoding Categorical Values

Machine Learning can not process non numerical data. before training the model we need to convert the non-numerical values to a numeric representation. here the coulmn state is non numeric, we need to convert them into a numerical representation.

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

now the categorical values are represented in numerical format. here we are using One hot encoding to represent categorical data to numerical representation.

Splitting Test and validation Set

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

It is better practice in ML to at least divide data into 2 splits, test and training dataset. we use training datset to train the model and test dataset to evaluate the performance of the trained model.

Feature Scaling

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

It is better to scale the values into a certain range, it will speed up training process as the model has to handle lesser range of values.
There are 2 popular ways of scaling your data

  1. Normalization: ranges the values from 0 to 1
  2. Standardization : ranges the values into a certain range so that it has mean= 0 and slandered deviation = 1

Training The model

# Training Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Here, using LinearRegression from sklearn.linear library you can train the model with training data

Predicting using Trained Model

# Predicting the Test set results
y_pred = regressor.predict(X_test)

Using the trained model we can predict profit / y / dependent variable from X_test data.

Evaluating Model’s Performance

TO evaluate model’s performance we use different matrices which can be imported from sklearn import metrics.

#Importing metrics library
from sklearn import metrics
#Printing MAE
print(metrics.mean_absolute_error(y_test, y_pred))
#Print MSE
print(metrics.mean_squared_error(y_test, y_pred))
#Print RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

For Evaluating regression model we can use MAE, RMSE, MSE, R-Squared, Adjusted R-Squared metrics.

For classification we use different metrics, i wrote another blog on Evaluating Classification you can find it in the link below

--

--