Machine Learning Project Workflow

Vineet Maheshwari
Dec 19, 2018 · 5 min read

Workflow can mean different things to different people, but in the case of ML it is the series of various steps through which a ML project goes on. It means that passing each and every stage under the workflow to complete the project successfully and in time.

We will follow the general Machine Learning workflow steps :

  1. Gathering the data.
  2. Preparation of Data.
  3. EDA (Exploratory Data Analysis).
  4. Feature Engineering Selection.
  5. Choosing the best model.
  6. Training our model.
  7. Evaluating the model.
  8. Performing Hyper Parameter Tuning on the model.
  9. Interpreting the model results.

Now there is a question that how do we start ?

Problem Definition :

The very first step before we go deep into the coding part and workflow part, we need to get the basic understanding about our problem, what are the requirements and what are the possible solutions.

Now here we will be working on a predefined data set known as iris data set. And it is a part of Supervised learning as we have access to both features and target and we need to rain our model that can map between two. Our model must be accurate and interpretable.

GATHERING DATA

We now move on to our first step, i.e gathering data. The quality of our model depends on the quantity and quality of data collected, therefore this step is the most important step.

DATA PREPARATION

Here we load the data and prepare it for using it in machine learning model. But a perfect data is that which is perfectly cleaned and formatted. Data cleaning is the necessary part of most of the data science problems.Data pre-processing is the part of data preparation only. First we load the data as Pandas DataFrame :

Pandas is the python library used for loading our data into the model. Matplotlib is used for plotting the graph of the desired results.

import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)
#shape
print(dataset.shape)
#this will be the output - (150,5)
print(dataset.head(5))
source : towardsdatascience.com

Now we move onto the next step, i.e, EDA. It is an open-ended process where we develop statistics and figures to find a trend or relationship with the data.

So basically, EDA help us to know more about our data and that what can we know from it. These findings, information can help us to know that what features we can choose in our model, and may be we can improve our way of feature selection.

Feature Engineering Selection : It provides the return on time invested in the machine learning problem. It is the process of taking raw data and choosing or extracting the most relevant features. It help us to remove the features from the model that are not required, this help us to create a better and more interpretable model.

A machine learning model learns from the data we provide, so the data must contain every relevant information so that the output that is being predicted by the model is very accurate.

Using One-Hot encoder is one of the few steps of Feature Engineering. This process help us to include categorical variables in our data.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

The above process was to split the data set into two parts, 80% of it will be used to train our model, and other 20% will be used to hold back as the validation data set.

source : pybloggers.com

This above image is here for our better understanding, it tells us that there are 3 different classes of the data set, namely — Setosa, Versicolor, Virginica. And there are total of 150 values, 50 of each.

This graph can be obtained by the help of a python library, MATPLOTLIB.

Choosing the model : Moving on forward to the next step, we have to choose the best suited model.

Models are being compared on the basis of the accuracy score that they generate. We can move forward with KNN model, as in general cases, it generates the best results.

# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
#Output generated is#0.9#This means that the accuracy of the model is 90%.

One way to choose the best model is to train each and every model and take the results of that model that is showing the best results out of them (obviously, a time taking process, but quite interesting if we get familiar). This step also includes training the data set and fitting our data in the model and then testing it to predict and get the accuracy score.

Parameter Tuning : Once the evaluation is over, we can check for better results by tuning the parameters. There are several parameters, if their values are being changed then obviously we will observe some change in the results and most importantly, in our accuracy score. These parameters are known as hyper parameters, and different values of them are totally dependent on the model on which we are working. Since there are many considerations at this phase of project, we need to choose the best of all. These values define the accuracy of the training model and also the duration of training.

Interpretation of results : Now it is all upon us that what do we want to interpret from the outcomes ? How do we want to use the training model ? What values we want to predict ?

So these are the various questions, and only we can answer them on our own.

We learnt about the work flow of Machine Learning and went deep into various steps coming in the way for a better understanding. We developed a model and used a vary basic data set named as IRIS Data Set.

In the similar way, it can be implemented on different data set and can work in the way we want it to.

SO HAPPY CODING !!

Data Driven Investor

from confusion to clarity, not insanity

Vineet Maheshwari

Written by

To infinity and beyond.

Data Driven Investor

from confusion to clarity, not insanity

More From Medium

More from Data Driven Investor

More from Data Driven Investor

More from Data Driven Investor

More from Data Driven Investor

Introduction to Stock Analysis in Python.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade