Machine Learning Workflow — Part 1

5 min readJan 19, 2023

In this blog, we will be discussing the workflow of a Machine learning project. This includes all the steps required to build the proper machine-learning project from scratch level. There will be two parts the first part till the model preparation and the second part is about model deployment with monitoring.

In this article, you will learn:

Understand Business Needs
Data Collection
Data Preprocessing
EDA
Feature Selection
Model Building
Model Validation and Evaluation
Hyper Parameter Tuning

1. Understand Business Needs

Understanding the business needs will help you scope the necessary technical solution, data sources to be collected, how to evaluate model performance and more. Once you find a vision of your problem, you can choose a model based on your problems.

2. Data Collection

The dataset can be collected from different sources such as a file, database, sensor, and many other sources even can buy from organizations also. Datasets might differ, some of them have them stored in the Cloud, some in Excel. The most common format we use for datasets is CSV and xlsx.

3. Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. The majority of the real-world datasets for machine learning are highly susceptible to being missing, inconsistent, and noisy. The main major steps in data preprocessing are:

Data Cleaning
Split Numerical and Categorical Features
Encoding the Categorical Features

Data Cleaning

Datatype issues: Check the feature data types presented in the right format.

Duplicate Data: Remove the duplicate data in the datasets.

Inconsistent Data: Data inconsistency leads to a number of problems, including loss of information and incorrect results.

Handling Missing Values: Missing data is defined as the values or data that are not stored or not present for some variables/features in the given dataset. We can handle those features by following techniques:

Try to find the missing values with the help of other features/Variables.
Try with Mean, Median, and Mode.
Can find missing values with ML algorithms.
If any above methods are not fit for your dataset then can remove those rows in your dataset.

Noisy Data: The presence of noise in a data set can increase the model complexity and time of learning which degrades the performance of learning algorithms.

Binning Method
Regression Method
Clustering Method

Split Numerical, Categorical Features

We should handle our features into two different sections which are numeric and category. In numeric variables, you can impute missing values using mean, mode, or median, replace invalid values, remove outliers, study the correlation among them, create bins using the binning technique, standardization, normalization, etc. In categorical variables, you can impute missing values with new categories, frequently occurring categories, encodings, etc.

Encoding the Categorical Data

Encoding categorical data is a process of converting categorical data into integer format.

Nominal Encoding:

One hot Encoding
One hot Encoding for multiple categories
Mean encoding

Ordinal Encoding:

Label Encoding
Target-guided ordinal Encoding

4. EDA (Exploratory Data Analysis)

Univariate Analysis

Univariate Analysis is a type of data visualization where we visualize only a single variable at a time. We can use histograms, Pie charts, Bar Char, etc. The main purpose of Univariate analysis is to describe the data.

Bivariate Analysis/Multivariate Analysis

A bivariate analysis will measure the correlations between the two variables. Multivariate analysis is used when we analyze more than two variables. The main purpose of Bivariate analysis is to explain the data and the purpose of multivariate analysis is to study relationships between variables.

Pivots

A Pivot Table is an interactive way to quickly summarize large amounts of data. You can use a Pivot Table to analyze numerical data in detail and answer unanticipated questions about your data.

Insights, Reports, Visual Graphs

In this section summarize everything which we have done in previous steps and make it as graphs and reports with findings.

Handling Outliers

An Outlier is data that deviates significantly from the rest of the data. They can be caused by measurement or execution errors. We can detect those outliers with Boxplot, Z-score, and IQR.

Remove outliers data
Rescaling outliers data

Train Test Split

The train test split technique is used to estimate the performance of machine learning algorithms which are used to make predictions on data not used to train the model.

Feature Transformation

Feature transformation is a mathematical transformation in which we apply a mathematical formula to a particular column (feature) and transform the values, which are useful for our further analysis. It is a technique by which we can boost our model performance.

Logarithmic transformation (Applied to right-skewed data, only for positive values)
Square Transformation (Applied to left-skewed data)
Square Root Transformation (Only positive numbers include 0, right skewed, weaker than Log transformation)
Reciprocal transformation (Only non-zero values)
Box-cox transformation (More skewness)
Yeo-Johnson transformation (More skewness)

Feature Scaling

Standardization:

Scale down based on standard normal distribution, mean=0, and standard deviation=1.

Standard Scaler

Normalization:

Scale down your feature between 0 to 1.

Min Max Scaling
Mean Normalization
Max Absolute Scaling
Robust Scaling

5. Feature Selection

Feature Selection Methods

Filter Methods

Correlation
Chi-Square Test (Category to Category target variable)
Information gain (Category to Category target variable)
Fisher’s Score
Missing Values

Embedded Method

Regularization L1, L2
Random Forest Importance

Wrappers Method

Forward Feature Selection
Backward Feature Selection
Exhaustive Feature Selection
Recursive Feature Elimination

How to Choose a Feature Selection Method?

Numerical Input, Numerical Output: Pearson’s correlation coefficient (for linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
Numerical Input, Categorical Output: ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
Categorical Input, Numerical Output: ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
Categorical Input, Categorical Output: Chi-Squared test (contingency tables) or Mutual Information.

6. Model Building

Choose ML Algorithms

7. Hyper Parameter Tuning

Manual Search
Grid Search
Randomized Search
Halving Grid Search
Halving Randomized Search
HyperOpt-Sklearn
Bayes Search
Successive Halving

8. Model Validation and Evaluation

Train Data Validation
Test Data Validation
Evaluation Metrics

Metrics for Regression

MAE (Mean Absolute Error)
MSE (Mean Squared Error)
RMSE (Root Mean Squared Error)
RMSLE (Root Mean Squared Logarithmic Error)
MAPE (Mean Absolute Percentage Error)
WMAPE (Weighted Mean Absolute Percentage Error)

Metrics for Classification

Accuracy
Precision and Recall
F1-Score
Confusion Matrix
TPR (True Positive Rate)
TNR (True Negative Rate)
FPR (False Positive Rate)
FNR (False Negative Rate)
Receiver Operating Characteristic Curve (ROC)
Area Under Curve (AUC)

Metrics for Clustering

Silhouette Score
Rand Index
Dunn’s Index
Adjusted Rand Index
Mutual Information
Calinski-Harabasz Index
Davies-Bouldin Index

Conclusion

So far, We have discussed the workflow of Machine learning which gives us an idea of how ML projects handling. You might have seen many techniques in each step and chosen those techniques that are all based on your datasets and your problems. I will upload how to choose those techniques in upcoming blogs. Part 2 will be discussing how model deployment happens and how we can track our model performance and implementation for the above steps will upload soon.

Thanks for reading my blog you can check my other blogs…

Convolutional Neural Networks (CNN) — Architectures Explained

Introduction

medium.com

OpenCV — Complete Beginners Guide Part-1/2

What is OpenCV?

medium.com

K-Nearest Neighbor (KNN)-Using Python

Introduction

medium.com

The Math Behind K-Means Clustering

Introduction

medium.com

Have doubts? Need help? Contact me!

LinkedIn: https://www.linkedin.com/in/dharmaraj-d-1b707898

GitHub: https://github.com/DharmarajPi

Machine Learning Workflow — Part 1

1. Understand Business Needs

2. Data Collection

3. Data Preprocessing

4. EDA (Exploratory Data Analysis)

5. Feature Selection

6. Model Building

7. Hyper Parameter Tuning

8. Model Validation and Evaluation

Conclusion

Convolutional Neural Networks (CNN) — Architectures Explained

Introduction

OpenCV — Complete Beginners Guide Part-1/2

What is OpenCV?

K-Nearest Neighbor (KNN)-Using Python

Introduction

The Math Behind K-Means Clustering

Introduction

Written by Dharmaraj