Machine Learning Workflow — Part 1

Dharmaraj
5 min readJan 19, 2023

--

In this blog, we will be discussing the workflow of a Machine learning project. This includes all the steps required to build the proper machine-learning project from scratch level. There will be two parts the first part till the model preparation and the second part is about model deployment with monitoring.

In this article, you will learn:

  1. Understand Business Needs
  2. Data Collection
  3. Data Preprocessing
  4. EDA
  5. Feature Selection
  6. Model Building
  7. Model Validation and Evaluation
  8. Hyper Parameter Tuning

1. Understand Business Needs

Understanding the business needs will help you scope the necessary technical solution, data sources to be collected, how to evaluate model performance and more. Once you find a vision of your problem, you can choose a model based on your problems.

2. Data Collection

The dataset can be collected from different sources such as a file, database, sensor, and many other sources even can buy from organizations also. Datasets might differ, some of them have them stored in the Cloud, some in Excel. The most common format we use for datasets is CSV and xlsx.

3. Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. The majority of the real-world datasets for machine learning are highly susceptible to being missing, inconsistent, and noisy. The main major steps in data preprocessing are:

  • Data Cleaning
  • Split Numerical and Categorical Features
  • Encoding the Categorical Features

Data Cleaning

Datatype issues: Check the feature data types presented in the right format.

Duplicate Data: Remove the duplicate data in the datasets.

Inconsistent Data: Data inconsistency leads to a number of problems, including loss of information and incorrect results.

Handling Missing Values: Missing data is defined as the values or data that are not stored or not present for some variables/features in the given dataset. We can handle those features by following techniques:

  1. Try to find the missing values with the help of other features/Variables.
  2. Try with Mean, Median, and Mode.
  3. Can find missing values with ML algorithms.
  4. If any above methods are not fit for your dataset then can remove those rows in your dataset.

Noisy Data: The presence of noise in a data set can increase the model complexity and time of learning which degrades the performance of learning algorithms.

  1. Binning Method
  2. Regression Method
  3. Clustering Method

Split Numerical, Categorical Features

We should handle our features into two different sections which are numeric and category. In numeric variables, you can impute missing values using mean, mode, or median, replace invalid values, remove outliers, study the correlation among them, create bins using the binning technique, standardization, normalization, etc. In categorical variables, you can impute missing values with new categories, frequently occurring categories, encodings, etc.

Encoding the Categorical Data

Encoding categorical data is a process of converting categorical data into integer format.

Nominal Encoding:

  1. One hot Encoding
  2. One hot Encoding for multiple categories
  3. Mean encoding

Ordinal Encoding:

  1. Label Encoding
  2. Target-guided ordinal Encoding

4. EDA (Exploratory Data Analysis)

Univariate Analysis

Univariate Analysis is a type of data visualization where we visualize only a single variable at a time. We can use histograms, Pie charts, Bar Char, etc. The main purpose of Univariate analysis is to describe the data.

Bivariate Analysis/Multivariate Analysis

A bivariate analysis will measure the correlations between the two variables. Multivariate analysis is used when we analyze more than two variables. The main purpose of Bivariate analysis is to explain the data and the purpose of multivariate analysis is to study relationships between variables.

Pivots

A Pivot Table is an interactive way to quickly summarize large amounts of data. You can use a Pivot Table to analyze numerical data in detail and answer unanticipated questions about your data.

Insights, Reports, Visual Graphs

In this section summarize everything which we have done in previous steps and make it as graphs and reports with findings.

Handling Outliers

An Outlier is data that deviates significantly from the rest of the data. They can be caused by measurement or execution errors. We can detect those outliers with Boxplot, Z-score, and IQR.

  1. Remove outliers data
  2. Rescaling outliers data

Train Test Split

The train test split technique is used to estimate the performance of machine learning algorithms which are used to make predictions on data not used to train the model.

Feature Transformation

Feature transformation is a mathematical transformation in which we apply a mathematical formula to a particular column (feature) and transform the values, which are useful for our further analysis. It is a technique by which we can boost our model performance.

  1. Logarithmic transformation (Applied to right-skewed data, only for positive values)
  2. Square Transformation (Applied to left-skewed data)
  3. Square Root Transformation (Only positive numbers include 0, right skewed, weaker than Log transformation)
  4. Reciprocal transformation (Only non-zero values)
  5. Box-cox transformation (More skewness)
  6. Yeo-Johnson transformation (More skewness)

Feature Scaling

Standardization:

Scale down based on standard normal distribution, mean=0, and standard deviation=1.

  • Standard Scaler

Normalization:

Scale down your feature between 0 to 1.

  • Min Max Scaling
  • Mean Normalization
  • Max Absolute Scaling
  • Robust Scaling

5. Feature Selection

Feature Selection Methods

Filter Methods

  1. Correlation
  2. Chi-Square Test (Category to Category target variable)
  3. Information gain (Category to Category target variable)
  4. Fisher’s Score
  5. Missing Values

Embedded Method

  1. Regularization L1, L2
  2. Random Forest Importance

Wrappers Method

  1. Forward Feature Selection
  2. Backward Feature Selection
  3. Exhaustive Feature Selection
  4. Recursive Feature Elimination

How to Choose a Feature Selection Method?

  1. Numerical Input, Numerical Output: Pearson’s correlation coefficient (for linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
  2. Numerical Input, Categorical Output: ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
  3. Categorical Input, Numerical Output: ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
  4. Categorical Input, Categorical Output: Chi-Squared test (contingency tables) or Mutual Information.

6. Model Building

  • Choose ML Algorithms

7. Hyper Parameter Tuning

  1. Manual Search
  2. Grid Search
  3. Randomized Search
  4. Halving Grid Search
  5. Halving Randomized Search
  6. HyperOpt-Sklearn
  7. Bayes Search
  8. Successive Halving

8. Model Validation and Evaluation

  1. Train Data Validation
  2. Test Data Validation
  3. Evaluation Metrics

Metrics for Regression

  • MAE (Mean Absolute Error)
  • MSE (Mean Squared Error)
  • RMSE (Root Mean Squared Error)
  • RMSLE (Root Mean Squared Logarithmic Error)
  • MAPE (Mean Absolute Percentage Error)
  • WMAPE (Weighted Mean Absolute Percentage Error)

Metrics for Classification

  • Accuracy
  • Precision and Recall
  • F1-Score
  • Confusion Matrix
  • TPR (True Positive Rate)
  • TNR (True Negative Rate)
  • FPR (False Positive Rate)
  • FNR (False Negative Rate)
  • Receiver Operating Characteristic Curve (ROC)
  • Area Under Curve (AUC)

Metrics for Clustering

  • Silhouette Score
  • Rand Index
  • Dunn’s Index
  • Adjusted Rand Index
  • Mutual Information
  • Calinski-Harabasz Index
  • Davies-Bouldin Index

Conclusion

So far, We have discussed the workflow of Machine learning which gives us an idea of how ML projects handling. You might have seen many techniques in each step and chosen those techniques that are all based on your datasets and your problems. I will upload how to choose those techniques in upcoming blogs. Part 2 will be discussing how model deployment happens and how we can track our model performance and implementation for the above steps will upload soon.

--

--

Dharmaraj

I have worked on projects that involved Machine Learning, Deep Learning, Computer Vision, and AWS. https://www.linkedin.com/in/dharmaraj-d-1b707898/