Machine Learning Process — Overview

Shanthababu Pandian
Analytics Vidhya
Published in
5 min readAug 2, 2020

Make it simple…. Try to understand the simple steps What and How raw data has been prepared for Data Science and Machine Learning process and exposed to machine learning algorithm.

Process is always! Set of steps one after another, by successful completion and contented of the current steps and move on to next steps. Even in ML process also there is no exceptional obviously. Let’s have quick and short discussion on this.

The below diagram would help you to understand the steps in simple ways.

Diagram 1. Machine Learning Process

There are 5 steps are there,
1. Collection of Data from various data source
2. Data cleaning and Feature Engineering
3. Model building and Selection of ML Algorithm
4. Model Evaluation
5. Model Deployment

The below diagram would help you to understand the workflow of ML process.

Machine Learning Process workflow
Diagram 2. Machine Learning Process workflow
  1. Collection of Data from various data source: Generally, Data collection is the key process in ML space, based on the business problem, we have to go and grab the date from different sources, it would be internal or external based on the customer projection in the markets. Really some data would be really from social media and public option, let say the survey of the product success or failure based on the feedback and we have to use this opportunity for next level marketing or recommendation point of view. Some scenarios it would be purely internal let’s say survey conducted among the employee population. Certainly, ML always closely related with statistics and probability (Will explain in my upcoming blogs).
Data from various data source
Diagram 3. Data from various data source

2. Data cleaning and Feature Engineering: Data cleansing is the key process in ML space. Before start analyzing the suitable algorithm for the given data set, we have to understanding the give data set and pick up the cleaning up/cleaning activity to purify the data set, to get the optimized results, otherwise we won’t get real insight of the data and your data won’t talk much (Yes… Data will talk to you if you willing to listen 😊 or if you torture ☹)

Will Data Talk
Diagram 4. Will Data Talk

Data cleaning process gives you the better understanding of the features and the relationships between them each entities/features/attributes.

During the step we must extract the essential variables and leaving behind/removing non-essential variables is big challenging one and critical one.

Handling missing values or human error in the give data set would enhance the process better and improve the modelling performance high and data quality would be improved considerably.

Missing Data
Diagram 5. Missing Data

In some cases, to have to Identifying outliers in the given dataset. Because this would lead us to unrealistic results. So, we either truncate the data above a threshold or transform the data using log transformation. there are lots of methods in ML space to handle these situations.

outlier in data set
Diagram 6. outlier in data set

Convert categorical columns into numerical is one of the major and key approach in most of the ML model developments since most algorithms need numerical features. Numbers are important always!

Feature engineering is expensive and time-consuming process, but necessary process to make your data neat and tidy. Most of the time feature engineering can be a manual process, it can be automated. So that the model development can be expedited much better and efforts can be channelized into other activity and other value-added things can be bringing into the project and customer benefits perspective.

3. Model building and Selection of ML Algorithm: this step is major step and almost we’re the to close the process, so we have to be very careful in model selection is the process, since choosing one of the models as the final model to addresses the given business problem. it involves/depends on below factors

a. Input data type
b. Output date type

On top few other elements would affect model, like accuracy, complexity, scalability, and interpretability. During this step we must train the model to understand the outcomes and test with same set of algorithms and compare the TRAIN and TEST outcomes (Actuals Vs Predicted)

TRAIN and TEST

Train and Test Architecture
Diagram 7. Train and Test Architecture
Diagram 8. Train and Test Split

(Will get back to you all with detail study of TRAIN and TEST, since it requires more nice explanations and interesting topic.)

4. Model Evaluation: Model evaluation is the process, something like validating your model with some standardized mathematical formulae or methodology. Kind of A=B or B=A (2x+5y=20)has been proved while doing our mathematics in schools/college. Model evaluation are used to assess better fitment of model and data, and to compare different models, in the point of view of correct model selection, and to predict the values are expected to be accurate. Let say Linear Regression model with Root Mean Square Error (R2) and Logistic Regression model with Confusion Matrix….

Model Evaluation
Diagram 9. Model Evaluation

5. Model Deployment: Machine Learning models are deployed into production environment, like other application, in recent culture obviously this happen in cloud space and data can be feed from internal or external source after all prerequisites are successfully completion. Most of the real time data keep on feeding into the model and monitoring the performance of the model by generating the reports and nice visualization using tools. So, that the customers can understand the business performance and where improvements are needed and what are the key business decisions they can go ahead. Etc!

Will get back with more topics on Data Science and ML with nice and sample code and easy way to understand the concepts.

WRITTEN BY

Shanthababu Pandian, Data & Analytics Technical Delivery Manager: Data Scientist : Machine Learning Practitioner : Azure Data Engg https://www.linkedin.com/in/shanthababu-pandian-b2a9259/

--

--

Shanthababu Pandian
Analytics Vidhya

Data & Analytics Technical Delivery Manager:Data Scientist;Machine Learning Engg, Azure Data Engg.https://www.linkedin.com/in/shanthababu-pandian-b2a9259/