Decoding the Machine Learning Development Lifecycle (MLDLC)

Table of contents:-

Paresh Patil
7 min readJun 18, 2023

· What is MLDLC-
· 1. Framing the Problem
· 2. Gathering Data
· 3. Data Pre-processing
· 4. Exploratory Data Analysis
· 5. Feature Engineering and Selection
· 6. Model Training,Evaluation& Selection
· 7. Model Testing
· 8. Model Deployment

What is MLDLC-

It is set of guidelines that you need to follow which while you createing machine learning based software product it could be recomender system, loan prediction model for bank

1. Framing the Problem

If you want to make something, you need to decide some things, then move on.

You are not doing a school or college project. you are working for some company, and that company is serving to it’s client or customer

you cannot do that as if you started and you realize in midway, that you had thought something wrong Let’s start again you cannot do so because it is costing to company it is your responsibility that frame the problem correctly

In this stage, you decide what the problem is and how to solve it. Who is your customer? How much will it cost? How many people will be on the team? What will the end product look like? Which machine learning model will you be using? Where to deploy it? What framework will be used? From where will the date come? What will be the source of data?

After this all things are properly figured out properly, then only we can proceed to further step

2. Gathering Data

Data is essential. While working on a machine learning project, you need the data to build the model. without data it is not possible

For college projects, data is easily available from different websites, most likely Kaggle, but in the case of companies, it is a different scenario. There are different ways for companies to collect data.

i. APIs-Hit the API using Python code and fetch data in JSON format.

ii.Web Scrapping: Sometimes data is not publicly available, i.e., it is on some website, so we need to extract it from there.For eg-Trivago uses this method to collect hotel prices data from every website

iii.Data Warehouse-Data is also stored in databases.But this data cannot be directly used as it is running data.So data from a database is stored in a data warehouse and then used.

iv.Clusters: Also, data is sometimes stored in tools like Spark in the form of clusters, which are basically big data, so data is fetched through these clusters.

3. Data Pre-processing

when you are taking data from external sources that are bound to be unclean or dirty. You cannot use that data directly.

You cannot pass on this data directly to a machine learning model because the result is not good. Data can have structural issues. it can have missing data, can Contain some outliers & noises

So here, you need data preprocessing.it involves removing duplicates, removing missing values, removing outliers, and scaling the values (standardization)

The core idea behind data preprocessing is to bring data in such a format that it will be easily consumed by your machine learning model.

4. Exploratory Data Analysis

In this stage, you analyze data, which means you try to study the relationship between input and output variables.

The whole idea is that you have to make ML-based software. before making it you need to know “what is in your data” if you don’t know this you cannot make model properly

At this stage, you need to perform. In many experiments with data, you have to extract hidden relationships from the data. This stage gives data insights by visulizing data, univariate analysis, bi-variate analysis, multi-variate analysis, outlier detection, handle imbalanced dataset

The whole idea behind this stage is to get a concrete idea about duty. The more time we spend on EDA, the more we get to know about data, which helps in decision-making while implementing models.

5. Feature Engineering and Selection

features are the input columns. Features are Important because output is depends upon the Input (features)

The idea behind feature engineering is that sometimes you create new columns in data by using existing columns to make intelligent changes to existing columns to make analysis easier.

Assume you want to predict house prices and have input columns such as number of rooms, number of bathrooms, locality, and so on.

In this scenario, you remove the number of rooms and bathrooms and replace them with a single column called "Sqft," which represents the number of rooms and bathrooms.

so what is the benefit? you have only one column instead of 2 this is called feature engineering

feature selection-

Sometimes you have more features, like 100 or 200, but you cannot proceed with all features because of two reasons.

  1. These features are not helpful. for building model not necessary every input affects the output you need to remove those features that are not impacting your off when you select features
  2. With more columns, it will take more time to train the model, so by removing irrelevant columns, you can save time.

Both feature engineering & selection are crucial

6. Model Training,Evaluation& Selection

Once you're sure about your data, you are ready to train the model. you try different diff. machine learning algorithm you train that algorithm by your data every algorithm

In general, no one does such a thing. that someone trains only one So to be honest everyone knows that is Any one algorithm is good for any one type of data. But you never Know which algorithm turns out to be good for any particular data

for example

Naive Baye's algorithm performs best on text data, but maybe some other algorithm can perform better. we train the different model by various do algorithms and then evaluate by diff metrics like accuracy, square error etc.

The best model is selected, and parameters are tuned so that the performance of the model gets improved.

7. Model Testing

Once our machine learning model has been trained on a given dataset, we test the model. In this step, we check the accuracy of our model by providing a test dataset.

Testing the model determines the percentage accuracy of the model as per the requirements of the project or problem.

8. Model Deployment

In this step, we deploy the model in a real-world system.If the above-prepared model is producing an accurate result as per our requirement with acceptable speed, then we deploy the model in the real system. But before deploying the project, we will check whether it is improving its performance using the available data or not.

For deployment we can use Heroku, Amazon Web Services, Google Cloud Platform,etc. Now our model is online and serves user requests.

For an individual who is working on a personal project or college project, these are the complete steps.

The next two steps are used by companies.

A. Testing Apps or Software: In this step, the company rolls out alpha/beta versions of the deployed model to a particular group of users or clients to check whether the model is working as per requirement.The feedback is collected from these users and then worked upon.If the model is working correctly, then it is rolled out to everyone.

B. Optimize: In this stage, companies use servers to take backups of models, backups of data,load balancing (service the request if many users are requesting it), and rotting (frequently re-training models as data evolves with time).This step is generally automated.

Thank you for joining me on this exploration of MLDLC!

I hope this blog has provided you with a solid foundation in understanding feature engineering and its significance. Stay tuned for upcoming articles where we will take a closer look at each type and technique, equipping you with a comprehensive toolkit for your data analysis and modeling endeavors.

Thank you for joining me on this journey, and get ready to dive even deeper into the world of feature engineering!

--

--

Paresh Patil

Data wizard, blending science and analysis, conjuring insights to fuel innovation and drive data-driven excellence