Machine learning development life-cycle

7 min readJul 16, 2022

Introduction

What is Machine Learning?
Machine learning (ML) is a type of artificial intelligence (AI) that allows you to predict results more accurately. Machine learning algorithms use historical data as input to predict new output values.
The machine learning life cycle is a cyclical process that data scientists should follow. It defines each step that a company or organization must follow in order to leverage machine learning and artificial intelligence (AI) to achieve practical business value.
In this article, we will explore different stages of the machine learning development life cycle. so let’s get started with the article.

stage 1: Framing machine learning problem :

Problem framing is how you analyze the problem from machine learning and deep learning perspectives.
This is an important step in a machine learning project what things we are going to do in a later stage and does it provide any business value or not.
It also identifies metrics that need to adjust for monitoring performance and error.
ex quora question similarity analysis problem — in this problem our main task is to identify the meaning of whether two problems are the same or not.
Amazon, Netflix, or Flipcart recommendation engine that will recommend products to the customer.

Stage 2 : Gathering data :

1. CSV: Enterprises receive CSV format data for some use cases. This is a very easy way to get the data.
2. Web scrapping: Web scraping is the process of automatically collecting structured web data. This is also known as web data extraction. Web scrapping is used for price monitoring, price prediction, lead generation, and stock prediction. Web scrapping is mostly used by people and businesses who want to leverage the huge amount of web-published data to make smart decisions. Most of the developers use the Beautifulsoup library for web scrapping.
3. API: Some websites or apps have their own API. This is especially true for software or websites that distribute data, as API is the best solution for transferring data from one software to another. For example, Wikipedia has an API. This is because it is intended to provide information to those who are interested. Developers can extract data from any database and can be used for their task or either can store it in a file.

Source : https://www.cleveroad.com/blog/what-is-an-api/

4. Database: You can get the data from the database ex Cassandra, PostgreSQL database
5. Data warehouse: A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data typically flows into the data warehouse on a regular basis from transactional systems, relational databases, and other sources. ex-Amazon Redshift.

Stage 3: Data preprocessing:

Source : https://www.techtarget.com/searchdatamanagement/definition/data-preprocessing

Remove duplicates: Suppose our dataframe is df

df.duplicated().sum()  - this code will give no duplicated value.df.drop_duplicates(inplace=True)   - this code will drop duplicate value.

Remove outliers :

Plot box plot to find outliers

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box(grid='True')

Other methods of outlier detection:

Using scatterplot.

z-score

IQR(Inter Quartile Range)

3. Scaling:

from sklearn.preprocessing import MinMaxScaler , StandardScaler
scaler = StandardScaler()scaler.fit(data)

Stage 4: Exploratory data analysis:

Source : https://devopedia.org/exploratory-data-analysis

import numpy as np         # linear algebra
import pandas as pd        # data processing, CSV file I/O (e.g. pd.read_csv)
# Plotting libraries
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#loading data
train_df =pd.read_csv("../input/titanic/train.csv")    #enter path of data
test_df = pd.read_csv("../input/titanic/test.csv")
train_df.head()sns.set_style('darkgrid')
sns.countplot(data=full_df,x='Survived')
plt.show()

EDA is primarily used to gain deeper insights from data like the mean, median, and standard deviation of the dataset, so we can identify any outliers in data.
By plotting the correlation matrix we can know about important input features that have more dependent on output features.
EDA technology was originally developed by the American mathematician John Tukey in the 1970s and is still a widely used method in the data discovery process today.

Stage 5: Feature engineering and selection:

Source : https://www.datasciencecentral.com/feature-engineering-at-a-glance/

Feature engineering refers to the process of selecting and transforming variables/features in a dataset when building a predictive model using machine learning. Therefore, before training the data with machine learning algorithms, it is necessary to extract features from the collected raw dataset. Otherwise, it will be difficult to get the right insights from the data.

Feature engineering has two goals:

Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
Improving the performance of machine learning models.

Stage 6: Model training, evaluation, and selection:

Model selection and evaluation are very important operations in a machine learning workflow. This is the workflow section for analyzing the model. Examine the more insightful statistics of its performance to determine the actions that need to be taken to improve this model.
Importance of model training is a major step in machine learning, making it a working model that can be validated, tested, and deployed.
The performance of the model during training ultimately determines how well the model works when it is ultimately included in the end-user application. Both the quality of training data and the choice of algorithms are central to the model training phase. In most cases, training data is divided into two sets, one for training and one for validation and testing.
The choice of algorithm is largely determined by the final use case. However, you should always consider additional factors such as B. Algorithm model complexity, performance, interpretability, computer resource requirements, and speed. Balancing these various requirements can make algorithm selection a tedious and complex process.

Stage 7 : Model deployment:

Source : https://www.youtube.com/watch?v=xEG8u62TB2Q

ways to deploy machine learning model

1. FastAPI + Uvicorn

$ pip install fastapi uvicorn

2. Heroku

$ sudo snap install — classic heroku

You can install Heroku on Ubuntu directly from the terminal using the above command.

Stage 8: Testing :

Source : https://realpython.com/python-testing/

Unit testing: testing the smallest testable unit of an application. It is done during the coding phase by the developers.
Regression test: Check whether your model breaks and test for previously encountered bugs.
Integration test: Check whether the different components work with each other within your machine learning pipeline.

Stage 9 : Optimize:

Optimize the model if the accuracy of the model is not good.

Conclusion

In this article, we learned about the machine learning development life cycle. The key takeaways from the article are:

Problem framing is how you analyze the problem from machine learning and deep learning perspectives. It identifies metrics that need to adjust for monitoring performance and error in a machine learning project.
Web scraping is the process of automatically collecting structured web data. Web scrapping is used for price monitoring, lead generation, and stock prediction. By using API, web scrapping, CSV, and database we can collect input data that require for machine learning.
EDA (Extensive Data Analysis Division) is an advanced form of data analysis. EDA plots a correlation matrix, which allows us to know about important input features that have more dependent on output features.
Feature engineering is the process of selecting and transforming variables/features in a dataset when building a predictive model using machine learning.
Training is a major step in making a working model that can be validated, tested, and deployed. The performance of the model during training ultimately determines how well the model works when it is ultimately included in the end-user application. Both the quality of training data and the choice of algorithms are central to the model training phase.
In data preprocessing, we eliminate outliers. Outliers can also be identified using box plots, scatter plots, and the z-score approach. We also remove all null values, duplicate values, and any other undesirable data from the dataset.
We can utilize Heroku, APIs, and any cloud services like Aws, Azure, and GCP to deploy machine learning models.
Unit testing is frequently done during the development phase. Regression testing and integration testing are the other two types of testing we found for machine learning algorithms. then a regression analysis Check to see whether your model fails, test for already known flaws, and do an integration test to ensure that every component of pipeline is functions as it should.

So, this was all about the machine learning development life cycle. Hope you liked the article.