The Machine Learning Process

Published in

Developer Students Club, VJTI

6 min readDec 10, 2021

Quote by German-American Computer Scientist, Sebastian Thrun

Machine Learning is becoming predominant nowadays. From the top research labs in the world to startups looking to design solutions, Machine learning is at the heart of the current technological revolution. We, humans, excel at creativity, learning, and inference, while machines excel at computation and memory. Machine learning is a combination of both which is creating endless possibilities to improve human lives.

Following are a few of the day-to-day examples where ML is used exclusively:

Image Recognition
Traffic Alerts
Video Surveillance
Product Recommendation (Video recommendation on YouTube)
Online Support using Chatbots
Google Translate
Stock Market Signals

Let’s discuss the general pathway of using Machine Learning and Data Science for useful Real-World Applications. This overview will be very broad and in reality, there is a lot of overlap between various stages presented here. We will also try to distinguish various roles in the process such as a Data Engineer, Data Analyst, Data Scientist, Machine Learning Researcher, etc.

To begin with, it all starts with the real world. There are two situations that we encounter. We either have a problem that we need to solve or a question that we need to answer. To label these in a general sense, it could be something like, “How do we fix or change x out in the real world?”. That could be a problem that we need to solve. Or, there could be a question that we need to answer. For example, “How does a change in x affect y?”.

One example for the above discussion: In the case of rainfall prediction, the problem statement could be to “use ML algorithms to predict whether it will rain tomorrow or not”. There is one other possible problem statement: “Using ML algorithms, estimate the amount of rainfall tomorrow”.

We need to somehow create a data product or perform data analysis in order to answer these questions. There is clearly some overlap between a data product & data analysis. But it is important to separate these out in our minds in order to fully understand the ML pathway. A data product can be something like a mobile app, service, website, dashboard, etc. Data analysis falls a little lower than that. It will be things like reports, visualizations, communications, etc. And obviously, any sophisticated data product probably has data analysis within it. So there is an overlap between these two already. But the main idea is how can we get from the real world to a data product or data analysis and then use that data product or data analysis to create change in the real world. So let’s proceed…

Data Collection & Preprocessing

First and foremost, we need actual raw data from the real world. This can come from a wide variety of sources such as physical sensors, surveys, simulations, experiments, data usage, etc.

Then we need to process and store the data. We can store it in CSV files, Excel, Cloud storage, SQL database, etc. Depending on the data format, the source, and the size, some options are going to be better than others. In general, this procedure of gathering raw data and processing it falls under Data Engineering.

Analyzing The Data

This is how the data reaches the EDA stage

After we have collected and stored the data, we need to clean and organize that data. A lot of data analysis methods and ML models require data to be in a specific format. Cleaning and organizing the data means dealing with potentially missing data or restructuring the data in a way where features and labels are organized correctly for the machine learning model.

Once the data is cleaned and organized, the next step is to perform exploratory data analysis. These can be things like statistical analysis to answer a question or can be visualizations to explore populations. In general, this is called Data Analysis, and the roles that are associated with data analysis can be a Data Scientist or a Data Analyst.

Top 10 data analysis tools used in 2021:

Top 10 Data Analytics Tools in 2021 | Data Analytics Tools | Edureka

The word 'Data' has been in existence for ages now. In the era of 2.5 Quintillion bytes of data being generated every…

www.edureka.co

Example (Let’s continue the rainfall prediction example):

We would need to check whether there is any missing data.
If some data is missing, we can take proper steps to deal with it, which generally includes either deleting the entire row containing missing data or adding some arbitrary values in its place(values like 0, mean value, etc.).
Then we draw graphs (charts) to check the relation between various features.
There are even more steps in this stage, like dropping highly correlated features.

Now after this step, we have to stop and think, if we were just trying to answer a question. Then this maybe all we need to do. In general, if we perform some exploratory data analysis and we are trying to answer a question based on the data we collected then this is probably going to be enough. We can run some sort of visualization or statistical tests and communicate the results in order to make decisions that affect the real world.

But what if we need to continue and instead of performing data analysis we actually want to create a data product? In this case, after we have done the data analysis, we move on to the Machine Learning section.

The ML Part

We now take our cleaned data and after we have explored it to understand different trends and visualizations, we either perform Supervised Learning (predict a future outcome) or Unsupervised Learning (help us discover hidden patterns to gain insights into the data). The difference here is that in supervised learning we are trying to predict patterns based on historically labeled data, whereas in unsupervised learning patterns are inferred from unlabelled input data.

Example: For our rainfall prediction model (considering the binary classification approach), different classification models can be experimented with, which includes LogisticRegression, XBGClassifier, CatBoostClassifier, or even some neural networks can be designed!

In general, we can describe the steps of exploring data and creating machine learning models as the Machine Learning part of this pathway. This is where someone has a role of a Data Scientist if they are doing both. If they are heavily focused on Machine Learning models and optimizing those models, they can be labeled as Machine Learning Engineer or Machine Learning Researcher.

Once we have created successful machine learning models, we create some sort of a data product such as a service, application, or dashboard, or even a combination of all. This data product can then predict future outcomes or gain insights on the data that we weren’t able to see before without machine learning and then use that to create some change in the real world.

Wrapping Up

This is our general Machine Learning Process. We collect and store the data, clean and organize it, perform exploratory data analysis, and then if necessary and if it doesn’t answer our question yet, we create a machine learning model to create a data product to go back to the real world. And this cycle starts over!

Collaboration with TANISH SAWANT