Machine Learning Guide for Everyone: Workflow of Machine Learning Model

Vaishnavi Ajmera
VLearn Together
Published in
6 min readJun 22, 2020

How does something work? What are the different stages of developing something?

These are some crucial questions which we need have to answer about before we start exploring and learn about it.

In this article, we will be learning about Machine Learning, by discussing the basic methodology of building a Machine Learning model from scratch.

We can define the Machine Learning Model Workflow in the following stages-

  1. Defining the Problem Statement

2. Gathering Data

3. Preprocessing of Data

4. Finding out the suitable model

5. Choosing the Measure of Success

6. Training the model

7. Testing & Evaluation of the Model

Before going into the details of the stages, we will first learn a basic thing, What is Machine Learning Model?

Machine Learning Model is a piece of code that has been trained to become smart with data by an engineer or a data scientist. We can train the machine learning model over a set of data, providing it with an appropriate algorithm that it can use to reason over and learn from the data provided.

So, if you train your model to recognise a cat, it will learn to recognise cats for the lifetime.

Now, let’s discuss the different stages of the Machine Learning Model Workflow.

1. Defining the Problem Statement

The first and the most crucial step in any project is defining the problem statement. We may be using the best algorithm available for the model but if we don’t know what problem are we solving, the results are meaningless.

The following questions must be answered or considered while defining the problem-

  • What is the main problem?
  • What are we trying to predict?
  • What are the assumptions about the problem?
  • What are the benefits of solving the problem?
  • What is the input data? Is it available or not?
  • What is the expected outcome?
  • How is the problem going to be solved?
  • What is the status of the target feature?
  • How is the model going to be measured?

It is important to keep in mind that machine learning model will predict on the basis of the patterns present in the training data, so we are making the assumption that future will behave like the past, but it is not always true.

2. Gathering Data

Gathering Data is the real step of development of a machine learning model. This step plays a major role and tells us how good the model can be and how better can it perform?

For collecting data we should know the answers to these questions-

  • Where is the data?
  • How much data is needed?
  • How can data be collected?
  • What are the different sources of data?

There are various different techniques for data collection based on the answers to the above questions.

3. Preprocessing of Data

Before training the model, we have to preprocess the data in a form that is suitable to be fed into a machine learning model.

Data Preprocessing is the process of cleaning and transforming the raw data to extract the desired features. Raw data is the data collected from the real world which has been gathered from different sources. So, it is not feasible to do the analysis of raw data. Therefore, we have to convert data into a desirable form. The need for Data Preprocessing is to achieve good results from the model.

The real-world data can be of different types such as- missing data, noisy data, inconsistent data, etc.

Data Preprocessing includes the following steps-

  • Dealing with missing data
  • Handling categorical data
  • Feature engineering
  • Selection of meaningful features
  • Outliers detection

4. Finding out the suitable model

Our main goal is to train a suitable model which can perform in the best possible way using the preprocessed data. So, we have to choose the type of machine learning which algorithm is suitable for our problem statement. We should choose from the various supervised and unsupervised learning algorithms available.

In order to this, we should have a basic knowledge of different algorithms. To get a quick overview of different machine learning algorithms you can refer to the article- Machine Learning Guide for Everyone: Introduction.

5. Choosing the Measure of Success

It is well said by Peter Drucker, Professor at Harvard University- “If you can’t measure it, you can’t improve it.”

It means that if we want to improve or control something, it is essential to define what is the measure for success- Accuracy? Precision? Customer Satisfaction?

In machine learning, there are different measures available for different algorithms.

1.Mean Absolute Error- Mean Absolute Error is the measure which is used mainly in the regression model. It is defined as the average of the difference between the actual values and the predicted values.

2. Mean Squared Error- Mean Squared Error is similar to Mean Absolute Error and used in Regression algorithms. The only difference is that it is the average of the square of the difference between the actual values and the predicted values.

3. Accuracy- Accuracy is the measure which is mainly used in the classification model. It is defined as the value of the number of correct predictions the total number of samples. It is a good measure when there is an equal number of samples from each class available for training the model.

4. Logarithmic Loss- Logarithmic Loss measures the performance of a multiclass classification model whose output is in the form of probability. It is based on penalising the false classification.

5. Confusion Matrix- Confusion Matrix gives us a matrix as an output for measuring the performance of a binary classification model. There are 4 important terms that can be located in a confusion matrix.

  • True Positive- The case in which predicted output is YES & the actual output is also YES.
  • True Negative- The case in which predicted output is NO & the actual output is also NO.
  • False Positive- The case in which predicted output is YES & the actual output is NO.
  • False Negative- The case in which predicted output is NO & the actual output is YES.

6. F1 Score- F1 Score is defined as the harmonic mean between precision and recall of the model. F1 score helps us to find the balance between precision and recall. It is the measure which tells us how precise and robust is our model.

  • Precision- It is the ratio of correct positive results to the number of total positive results predicted by the classifier.
  • Recall- It is the ratio of correct positive results to the number of all positive samples.

6. Training the model

Training the model is one of the most important steps in Machine Learning Model Development. For training a model we initially split the model into three different parts- Training data, Validation data and Testing data.

We use the training dataset to train the model, for tuning of parameters we use the validation dataset and then test the performance of the model on the unseen test dataset. An important point is to note is that only training and validation dataset is used during the training of the data. We can say that training the model is the learning phase of the model.

The learning process includes the following steps-

  1. We take a randomly initialized matrix which contains weights and biases.

2. Then we use this matrix to predict the output of the input data.

3. After that, we find the errors in our prediction in the form of measure we have chosen earlier.

4. Now, we adjust the value of weights and biases on the basis of errors to make a good predicting model.

The steps 2–4 are repeated until the errors get constant and become as small as they can. In each iteration of the learning process, the initial random matrix moves closer to the ideal and more accurate results.

There are various different methods available to update the weights and biases. The most common approach is gradient descent. Gradient descent is an optimization algorithm which includes taking steps proportional to the negative of the gradient(a first-order derivative of the function).

7. Testing and Evaluation of the Model

After the model is trained, it is used to predict the output of the testing dataset, i.e. the unseen data. Then we measure how well our model is able to perform in the test using the criteria of the measure we defined earlier for the process. This is how we define the test accuracy of the model.

Model Evaluation is an integral part of the workflow. It helps us to find out how well our chosen model will work in the future. To further improve the model after the evaluation we might tune the model’s hyperparameters or we can find a good model by training different models for the problem and comparing them.

Conclusion

We have covered a very important concept of machine learning by going through different small steps in this article. It is necessary to have the knowledge of different steps of the methodology of model development in machine learning to have good results. In the upcoming articles, we will deep dive into the different machine learning algorithms.

Stay Tuned! Happy Learning!

--

--