Machine Learning 101–The 7 Steps of a Machine Learning Process

Dhruv Kapoor
Analytics Vidhya
Published in
5 min readApr 14, 2020

Data is everywhere. If you simply stop and look around, you’ll find tremendous amounts of data. The real challenge lies in determining what the data is trying to tell us and finding a way to extract the intricacies and patterns which are hidden to the naked eye.

In this blog post, we will go through the 7 Steps of a Machine Learning process as explained by Yufeng G, in this wonderful video as a part of Google Cloud’s AI Adventures videos posted on YouTube.

If you’ve just begun your Machine Learning journey, then I encourage you to check out my previous article before moving further. Check it out below!

7 Steps of Machine Learning

To understand these steps more clearly let us assume that we have to build a machine learning model and teach it to differentiate between apples and oranges.

Gathering Data

The initial step of our process involves extracting data from our sources, so as to create a dataset. Our aim is to create a model which can either be used to make predictions (Classification/Regression) or to draw out important information from our dataset (Clustering). While we are extracting data, we must ensure that the features which we choose represent our data accurately. In other words, if we showed a layman the features of our dataset, they should be able to recognize what these features are aiming to describe. Moreover, it is important that the quality and quantity of our data is of a high standard, as this may affect our model in the future.

For the sake of simplicity, we shall take only 2 features into account, namely the colour of each fruit and the texture of its outer surface, i.e. whether it is rough or smooth. We can also make use of online repositories such as Kaggle, UCI, etc. which contain pre-processed datasets.

Data Preparation

This is by far the most important step of our process and requires us to ensure that our data is formatted correctly, so that it can be interpreted and easily understood by our model. It is important to clean our data by scaling features, correcting errors, filling in missing values, normalization etc. We may also perform Exploratory Data Analysis (EDA) on our dataset to understand its features more clearly. This will allow us to determine features which may be more relevant to our target variable or even inter-related to one another. Our dataset is eventually split into a training set and test set (and sometimes also a validation set). Essentially, it is organizing our data in such a way that our machine learning model can understand it properly.

Choosing a Model

In this step we must choose a model to train for the task at hand. Selecting the correct model for our process is very important as there are models which are specifically suited to images, sequential data, text data, numerical based data, etc. Our task to distinguish between apples and oranges is a binary classification problem, i.e. two output classes are possible.

Image via MathWorks

Training

The goal of training is to feed our model more and more data to improve it’s ability to accurately predict our target variable. This step is considered to be a bulk of most machine learning processes, as it often takes a large amount of time for a model to discover the patterns and characteristics which are hidden in large, detailed and complex datasets. In our case, the model is a fairly simple one, as it only has to be trained to learn the difference between apples and oranges.

Evaluation

Once we’ve trained our model, we must determine whether it makes the right predictions or not. As mentioned earlier, we had split our dataset into a training set and test set. Now we use our test set, which contains unseen data points, to evaluate our model. The metrics on which we evaluate our model should be clearly defined. If we are to use several different models, then this pre-defined metric will allow us to choose one model over another.
A good rule-of-thumb to follow whilst splitting your dataset is to ensure a 80/20 or 70/30 split into the training set and test set respectively.

Since our problem is a binary classification problem, we may choose to evaluate our model on the basis of its accuracy, i.e. how many correct predictions it made.

Image Source: https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428

Parameter Tuning

Once we have evaluated our model, we may then observe that it is possible to improve it by adjusting the parameters on which our model was trained. There are several parameters which one can alter and play around with depending upon the algorithm being implemented, and these parameters are referred to as “hyperparameters”. Determining the correct combination of values and tinkering with them so as to understand how they affect our model is somewhat of an art form in itself. It is often a tedious and intimidating task, and one which depends highly on various factors such as the quality of your dataset, training, required output, etc.

In fact, what constitutes a good model is often an open-ended question and could vary from person to person, even for the same datasets. Thus, it is necessary to define the criteria for which your model may be considered acceptable .

Prediction

As we have finally trained our machine learning model and found the optimal combination of hyperparameters, it is time to dive into the final step. Here, we seek to implement our model in real life and determine whether or not it is a viable option. It is important to remember that the results of our models are often used in a much larger decision making process. Thus, we must determine the utility of the results we have obtained and subsequently decide on how to improve our system by analyzing these results.

Photo by William Iven on Unsplash

Summary

The massive implementation and integration of machine learning in our every day lives allows us to create a generalized framework which can be utilized in many scenarios. The basic principles of our framework as are follows:

  • Gathering Data
  • Data Preparation
  • Choosing a Model
  • Training
  • Evaluation
  • Parameter Tuning
  • Prediction

Thanks for sticking by and stay tuned for more!

--

--

Dhruv Kapoor
Analytics Vidhya

Data Science. Machine Learning. Always eager to learn.