Building The Perfect Machine Learning Project: Steps to follow

A breakdown of the standard Machine-Learning project lifecycle including tasks and sub-tasks within each step with practical examples in python project.

Einstein EBEREONWU
8 min readJun 10, 2022
Photo by Markus Winkler on Unsplash

When building a Machine Learning project, there are certain steps that are followed in order to refine the end-to-end procedures of training a model otherwise known as Machine Learning pipe-line. In this article, I will be highlighting and shedding light on them.

They’re as follows:

  1. Data Gathering
  2. Feature Engineering (FE)
  3. Exploratory Data Analysis (EDA)
  4. Feature Selection (FS)
  5. Cross-Validation
  6. Model Training and Evaluation
  7. Hyper-parameter Tuning
  8. Deploying, Monitoring and Maintaining the Model/System
  9. Re-training and Re-deploying

Data Gathering

What is data gathering?

I will first start with a basic definition of data which as we all know are raw facts and figures that have implicit meaning(s) and can be used for analytical and/or other purpose. Data gathering on the other hand is the process of collecting and compiling data. No other process can hold without data as without data, there is nothing for machines or humans even to learn from. There are several methods of data gathering such as web scraping, survey, API calls, publicly available data download amongst others. Before data gathering, the following questions are to be taken into consideration; how much data is needed, what type of data is needed, how will such data be collected, what will be the cost of collecting such data, how will the data be stored safely e.t.c.

Feature Engineering (FE)

Almost all the time after the data gathering process is complete, we don’t turn out with “Clean data” NB: data that is without fault and ready for analysis. These faults could be missing values, duplicate data, outliers, wrong spellings (in the event of text data), wrong data format i.e integers as strings e.t.c. Therefore, the data has to be worked on to ensure it is ready for analysis. The following could be done; replacement of missing values with the mean or median values, new columns could be derived from the combination or mathematical operations between two or more existing columns(values), feature scaling, normalization, feature encoding (replacing string data with integers or float maps as most ML models do not accept strings). I will be discussing more on feature encoding in another article.

Example of Feature Engineering (Filling missing values with mean)
Example of Feature Engineering (feature encoding)

Notice how the NaN/missing values from the first image are now replaced with the mean of the columns and also how the values of the “gender” column in the second image changed from being Male and Female to being 1’s and 0’s. These are a simple examples of what feature engineering is. View full code on my GitHub repo. NB: Both data sets are randomly generated.

Exploratory Data Analysis (EDA)

As the name implies, exploratory data analysis has to do with thoroughly examining data in order to understand things such as what features/attributes are determining factors for the dependent set (What the model is to predict), how variables are related to each other (correlation) and so many other factors. However you should note that correlation doesn’t necessarily mean causation as the fact that two things correlate doesn’t necessarily mean that one causes the other. Some of the ways of performing EDA include but are not limited to plotting scatter plots, histograms, correlation tables (Spearman’s, Pearson), crosstabs, e.t.c.

Exploratory Data Analysis (Spearman’s correlation table)
Exploratory Data Analysis (Scatter plot)

In the first diagram, the table tells us the degree of positive/negative correlation between attributes. In the second diagram, the plot suggests that males generally have bigger forehead’s than females. See full code on GitHub.

Feature Selection (FS)

After the Exploratory Data Analysis (EDA) is complete you will have a good understanding of what features/attributes are most relevant in making predictions. You can then go ahead to drop irrelevant columns, this is known as dimensionality reduction. It helps reduce noise in the data and increases the overall accuracy of the model. To see an example of feature selection, scroll up to the image Example of Feature Engineering (feature encoding). In the image, you will notice that the ‘long_hair’ column from the first table is no longer available in the second table and that is because during the course of EDA, the data set suggested that both males and females have long or short hair and that it wasn’t gender specific. In that case, such column is irrelevant and will not help with the model’s accuracy so it can and should be dropped.

Cross Validation

There are several algorithms that data can be fitted to for training but in order to get the best possible result, we go for the “best algorithms for the job” and how do you know what algorithm is the “best for the job”? Well, this is where cross-validation comes in. Cross-validation is the process of training a model by a subset of the data then testing the model with another subset (a subset that was not used for training). This process is carried out for several different algorithms and the average prediction accuracy score is calculated. After this, the algorithm with the highest average score is selected for further processes. Types of cross-validation include but are not limited to K-fold cross validation, Stratified k-fold cross validation, holdout cross validation. I will be discussing more about this concept in another article.

Model Training and Evaluation

Model Training

At this stage, the best algorithm for the job has been selected and it is time to feed the algorithm the processed data for it to learn from. Model training without any hyper-parameters usually looks like this;

Basic example of Model Training(Extreme Gradient Boost Regressor Algorithm)

Evaluation

During evaluation , there are several techniques that could be employed to evaluate a models performance such as calculating R2(R square) score, root mean squared error both for regression tasks and log loss, area under curve plot, confusion metrics to show number of true positives, true negatives, false positives and false negatives for classification tasks. After a confusion matrix is calculated, plotting a heatmap of it will make it more appealing to the eyes and easy to understand.

Evaluation (Classification/clustering tasks…. Confusion matrix heat map, Log loss plot and Area under curve plot)

From the Heat map, you can see the number of times the models predictions were right (Numbers in green background) and the number of times its predictions were wrong (numbers in white) NB: These predictions are made by using model.predict on the independent set and validated with the actual values from the dependent set. From the log loss and AUC plots, the lines follow almost the exact same path meaning that the model was able to generalize properly on training/familiar data and new/test data hence there is no over fitting and the model is good.

Hyper-parameter Tuning

What are hyper-parameters?

These are algorithm configurations that can be adjusted in order to boost model performance. The performance of a model can be boosted by simply adjusting its hyper-parameters. Various algorithms have various hyper-parameters which can be fond in their documentations. Before making any adjustments to any hyper-parameter, it is important that you know its effects and which of these hyper-parameters need change or not. There are manual ways of adjusting hyper-parameters but I personally prefer to save time by using Grid Search CV or Randomized Search CV. These two cross-validation searcher techniques train models on combinations of hyper-parameters provided. I will be discussing more on these two techniques and which of the two you should use based on the model complexity and your computer resources another time.

Hyper-parameter tuning (GridSearch CV)

In the image above, you see the dictionary of hyper-parameters. The GridSearchCV trains several models with all possible combinations of hyper-parameters provided and in the end, you can run a very short line of code that will return the best possible combination of hyper-parameters for your model as prescribed by the grid searcher then you can go ahead to copy and paste those hyper-parameters and re-train the model.

Deploying, Monitoring and Maintaining the Model/System

Deploying

Model deployment is the integration of a machine learning model into an existing environment for the purpose of receiving input and providing output (predictions). This step involves putting the model out in order for it to do what it is meant to do. At this moment, I deploy my ML models to the web simply by building a web-app. Front-end with HTML & CSS, back-end with Flask and host the web-apps on free hosting platforms such as Heroku. There are several methods of deployment which I also will discussing as soon as I have a good grasp of them.

Monitoring and Maintenance

In order to make sure a model is still performing as it should, it needs to be constantly monitored and maintained. This process helps detect wrong predictions, model degradation, data drift, concept drift e.t.c. A simple way of monitoring and maintaining a model is by checking its accuracy on real-world data over time, if the accuracy remains above acceptable requirement standards then the model is still good for a while. If it doesn’t meet acceptable standard then the next step comes up.

Re-training and Re-deploying

Re-training and re-deploying is basically a redo of the previous 8 steps. As you can see, the process of training a machine learning model is cyclical. The single purpose of re-training a model is to increase/maintain its accuracy over time.

At this point, I have come to the end of this article. I hope you were able to pick up a few new tricks and also understand the steps, tasks and sub-tasks involved in building a Machine Learning pipe-line. Feel free to share, applaud and leave a comment if any, I’ll be sure to respond as soon as I can.

In my next article, I will be discussing web scraping for data gathering with Python(BeautifulSoup) and Excel.

To read the article mentioned above, follow this link

Thanks and see you again next time.

--

--

Einstein EBEREONWU

Google Certified Data analyst • DataScientist 💻• ML/AI 🤖• Pythonist.