The Stages of a Machine Learning Project

From the Business Problem to the Machine Learning Solution Addressing the Business needs

Published in

The Startup

10 min readAug 4, 2020

Summary of the stages of a machine learning project and some activities involved — An infographic depicting the flow of a machine learning project __ (image by author)

Many businesses and organizations are turning to machine learning for solutions to challenging business goals and problems. Providing machine learning solutions to meet these needs requires that one follows a systematic process from problem to solution. The stages of a machine learning project constitute the machine learning pipeline. The machine learning pipeline is a systematic progression of a machine learning task from data to intelligence.

During our training as ML engineers, a lot of focus is invested in learning about algorithms, techniques, and machine learning tools but often, less attention is given to how to approach industry and business problems from the problem to a usable solution.

In this article, I present the machine learning pipeline that provisions for a comprehensive approach to solving real-world problems using machine learning. I will start with the observable or explainable problem as companies/businesses are likely to present them to an engineer and will walk you through the stages that a project needs to go through up till it ends as a usable solution available to platform end-users.

You will basically see at a top-level what stages were involved in building, for instance, the Netflix movie recommendation engine that runs in the background of the movie platform and personalizes your experience, showing you the movies you are likely to be interested in.

Solving any business problem follows these fundamental stages and so it is necessary for all practitioners to understand and leverage it. If you sharpen your thinking about machine learning projects in light of this article, I believe that you will be more effective, structured when doing ML projects. You will understand from this article how to relate more with industrial stakeholders who may not understand the whole ML buzz but are genuinely seeking relevant and desired solutions to good problems.

I know from my experience when I did my first internship as a data analyst for one of Africa’s leading data center colocation service providers, Africa Data Centers, the frustration inherent in not following this paradigm. I did not think then that my approach was not optimal, because I did not know it then. I can only imagine how much time and frustration it would have saved me and how considerably improved my performance and output would have been if I had this understanding then.

The stages of a machine learning project are summarized by the figure below.

The machine learning pipeline, business problem to solution __ (image by author)

The Business problem or research problem

Start with the business problem__

In many cases, organizations tend to present this as a goal, what they want to achieve. Very often there is a story to it and that story is important. This is how we have been doing things or how the system was behaving, and this is what we would like to achieve. In my case, it was something like;

“We will like to use the historical data we have on our energy consumption to determine our options for energy efficiency optimization and cost-saving.”

Simply put, the business problem was our energy cost is very high; what can we do to reduce expenses on energy?

Framing the machine learning problem

From the business problem, you frame the machine learning problem. This is where domain knowledge/expertise comes in. This is not trivial at all because to get the right solution you must start with the right problem/questions.

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question” __ J. Tukey, The Future of Statistical Analysis

Many organizations that are using machine learning seriously sometimes consult domain experts to help them ask the right questions. However, it may not be the case that when faced with a problem, you’d always bring in an expert. You may be the expert/consultant that was brought to figure things out. In that case, research is the only way to go. What is the industry doing to solve the same or similar problems, are there parallels in other industries, how has the problem been solved elsewhere?

Your goal is to not waste time answering the wrong question. Translating a business problem to a machine learning problem is so important that it determines the fate of your entire project. By working on this step, you start to realize the kind of data that will be necessary to answer the machine learning question. From all your research and understanding, you should already have relevant features to expect in your dataset.

Data collection and/or integration

Is the data relevant to the problem?
Is the data enough to train a good model?

This step involves putting together already existing data or collecting the necessary data. If there is already existing data, you will have to determine if the data is relevant to the machine learning problem and thus the business problem. This is very important especially if the organization is not a typical machine learning organization to have predetermined that before collecting the data they now have. It is not uncommon (been there once) to find that an organization has collected data that is not relevant to the problem they want to solve.

A good rule of thumb is to ask the question “What data will a human expert need to solve the problem if this task was left to them?”. If a human expert cannot use the data available to deduce correct predictions, it is almost definite that a machine cannot. Again, an expert will provide you with better information about how they will solve the problem and what data they will need to answer the question you are trying to answer using machine learning.

The quality of the model or analysis performed is totally dependent on the quality of the data. Just like one cannot make fine wine with low-quality grapes, one cannot build a good model with poor quality data.

It might also be possible to deduce more valuable features from the original data using feature engineering. Therefore, also think critically to see if relevant features are simply hidden in the dataset. Nevertheless, it is recommended to also advise what data the organization/business should collect that will help their quest better. The data can then be used to train new and better models or tune old ones.

The final consideration is the size (number of examples) of the dataset. While there is no definite answer to how much data is enough data, algorithms always tend to perform better when trained with huge amounts of data.

The required minimum is to have at least 10times as many data examples as there are features in the dataset.

If this is not the case, then more data should be collected. Many options are available for getting more data. These include crowdsourcing using platforms like Amazon Mechanical Turk; other external sources; or internal data collection within the organization. For some problems, it might be possible and appropriate to generate more data from existing data examples. This is best determined by a machine learning engineer.

Data preparation/pre-processing

At this stage, you explore the data critically and prepare or transform it such that it is ready for training. Look out for such things as missing data, duplicate examples and features, feature value ranges, the data type of values, feature units, and so on. Use easy tools to quickly examine the data and scavenge as much general information as possible. After gathering useful information, some of the following actions may be required:

Deal with missing data (NaN, NA, “”, ?, None) and outliers__ Standardize all missing data to np.nan. Some common options for handling missing data and outliers: dropping the data examples with missing values or applying imputation techniques (mean, mode/frequency, median).
Deal with duplicate features and/or examples__ Duplicate features cause problems of linear dependence in the data set and duplicate examples may give a false impression of the data being enough meanwhile the number of unique examples might be too small to reasonably train a good model.
Feature scaling, normalization, standardization__ You want to ensure that your features are in the same or comparable ranges typically 0 to 1. This ensures that your model trains faster and is stable especially if you are using optimization algorithms like gradient descent.
Balance the class sizes for categorical data__ Ensure that the number of training examples across the different target categories in your dataset is comparable. But if the task you are working on involves naturally skewed patterns where one class always dominates the other, balancing is not an option. This is common with anomaly detection tasks like rare diseases prediction (e.g. cancer), and fraud detection. An appropriate training method and evaluation metric must be chosen for skewed datasets that cannot be reasonably balanced.
Harmonize inconsistent units__ Inconsistent units can easily escape notice. Ensure that all units measuring the same physical quantities are the same. Just to emphasize the point, NASA lost its $125-million Mars Climate Orbiter satellite because of inconsistent units.

Data visualization and exploratory analysis

Data visualization provides the most optimum means for exploratory analysis. Using plots like histograms and scatter plots one may easily spot things like outliers, trends, clusters, or categories in your dataset. However, visualizations tend to be very useful only for low dimensional data (1D, 2D, 3D) as higher dimensions cannot be plotted. For high dimensional data, you may select some specific features to visualize.

Feature selection and Feature engineering

Which features are relevant to make correct predictions?

The goal is to select features so that you have the least correlation between features but the maximum correlation between each feature and the targets.

Feature engineering involves manipulating the original features in the dataset into new potentially more useful features. As mentioned above, always think of what hidden features might be deduced from the original data. Debatably, feature engineering is one of the most critical and time-consuming activities in the ML pipeline.

With all the above steps performed, you now have a sizable dataset with features that are relevant for the ML task and we can proceed (with some confidence) to train a model.

Model Training

The first step before training is to split your dataset into a train set, cross-validation, or development set and test set with randomization.

Randomization helps to eliminate bias in your models and is achieved by shuffling the data before splitting. Randomization is extremely important, especially when dealing with sequential data that follows some chronological order. This will ensure that the model does not go learning the structure in the data.

There is no guiding rule for optimum splits, but the main intuition is to have as much training data as possible; smaller but sufficient data to tune hyperparameters during training and enough test data to test the model’s ability to generalize on. Some typical and commonly used splits include:

Some common data split percentages in machine learning

Next, you will set aside the test set for later testing your models and proceed with the train set to train your model. It is good to quickly try out different potential algorithms and pick the one with the best generalization performance on the cross-validation or dev set for further tuning or pick a set of algorithms to form an ensemble. Use the dev set for model hyperparameter tuning.

Dataset splits and usage __ (image by author)

Model Evaluation

Is the model useful, (does it have the minimum required performance measure)?

Is the model computationally efficient?

Once you have optimized your model’s performance on the dev set as much as possible, you can now assess how well it performs on unseen data that was set aside in your test set. The performance observed on the test data gives you a glimpse of what you can expect to see in the production environment. Use single value evaluation metrics for quantifying performance.

Accuracy: suitable for classification task
Precision/recall: suitable for skewed classification task
Rsquared: suitable for regression

It is hard to strike a good balance between precision and recall, hence they are always combined into a single value evaluation metric, the F1 score.

If the minimum required performance is obtained, then you have a useful model that is ready for deployment.

“All models are wrong, but some are useful.” __ George Box

Model deployment, integration, and monitoring

“The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems. It is only once models are deployed to production that they start adding value.” __ Christopher Samiullah

Deployment is very crucial and probably the ML engineer’s nightmare as it is more of a software engineering discipline. Nevertheless, ML engineers are largely expected to be able to deploy and integrate their models with existing software systems to cater to end-users. I have very little to say about deployment, but a few things to note are how ML deployments fundamentally differ from explicitly programmed software.

Models in production environments suffer from performance decay with time. As a solution, monitoring the performance of your model in production is standard practice. Performance decay is inevitable partly because of drifts in data distribution in the production environment outside the data distribution that was existent in the train set. If you notice a significant difference in the production data distribution, then you need to retrain your model.

Model in production is continuously monitored, retrained and deployed __ (image by author)

Over the lifetime of any deployed ML model, the cycle monitor, retrain, and update is a routine process and it helps to use continuous logging of system performance information and creating performance drift alerts for efficient monitoring.

Conclusion

In this article, I summarized the stages of a machine learning project from understanding the problem to a usable solution.

An ML solution is a system with a machine learning engine running in the background.

Summary of activities:

Understand the business problems and needs
Frame the ML problem
Understand the data needs and acquire the data
Clean and preprocess the data
Select relevant features
Perform feature engineering
Train a model
Tune hyperparameters to optimize the performance of the model (accuracy and speed).
Test the model
Deploy the model
Monitoring and updating the model/system (continuous process)

Resources

Getting started with AWS machine learning course on Coursera
Machine Learning course by Andrew on Coursera
How to deploy machine learning models
6 stages to get success in machine learning projects