Machine Learning for Non-Specialists — Part 1

Photo by Possessed Photography

Introduction

Put simply, Machine Learning is a process of using a code (an algorithm, also called a Model) to train on existing data to produce another code (a Trained Model) that can make predictions on new incoming data. Confusing? Don’t worry! This blog post is to debunk the concept of Machine Learning for non-specialists i.e. you do not need a Ph.D. to understand and implement Machine Learning.

We have helped many clients in migrating their data from on-premises infrastructure to cloud either as Infrastructure as a Service (IaaS) or Platform as a Service (PaaS) by building a data warehouse for meeting all their analytical needs. This gives our clients an ability to find insights from what has already happened. There is now a big drive to use data in a more strategic way by trying to find what might happen in the future. This is where Machine Learning can really shine.

Machine Learning (ML) is a subset of Artificial Intelligence (AI). The two most widely used ML Model categories are supervised and unsupervised algorithms. A supervised model is used when you want to predict a target value. The dataset used for training the model contains columns and rows. The column that contains the target value is called the “Label” or dependant variable and all other columns in the dataset are called “Features” or independent variables. You train the model on a dataset that already has the target values. The model learns from it and finally produces a trained model that can predict the target value on new incoming data. It is like you preparing for an exam using a Practice Q&A session which has both the questions and answers for you to train on and then take the real exam.

An unsupervised model on the other hand, is used when you do not explicitly have the target value you want to predict but want to understand the underlying structure of the data to enable you to make more informed decisions.

For the sake of simplicity we will mainly focus on the supervised learning in this blog. Here’s a simplified eight step process.

Step 1: Value Proposition

The end goal is not to produce a trained model that predicts well on new data. A well-trained model is only an enabler to achieve the end goal, which is to create value for the end consumer. Hence knowing that value should be our first job. The predictions only become valuable when they are used to make informed decisions or automate the process of making good decisions that best serve the organisation’s objectives.

Step 2: Data Acquisition, Profiling and Cleansing

A good understanding of your value proposition will enable you to identify what datasets are required to produce an effective trained model that can enable you to realise the target value proposition. Once the data is acquired, the next step is to find if the data is “fit for purpose” for training a model. Some of the common data quality issues that need addressing are missing values, datatype mismatch, different columns having big differences in the range of values and values that are either too low or too high from the majority of the dataset are also referred to as Outliers.

Step 3: Feature Engineering

Too many columns can sometimes lead to overfitting i.e. a trained model that shows very high accuracy on the testing dataset but shows sub-optimal performance on new unseen data. This is sometimes referred to as “The curse of dimensionality”. Feature Engineering is a process of selecting only those columns that have the most predictive power or creating a new small set of more meaningful columns (referred to in this context as features) from the existing set of columns. Using a reduced set of meaningful columns also minimises training times and save costs.

Step 4: Data split for Training and Testing

Once the data has been cleansed and you have selected the most predictive set of columns, it is time to prepare the data for training a model. The data is split into training and testing dataset usually in the ratio of 70:30. The dataset is divided using techniques to ensure that both sets must contain different completely disjointed set of observations. There are other techniques as well where data is split into many subsets (folds) and the model is then trained on each fold with different set of parameters. This technique is called Cross-validation.

Step 5: Choosing Model(s) for Training and Hyperparameter Tuning

There are many types of algorithms (models) available. The choice of model selection for training the dataset is primarily driven by the training time and resources available. The following questions can help you narrow down the best model for a given task: Is the model good for linear or non-linear relationships? What is the training speed? What is the resource consumption? What is the accuracy of the model? Is it more suitable for datasets with large number of columns? How many Hyperparameters are used?

Once the model is selected, you need to train it on the training dataset prepared in the previous steps. This can be done on your local machine or a CPU\GPU based compute cluster depending on the complexity of the model. A lot can be done with simple models that have a linear relationship such as Linear Regression model, which can be trained even in Excel!

Each model has two sets of parameters. Internal parameters, whose values cannot be changed by you but are learned by the model from data through the model training process. The other set is called Hyperparameters, whose values are set by you and affect the duration of the training and accuracy of the predictions. Hyperparameters are usually set manually based on the heuristics or knowledge but this process can also be automated by providing a range of values and then training the model using all or random selection from the cartesian product of the combination of these values.

Step 6: Model Evaluation and Selection

Once you have the trained model, it is time to test its accuracy on the Testing dataset before deploying it to production. Model scoring is the process of running a trained model on a testing dataset to predict the labels and evaluation is the process of comparing the predicted labels with the actual labels of the testing dataset. The choice of evaluation metrics depends on the type of algorithms used to train the model. Finally you select the best model for deployment to production.

Step 7: Model Deployment and Consumption

Once the Model is trained on the training dataset and tested on the testing dataset, it is time to deploy this trained model and then consume the trained model to create predictions on new data. These trained models can be deployed to a local machine or exposed via a web service that can be called from a REST API. Finally, the deployed model can be consumed via web applications, programming languages or even applications such as Excel and Power BI.

Step 8: Model Interpretability and Data Drift

Model Interpretability also referred to as Model Explainability is the process of understanding how each input feature contributes to the outcome of the model. Transparency is important to understand what’s going on and trust the output. Sometimes Model Interpretability is considered earlier in Step 5 as a criterion in choosing the right model for training.

Drift in model input data leads to model performance degradation. Hence monitoring data drift is important to proactively identify model performance issues and re-train the model on the old plus newly accumulated dataset to increase accuracy.

Machine Learning Canvas

You can plan the process of Machine Learning using a canvas as shown below. One can think of the eight step Machine Learning Process as the “What” space and the Machine Learning Canvas as the “How” space. The “How” space is all about how you go about it. Part 2 of this blog will explain this process and the specialist terms in detail. Watch the space!!

Summary

At its most fundamental level, the key to successfully building Machine Learning Models is to understand the value proposition. Next, whether it is for training, validating, testing or inferring, good quality data is the lifeline of the Machine Learning Models. You need to carefully choose the right algorithm to train your model based on the problem you are trying to solve. Model Explainability is important to identify any bias and in some cases also useful for compliance and regulatory purposes. Machine Learning is an iterative process and your trained models do not remain fit for ever, they need regular re-training.

Thanks for reading. If you enjoyed learning more about Machine Learning or are already well versed on the subject, consider looking into our open roles, such as this Microsoft Azure Data and Power Platform Consultant position.

--

--

Man S Chawla https://www.linkedin.com/in/msinghuk/
Capgemini Microsoft Blog

A sr Solutions Architect with Capgemini certified in Microsoft Azure Architect Design and Designing an Azure Data Solution with over 15+ years of experience .