Introduction to Machine Learning Pipeline on AzureML Studio

7 min readJun 30, 2021

Image from google

Introduction

Traditional programming approaches a problem using data and some defined set of hard-coded rules which is not effective for the problems with large data and we might not always come up with the right rule. Here comes machine learning in which using data it finds the rules or pattern itself.

Let’s take an example if there is a function which takes two inputs and returns an output which is multiple of the inputs, so for a problem like this which is relatively small, easier traditional programming is the best suited approach. But, for a problem like to recognize a handwritten number from an image where you are not sure about the characteristics which tell the difference between the numbers machine learning is the approach needed.

In this article I am going to explain the basics of machine learning followed by a problem solved using Azure designer.

Data science process

Image from devopedia

Step 1: Problem definition

This is the first step of the process. Defining a problem gives a direction to what needs to be solved, else even the complex algorithm won’t help.

Step 2: Data Collection

It is a procedure of collecting the best relevant sampling data while also considering the issues like privacy, data quality etc. using standard techniques like questionnaires, surveys, document, records, web scraping etc.

Step 3: Exploratory Data Analysis

It is a process of finding patterns in data. It is mostly like detective work where the data is a mystery and need to find an insight from it.

By using plotting tools and techniques you can come up with conclusions like the relationship between input and output features, type of distribution input feature follows, whether the data needs any transformation or is there any missing value in any of the features and much more.

Step 4: Model Building

In this step, you split the input data into three different datasets as follows:

Train dataset: This data is used to train the model
Validate dataset: This data is used in hyper parameter tuning to find the best hyper parameter and also avoid data leakage.

Alert!

Parametric vs Non-Parametric model

As the name implies Parametric models are those having fixed numbers of parameters. It makes the assumption of mapping a function using which the parameters are to be found. Here the parameters are also known as coefficients. You will learn more about it in the coming section i.e. Linear Regression.

Non-Parametric models are those which do not make any assumption regarding the mapping function between input and output data.

3. Test dataset: This data is used to evaluate the model performance.

The model then gets built using any hyper parameter tuning technique in which the train dataset is used to fit the model and the validate dataset is used to evaluate the model performance.

Then, using the best parameters fit the model using train data and evaluate its performance using test data.

Use of Azure Designer

Azure Designer is a cloud service (web based service that offers data storage, computing, etc.) used to create and deploy the model by just drag and drop UI interaction.

There are many problems in machine learning which requires a large amount of computation power and data storage like image processing, text processing and more. For that there are cloud services readily available that are equipped with machine learning models, libraries and development environments. Azure is one of them, which you will know to some extent by the end of this article.

Dataset

The dataset used in this article is Newyork-Taxi-Sample-data, can be downloaded from the clicking the link.

Number of features/ columns:

Independent features: 13
Dependent feature: 1

Number of data points/ row: 11735

To get info. of data, columns, description, data type please check out this link:

NYC Taxi & Limousine Commission — yellow taxi trip records

Break the problem:

Business Objective

Using the independent features, the task is to predict the dependent feature which is totalAmount.

2. Framing the Machine Learning Problem

If the predicted value is numerical and continuous it implies it is Regression. Eg. predict the height, predict the price of house, click-through-rate.

If the predicted value is any one label out of N such labels it would imply Classification. Eg. spam detection, sentiment analysis, cancer detection etc.

Both the above ML problems come under supervised ML problems where both the features and labels are taken as input data and feed while training the model. Another ML problem comes under the unsupervised category, which is not the focus of this article, so is not explained here.

Which ML problem is it?

In the dataset, the predicted value i.e, totalAmount is numerical and continuous so it is a Regression problem.

3. Evaluation Metric

RMSE: It stands for root mean squared error.

Image from James Moody

To understand it better, let’s break it down into following steps:

Step1: find the predicted value, y_pred

Step 2: find the squared error i.e the square of difference of the predicted value and the true value.

Squared_error = (y_pred — y_true)²

Step 3: repeat Step 1 and 2 for n data points and sum it up.

Step 4: divide the value obtained from step3 with n where n is the number of total data points.

Step 5: Apply square-root on the value obtained from step 4.

4. Model: Linear regression

It is an approach of describing the relationship between the dependent and independent variables by generating a line in case of two features, plane in case of three features and a hyperplane in case of d-dimensional features.

Image from towardsai.net

In the above diagram, X is an independent or input feature using which y, which is a dependent feature, needs to be predicted.

The dots are different data points and the line represents the relationship between X and y.

As the general equation of a line is :

y = mx + c, where m is the slope and b is the intercept.

Alert!

Model vs Algorithm

An Algorithm is a procedure of finding the coefficients using data, it belongs to the learning process whereas the model is the representation of an algorithm that is learned from the data.

Eg. Linear Regression is an algorithm but the coefficients like slope, intercept and the equation it learned from applying data on the algorithm is the model.

In training a linear regression, it tries to find a line by learning the coefficients that fit the data best.

There are a lot more things to know about linear regression like:

Mathematical formulation
Cost function
Optimization
Cases like featurization, effect of outlier etc.

which is beyond the scope of this article.

For more information on it, you can follow the links mentioned as follows:

https://en.wikipedia.org/wiki/Linear_regression

https://machinelearningmastery.com/linear-regression-for-machine-learning/

Train a linear regression using Azure Machine Learning

Open Azure studio on the web browser.

Adding Data-set

Create train pipeline

Select the Datasets from Home > Designer > Authoring menu.
Drag and drop the relevant data-set onto the canvas.
Similarly, follow this drag-drop with Split data, Model Training and Model scoring and Evaluation.
After the above step the canvas will look like the below diagram :

The arrows represent the flow of data.

Click on submit button to start training it.

# The training would take some time, you can see the running logs once training gets started. The running logs option will automatically get popped up.

Evaluate model and visualize

Click on the Evaluate Model tag on canvas once the training gets completed.

Conclusion

So far, we have covered what is the machine learning, data science process and linear regression like 10000th feet view. Also, the basics of Azure ML.

To conclude, the machine learning process is a continuous process of gathering of data, exploratory analysis of data, model training, evaluation and deployment and using cloud services we have an ease of doing it using tools like Azure designer.