Beginner’s guide to Azure Machine Learning Studio using custom dataset

Published in

Analytics Vidhya

10 min readJun 25, 2020

Before we talk about anything,how about we begin with a friendly example? When you receive an email, the provider automatically places it into the inbox/spam folders. Almost all the time, they are correctly placed in their corresponding folders while sometimes, even the mails that we wanted to see in our inbox are marked as spam 😕. But more than all this, who does this job for us? 🤔

Machine Learning is the magician in the background here!

Machine learning makes a certain task easy by learning on a set of data. The times, the mail has been correctly placed in the inbox and the times it fails to do the job, depend on how accurate the Machine Learning model is.

In simple terms, Machine Learning uses some set of algorithms to learn different examples of similar data to perform a specific task for a particular domain.

When I began to learn Machine Learning, I found this course by Google that helped me understand the concepts better. You can practice Machine Learning anywhere as far as you have a machine with good computational capacity. Now once you have a model working fine, what next? How do you put it in action? You need to deploy it somewhere, right? There are a lot of options you can choose from like either get a space on the cloud, create your own environment and deploy there or choose from existing service providers like Amazon Web Services, Microsoft Azure or Google Cloud.

Out of all the above options, I found Microsoft Azure easier to begin with because of the user friendly interface offered by the Azure Machine Learning Studio.

Before we begin, you need a Microsoft account and then just follow this link https://studio.azureml.net. Now, click on the Sign in button. By the time writing this, Azure Machine Learning Studio offers 10 GB free Storage which is enough to begin with. It also has paid subscriptions based on the API and storage usages.

Once you log in, you should be able to see something similar to this:

On the left panel, you find tools from the ML Studio where all options except settings are only useful once you have one or more experiments setup. Here, I shall only be covering experiments and in the next write up, I shall talk about using machine learning models as web services.

An experiment is a collection of different modules forming a workflow.

If you have some experience in machine learning, the process of training models generally has these steps:

Data Acquiring
Data Preprocessing
Splitting the data
Model selection
Feeding the model with data
Model evaluation

The same pattern is followed by experiments in the ML Studio as well.

Data Acquiring

Mostly, whenever someone wants to create their own machine learning service, they’ll always use their own data. Here too, we shall use custom dataset. ML Studio gives you to option to use data from a blob in Azure Storage, from Azure SQL Database, from a url, storing dataset in the Studio itself, etc. We shall be looking at storing the dataset in the Studio.

We shall use the SMS Spam Collection Data Set available here. It is a collection of SMS classified as ham/spam. Download the dataset and unzip the contents. Now, let’s take it to azure. In the ML Studio, click on Datasets. Click on New > FROM LOCAL FILE. Select the SMSSpamCollection file. While using custom datasets, some datasets have the column names, while some don’t. Our doesn’t have column names. So we shall select (.nh.tsv) where ‘nh’ stands for no header. If your dataset has column names, select the file format without ‘nh’.

Now click experiments from the left column and click the new button and select Blank Experiment. The following window should appear with an empty experiment:

The left panel has all the items that you can choose from while creating an experiment. All you have to do is, drag and drop them to the center where the experiment will be visualized. The right side has options to modify the parameters when you choose the items from the left panel.

Now, let’s import out data. In the left column, click on Saved datasets >My Datasets > SMSSpamCollection and drag it to the center of the experiment like this:

Let’s have a look at what our data looks like. Right click on SMSSpamCollection > dataset > Visualize.

A window should appear like this:

The first column represents the ‘labels’ and the next column represents the messages that will be used as ‘features’.

Data Preprocessing

The data has no column names. So, instead of the header names, something like ‘Col1’, ‘Col2’ appears. Let’s rename these head names. From the left column, search for ‘edit metadata’ , drag Edit Metadata and place it under SMSSpamCollection. As you can see each item from the left column has ‘ends’ that can be connected to the ‘ends’ of other nodes. Connect the lower end of SMSSpamCollection to the upper end of Edit Metadata.

Now click on Edit Metadata. On the right side, a panel should appear. Click on Launch column selector. A window should appear asking you the columns you want to select. Select the columns you want to rename. Here it is Col1 and Col2. Now under New column names, type the following:

Label,Text

Here, the names of the columns should be separated by commas(,) and follow the order by which they were selected from the Launch column selector menu. Right click on Edit Metadata and click run selected. Some process should begin and once it is finishes, you should see a tick mark beside the item’s name. Each time a process has finished running, a tick mark will appear beside the item name. Now, we have columns namely Label and Text.

Now, we need to tell ML Studio that the ‘Text’ column is the feature column in the data. Let’s do it. Again add Edit Metadata, place it under previous Edit Metadata and connect it to the previous Edit Metadata like this:

Click on Edit Metadata, select Launch column selector and select Text column. Now on the right panel, set Data Type as ‘String’ and Fields as ‘Features’. Right click on the second Edit Metadata item and click run selected.

Machine learning models only understand numbers. And we have textual data. We need some textual representation of our data so that the model can learn on the data. But before that, let’s do some preprocessing on it.

Search for Preprocess Text in the left column and add it below Edit Metadata and connect them.

Now click on Preprocess Text and on the right column, click on Launch column selector and select Text column. Leave the rest of the fields as they are. Now right click on Preprocess Text and click run selected. Now let’s see what the preprocessed data looks like. Again right click on Preprocess Text > Results dataset > Visualize. A window should appear.

ML Studio SMS Spam Collection preprocessed data

A column named Preprocessed Text has been added to the data.

Let’s convert this textual data to some numerical form. Search for Feature Hashing and place it under Preprocess Text. Connect these two items.

Now, click on Feature Hashing and on the right column, click on Launch column selector and select Preprocessed Text column. I shall not cover more about Feature Hashing. But in simple terms, it shall give us numerical features representing the text in our feature column. Leave the other parameters as they are. Right click on Feature Hashing and click run selected. Once the process has finished, again right click on Feature Hashing > Transformed dataset > Visualize

The feature hashing process created 1024 feature columns representing our text data.

Splitting the data

In the left column, search for Split Data and drag it below Feature Hashing item. Connect these two items like this:

Click on Split Data and on the right side, under ‘Fraction of rows’, type ‘0.7’ . We shall select 70% of the data for training and 30% for evaluating, leaving the rest of the parameters unchanged.

Model Selection and Feeding the model with data

We have our training data ready. Now, we need to tell the experiment that we want to train our model. But before we do so, let’s select the best features from all the 1024 features generated so that it simplifies our model. Search for Filter Based Feature Selection, drag it and place it under Split Data, connecting both the items.

Click on Filter Based Feature Selection and on the right column, you’ll be asked to select the target. Click on Launch column selector and select Label column. In the field Number of desired features, enter 100. This means that of all the 1024 columns, we want the 100 best features. There are various Feature scoring methods, which you can read in detail here. Now right click on Filter Based Feature Selection and click run selected. Once the process has finished, you can view the selected features by again right clicking Filter Based Feature Selection > Filtered dataset > Visualize.

From the search bar, search for Train Model and place it under Filter Based Feature Selection. Now connect the end of Filter Based Feature Selection with the upper second end of Train Model. To train the model, you’ll need to mention the target/label. To do this, on the right column, click on Launch column selector and select Label column. We told the experiment that we shall train the model.. But wait! What algorithm? We shall use Two Class Support Vector Machine. Search for it from the left column and place it near Train Model. Connect the end of Support Vector Machine with the upper first end of Train Model.

Model evaluation

We told the model to train on the training data. Now let’s mention what data it will evaluate on. We shall tell our model that the other part of the split data has to be used for evaluation. But before we do that, Feature Hashing created 1024 columns along side the ‘Label’,‘Text ’and ‘Preprocessed Text’ columns. We do not need the ‘Text ’and ‘Preprocessed Text’ columns anymore since they are in textual form. Let’s get rid of them first. Search for Select Columns in Dataset in the left column, drag it and place it under Split Data. Connect the lower second end of Split Data to the upper end of Select Columns in Dataset.

Click on Select Columns in Dataset and on the right column click on Launch column selector. Go to ALL COLUMNS tab and under it you will see a drop down. Select Exclude and in the text box beside it, enter ‘Text, Preprocessed Text’ , each one by one. This tells the column selector to include all the columns except the ones mentioned here.

Now, when we have this filtered data ready, we need to tell the model to predict on this test data. To do this, search for Score Model and place it between Train Model and Select Columns in Dataset. Connect the upper first end of Score Model to the lower end of Train Model and the upper second end of Score Model to the lower end of Select Columns in Dataset.

Right click on Score Model and click run selected. The process will finish and the experiment will create two columns in the dataset. One will be Scored Labels and the other will be Scored Probabilities.

You can view these columns by right clicking on Score Model > Scored dataset > Visualize. But before you do so, you will need to add a Select Columns in Dataset item and exclude all the 1024 feature columns so as to view the scored labels and their probabilities. You can skip this if you do not wish to view the scored dataset.

Once we have our scores ready, we need some metrics to determine how accurate our model is, right? Let’s do so. Search for Evaluate Model and place it below Score Model. Connect both the items.

When all the items in the experiment have been placed and connected, they should look like this:

Click on the big Run button and let the processes finish. This might take a while. Once the experiment has run completely, right click on Evaluate Model > Evaluation results > Visualize

A window should appear where you will see the ROC graph. Scroll down a bit and you should see something like this:

We get more than 95% accuracy which is pretty good.

That’s it! You just trained a model on Azure Machine Learning Studio. We shall look at how to use the model as a web service in my next blog.

Thank you for reading!😄

Beginner’s guide to Azure Machine Learning Studio using custom dataset

Written by Het Pandya