Code or No Code? Build a model on Jupyter Notebook and Microsoft Azure Studio

Andrew Kinyua
Analytics Vidhya
Published in
10 min readOct 18, 2020

--

A New Data Dawn ~ Photo Location : Nairobi, Kenya
The New Data Dawn, Photo Credit: Andrew Kinyua

Welcome! It's been a while since I last posted on the platform. However, this time, I shall solely focus on the specialized field of Data Science rather than my ideological views. The field has grown over the years coupled with great innovations. As a learner/contributor/beginner in the Data Science field, do you really need to learn to code to build a model?

In this post, we shall review two different steps to build a predictive model with the use of the Jupyter Notebook lab and Microsoft Azure platform. The main aim of the post is to understand the steps to build a model and how we can achieve it using the two platforms.

NB: I am not affiliated in any way to any of the platforms above.

Let us begin!

Generally, there are several steps to build a model as discussed below:

  1. Data Collection — Basically entails reviewing, understanding the problem, and selecting the appropriate dataset.
  2. Prepare data — The step entails data cleaning, data labeling, data visualization, data normalization, and data splitting.
  3. Model selection — Essentially, this entails the process of selecting the correct algorithm to solve the problem. This may be done based on previous research on the performance of the algorithm on similar problems.
  4. Training — Generally, this step involves the process of training the model on the selected dataset to improve the final predicted output.
  5. Evaluation — The model needs to be evaluated on untrained data, the step involves exposing the model to unseen data. This serves as the best indicator of how the model would perform in the real world scenario.
  6. Parameter Tuning — Once we have evaluated the model, we need to improve the performance of the model by fine-tuning specific parameters. The step is similar to tuning your radio frequency to get a clearer signal for the specific radio channel you listen to.
  7. Prediction — This is the final step where your generated model is used in predicting/solving the problem.

In this post, we shall undertake the steps with the aim of understanding the steps using the ‘normal’ Jupyter Notebook and using the MS Azure platform.

The chosen dataset for the post is the classic Iris Dataset, however in this case we shall use the Iris Two Class Dataset found on the MS Azure Dataset. The latter has two classes, 100 rows, and 4 features.

Data Collection / Dataset Load

Microsoft Azure Platform

To retrieve the Iris Two Class Dataset, click on Saved Datasets on Microsoft Azure Machine Learning Studio and select Samples then scroll down to retrieve the dataset as depicted below- (the dataset is the last item on the second image). Select and drag it to your workspace:

Iris Two Class Dataset Load

We can now visualize the dataset details such as the number of classes in the dataset on MS Azure by:

  1. Right-click on the dataset.
  2. Select dataset
  3. Select Visualize

The visualization page (shown below) will provide you with statistics of the dataset for each column such as the unique values in each column, mean, max values, min values, missing values, and standard deviation. To review the statistic of any column, you have to select the specific column. Isn’t it very easy?

Dataset Description / Visualization

Let’s review how the same can be achieved using Jupyter Notebook.

Jupyter Notebook

The same process can be achieved on Jupyter Notebook by loading the dataset Excel file (CSV)as depicted below (The CSV file should be in the same directory as the Jupyter Notebook):

Iris Two Class Dataset Load

We can then also preview the description of the dataset with the aid of some inbuilt Pandas Dataframe functions which will provide a description of the dataset.

Dataset Description / Visualization

We can also check for class imbalance by counting the number of rows in each class. In this case, we have an equal number of rows for each class:

Number of classes in the dataset

Further, we can check if there exist any null values in any of the columns of the dataset. In this case, we have to iterate through each column returning True if there is a null value, else False.:

Check for Null values

From this, we know that the dataset does not have any null values and is balanced. This information is important for the next step i.e data preparation

Data Preparation

In this step, since our dataset is ‘clean’, we shall focus on the normalization of the data and data splitting. From the description of the dataset, the value range for each of the columns is small, when you analyze the difference between the maximum and minimum values in each of the columns. The normalization step here is basically for showing how we can achieve it in each of the platforms.

Microsoft Azure Platform

To add the normalization step in your Experiment, one needs to:

  1. On the left corner, of MS Azure studio workspace, click on the Data Transformation drop-down
  2. Click on Scale and Reduce drop-down.
  3. Select Normalize Data and drag it to your workspace.
Data Normalization

NB: You have to link your dataset which is existent in the workspace with the Normalize Data. Join the arrows to make the link successfully.

The platform offers different transformation/normalization methods that one can use for their dataset as depicted in the picture below:

Normalization methods

One can select their choice of normalization method from ZScore / MinMax / Logistic / LogNormal / Tanh. We shall use the MinMax transformation method for this post.

At the moment your workspace should be as below:

Data Normalization Workspace

To normalize the dataset, you can click on Run. Easy?

The dataset values after MinMax Normalization:

Normalized dataset

Split Data

Now that we have our dataset normalized we can then split the data to train and test datasets. To achieve this, we load the split data module:

  1. Click on the Data Transformation drop-down
  2. On the listed choices click on the Sample and Split drop-down
  3. Click on Split Data and drag it to the workspace.
Split Data

To split the data, you must connect the normalized data arrow to be the input of split data. Further, the output of split data consists of two arrows as we have train data and test data separately as represented below:

Split Data

To select how to split your data, click on the Split Data module on your workspace, and on the right window different options appear:

Split Data

One has different options as you split the data, for the splitting mode you can either choose to:

  1. Split Rows: Use this option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split. By default, the data is divided 50/50.
  2. Regular expression split: Choose this option when you want to divide your dataset by testing a single column for a value.
  3. Relative expression split: Use this option whenever you want to apply a condition to a number column. The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage.

This is as per the MS Azure documentation which can be found here: https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/split-data

In this case, the data was split by rows, and 70% to 30% for train and test data were selected respectively.

Jupyter Notebook

Dataset normalization can be undertaken with the use of sklearn inbuilt scaling functions. In this case, we shall show how to transform via MinMax and ZScore (referred to as Standard Scaler in sklearn) respectively.

We first load the data for processing as a NumPy array, then fit the data on the MinMaxScaler object created (scaler), and finally return the data as a NumPy array (new_data). We print out the values of the MinMaxScaler to compare with MS Azure, the values are the same. We shall use the MinMax normalization for the training.

MinMax Normalization Step

For the Z-Score, we undertake the same steps as the MinMaxScaler however, we change the MinMaxScaler function to StandardScaler.

ZScore / StandardScaler Normalization Step

Our dataset is now normalized.

For a detailed explanation of MinMax / ZScore check this links below by Serafeim Loukas

Split Data

Since our data is normalized we can now split the data to train and test. We utilize the train_test_split module available on sklearn as represented below:

Split Data

The train_test_split module in sklearn splits the dataset by rows. The test size is set to 30% and train 70%.

Our dataset is now normalized and already split into train and test. Fast and flexible?

Model Selection

MS Azure

For the dataset, the model selected was the Two-Class Neural Network (Binary Classifier) on the MS Azure Platform. You can select it as shown below:

Model Selection

We can then preview the different parameters available for the chosen model:

Model Selection

As can be seen above, there are different parameters for the model that can be altered to influence the performance of the model. The Two-Class Neural Network has only one hidden layer and one can only alter the number of hidden neurons in the layer. Flexible?

Jupyter Notebook

The selected model was loaded on Jupyter Notebook ~ MultiLayer Perceptron leveraging on the sklearn library.

Model Selection

The parameters set for the model differ from the ones set on MS Azure. Notably, the activation parameter and solver which can be tweaked using Notebook, I did not find similar parameters in MS Azure. Therefore, not sure of the default values for the two parameters.

Training

MS Azure

To train the model with the set parameters, we add the Train Model module as shown below:

Model Train

We have to select the ‘Class’ / Target column, therefore launch the column selector, and choose the Class / Target column in your dataset as shown below:

Target variable selector

Jupyter Notebook

For Jupyter Notebook we fit the model on the dataset as below:

Fit Model

Evaluation

The trained model is then evaluated on the test dataset to test the performance.

MS Azure

To test the model, we load the Score Model module to the workspace. We achieve this by dragging the Score Model to the workspace:

Score Model

To retrieve the output of the model, right-click on Score Model -> Scored dataset, and select Visualize:

The output should be:

Score Model Output

The score model output does not provide the performance of the model hence we have to include the Evaluate Model Module to retrieve the performance of the model using different metrics.

We can add it by:

  1. Click on the Machine Learning drop-down on the left.
  2. Click on the Evaluate drop-down
  3. Drag and drop Evaluate Model to your workspace
Evaluate Model

Now we can evaluate the performance of the model on the test data clearly:

Model Evaluation

We can evaluate and review that the accuracy of the model is 0.40 on the test data and the AUC with a threshold of 0.5 is 0.500.

Jupyter Notebook

We can undertake the evaluation step similarly with some lines of code:

Model Evaluation

From we can view that the accuracy of the model is 0.43 and the AUC is 0.23.

Disclaimer: The evaluation of the model here does not include cross-validation which would be a better representation of the overall performance of the generated model on both platforms. It is highly recommended to implement cross-validation for a better evaluation of your model.

To improve the performance of the model, one can modify several parameters such as:

  1. The number of hidden neurons in the hidden layer.
  2. The learning rate of the created neural network.
  3. The maximum number of iterations.

Further, one can also undertake a GridSearchCV to determine the optimal parameters for the model.

You can find the Jupyter Notebook Code here: https://github.com/Ankin01/Build-a-model-using-Jupyter-Notebook

You can find the complete diagram of MS Azure steps here:

https://github.com/Ankin01/Build-a-model-using-Microsoft-Azure

Now that you have seen the two approaches:

What is your preference? Code or No Code? Give your feedback & connect:

LinkedIn: Andrew Kinyua

Tweet @KinyuaAndrew

Happy Data Wrangling!

--

--