Data Science👨‍💻: Introduction to Orange Tool Part-2

Published in

Geek Culture

6 min readAug 28, 2021

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

The Best Way To Create Free Future Is To Create It.
~ABRAHAM LINCOLN

This blog is all about how to split data into training and testing using the Orange tool. We will also learn more about Test & Score Widget. We will also explore the cross-validation method using the Orange tool.

Prerequisite:

Introduction to Orange Tool Part-1

Train Test Split :

Now if you’re from an ML DL background then you might know why this is a more important step. For those who don’t know we split our data into two parts train data and test data. We train our model on train data and then we will test our model on test data. Test Data remain unseen during training so that’s why we can actually get an idea that how our model performs on unseen data.

For the Train Test Split, I used the below workflow.

Here as usual I load iris.tab dataset in the File widget which comes with the orange tool.

After that, I pass the whole dataset into Data Sampler Widget. In Data Sampler Widget we will partition our dataset into train and test data.

As you can see I split the data into 80:20 ratio i.e 80% Train Data and 20% Test Data. On the bottom, you can see 120 data points use for Training and 30 data points used for testing from a Total of 150 data points.

Now after split the data I connect Data Sampler with Test & Score Widget. I connect two lines one for train data and another for test data.

Data Sample -> Data ( Train Data )

Data Sample -> Test Data ( Test Data )

Now for model creation, I used Random Forest, SVM (Support Vector Machine), and KNN ( K Nearest Neighbors ) Widgets. These all the widgets are machine learning algorithms. Connect all the widgets with Test & Score Widget.

Test & Score widget must need two things.

(1) Data ( Train & | Test )

(2) Machine Learning Algorithm

When we are using Train test data we always test our model on test data so we have to specify that thing into Test & Score widget.

As you can see on the left side Test on test data is selected i.e the results which you are seeing( right side )coming from testing on test data.

We get the best results on all the algorithms. ( Approx 98% CA a.k.a Classification Accuracy )

What is the effect of splitting data on the classification result/ classification model?

As you can see accuracy is a little bit higher in With Splitting But it’s not the case all the time. Here we have very clean and low data ( 150 data points ) But in some situations when we have a lot of data points at that time if you will not split your data then your model might get overfit. So it’s always good to split data into Train and Test. So that we can get information that how’s our model perform on an unseen dataset ( test data ).

Cross Validation:

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

We can do cross-validation using the Test & Score widget. Note that during splitting the data cross-validation use the whole dataset not only train data nor test data.

For Cross-Validation I used the following workflow:

Now you’re familiar with this workflow it’s a very simple workflow we will directly focus Test & Score widget.

As you can see I used Number of folds = 10 i.e Total of 10 times random data points will be tested on our models and then we will get an average result. Cross-validation is a very powerful technique for evaluating the model. We can also find whether our model is overfitted or not by using cross-validation.

After that Test & Score widget connects with the Confusion Matrix in which we can see the result and after that, from the confusion matrix, we can select the data and view it into the Data Table widget.

Confusion Matrix ( with not selected anything )

As you can see first I select Misclassified data from Confusion Matrix and then View it on Data Table Widget. This is how we can explore our results using the Confusion Matrix and Data Table widget.

What is the cross-validation effect on model output/accuracy?

Without Cross-Validation vs With Cross-Validation

As you can see our accuracy is decreasing while using Cross-validation but still it's a very good performance. In the case of without cross-validation we will test our model once and with cross-validation we tested our model K times (K=number of fold) on random data points from the dataset. That’s why after getting good accuracy on test data always assured that accuracy by performing Cross-validation.

While using Cross-Validation You can see model comparison by metrics parameters like accuracy, precision, recall, f1-score, AUC Curve, etc…

Read this article for more details on Cross-Validation here.

Configured Files:

Data-Science/Practical 5 Introduction to Orange tool Part-2 at master · manthan89-py/Data-Science

This repository contains concepts and project-related Data Science and Machine Learning. - Data-Science/Practical 5…

github.com

Conclusion:

I hope now you can work by yourself in the orange tool. I tried to cover as many things as I can. Now you can explore more by yourself.

Do check out more features of the Orange tool here.

Keep Exploring…!!👍

LinkedIn:

Manthan Bhikadiya - Data Scientist - Pivotchain Solutions | LinkedIn

View Manthan Bhikadiya's profile on LinkedIn, the world's largest professional community. Manthan has 3 jobs listed on…

linkedin.com

Github:

manthan89-py - Overview

Interested in AI, Deep Learning, Machine Learning, Computer Vision, Blockchain, and Flutter😇. Doing Some Competitive…