Data SciencešŸ‘Øā€šŸ’»: Introduction to Orange Tool Part-2

Manthan Bhikadiya šŸ’”
Geek Culture
Published in
6 min readAug 28, 2021

--

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

The Best Way To Create Free Future Is To Create It.

~ABRAHAM LINCOLN

This blog is all about how to split data into training and testing using the Orange tool. We will also learn more about Test & Score Widget. We will also explore the cross-validation method using the Orange tool.

Prerequisite:

Introduction to Orange Tool Part-1

Train Test Split :

Now if youā€™re from an ML DL background then you might know why this is a more important step. For those who donā€™t know we split our data into two parts train data and test data. We train our model on train data and then we will test our model on test data. Test Data remain unseen during training so thatā€™s why we can actually get an idea that how our model performs on unseen data.

For the Train Test Split, I used the below workflow.

Train Test Split Workflow

Here as usual I load iris.tab dataset in the File widget which comes with the orange tool.

After that, I pass the whole dataset into Data Sampler Widget. In Data Sampler Widget we will partition our dataset into train and test data.

Data Sampler Configuration

As you can see I split the data into 80:20 ratio i.e 80% Train Data and 20% Test Data. On the bottom, you can see 120 data points use for Training and 30 data points used for testing from a Total of 150 data points.

Now after split the data I connect Data Sampler with Test & Score Widget. I connect two lines one for train data and another for test data.

Data Sample -> Data ( Train Data )

Data Sample -> Test Data ( Test Data )

Now for model creation, I used Random Forest, SVM (Support Vector Machine), and KNN ( K Nearest Neighbors ) Widgets. These all the widgets are machine learning algorithms. Connect all the widgets with Test & Score Widget.

Test & Score widget must need two things.

(1) Data ( Train & | Test )

(2) Machine Learning Algorithm

When we are using Train test data we always test our model on test data so we have to specify that thing into Test & Score widget.

Test & Score Widget Properties

As you can see on the left side Test on test data is selected i.e the results which you are seeing( right side )coming from testing on test data.

We get the best results on all the algorithms. ( Approx 98% CA a.k.a Classification Accuracy )

What is the effect of splitting data on the classification result/ classification model?

With Splitting vs Without Splitting

As you can see accuracy is a little bit higher in With Splitting But itā€™s not the case all the time. Here we have very clean and low data ( 150 data points ) But in some situations when we have a lot of data points at that time if you will not split your data then your model might get overfit. So itā€™s always good to split data into Train and Test. So that we can get information that howā€™s our model perform on an unseen dataset ( test data ).

Cross Validation:

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

We can do cross-validation using the Test & Score widget. Note that during splitting the data cross-validation use the whole dataset not only train data nor test data.

For Cross-Validation I used the following workflow:

Cross-Validation Workflow

Now youā€™re familiar with this workflow itā€™s a very simple workflow we will directly focus Test & Score widget.

Test & Score with Cross-Validation

As you can see I used Number of folds = 10 i.e Total of 10 times random data points will be tested on our models and then we will get an average result. Cross-validation is a very powerful technique for evaluating the model. We can also find whether our model is overfitted or not by using cross-validation.

After that Test & Score widget connects with the Confusion Matrix in which we can see the result and after that, from the confusion matrix, we can select the data and view it into the Data Table widget.

Confusion Matrix ( with not selected anything )
Misclassified data

As you can see first I select Misclassified data from Confusion Matrix and then View it on Data Table Widget. This is how we can explore our results using the Confusion Matrix and Data Table widget.

What is the cross-validation effect on model output/accuracy?

Without Cross-Validation vs With Cross-Validation

As you can see our accuracy is decreasing while using Cross-validation but still it's a very good performance. In the case of without cross-validation we will test our model once and with cross-validation we tested our model K times (K=number of fold) on random data points from the dataset. Thatā€™s why after getting good accuracy on test data always assured that accuracy by performing Cross-validation.

While using Cross-Validation You can see model comparison by metrics parameters like accuracy, precision, recall, f1-score, AUC Curve, etcā€¦

Read this article for more details on Cross-Validation here.

Configured Files:

Conclusion:

I hope now you can work by yourself in the orange tool. I tried to cover as many things as I can. Now you can explore more by yourself.

Do check out more features of the Orange tool here.

Keep Exploringā€¦!!šŸ‘

LinkedIn:

Github:

Thanks for reading! If you enjoyed this article, please hit the clap šŸ‘button as many times as you can ( max 50 times šŸ˜‚ ). It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.

--

--

Manthan Bhikadiya šŸ’”
Geek Culture

Beyond the code lies magic. šŸŖ„ Unveiling AI's potential with Generative AI, ML, DL, NLP, CV. Explore my blog's insights!