Data Science: Visual Programming with Orange Tool

Yash Alpeshbhai Patel
6 min readSep 9, 2021

--

Fig.1 Visual Programming(credit)

This blog is all about visual programming using orange tool. Orange is an excellent data mining tool for both novice and experienced data scientists. Users can focus on data analysis rather than laborious coding thanks to the user interface, making the creation of complicated data analytics pipelines straightforward.

We will learn how to split our data into training and testing data in Orange in this visual programming?

We will learn what effect does data splitting have on the classification result/model? and how can I utilise cross-validation in Orange effectively? What impact does it have on model output/accuracy?
We’ll also have a better understanding of the Test & Score Widget. We’ll also use the Orange tool to investigate the cross-validation method.

Workflow Creation

For visual programming i have used below workflow(Fig.2)

Fig.2 Workflow

Prerequisite:

Getting Started With Orange Tool

Let’s begin with a basic workflow. We’ll use the File widget to load data, such as the well-known Iris data set.

How to Split our data in training data and testing data

For this purpose we use Data Sampler Widget. Now, we will split the data into two parts, 80% of data for training and 20% for testing. We will send the first 80% onwards to build a model and remaining for testing purpose.

Fig.3 Passing Data to Data Sampler

Data Sampler

It selects a subset of data instances from an input dataset.

Inputs

  • Data: input dataset

Outputs

  • Data Sample: sampled data instances (used for training)
  • Remaining Data: out-of-sample data (used for testing)

So we have to pass the whole dataset into Data Sampler Widget. In Data Sampler Widget we will partition our dataset into train(80% ) and test data(20%)(Fig.4).

Fig.4 Data Sampler Widget

What effect does data splitting have on the classification result/model?

To know the effect, first we have to create a workflow which tests the learning algorithms(SVM, KNN and many more)on data and score them. For this we will use Test and Score widget which gets data from Data Sampler and, train, test and score learner algorithm.

Test and Score

Tests learning algorithms on data.

Inputs

  • Data: input dataset
  • Test Data: separate data for testing
  • Learner: learning algorithm(s)

Outputs

  • Evaluation Results: results of testing classification algorithms.
Fig.5 Data Sample -> Test and Score

After splitting the data we have to connect Data Sampler with Test & Score Widget by connecting two lines one for train data and another for test data. so by clicking on link it opens Edit Links and we have to edit links as shown below(Fig.6).

Fig.6 Link edit

Data sample(80%) -> Data( Train Data )

Remaining Data(20%) -> Test Data

I utilised Naive Bayes, Random Forest, Neural Network, and KNN ( K Nearest Neighbours ) Widgets to create the model. Machine learning methods are used in all of the widgets. you can connect all of your widgets with the Test & Score Widget as learner shown in Fig.7 .

Test and Score widget must need two things to test and score as seen in above Test and Score section.

(1) Data ( Train & Test )

(2) Machine Learning Algorithm

Fig.7 Model Creation

Now after sending the models to Test & Score along with Train and Test samples we observe their performance in the table inside the Test & Score widget. But before observing evaluation result we have to make Test and Score widget to evaluate on test samples by clicking option Test on test data on the left panel of widget as shown in the Fig. 8, Because there are other option available for evaluation such as cross validation, Leave one out, and others. So while using test data we always test our model on test data.

Fig.8 Test and Score on Test Data

No why we need to separate train and test data?Main reason is for evaluation purpose. Because overfitting is a common problem while training a model. When a model performs exceptionally well on the data we used to train it, but fails to generalise well to new, previously unseen data points, this phenomena happens.

So test data act as new, previously unseen data points and when model evaluates on the basis of test data we come to know the actual accuracy of model. Alternatively when model evaluates on the basis of train data it gives better accuracy compare to test data, reason behind this is that model already trained on same features which we used for evaluation purpose. But such models are not generalised for real world data they just overfit to training set.

Fig.9 Test on train Data vs Test on test Data

So the effect of splitting data on classification model is nothing but CA(Classification Accuracy). Here we can see that CA for Test on train data(left side)is greater but we know that is not consider as actual accuracy, we really want our model that can generalise to the every test data.

Cross Validation

Cross-validation is a statistical method for estimating machine learning model performance (or accuracy). It is used to prevent overfitting in a prediction model, especially when the amount of data available is restricted. In cross-validation, a set number of folds (or partitions) of the data are created, the analysis is conducted on each fold, and the total error estimate is averaged.

Splitting data into train and test data which we had done above is also type of validation called Holdout validation. One technique to improve the holdout method is to use K-fold cross validation. This strategy ensures that our model’s score is independent of how we chose the train and test set. The data set is subdivided into k subsets, and the holdout approach is applied to each subset k times.

For more understanding on cross validation refer this blog.

How to efficiently use cross-validation in Orange?

For the same workflow using Test and Score widget we can use cross validation by clicking on option of cross validation on the left panel of widget as shown in the Fig. 10. Also we are able to change the value K folds.

Fig.10

As we can see we have used K=10 for cross validation so Number of folds are 10. The data set is subdivided into 10 subsets, and the holdout approach is applied to each subset 10 times.

What is the effect of it on model output/accuracy?

Fig.11 Cross-validation vs Holdout validation

Cross–validation is a method of evaluating a machine learning model’s ability to predict fresh data. It can also be used to detect issues like as overfitting or selection bias, as well as provide information on how the model will generalise to a different dataset. Here instead of single holdout approach it perform K times which provide better Actual accuracy of model. So we can see Cross validation accuracy is less but more accurate or generalise.

Same we can analyse using confusion matrix.

Fig.12 Confusion matrix

conclusion

Here we had learned visual programming with orange tool. we have create workflow and learned some features using orange widgets.

Thank You for watching

--

--