A No Code Machine Learning Classification Model Using Weka

John R. Ballesteros
5 min readNov 14, 2023

--

Whether to start doing Machine Learning or for easy model prototyping, Weka offers a no code and easy to use platform for common algorithms in AI. Weka is a tool from the time of the data mining, but it has been continiously updated up to these days with the topics of Machine and Deep Learning. This story presents an example of a classification problem using Weka.

Start downloading Weka, to do that please visit one previous posts that shows how to do that, and the general contents of this nice tool, created long time ago by the University of Waikato in New Zealand:

https://medium.com/@jrballesteros/weka-how-to-learn-machine-learning-for-non-experts-70e9767b08b2

Open Windows Explorer and navigate to “C:\Program Files\Weka-3–9–6\data” and after that, right click on diabetes.arff file and choose open with notepad, the metada of the file is displayed, data dictionary can be found as well, this is, the name and number of fields, and the records can be read, the mean, the standard deviation and the data type for every attribute in the dataset. Ofcourse, the target variable which in this case is the Class attribute, can be visualized as well and appears as the last field by default. See Figure 1.

Figure 1. Metadata of the diabetes dataset.

Close the file and Open Weka, click on the Explorer tab in the right panel, and then, click Open File and navigate to “C:\Program Files\Weka-3–9–6\data\diabetes.arff”, the file is now open within Weka. See Figure 2.

Figure 2. Diabetes.arff filed opened with Weka. The .arff is the Weka’s proprietrary format that easily stores metadata.

A Classification Problem

One way to understand a classification problem is to ask if a set of variables: (v1, v2, v3,…,vn) can determine a class, this is, if there exist a function that maps the different variables into classes: f(v1, v2, v3,…,vn) → (c1,c2,c3,…,cn), in a way that the error to assign an element to a specific class is the minimum. A formal definition for a binary classification problem is: Let (X1, Y1), . . . ,(Xn, Yn) that are n independent random copies of (X, Y ) ∈ X × {0, 1}. Denote by PX,Y the joint distribution of (X, Y ). The so-called feature X lives in some abstract space X (think Rd ) and Y ∈ {0, 1} is called label. For example, X can be a collection of gene expression levels measured on a patient and Y, indicates if this person suffers from obesity. The goal of binary classification is to build a rule to predict Y given X using only the data at hand. Such a rule is a function h : X → {0, 1} called a classifier. Some classifiers are better than others and we will favor ones that have low classification error R(h) = P(h(X) = Y ).

Exploratory Analysis

Many questions arise when dealing with the data of a specific field like medicine, finance, and so on. In this case, some examples are:

  • Which are the aspects (variables) that influence suffering diabetes the most? In other words, if scarce variables are available which are the most important to predict this disease?
  • Which variables are correlated between them?
  • From the statistics point of view, what are the distributions of variables, data types, availability (no null), and their means and standard deviation?
  • Are the classes imbalance?

These questions can be answered as a result of an exploratory data analysis (EDA). Figure 3 shows some details for the Diabetes dataset, there are 9 attributes, including the target variable, and 768 examples.

Figure 3. Number of attributes and records in the diabetes dataset.

Since almost the double of the examples are of negative class (500 records), the dataset is considered imbalance at 35%.

Figure 4. The class attribute shows that the dataset is imbalance to the 35% approx.

Figure 5 shows the distribution of the Plasma variable, and two conclusions can be extracted, the distribution of the variable and positive class are correlated, they are very similar. Second, it is almost impossible that a person has diabetes when his/her plasma value is under 80.

Figure 5. Distribution of Plasma attribute vs positive class. If Plasma value is above 99.5 is increasingly likely that patient has the disiase.

A Classification Algorithm

Let’s start with a Decision Tree, go and click on the Classify tab, and pick the trees/J48 algorithm. Make sure to choose Cross-validation with fold equal to 10 to guarantee . After that, click on Start button. Figure 6 shows the results.

Figure 6. A Decision Tree Classifier. Accuracy of 73.8%

A Logistic Regression can be an optional classifier, click on the Choose button and then functions/Logistic and hit the Start button. The results are shown in the Figure 7.

Figure 7. Logistic Regression Classifier. Accuracy of 77.2%

However, since the dataset is imbalance, the F-Measure is a better metric to evaluate the results, in this case, Logistic Regression is the best option of these two methods with a value of 0.834.

Feature Importance

Now go to Preproccess tab and then: filters/Supervised/Attribute/Attribute selection and then choose apply, see figure 8 for the feuture importance of the initial variables.

Figura 8. Feature importance. Plasma is the most important variable of the dataset.

But, it is possible that the first two features “plas” and “mass” are not the best combination, so a different search like the SubsetEval — with Greedy Stepwise. Figure 9 shows the best combination of variables.

Figure 9. The best combination of features for diabetes prediction.

Conclusion

Weka makes it easy to explore the data, see the distributions, create feature importance and develop classification models.

The plasma, the mass and the age are in order the three most important varuables when predicting diabetes.

Support me

Enjoying my work? Show your support with Buy me a coffee, a simple way for you to encourage me and others to write. If you feel like it, just click the next link and I will enjoy a cup of coffee!

--

--

John R. Ballesteros

Ph.D Informatics, Assoc. Professor of the UN, Med. Colombia. GenAi, Consultant & Researcher AI & GIS, Serial Interpreneur Navione Drone Services Co, Gisco Maps