Weka: How to learn Machine Learning for Non-Experts

John R. Ballesteros
8 min readOct 22, 2023

--

A couple of weeks ago I was invited to teach a small course of Machine Learning (ML) for high school students. Soon, I realized it is quite a challenge, why? because it brings big and difficult questions like: How do you actually use ML on your own problems? How best prepare your data for machine learning?, Which algorithms to use? How to choose one model over another? How to apply machine learning without a single mathematical equation or line of programming code?. Studying those questions, I found WEKA. So, what is Weka?

This story presents a general tour of the Weka Workbench for ML.

Weka

It’s an open source and stand alone software that provides tools for data preprocessing, Machine Learning algorithms implementation, and visualization, so that people can develop machine learning techniques and apply them to real-world data problems. These tools are summarized in the following diagram. Weka is an initiative of the University of Waikato in New Zeland.

Figure 1. What is Weka

First of all, download and install Weka in your workstation (Windows, Mac OS X, or Linux). https://waikato.github.io/weka-wiki/downloading_weka/

In Weka, you can use two kind of files, the common CSV, or .ARFF, which is a modified CSV format that includes additional information about the types of each attribute (column).

Exploring data

  1. Start Weka (click on the bird icon), this will start the Weka GUI Chooser.
  2. Click the “Explorer” button, this will open the Weka Explorer interface.
  3. Click the “Open file…” button and navigate to the data/ directory in your Weka installation and load the diabetes.arff dataset.
  4. Click on different attributes in the “Attributes” list and review the details in the “Selected attribute” pane.
  5. Click the “Visualize All” button to review all attribute distributions.
  6. Click the “Visualize” tab and review the scatter plot matrix for all attributes. Get comfortable reviewing the details for different attributes in the “Preprocess” tab and tuning the scatter plot matrix in the “Visualize” tab.

Rescaling data

Raw data is often not suitable for modeling. Often you can improve the performance of your machine learning models by rescaling attributes.

  1. Using the same dataset, Click the “Choose” button in the “Filter” pane and select unsupervised.attribute.Normalize. Click the “Apply” button.
  2. Review the details for each attribute in the “Selected attribute” pane and note the change to the scale.
  3. Explore using other data filters such as the Standardize filter.
  4. Explore configuring filters by clicking on the name of the loaded filter and changing it’s parameters.
  5. Test out saving modified datasets for later use by clicking the “Save…” button on the “Preprocess” tab.

Feature selection

Not all of the attributes in a dataset may be relevant to the predicting attribute. Feature selection is the key to identify those attributes that are most relevant to the output variable.

  1. Click the “Select attributes” tab.
  2. Click the “Choose” button in the “Attribute Evaluator” pane and select the “CorrelationAttributeEval”.
  3. You will be presented with a dialog asking you to change to the “Ranker” search method, needed when using this feature selection method. Click the “Yes” button. Click the “Start” button to run the feature selection method.
  4. Review the output in the “Attribute selection output” pane and note the correlation scores for each attribute, the larger numbers indicating the more relevant features.
  5. Explore other feature selection methods such as the use of information gain (entropy).
  6. Explore selecting features to removal from your dataset in the “Preprocess” tab and the “Remove” button.

Machine Learning Algorithms in Weka

Weka provides a large number of algorithms, let’s take a closer look.

  1. Open the Weka GUI Chooser and then the Weka Explorer.
  2. Load the data/diabetes.arff dataset.
  3. Click the “Classify” tab.
  4. Click the “Choose” button and note the different groupings for algorithms.
  5. Click the name of the selected algorithm to configure it.
  6. Click the “More” button on the configuration window to learn more about the implementation.
  7. Click the “Capabilities” button on the configuration window to learn more about how it can be used.
  8. Note the “Open” and “Save” buttons on the window where different configurations can be saved and loaded.
  9. Hover on a configuration parameter and note the tooltip help.
  10. Click the “Start” button to run an algorithm.
  11. Browse the algorithms available. Note that some algorithms are unavailable given whether your dataset is a classification (predict a category) or regression (predict a real value) type problem.
  12. Explore and learn more about the various algorithms available in Weka.
  13. Get confidence choosing and configuring algorithms.

Estimate Model Performance

  1. Load the data/diabetes.arff dataset. Click the “Classify” tab. The “Test options” pane lists the various different techniques that you can use to evaluate the performance of an algorithm.

The gold standard is 10-fold “Cross Validation”. This is selected by default. For a small dataset, the number of folds can be adjusted from 10 to 5 or even 3. If your dataset is very large and you want to evaluate algorithms quickly, you can use the “Percentage split” option. By default, this option will train on 66% of your dataset and use the remaining 34% to evaluate the performance of your model. Alternately, if you have a separate file containing a validation dataset, you can evaluate your model on that by selecting the “Supplied test set” option. Your model will be trained on the entire training dataset and evaluated on the separate dataset. Finally, you can evaluate the performance of your model on the whole training dataset. This is useful if you are more interested in a descriptive than a predictive model.

2. Click the “Start” button to run a given algorithm with your chosen test option.

3. Experiment with different Test options. Further refine the test options in the configuration provided by clicking the “More options…” button.

Classification Algorithms

The following are the available classification algorithms available in Weka:

Logistic Regression (functions.Logistic).
Naive Bayes (bayes.NaiveBayes).
k-Nearest Neighbors (lazy.IBk).
Classification and Regression Trees (trees.REPTree).
Support Vector Machines (functions.SMO).
Experiment with each of these top algorithms.

  1. Load the data/diabetes.arff dataset. Click the “Classify” tab. Click the “Choose” button.
  2. Try them out on different classification datasets, such as those with two classes and those with more.

Regression Algorithms

Many of the classification algorithms can be used for regression. Regression is the prediction of a real valued outcome (like a dollar amount), different from classification that predicts a category (like “dog” or “cat”). A list of the top 5 algorithms used for regression is:

Linear Regression (functions.LinearRegression).
Support Vector Regression (functions.SMOReg).
k-Nearest Neighbors (lazy.IBk).
Classification and Regression Trees (trees.REPTree).
Artificial Neural Network (functions.MultilayerPerceptron).

  1. Open the Weka GUI Chooser and then the Weka Explorer. Load the data/housing.arff dataset.
  2. Click the “Classify” tab. Click the “Choose” button.
    5 Top algorithms that you can use for regression include:
  3. Experiment with each of these top algorithms. Try them out on different regression datasets.

Ensamble Algorithms

In ocassions is not possible to get good results with normal algorithms, the ensemble methods let you choose and experiment using different combinations of sub-models. Combinations of techniques that work in very different ways and produce different predictions often result in better performance. Weka provides a large suite of ensemble machine learning algorithms and this may be Weka’s second big advantage over other platforms. Top models included are:

Bagging (meta.Bagging).
Random Forest (trees.RandomForest).
AdaBoost (meta.AdaBoost).
Voting (meta.Voting).
Stacking (meta.Stacking).

  1. Open the Weka GUI Chooser and then the Weka Explorer.
  2. Load the data/diabetes.arff dataset. Click the “Classify” tab. Click the “Choose” button.
  3. Try them out on different classification and regression datasets.

Algorithms Performance Comparison

Weka provides a tool, specifically designed for comparing algorithms, called the Weka Experiment Environment. It allows users to design and execute controlled experiments with machine learning algorithms and then analyze the results.

  1. Open the “Weka Chooser GUI”. Click the “Experimenter” button to open the “Weka Experiment Environment”. Click the “New” button.
  2. Click the “Add new…” button in the “Datasets” pane and select “data/diabetes.arff”. Click the “Add new…” button in the “Algorithms” pane and add “ZeroR” and “IBk”.
  3. Click the “Run” tab and click the “Start” button. Click the “Analyse” tab and click the “Experiment” button and then the “Perform test” button.

This compared the ZeroR algorithm to the IBk algorithm with default configuration on the diabetes dataset. The results show that IBK has a higher classification accuracy than ZeroR and that this difference is statistically significant (the little “v” character next to the result).

4. Expand the experiment and add more algorithms and rerun the experiment. Change the “Test base” on the “Analyse” tab to change which set of results is taken as the reference for comparison to the other results.

Model Fine Tunning

To get the most out of a machine learning algorithm, the parameters of the method should be tunned to your problem. Since it’s impossible to know how to best do this before hand, lots of different parameters and values should be changed.

The Weka Experiment Environment allows users to compare the results of different algorithm parameters and whether the differences are statistically significant.

  1. Open the “Weka Chooser GUI”. Click the “Experimenter” button to open the “Weka Experiment Environment”. Click the “New” button.
  2. Click the “Add new…” button in the “Datasets” pane and select “data/diabetes.arff”.
  3. Click the “Add new…” button in the “Algorithms” pane and add 3 copes of the “IBk” algorithm.
  4. Click each IBk algorithm in the list and click the “Edit selected…” button and change “KNN” to 1, 3, 5 for each of the 3 different algorithms.
  5. Click the “Run” tab and click the “Start” button.
  6. Click the “Analyse” tab and click the “Experiment” button and then the “Perform test” button.

As can be seen in this example, the results for large K values are better than the default of 1 and the difference is significant. Explore changing other configuration properties of KNN.

Saving a Model

Once a top performing model has been found for a specific problem, it is a good practice to save it for later use.

  1. Open the Weka GUI Chooser and then the Weka Explorer. Load the data/diabetes.arff dataset. Click the “Classify” tab.
  2. Change the “Test options” to “Use training set” and click the “Start” button.
  3. Right click on the results in the “Result list” and click “Save model” and enter a filename like “diabetes-final”. This trained a final model on the entire training dataset and saved the resulting model to a file.

This model can be open back into Weka and use it to make predictions on new data.

4. Right-click on the “Result list” click “Load model” and select your model file (“diabetes-final.model”).

5. Change the “Test options” to “Supplied test set” and choose data/diabetes.arff (this could be a new file for which you do not have predictions).

6. Click “More options” in the “Test options” and change “Output predictions” to “Plain Text”.

7. Right click on the loaded model and choose “Re-evaluate model on current test set”. The new predictions will now be listed in the “Classifier output” pane.

Experiment saving different models and making predictions for entirely new datasets.

Conclusions

This story presents a general description of the capabilities of Weka workbench, showing a specific example in detail will be the objective of an another story.

References

The full documentation and more resources can be found here: https://waikato.github.io/weka-wiki/

Support me

Enjoying my work? Show your support with Buy me a coffee, a simple way for you to encourage me and others to write. If you feel like it, just click the next link and I will enjoy a cup of coffee!

--

--

John R. Ballesteros

Ph.D Informatics, Assoc. Professor of the UN, Med. Colombia. GenAi, Consultant & Researcher AI & GIS, Serial Interpreneur Navione Drone Services Co, Gisco Maps