Tutorial: Document Classification using WEKA
Introduction
This tutorial is an extension for “Tutorial Exercises for the Weka Explorer” chapter 17.5 in I Witten et al. 2011. Data Mining (3rd edition) [1] going deeper into Document Classification using WEKA
Upon completion of this tutorial you will learn the following
1. How to approach a document classification problem using WEKA
2. What are the options available in WEKA to prepare your dataset for Machine Learning classification algorithms
3. Which algorithms works best for this problem
4. How to evaluate the results
5. Some tips
Dataset
We will be using the ReutersCorn dataset which is already part of WEKA examples
Assumptions
It is assumed that you have WEKA installed on Linux and you have basic familiarity with the tool and machine learning
Document Classification Process
Document Classification is also a Data Mining problem and fortunately we can make use of the CRISP-DM (Cross Industry Standard Process for Data Mining) process, which according to Wikipedia is “ a data mining process model that describes commonly used approaches that data mining experts use to tackle problems”, the process is illustrated in the diagram below

Illustration 1: CRISP-DM Process
We will be covering all steps above except “Deployment” which is not relevant to this tutorial
Finally it is worth noting that this process is iterative, so in real life you will need to restart the process again based on the results of your evaluation/deployment
Business Understanding
What we are doing in this tutorial is trying to find a Model that that can classify documents into “Corn” (Corn news) and Non-Corn ( not mainly about corn ) with best possible accuracy
Data Understanding
Let’s start the real work, in the following steps we will inspect the dataset to decide which steps are needed to prepare the data for classification
Manual Dataset Inspection
1. Go to “/usr/share/doc/weka/examples/” on Linux or similar path on Windows
2. You will find the following files
ReutersCorn-train.arff.gz
ReutersCorn-test.arff.gz
The first one is the training dataset and the second will be used to test the model
3. Decompress both files using the following command
gunzip ReutersCorn-train.arff.gz
gunzip ReutersCorn-test.arff.gz
4. Open the training file and inspect it

Illustration 2: ARFF file
As you can see the file is in WEKA ARFF format which is simple, a header section which includes description of attributes and the class label, so for our case we have 1 attribute which is “Text” of type String and “class-att” of type boolean (0 or 1) zero for non-corn and one for corn
The rest of the file is the data in the following format
Text, class label
Text, class label
Inspection results
As you can see in the illustration below, there are many issues in the text file, such as
1. CASPITAL/small letters
2. Special characters
Also there are some cases of redundancies and synonyms for the same word ( corn and maize ) and variations ( corning and cornfeed )

Illustration 3: Data cleaning examples
We will be fixing all these issues in the next stage, also you need to do the same for the test file (ReutersCorn-test.arff)
Inspection using WEKA
1. Open WEKA, Click on Explorer
2. In the Preprocess tab, click Open file and select the training file
3. From the Attributes section, click on “class-att”

Illustration 4: WEKA input file initial inspection
Shown in the illustration above is the number of instances in the file, the number of attributes for each instances, the distribution of corn and non-corn instances which is 45 (2.9%) corn and 1509 (97.1%) non-corn news
4. You can also edit and view the file from WEKA, click Edit as shown below

Illustration 5: WEKA Editor
Data Preparation
In this stage we will prepare the data for classification using information acquired from the observation phase, we will do the following
1. Cleaning
2. Text Tokenization & Transformation
3. Attribute selection
Cleaning & Tokenization
Both cleaning and tokenization can be done using the StringToWordVector (STWV) filter in WEKA
What is Tokenization: To make the provided document classifiable using Machine Learning we need to do Feature extraction that is converting the normal text to a set of features that can then be used by the ML Algorithm to discriminate between corn and non-corn
In our case this will be done by STWV by assuming each word (String To Word) in the document is a feature and the number of occurrences in each instance is the feature value
1. Click on Choose button below Filter
2. Choose weka->filters->unsupervised->attribute->StringToWordVector

Illustration 6: WEKA Filter
3. Click on the “StringToWordVector” word to open the options

Illustration 7: StringToWordVector Settings
4. Also click on WordTokenizer and fill the options as specified in the following table

Illustration 8: WordTokenizer
Setting: IDFTransform/TFTransform
Value: True
Details: Instead of calculating normal word frequency, use TFIDF calculation [3]
Setting: LowerCaseTokens
Value: True
Details: Convert all words to lowercase
Setting: OutputWordCounts
Value: True
Details: Instead of using word occurrence ( 0 and one ) as feature value, use word frequency or TFIDF
Setting: Tokenizer
Value: .,;:’”()?!/ -_><&#
Details: Split words by these characters
Setting: WordsToKeep
Value: Any appropriate number
Details: How many words to keep after tokenization, this will limit the number of attributes you will have
5. TIP: click Save to save the settings that you have just selected since we will need it in later stages
6. Click OK
7. Click Apply in the Filter section — far right

Illustration 9: Attributes list after feature extraction
Now you can see in the illustration above, the STWV produced more than 500 attributes (Features) where many of them are irrelevant, so how do we know which ones are significant ?
Attribute selection
There are many way to spot the best attributes, I will list them here for your information, but for the sake of this tutorial the provided attributes will be left as is
1. From the pre-process tab you can click on each attribute and see if it makes a good split for the data

Illustration 10: Attribute split
2. You can use the Associator tab — the Apriori” algorithm- to discover relations between important attributes ( you will need to use Filtered Associator → Discretize filter to be able to use it on this dataset )

Illustration 11: Apriori

3. One of the best options is to use the “Select Attributes” tab, choose InfoGainAttributeEval + Ranker search method, this will give you a list of the most significant attributes

Illustration 12: Select Attributes
4. You can also Visualize using the tab the attribute and try to find key splitter attributes

Illustration 13: Visualization
5. Finally you can run any Tree Algorithm and visualize the tree structure to see the main attributes used for splitting ( after running J48 go to the last item in the results list and right click → visualize tree )

Illustration 14: Visualized Decision Tree
Now that we have our input dataset preprocessed it is time to start the classification process
Modeling
Note: before continuing, just go to the preprocess tab and open the training file again or click undo, we need to keep the file in its initial format since we will use the Filtered classifier to do the transformation automatically for both training and testing files
First we need a Baseline to compare performance against
1. Go to Classify tab, choose Filtered Classifier, then choose ZeroR (from rules) and StringToWordVector, don’t forget to use the same setting that we saved earlier

Illustration 15: Filtered Classifier
2. Go to Test options, choose the ReutersCorn-test.arff testset

Illustration 16: Adding the test set
3. Click Start
Our current baseline shows 96% accuracy but 1 False Positive rate which is very bad, what the ZeroR actually does is always classifying the document as the majority class which is in our case is the non-corn class, this is why we are getting such result

Illustration 17: ZeroR Results
Classification
Now we will now do the classification, we will use JRIP algorithms for this tutorial, you can do your own tests on other algorithms specially Rules and Trees categories since they take attribute correlation into account and provide human readable model that you can communicate
1. Click on Filtered Classifier
2. Choose JRIP algorithm from rules

Illustration 18: JRIP Classifier
3. Click OK, then Start the algorithm
Evaluation
The results shows higher accuracy than the baseline, also note that we got 1 TP (True positive) Rate for corn, which means all corn documents has been classified as corn

Illustration 19: Final Results
The Model
The good thing about this category of algorithms is the human readable model, note the JRIP rules above, which indicate that documents containing TFIDF values for the word “corn” greater than 2.62 will probably be about corn
Unballanced dataset and 99% accuracy
99+% accuracy doesn’t make sense, it is probably due to the fact that the dataset contains 97% of one class and 3 percent for the other, this is why we got high accuracy for ZeroR
To handle this issue, we need to focus on the TPR/ROC AREA instead of total accuracy, we can also down sample the dataset to re-balance it
References
1. I Witten et al. 2011. Data Mining (3rd edition). Elsevier. Ch.17 Tutorial Exercises for the Weka Explorer: https://moodle.umons.ac.be/pluginfile.php/43703/mod_resource/content/2/WekaTutorial.pdf
2. Cross Industry Standard Process for Data Mining http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
3. TFIDF (Wikipedia) http://en.wikipedia.org/wiki/Tf%E2%80%93idf