NLP using WEKA

Published in

Analytics Vidhya

5 min readOct 3, 2020

Weka

Weka(Waikato Environment for Knowledge Analysis) is open-source software developed by Waikato University, used for automated data mining tasks. Weka is a flexible and straightforward way to implement, it is portable and therefore platform-independent. It provides various algorithms that may be used for any selected dataset. The subsequent are the applications provided by Weka in its application(Figure 1):

Explorer: Interface to perform the data mining tasks over raw data or data collected by scraping
Experimenter: Allows users to execute different experimental variations on datasets
KnowledgeFlow: Explorer with drag and drop functionality. Supports incremental learning from previous results
WorkBench: Combines all GUI interfaces into one
Simple CLI: Command-line interface, for execution commands from a terminal

Figure 1: Weka GUI

The following are the techniques available within the software for data processing tasks:

Methods

Association
Attribute Selection
Classifiers
Clusters
Preprocessing filters

Algorithms

KNN Classification
Multi-objective Evolutionary Algorithm
C4.5 Decision Tree
Learning Vector, Quantization Particle, Swarm Optimization

Natural Language Processing(NLP)

In today’s world where people are trying to deploy machines for their day-to-day work, it’s important for the users that they are conveyed about the results or the ongoing process report. NLP can be a way of helping the machines learn the ways to communicate or help us understand the messages that the machine wants to convey to us.

The following are the popular areas in which NLP is already being used:

Articles (news, blogs etc)
Predicting your searches
Chatbots
Email sorting
video/site recommendations
Text-to-speech

And there are endless examples. For the hands-on session, I have tried to implement sentiment analysis, the results can be used for analysing users behaviour and habits.

Hands-on WEKA

Sentiment analysis is a very crucial part of machine learning. It will not only help the machine to know what emotions or sentiment a person exhibits but also reciprocate the same. Here I have tried to implement a simple sentiment analysis for the Twitter text database.

We’ll be using KnowledgeFlow for the following analysis.

Step 1: Get the dataset
I got the data from here. The dataset has four fields, “tweet_id”, “sentiment”, “author” and “content”. We have used CSVLoader from DataSource. The data has around 40,000 data points or instances or rows of data. It can be complicated and time-consuming for such a large dataset to be processed, also my processor can not handle so much data and processing so I have removed/pruned 50% of the data randomly. This can be done using Filter.Unsupervised.Instances.RemovePercentage. This data needs to be stored in another CSV file therefore, another component needs to be added for saving the file in CSV format. So now I have three components for my knowledge flow DataSource.CSVLoader, Filter.Unsupervised.Instances.RemovePercentage and DataSink.CSVSaver

Right-click the CSVLoader icon and select configure. You’ll be able to edit the CSVLoader properties and add a file.
Once the data is set, you can now input that dataset to the next component i.e. filter.
Right-click the CSVLoader and select the dataset. It’ll automatically show an arrow/line will come out. Join that to the filter

Tip
Users can configure and adjust the properties of the component.
The CSVSaver component gets the dataset input from the filter.

We are done with importing the dataset. I have run the knowledgeflow, and the window now appears like Figure 2.

Figure 2: Step 1 completion

Step 2: Data Preprocessing

To implement sentiment analysis, there’s a lot of preprocessing required, which include:

Word parsing and tokenization
Stop-word removal
Lemmatization or stemming
Feature extraction

These would improve the efficiency at the later stage of the processing. This step not only reduces the size of the dataset but also removes redundant data which can create bias in the system/algorithm.

I have chosen the StringToWordVector filter from filters.unsupervised.attribute. The properties of the filter are editable, by right-clicking the filter (Figure 3).

Figure 3: wordtovec properties

Step 3: Classification

For the classification, we need to first define the class or feature that’ll be predicted using Filter.Unsupervised.attribute.ClassAssigner. We need to add this to the step 2, i.e. to preprocessed data.

We need to create training data and test data using validation folds, we’ll need Evaluation.CrossValidationFoldMaker. This will give us two data sets, test and train. We can use it directly for our classifier.

Classifier, there are several classifiers used for sentiment analysis, these include:

NaiveBayes
Random Forest
Logistics Regression
XgBoost

And a few more. I have used NaiveBayes classifier, Classifier.bayes.NaiveBayes.

The trained model will give us results in the batch format since we are using cross validation for 10 folds, we’ll get 10 outputs. We need to add another component to evaluate the performance for the classifier, Evaluation.ClassifierPerformanceEvaluator. The knowledgeflow diagram appears like figure 4 after step 3.

Figure 4: Step 3 completion

Step 4: Results
The processing is done, for the visualization part, we have Visualization.TextViewer. There are various ways available to view the final result but the text looks more suitable for the task.

The results are shown in Figure 6.
The final Knowledge flow diagram appears in Figure 5.

Figure 5: Knowledge Flow diagram for sentiment analysis

Figure 6: Results after classification

Conclusion

The Weka is an amazing tool, for machine learning purposes. There are several restrictions, for the computation and memory utilization part, which can be overcome by tweaking a bit. I have been coding and trying to implement these algorithms, Weka made it really easy for me. I’ll be using Weka apart from the academic projects or tasks