Text classifier on Java with WEKA
Th e topic of determining the direction of transactions in financial services has been sufficiently studied, yet there are new approaches and tools to solve this problem at a different level. The article focuses on the text classifier of transactions based on the WEKA framework in Java.

Over the past few decades, the rapid development of financial services has been observed. Modern finance includes banking, lending, taxation, investment management, transfers, and payments. All these components have migrated almost entirely into virtual space. Online offices and mobile clients provide constant access to services for users. These resources are convenient and practical, offer personal and corporate levels of service, as well as a continuously operating customer service system. These tools allow making various types of money transactions. Electronic services are constantly being developed, automated and become ‘smarter.’
One of the most fundamental technical objectives in electronic finance is to determine the direction of the flow of material resources (expenditure/income). How to determine whether a bank transaction is incoming or outgoing at a certain time?
These and other similar tasks are easy to automate with the help of the Australian tool WEKA. The tool provides approaches for data analysis and machine learning, including the processing of financial texts.
Weka is a flightless bird species of the rail family and endemic to New Zealand. Authors and developers of free software, which is written in Java at the University of Waikato (New Zealand) and distributed under the GNU GPL license, chose this bird as a “core” and taught it a lot.

Despite its strange and funny name, WEKA is a tool that is capable of solving complex financial tasks.
Let’s take a look at the problem of classifying a money transaction by keywords. This objective is very similar to a SPAM-classifier: if the binary SPAM-classifier is properly modified, it might turn into a debit-credit-classifier.
A binary classifier can be developed on the Naive Bayes classifier, which is based on the Bayes Theorem with its strict (“naive”) assumptions about the independence of tests. Thereby, WEKA provides a powerful tool for the Bayes classifier.
Create a class named DebitCreditWekaClassifier.java that prepares the dataset and trains the model of the classifier.
public class DebitCreditWekaClassifier {...}
Enter protected class variables, i.e. class variables that are not accessible from the external world. In this case, we have the object of the FilteredClassifier filter and the logger of the Logger class. The FilteredClassifier object is an object of a classifier. The classifier is based on the filter (structure with weights or coefficients) through which the data pass. The structure of the filter and its coefficients are based solely on the training dataset. Test data will be processed by the filter without changing its coefficients and in some cases the structure. If unequal weights or classification errors occur after the training, then instances or attributes of the data are re-processed. Typically, an alternative sample from the original training dataset is used. So all filter weights are specified (reconfigured) before they are transferred to the final classifier.
Data for training trainData.txt will contain a set of instances and labels.
Attribute class is wekaAttributes. Once the attributes have been created, they cannot be changed. The following types of attributes are possible: numeric, string, date, relational, and nominal. The last type is a fixed set of nominal values. For more information about WEKA attributes, click here.
Let’s move on to the description of the basic class methods. First of all, we will look at the DebitCreditWekaClassifier class constructor. It looks like this:
This method contains an untrained instance of the classifier classifier. To determine the classification algorithm, set the polynomial Naive Bayes classifier NaiveBayesMultinomial() (see for more). Define model attributes in wekaAttributes. The first attribute is attributeText. Here will be placed the marked text for the training; the model will work with text data. Add class labels for training data to the classAttribute container; in our case, these are two types of labels: debit and credit. At this point, the formation of the constructor is complete.
Now, we have to convert the data to the required ARFF (Attribute-Relation File Format) format. Upload and save the data in the ARFF format.
We have to create a filter converting the dictionary into the vectors of properties of the StringToWordVector filter. The properties vector is an n-dimensional vector of numerical parameters representing a particular object.
To get a set of vectors based on text data, it is necessary to split the training sample into phrases. To do this, add a tokenizer that divides the data into phrases or N-grams. N-gram is, typically, a range of words. For the experiment, the size of the phrase is set to only a single word tokenizer.setNGramMinSize(1). As the separator, let’s specify any non-alphanumeric character “\\W”. It looks like this: tokenizer.setDelimiters(“\\W”). Apply the specified tokenizer to the filter.setTokenizer(tokenizer). The following is the standard procedure included in the filter’s capabilities — conversion to the lowercase filter.setLowerCaseTokens(true). Then we apply the new configured filter to the classifier.setFilter(filter).
Now, let’s move on to the training of the classifier to the prepared data.
The training takes place without errors and logging. After that, the MODEL = “model/debit_credit_model.dat” model has to be created and saved.
Run the test to analyze the obtained model and write the result into the logger.
LOGGER.info("Evaluation Result: \n"+wt.evaluate());
....
Analysis resultCorrectly Classified Instances 8 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.2334
Root mean squared error 0.2456
Relative absolute error 46.6838 %
Root relative squared error 49.1217 %
Total Number of Instances 8
Due to the short test sample, the relative and absolute classification errors are quite high and require the enlargement of the test dataset. It is worth noting that all test examples have worked with no errors. So, we have a minimal viable product (MVP) of the transaction classifier by keywords, which can become the core of any financial service for determining the direction of transactions.
See the full code on the website in the project https://github.com/AlexTitovWork/testWekaClassify.
Best regards, Alex!
by the author.