Text classification based on Apache Open NLP
The strength and wisdom of the Indians in the Apache Open NLP product allows you to solve complex tasks of preparing and classifying text using machine learning methods.

Problem statement
Over the past few years, machine learning has opened up new opportunities for the financial industries and the smart economy. For example, the task of automatically classifying products in the marketing and financial industries is very relevant.
Defining categories with the aim of automatically generating product catalogs is a popular task. Another task is to determine the objects of interest of buyers, the preparation of the main consumer basket. These and other areas require data processing automation.
The Apache Open NLP library provides a wide range of tools for automating such tasks, including the implementation of classifiers. One of the most powerful approaches in the classification is the criterion of maximum entropy. We will not dive into mathematical calculations and probability theory. You can see it here (link). Let’s just start checking how it works. First of all, let’s import java libraries for our task.
The use of libraries and their import
package nlp.textclass;
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.postag.*;
import opennlp.tools.tokenize.*;
import opennlp.tools.util.*;
import java.io.*;
import java.util.*;
For further work, we need a text processing library and data input / output tools. For these purposes, we will take the Open NLP product from the Apache Foundation and the java.util library for organizing input / output streams. Let’s start preparing the data.
Training data preparation
Before proceeding directly to the generation of the automatic classification model, I would like to receive data for processing and prepare them properly. After preparing the data and labels, it looks like this.
Here the first word is data, the second word is a label.
String[] product = [“cars_cargo” ,”ipad_electronic” ,”game_entertainment” ,”iphone_electronic” ,”stereo_electronic” ,”mouse_electronic” ,”keyboard_electronic” ,”tablet_electronic” ,”technic_electronic” ,”electronic_electronic” ,”analise_medical” ,”drugs_medical” ,”gas_oil” ,”shave_goods” ,”trousers_clothes” ,”bottle_alcohol” ,”bycicle_bike” ,”bike_bike” ,”stuff_clothes” ,”vine_alcohol” ,”water_goods” ,”jeance_clothes” ,”food_food”];
We get the file gen_data.txt. Training data may include 80% of all available. If we want to solve a real problem, we need more data!

It must be more than 5000 or better 50000 marked elements.
When we get normalized and structured data, we can begin to build the model. The model is generated from a set of training data.
Model generation
After training and obtaining the model-Maxent, we can test it on a variety of test data. Test data can be 20% of all data. But before testing the model, we will add handlers for the input texts. This is a detector of sentences and tokenizer.
The detector of sentences and tokenizer
To classify products, separate words must be selected in the input text. In the example, we divided the text into individual sentences and then into words. We build our part-of-speech model (POSmodel), after than.
POSModel model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), posModel.getFactory());
Sequence sequences[] = posTagger.topKSequences(message);
Training result looks like this.
POS model started
Indexing events using cutoff of 5Computing event counts… done. 374 events
Indexing… done.
Sorting and merging events… done. Reduced 374 events to 366.
Done indexing.
Incorporating indexed data for training…
done.
Number of Event Tokens: 366
Number of Outcomes: 47
Number of Predicates: 99
…done.
Computing model parameters …
Performing 100 iterations.
1: … loglikelihood=-1439.955203039556 0.016042780748663103
2: … loglikelihood=-1153.0290547162213 0.27807486631016043
3: … loglikelihood=-1027.8934671969257 0.2914438502673797
...99: … loglikelihood=-328.84956863946474 0.7593582887700535
100: … loglikelihood=-327.86155121782036 0.7593582887700535
Model generated…
Process finished with exit code 0
We can load the model from a file if it already exists.
InputStream inPosStream = getClass().getClassLoader().getResourceAsStream("en-pos.dat");
POSModel posModel = new POSModel(inPosStream);
inPosStream.close();
POSTaggerME posTagger = new POSTaggerME(posModel);
Well done, now let’s move on to testing, this is noted in the code. Then each word will be classified and labeled (TAG), for example:
-> OpenNLP: tag detector
boots it is clothes
dress it is clothes
We get a model, then run it on the text, which was divided into words. It detects words and classifies them using Java code. We used the OpenNLP library and created the NLPClassifier class (detailed code on Github)
public class NLPClassifier {…}
Let’s test the sentence detector. A tokenizer that works as a divider. POSTagger, which works as a classifier in the main function. You can find the code ( on Github)
The full code and tested binary models for the POS tagger are in the repository.
Other objects
Best wishes, Alex !