Analytics Vidhya
Published in

Analytics Vidhya

Text classification based on Apache Open NLP

The strength and wisdom of the Indians in the Apache Open NLP product allows you to solve complex tasks of preparing and classifying text using machine learning methods.

www.freepik.com

Problem statement

Over the past few years, machine learning has opened up new opportunities for the financial industries and the smart economy. For example, the task of automatically classifying products in the marketing and financial industries is very relevant.
Defining categories with the aim of automatically generating product catalogs is a popular task. Another task is to determine the objects of interest of buyers, the preparation of the main consumer basket. These and other areas require data processing automation.

The Apache Open NLP library provides a wide range of tools for automating such tasks, including the implementation of classifiers. One of the most powerful approaches in the classification is the criterion of maximum entropy. We will not dive into mathematical calculations and probability theory. You can see it here (link). Let’s just start checking how it works. First of all, let’s import java libraries for our task.

The use of libraries and their import

package nlp.textclass;

import java.io.FileInputStream;
import java.io.InputStream;

import opennlp.tools.postag.*;
import opennlp.tools.tokenize.*;
import opennlp.tools.util.*;
import java.io.*;
import java.util.*;

For further work, we need a text processing library and data input / output tools. For these purposes, we will take the Open NLP product from the Apache Foundation and the java.util library for organizing input / output streams. Let’s start preparing the data.

Training data preparation

Before proceeding directly to the generation of the automatic classification model, I would like to receive data for processing and prepare them properly. After preparing the data and labels, it looks like this.

Here the first word is data, the second word is a label.

String[] product = [“cars_cargo” ,”ipad_electronic” ,”game_entertainment” ,”iphone_electronic” ,”stereo_electronic” ,”mouse_electronic” ,”keyboard_electronic” ,”tablet_electronic” ,”technic_electronic” ,”electronic_electronic” ,”analise_medical” ,”drugs_medical” ,”gas_oil” ,”shave_goods” ,”trousers_clothes” ,”bottle_alcohol” ,”bycicle_bike” ,”bike_bike” ,”stuff_clothes” ,”vine_alcohol” ,”water_goods” ,”jeance_clothes” ,”food_food”];

We get the file gen_data.txt. Training data may include 80% of all available. If we want to solve a real problem, we need more data!

www.freepik.com

It must be more than 5000 or better 50000 marked elements.

When we get normalized and structured data, we can begin to build the model. The model is generated from a set of training data.

Model generation

After training and obtaining the model-Maxent, we can test it on a variety of test data. Test data can be 20% of all data. But before testing the model, we will add handlers for the input texts. This is a detector of sentences and tokenizer.

The detector of sentences and tokenizer

To classify products, separate words must be selected in the input text. In the example, we divided the text into individual sentences and then into words. We build our part-of-speech model (POSmodel), after than.

POSModel model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), posModel.getFactory());
Sequence sequences[] = posTagger.topKSequences(message);

Training result looks like this.

POS model started
Indexing events using cutoff of 5
Computing event counts… done. 374 events
Indexing… done.
Sorting and merging events… done. Reduced 374 events to 366.
Done indexing.
Incorporating indexed data for training…
done.
Number of Event Tokens: 366
Number of Outcomes: 47
Number of Predicates: 99
…done.
Computing model parameters …
Performing 100 iterations.
1: … loglikelihood=-1439.955203039556 0.016042780748663103
2: … loglikelihood=-1153.0290547162213 0.27807486631016043
3: … loglikelihood=-1027.8934671969257 0.2914438502673797
...
99: … loglikelihood=-328.84956863946474 0.7593582887700535
100: … loglikelihood=-327.86155121782036 0.7593582887700535
Model generated…
Process finished with exit code 0

We can load the model from a file if it already exists.

InputStream inPosStream = getClass().getClassLoader().getResourceAsStream("en-pos.dat");
POSModel posModel = new POSModel(inPosStream);
inPosStream.close();
POSTaggerME posTagger = new POSTaggerME(posModel);

Well done, now let’s move on to testing, this is noted in the code. Then each word will be classified and labeled (TAG), for example:

-> OpenNLP: tag detector
boots it is clothes
dress it is clothes

We get a model, then run it on the text, which was divided into words. It detects words and classifies them using Java code. We used the OpenNLP library and created the NLPClassifier class (detailed code on Github)

public class NLPClassifier {…}

Let’s test the sentence detector. A tokenizer that works as a divider. POSTagger, which works as a classifier in the main function. You can find the code ( on Github)

The full code and tested binary models for the POS tagger are in the repository.

Other objects

Best wishes, Alex !

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

What is Multi-Label?

No man is an island

Customer Cart Abandonment

Weight Initialization Techniques in Neural Networks

Activation functions in machine learning

Having Fun Learning CNNs: Example of Dog Breed Prediction Applicable to Human Images

Figure 0: Expected output.

Everything You Need to Know About Image Segmentation

Permutation based feature importance for clustering

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alexey Titov

Alexey Titov

R&D engineer in machine learning and data analysis. Java, C/C++, Python, M and CUDA. HPC, processors architectures and parallel systems.

More from Medium

Recommendation System with Content-based Filtering

Spark for NLP

Introduction to Big Data and Hadoop

Collaborative Filtering Recommendation System Using TensorFlow with Neural Net