Named Entity Recognition in Java using Open NLP

Published in

Analytics Vidhya

6 min readAug 31, 2020

Introduction

You may be wondering, how can we extract information from text or how do we extract keywords from text using just Java. When we think of using natural language processing, we would often get stable libraries in python like NLTK, spacy, etc. Hence you would often implement this in python. And when it comes to integration with the existing systems which may be written in java, you may end up writing rest wrappers over the python code so that systems can communicate with it. But there are many disadvantages to this.

Build an Authentication and Authorization mechanism and making sure the rest API follows the security protocols
Setting up a separate environment for hosting and deploying this application
Point of failures may increase

I had a similar issue in one of my projects I was working on. I had to extract the different maintenance information of data centers like the ID, date of maintenance, and provider from maintenance emails and appropriately create tickets. The automated ticket creation was written in Mulesoft which is a spring-based java integration tool.

I could have implemented a regex-based approach where I could write patterns for extracting these entities. But there were a lot of different patterns that had to be taken care and it was not a feasible solution. Then I knew this information was already present on existing tickets which were processed manually. So I thought of building a NER Model which could extract these entities by training the model with the dataset of tickets having this information.

I looked at many libraries that provided options to train a custom NER model and one of which was a very popular and production-grade library Spacy. Since the integration of creating the tickets was already present in Java, building the model in python and integrating it with rest wrapper was a little tough to overcome because of the disadvantage listed above.

Hence I came across a library named Open NLP by Apache. This had a pretty cool NER model, which is a java-based library and it could easily be used in Mulesoft ESB as well by making use of Java Component.

Let's dive into how we can create a custom NER model in Open NLP.

Step1: Adding the Library

Following is the dependency that can be added to your pom.xml of your maven project or you could download the jar and add it to build path as well.

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-uima</artifactId>
    <version>1.9.2</version>
</dependency>

Step2: Preparing the Training Dataset

The dataset that needs to be fed to the Open NLP Model should present in a text file. This training dataset should contain the text which contains entities or keywords that need to extracted and these entities should be annotated with start and end tags. The Start tag will denote the start of the entity that needs to be extracted. It should follow this syntax: <START:entity_type.entity_name>, where entity_name is optional, which can remain common across all different kinds of entities that will be annotated. entity_name is mandatory which is used to label the entity. The <END> used to mark the end of the entity.

For example, In my training dataset, the training dataset consisted of body of different emails with maintenance dates annotated with start and end tags. The different email body was put into a text file in separate lines.

Dear Network User, Please be advised that the network will be unavailable from 01:00am to 05:30am on <START:maint.mdate> November 12th, 2014 <END> . This period of downtime will be scheduled for necessary updates to be applied to the network servers. We apologise for the inconvenience that this may cause. Kindly inform the IT Service Desk (at ext. 1234) of any concerns that you may have about the planned outage. Kind regards, abc name Network Administrator.Due to system maintenance, Certain account related features on Net Banking would not be available till <START:maint.mdate>Monday 6th September 17:00 hrs <END>. Credit Card Enquiry, Demat, and Debit Card details would continue to be available. We regret the inconvenience caused

Few points that need to be noted while preparing training dataset are

The model works when there is space before and after the annotation tags.
The training dataset should contain a minimum of 7–10 records per one kind of text for the model to learn the pattern and predict properly
Each record of training dataset should be present in separate lines(ie. separated by \n newline)
Standard Data Preprocessing techniques can also be applied like removing punctuation characters, stopwords, non-ascii characters, etc

Step3: Build the Model

Import the following modules

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Collections;
import opennlp.tools.namefind.BioCodec;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinder;
import opennlp.tools.namefind.TokenNameFinderFactory;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
import opennlp.tools.util.TrainingParameters;

Read the training dataset(training_dataset.txt) text file prepared in Step2

InputStreamFactory in = null;
try {
    in = new MarkableFileInputStreamFactory(new File("AnnotatedSentences.txt"));
} catch (FileNotFoundException e2) {
    e2.printStackTrace();
}
ObjectStream sampleStream = null;
try {
    sampleStream = new NameSampleDataStream(
        new PlainTextByLineStream(in, StandardCharsets.UTF_8));
} catch (IOException e1) {
    e1.printStackTrace();
}

We can change the training parameters such as the number of iterations, cut off parameter for loss and the type of algorithm (currently the open NLP supports Maxent and Naive Bayes only)

// setting the parameters for training
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);
paramaters.put(TrainingParameters.ALGORITHM_PARAM, 'MAXENT');

We train the model by passing in the parameter such as codec, language, training dataset. After training, we can save the model to file system with a name which will be used to extract entities later.

// training the model using TokenNameFinderModel class
TokenNameFinderModel nameFinderModel = null;
try {
nameFinderModel = NameFinderME.train("en", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
} catch (IOException e) {
e.printStackTrace();
}
// saving the model to "ner-custom-model.bin" file
try {
File output = new File("ner-custom-model.bin");
FileOutputStream outputStream = new FileOutputStream(output);
nameFinderModel.serialize(outputStream);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

Download the tokenizer model from open NLP official website, this will be used to tokenize the sentences because the model needs the text in the form of tokens. We load the model and then feed the tokenized test sentence which we want to test the model.

sentence = "<put in the sample sentence that you want to test here>"// Tokenise sentences
InputStream inputStreamTokenizer = new FileInputStream("en-token.bin");
TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer);
TokenizerME tokenizer = new TokenizerME(tokenModel);
tokens =  tokenizer.tokenize(sentence);//Load the model created above
InputStream inputStream = new FileInputStream("ner-custom-model.bin");
TokenNameFinderModel model = null;
try {
    model = new TokenNameFinderModel(inputStream);
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} 
NameFinderME nameFinder = new NameFinderME(model);
Span nameSpans[] = nameFinder.find(tokens);// testing the model and printing the types it found in the input sentence
for(Span name:nameSpans){
String entity="";
System.out.println(name);
for(int i=name.getStart();i<name.getEnd();i++){
    entity+=tokens[i]+" ";
}
System.out.println(name.getType()+" : "+entity+"\t [probability="+name.getProb()+"]");

Why not Stanford core NLP?

You can wonder, why didn't I choose Stanford Core NLP which is also a widely popular java library for NLP. The reason was Stanford core NLP required a large number of the training datasets to train a customer NER Model, unlike Open NLP which could learn the patterns with very few training datasets. Maybe in my next blog, I will try to explain how to train a Custom NER Model with Stanford Core NLP and also compare with Open NLP.

I hope you liked my post on Open NLP, and if so please share this with your friends and colleagues who are interested. Also, give a lot of claps on this post, which will encourage me to write more posts. If you have any doubts and need help in accomplishing this code, feel free to drop in your thoughts in the comments section below.

Until next post, thank you !!!