Augmenting Data in Pipelines with NLP

Using Stanford CoreNLP in Apache Nifi

Drew Kerrigan
iss-lab

--

Stanford CoreNLP Processor running in Apache Nifi

The Stanford CoreNLP toolkit provides a nice java interface to a plethora of natural language processing functionality. The goal of this article is to briefly introduce what is possible when you combine CoreNLP with Apache Nifi.

Use-Case: Entity Extraction

Entity Extraction or Named Entity Recognition (NER) is a valuable tool in data processing workflows. It can be used in a variety of situations where you need to find people, companies, locations and other terms in some text data.

Take a look at the following news article JSON snippet:

{
"title": "City of Amarillo Recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) — A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr."
}

Our goal is to extract the locations and organizations from the data. We (ISS Technology Innovation Lab) have created an Apache Nifi custom processor which utilizes the Stanford CoreNLP toolkit to perform entity extraction on FlowFiles. You can find the repository on GitHub here: https://github.com/iss-lab/nifi-stanfordcorenlp-processor.

Setting up Nifi

Installation instructions can be found in the Nifi Documentation, and this article will focus on running this stack locally on Mac OSX.

brew install nifi
export NIFI_HOME=/usr/local/Cellar/nifi/1.9.2/libexec
mkdir -p ${NIFI_HOME}/nars/lib1/
cd ${NIFI_HOME}/nars/lib1/
curl -L -O https://github.com/iss-lab/nifi-stanfordcorenlp-processor/releases/download/v1.2/nifi-stanfordcorenlp-nar-1.2.nar

Those commands will install Nifi and download our Stanford CoreNLP processor nifi-stanfordcorenlp-nar-1.2.nar file into a directory where Nifi can find it. Before we start Nifi, there’s a little more configuration to do. Because CoreNLP tends to be memory intensive when loading models and processing text, we need to increase the java heap settings. Make the following changes to ${NIFI_HOME}/conf/bootstrap.conf to increase the max heap size to 4GB from the default 512MB:

# JVM memory settings
java.arg.2=-Xms512m
java.arg.3=-Xmx4g

Additionally, we need to tell Nifi where to find the custom nar file, this can be done by modifying ${NIFI_HOME}/conf/nifi.properties:

nifi.nar.library.directory.lib1=./nars/lib1

With all of that done, we can start Nifi as normal:

nifi start

After a few minutes, you should be able to access the web UI at http://localhost:8080/nifi.

Creating the Flow

To build our data processing flow, we first need to create some inputs and outputs in Nifi. Nifi is extremely flexible as far as data sources and sinks go, but we’re going to keep it simple here by using the built-in GetFile and PutFile processors. Let’s create some folders that we can use with them:

mkdir -p ${NIFI_HOME}/input
mkdir -p ${NIFI_HOME}/output

Next, click and drag the Processor icon into the flow area, and find the GetFile processor. Once it is added, right click it and select Configure and then select the Properties tab. Here you’ll need to specify the Input Directory as ${NIFI_HOME}/input.

The process is very similar for the PutFile processor, just specify the Directory property as ${NIFI_HOME}/output.

The last processor we need is the StanfordCoreNLPProcessor that was previously downloaded, and the properties configuration should look like the following:

Entity Types are the lower case NER tags that we are interested in, organization,location in this case. JSON Path is the piece of the incoming JSON document that we are interested in analyzing, $.['title','content'] for our data. This can also be left blank and the processor will treat the entire document as plain text.

Once the processors are configured, just click and drag arrows from one component to the next to connect them. The final flow should look something like this:

Finish up the flow by starting each component with the green play button. The PutFile processor may need one additional configuration: the success and failure outputs need to be auto terminated on the first configuration tab.

The Fun Part

Let’s test it out, create a file named ${NIFI_HOME}/input/1.json with the contents of our article from before:

{
"title": "City of Amarillo recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) — A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr."
}

If everything worked as planned, you should see a new file in ${NIFI_HOME}/output/1.json after a few minutes (subsequent files will process faster):

{
"title": "City of Amarillo recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) - A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr.",
"organization": [
"City of Amarillo Parks & Recreation Department"
],
"location": [
"Amarillo",
"Texas",
"AMARILLO",
"Texas",
"Texas",
"Sleepy Hollow",
"Elementary School Park"
]
}

Perfect! We have the original document, but now it is augmented with two new JSON properties: organization and location which are populated with categorized data from our article. This is just a simple project demonstrating the basic functionality, but there are many more interesting things that can be done with these tools.

Summary

We hope to add more capabilities from the Stanford CoreNLP toolkit to our custom Nifi processor in the future. There are also several configuration options available for our StanfordCoreNLPProcessor that were not discussed in the article. For more information and examples, check out the GitHub repository here: https://github.com/iss-lab/nifi-stanfordcorenlp-processor.

--

--

Drew Kerrigan
iss-lab
Editor for

Infrastructure automation and software engineering with a focus on blockchain and AI/ML.