Augmenting Data in Pipelines with NLP
Using Stanford CoreNLP in Apache Nifi
The Stanford CoreNLP toolkit provides a nice java interface to a plethora of natural language processing functionality. The goal of this article is to briefly introduce what is possible when you combine CoreNLP with Apache Nifi.
Use-Case: Entity Extraction
Entity Extraction or Named Entity Recognition (NER) is a valuable tool in data processing workflows. It can be used in a variety of situations where you need to find people, companies, locations and other terms in some text data.
Take a look at the following news article JSON snippet:
{
"title": "City of Amarillo Recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) — A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr."
}
Our goal is to extract the locations and organizations from the data. We (ISS Technology Innovation Lab) have created an Apache Nifi custom processor which utilizes the Stanford CoreNLP toolkit to perform entity extraction on FlowFiles. You can find the repository on GitHub here: https://github.com/iss-lab/nifi-stanfordcorenlp-processor.
Setting up Nifi
Installation instructions can be found in the Nifi Documentation, and this article will focus on running this stack locally on Mac OSX.
brew install nifi
export NIFI_HOME=/usr/local/Cellar/nifi/1.9.2/libexec
mkdir -p ${NIFI_HOME}/nars/lib1/
cd ${NIFI_HOME}/nars/lib1/
curl -L -O https://github.com/iss-lab/nifi-stanfordcorenlp-processor/releases/download/v1.2/nifi-stanfordcorenlp-nar-1.2.nar
Those commands will install Nifi and download our Stanford CoreNLP processor nifi-stanfordcorenlp-nar-1.2.nar
file into a directory where Nifi can find it. Before we start Nifi, there’s a little more configuration to do. Because CoreNLP tends to be memory intensive when loading models and processing text, we need to increase the java heap settings. Make the following changes to ${NIFI_HOME}/conf/bootstrap.conf
to increase the max heap size to 4GB from the default 512MB:
# JVM memory settings
java.arg.2=-Xms512m
java.arg.3=-Xmx4g
Additionally, we need to tell Nifi where to find the custom nar
file, this can be done by modifying ${NIFI_HOME}/conf/nifi.properties
:
nifi.nar.library.directory.lib1=./nars/lib1
With all of that done, we can start Nifi as normal:
nifi start
After a few minutes, you should be able to access the web UI at http://localhost:8080/nifi.
Creating the Flow
To build our data processing flow, we first need to create some inputs and outputs in Nifi. Nifi is extremely flexible as far as data sources and sinks go, but we’re going to keep it simple here by using the built-in GetFile
and PutFile
processors. Let’s create some folders that we can use with them:
mkdir -p ${NIFI_HOME}/input
mkdir -p ${NIFI_HOME}/output
Next, click and drag the Processor icon into the flow area, and find the GetFile
processor. Once it is added, right click it and select Configure
and then select the Properties tab. Here you’ll need to specify the Input Directory
as ${NIFI_HOME}/input
.
The process is very similar for the PutFile
processor, just specify the Directory
property as ${NIFI_HOME}/output
.
The last processor we need is the StanfordCoreNLPProcessor
that was previously downloaded, and the properties configuration should look like the following:
Entity Types are the lower case NER tags that we are interested in, organization,location
in this case. JSON Path is the piece of the incoming JSON document that we are interested in analyzing, $.['title','content']
for our data. This can also be left blank and the processor will treat the entire document as plain text.
Once the processors are configured, just click and drag arrows from one component to the next to connect them. The final flow should look something like this:
Finish up the flow by starting each component with the green play button. The PutFile
processor may need one additional configuration: the success and failure outputs need to be auto terminated on the first configuration tab.
The Fun Part
Let’s test it out, create a file named ${NIFI_HOME}/input/1.json
with the contents of our article from before:
{
"title": "City of Amarillo recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) — A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr."
}
If everything worked as planned, you should see a new file in ${NIFI_HOME}/output/1.json
after a few minutes (subsequent files will process faster):
{
"title": "City of Amarillo recognizes Texas arbor day",
"content": "AMARILLO, Texas (KAMR/KCIT) - A small forest of trees will have a new home on Friday (Nov. 1) as the City of Amarillo Parks & Recreation Department (PARD) celebrates Texas Arbor Day by planting 20 trees at Sleepy Hollow Elementary School Park, 3435 Reeder Dr.",
"organization": [
"City of Amarillo Parks & Recreation Department"
],
"location": [
"Amarillo",
"Texas",
"AMARILLO",
"Texas",
"Texas",
"Sleepy Hollow",
"Elementary School Park"
]
}
Perfect! We have the original document, but now it is augmented with two new JSON properties: organization
and location
which are populated with categorized data from our article. This is just a simple project demonstrating the basic functionality, but there are many more interesting things that can be done with these tools.
Summary
We hope to add more capabilities from the Stanford CoreNLP toolkit to our custom Nifi processor in the future. There are also several configuration options available for our StanfordCoreNLPProcessor
that were not discussed in the article. For more information and examples, check out the GitHub repository here: https://github.com/iss-lab/nifi-stanfordcorenlp-processor.