MacOSX Setup Guide For Using Stanford CoreNLP

Shan Dou
6 min readApr 21, 2019

Goal

Stanford CoreNLP is an annotation-based NLP processing pipeline (Ref, Manning et al., 2014). In the context of deep-learning-based text summarization, CoreNLP has been used by Fernandes et al. (2018) to provide structural annotations. As I tinker through the accompanying codes, having CoreNLP up and running becomes crucial. In particular, I hope to be able to run the corenlp.sh script inside of the CoreNLP git repo.

At first glance, the instructions on CoreNLP’s official website looked overwhelming, and I was mildly concerned about a likely painful process of eventually making it work on my Mac. It turns out to be an alright experience — as soon as I realized that both steps shown on the webpage are necessary to make the script work.

My Initial Misunderstanding

When I first go through the instructions, I thought the two sections — Getting a copy and Steps to setup from the official release have an “either…or…” relation. Given that I was most interested in making the git-repo script work, cloning the repo and following the associated steps were the only items that I had executed. Upon running the testing example, however, I got the following error messages:

java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.pipeline.StanfordCoreNLP

It was only after some more googling around that it finally dawned on me: I still need the CoreNLP framework itself, whereas the repo is more about the tools and utilities needed to use the framework. Below is a detailed guide:

Setup Steps

[STEP 1] Download CoreNLP and set up paths

  1. Download the corenlp framework

Approach 1: Downloading via web interface

Figure 1: Download corenlp zip file

Approach 2: CLI download via wget or curl

Download with

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

or

curl -O http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

2. Put the following lines into your .bashrc or .zshrc for paths settings

# Settings for Stanford CoreNLP
export CORENLP_ROOT="/Users/shandou/Software/stanford-corenlp-full-2018-10-05"
export CLASSPATH="$CORENLP_ROOT/javanlp-core.jar"
export CLASSPATH="$CLASSPATH:$CORENLP_ROOT/stanford-corenlp-models-current.jar"
for file in `find $CORENLP_ROOT -name "*.jar"`
do
export CLASSPATH="$CLASSPATH:`realpath $file`"
done

For testing, type echo $CLASSPATH in your command line terminal. You should see outputs similat to this:

/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javanlp-core.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-models-current.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.json-api-1.0-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-api-2.4.0-b180830.0359-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-models.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.activation-api-1.2.0-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/ejml-0.23.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.activation-api-1.2.0.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/slf4j-api.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/protobuf.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/joda-time.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/joda-time-2.9-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-impl-2.4.0-b180830.0438.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/xom-1.2.10-src.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/xom.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-javadoc.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/javax.json.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-api-2.4.0-b180830.0359.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jollyday.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/slf4j-simple.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-core-2.3.0.1-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-core-2.3.0.1.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jollyday-0.4.9-sources.jar:/Users/shandou/Software/stanford-corenlp-full-2018-10-05/jaxb-impl-2.4.0-b180830.0438-sources.jar

3. Test if corenlp itself is working following testing examples provided by the official setup guide:

# 1. Make a dummie input text file
echo "the quick brown fox jumped over the lazy dog" > input.txt
# 2. Test it out
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt

The processing takes a while to complete, and you should be see stdout similar as what’s shown below:

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Searching for resource: StanfordCoreNLP.properties ... found.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.5 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.8 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 10.184 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [11.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
[main] INFO edu.stanford.nlp.coref.statistical.SimpleLinearClassifier - Loading coref model edu/stanford/nlp/models/coref/statistical/ranking_model.ser.gz ... done [2.4 sec].
[main] INFO edu.stanford.nlp.pipeline.CorefMentionAnnotator - Using mention detector type: dependency
Processing file /Users/shandou/Software/input.txt ... writing to /Users/shandou/Software/input.txt.json
Annotating file /Users/shandou/Software/input.txt ... done [0.6 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
MorphaAnnotator: 0.1 sec.
NERCombinerAnnotator: 0.2 sec.
DependencyParseAnnotator: 0.1 sec.
CorefAnnotator: 0.0 sec.
TOTAL: 0.6 sec. for 9 tokens at 15.9 tokens/sec.
Pipeline setup: 52.6 sec.
Total time for StanfordCoreNLP pipeline: 53.3 sec.

Great! Now you should be find the output file named as input.txt.json that looks like:

Figure 2: Content of the annotation output

[STEP 2] Setup CoreNLP repo for script `corenlp.sh`

Now we are ready to move on to the steps needed for making corenlp.sh work.

  1. Clone repo: Go to its github repo and clone the package:
git clone https://github.com/stanfordnlp/CoreNLP.git

Go to the path under which you have cloned the repo (for me, it is /Users/shandou/Software/CoreNLP), and go to the subfolder path doc/corenlp/

You should be able to see the script corenlp.sh like what I have here:

Figure 3: Location of the `corenlp.sh` script

2. Set up apache ant

If you don’t already have ant, install it via brew install ant

Then run ant jar in your terminal

3. Download the latest model with

wget http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

or

curl -O http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

4. Modify corenlp.sh before testing

This is quite a gotcha. Given that we have set up path configuration assuming frequent use, the copying flags in the original script actually yields errors. We must apply the following changes before using the script:

Figure 4: necessary changes in corenlp.sh

Now give it a try following the comments in the script via

./corenlp.sh -file input.txt

DONE!

Additional Resources Regarding CoreNLP

  1. On CoreNLP GUI: I am yet to play with CoreNLP’s graphic user interface. For more information about the interface, please refer to a nice blog article on cloudacademy.com
  2. On details of the annotations: Please refer to CoreNLP doc page for a full list of annotations

Epilogue

Takeaway from running CoreNLP annotation on DigitalOcean Droplet: RAM intensive (≥ 16GB RAM recommended)

CoreNLP annotation turns out to be surprisingly RAM intensive. When testing on my 16GB RAM Macbook, both simple tests and my actual annotation tasks ran through without incidents (it does take about ~17 seconds to annotate each news article). Considering the time-consuming nature of the task, I just set up a DigitalOcean Droplet to run a bigger annotation task. At the very beginning I used a basic 8GB RAM configuration and kept getting “not enough memory” types of error messages for a simple one-sentence annotation. I first thought some cross-OS oddity might have occurred, until it finally dawned on me (after running across a user forum thread discussing RAM issues) that I simply did not set up enough RAM for this CoreNLP.

After resizing the Droplet to 16GB RAM, the annotation task has been running smoothly. If you also run into similar issues on your Linux server, check the output of free -hm and make sure that you have more than 8GB to spare (though I don’t yet know the exact RAM needs at the moment).

--

--