Coding Synergy: Bridging CoreNLP in Java with Python for NLP

Bagiyalakshmi
featurepreneur
Published in
3 min readJul 9, 2023

setup and use Stanford CoreNLP Server with Python.

  1. Download Stanford CoreNLP

2. Install Java8

3. Run the Stanford CoreNLP server.

  • Go to the path of the unzipped Stanford CoreNLP and execute the below command
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000
  • This command starts the server on port 9000 with a timeout of 30 seconds.

4. Accessing Stanford CoreNLP Server using Python

  • Installing Dependencies:

To access the CoreNLP Server from Python, we need to install the ‘Stanford’ or ‘stanza’ library, which provides a Python interface for communicating with the server. Use the following command to install the library.

You can use either one,

pip install stanfordnlp
pip install stanza
  • Now that we have set up the CoreNLP Server and installed the necessary dependencies, we can access it using Python. Here’s an example code snippet to get started.
  1. This is an example code snippet using the ‘Stanfordnlp’ library
from stanfordcorenlp import StanfordCoreNLP
import json

class StanfordNLP:
def __init__(self, host='http://localhost', port=9000):
self.nlp = StanfordCoreNLP(host, port=port,
timeout=30000) # , quiet=False, logging_level=logging.DEBUG)
self.props = {
'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation',
'pipelineLanguage': 'en',
'outputFormat': 'json'
}

def word_tokenize(self, sentence):
return self.nlp.word_tokenize(sentence)

def pos(self, sentence):
return self.nlp.pos_tag(sentence)

def ner(self, sentence):
return self.nlp.ner(sentence)

def parse(self, sentence):
return self.nlp.parse(sentence)

def dependency_parse(self, sentence):
return self.nlp.dependency_parse(sentence)

def annotate(self, sentence):
return json.loads(self.nlp.annotate(sentence, properties=self.props))

@staticmethod
def tokens_to_dict(_tokens):
tokens = defaultdict(dict)
for token in _tokens:
tokens[int(token['index'])] = {
'word': token['word'],
'lemma': token['lemma'],
'pos': token['pos'],
'ner': token['ner']
}
return tokens

def startpy():

sNLP = StanfordNLP()
text = 'Meet Dr.Shaun Murphy, who is autistic and a doctor'
print("Annotate:", sNLP.annotate(text))
print ("POS:", sNLP.pos(text))
print ("Tokens:", sNLP.word_tokenize(text))
print ("NER:", sNLP.ner(text))
print ("Parse:", sNLP.parse(text))
print ("Dep Parse:", sNLP.dependency_parse(text))

if __name__ == '__main__':
startpy()

2. This is an example code snippet using the ‘stanza’ library

import stanza
from collections import defaultdict

class stanza:
def __init__(self):
self.nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,ner,depparse')

def word_tokenize(self, sentence):
doc = self.nlp(sentence)
return [token.text for sent in doc.sentences for token in sent.tokens]

def pos(self, sentence):
doc = self.nlp(sentence)
return [word.xpos for sent in doc.sentences for word in sent.words]

def ner(self, sentence):
doc = self.nlp(sentence)
return [ent.type for sent in doc.sentences for ent in sent.ents]

def parse(self, sentence):
doc = self.nlp(sentence)
return [sent.deptree for sent in doc.sentences]

def dependency_parse(self, sentence):
doc = self.nlp(sentence)
return [
[
(word.head, word.deprel)
for word in sent.words
]
for sent in doc.sentences
]

@staticmethod
def tokens_to_dict(_tokens):
tokens = defaultdict(dict)
for token in _tokens:
tokens[int(token['id'])] = {
'word': token['text'],
'lemma': token['lemma'],
'pos': token['xpos'],
'ner': token['ner']
}
return tokens

def startpy():

sNLP = stanza()
text = 'Meet Dr.Shaun Murphy, who is autistic and a doctor'
print("Word Tokenize:", sNLP.word_tokenize(text))
print("POS:", sNLP.pos(text))
print("NER:", sNLP.ner(text))
print("Dependency Parse:", sNLP.dependency_parse(text))


if __name__ == '__main__':
startpy()

There is an extensive list of Annotators, here we will see some of them

  • tokenize

Tokenization is the process of turning text into tokens. For example, the sentence “Claire is a good singer.” this would be tokenized as “Claire”, “is”, “a”, “good”, “singer”, “.”.

  • pos

Part of speech tagging assigns part of speech labels to tokens, such as whether they are verbs or nouns. Every token in a sentence has applied a tag

  • ner

Recognizes named entities (person and Organization names, etc.) in text.

  • parse

Parsing refers to the process of analyzing the syntactic structure of sentences to determine the relationships between words and their roles in a sentence. It involves building a parse tree that represents the hierarchical structure of the sentence based on grammar rules and syntactic dependencies.

  • deparse

Deparsing, also known as un-parsing, is the reverse process of parsing. It involves reconstructing a sentence from its parse tree or syntactic representation, allowing the sentence to be generated or outputted in its original form. ref: here

In this article, we have explored how to access the Stanford CoreNLP Server using Python. By leveraging the ‘Stanfordnlp’ and ‘stanza’ libraries, we can conveniently communicate with the server and perform various natural language processing tasks. This integration of CoreNLP’s powerful analysis capabilities with the flexibility of Python opens up a world of possibilities for NLP applications.

Let’s keep exploring more and dig a lot!!

Happy learning and coding !!!

--

--

Bagiyalakshmi
featurepreneur

Learning something new everyday keeps me busy and refresh