AI in Practice: Identifying Parts of Speech in Python

Brian Ray and Alice Zheng at Puget Sound Python

In short: computers can at most times correctly identify the context of each word in a given sentence and Python can help.

Part of NLP (Natural Language Processing) is Part of Speech. I’m talking about nouns, verbs, adverbs, adjectives, pronouns …and all that stuff you learned in grade school (I hope). I did give a talk on this topic[1] back in December 2015 at the Puget Sound Python user group at Redfin in Seattle following Alice Zhang’s talk on Feature Engineering (or smiling photo to the left). I am revisiting this talk with a blog post today as there is still a lot of interesting tools relating to Parts of Speech in Python.

[1] yes, I did update it to run on Python 3 here and added Google API and you are welcome!

  1. NLTK (Proxy for others, good starting point)
  2. pyStatParser (python yay!, little slow, but fun)
  3. Stanford (popular) and btw, online! => http://nlp.stanford.edu:8080/parser/
  4. TextBlob (python yay! NLTK simplification)
  5. clips Pattern (python yay!)
  6. MaltParser (java 1.8)
  7. spaCy (python yay!)
  8. **NEW** Google Cloud Natural Language (API callable by Python)

our <sent> variable here is set to

“Each of us is full of s**t in our own special way. We are all sh**ty little snowflakes dancing in the universe.” ― Lewis Black, Me of Little Faith

NLTK

When we are talking about *learning* NLP, NLTK is the book, the start, and, ultimately the glue-on-glue. Please note many of the examples here are using NLTK to wrap fully implemented POS taggers (parts of speech taggers).

pyStatParser

from stat_parser import Parser
parser = Parser()
parser.parse(sent)
tree = parser.parse(sent) # returns nltk Tree instance
tree
output from stat_parser

PyStatParser hasn’t been maintained or needing much. I am asking for Python 3 Pull request to be merged here. I like it because it uses grammars (Probabilistic Context-Free Grammars) from QuestionBank and Penn Tree Bank. It wrapps NLTK and returns a NLTK tree.

Stanford POS

os.environ[‘CLASSPATH’] = “stanford-pos”
os.environ[‘STANFORD_MODELS’] = “stanford-pos”
from nltk.tag import StanfordPOSTagger
st = StanfordPOSTagger(‘english-bidirectional-distsim.tagger’)
st.tag(sent.split())

Stanford is probably considered the most widely used POS tagger. It is proven. I also spend a lot of time messing with some pretty large JAR (Java) files to get it to work. Does not feel modern at all but if proven is needed this is the way to go.

TextBlob

from textblob import TextBlob
blob = TextBlob(sent)
blob.parse()
blob.tags
TextBlob output

Regarding ‘Pythonic’, I find TextBlob probably the most. It’s simplicity reminds me of the requests module which is a compliment.

MaltParser

import nltk
mp = nltk.parse.malt.MaltParser(os.getcwd(), model_filename=”engmalt.linear-1.7.mco”)
mp.parse_one(sent.split()).tree()
output tree from MaltParser
list(res.triples())
[(('stuff', 'NN'), 'nn', ('Each', 'NN')),
(('stuff', 'NN'), 'nn', ('of', 'NN')),
(('stuff', 'NN'), 'nn', ('us', 'NNS')),
(('stuff', 'NN'), 'nn', ('is', 'NNS')),
(('stuff', 'NN'), 'nn', ('full', 'NN')),
(('stuff', 'NN'), 'nn', ('of', 'NN')),
(('stuff', 'NN'), 'prep', ('in', 'IN')),
(('in', 'IN'), 'pobj', ('way', 'NN')),
(('way', 'NN'), 'nn', ('our', 'NN')),
(('way', 'NN'), 'nn', ('own', 'NN')),
(('way', 'NN'), 'nn', ('special', 'NN'))]

Malt Parser is an dependency parser. I don’t fully understand the output.

Pattern

from pattern.en import parse, pprint
s = parse(sent,
tokenize = True, # Tokenize the input
tags = True, # Find part-of-speech tags.
chunks = True, # Find chunk tags
relations = True, # Find relations between chunks.
lemmata = True, # Find word lemmata.
light = False)
pprint(s)
output from Pattern

Pattern was the shocker. It could use a better name, and also is struggling to keep up with Python3. Nonetheless, pattern has some very cool features beyond just POS. For instance it comes with a search() method where you can find POS matching a rule in a parse tree. For example search(‘VB*', tree) matches even wild card VB…verbs. Very useful for feature engineering tasks.

spaCy

from spacy.en import English
parser = English()
parsedData = parser(sent)
for i, token in enumerate(parsedData):
print(“original:”, token.orth, token.orth_)
print(“lowercased:”, token.lower, token.lower_)
print(“lemma:”, token.lemma, token.lemma_)
print(“shape:”, token.shape, token.shape_)
print(“prefix:”, token.prefix, token.prefix_)
print(“suffix:”, token.suffix, token.suffix_)
print(“log probability:”, token.prob)
print(“Brown cluster id:”, token.cluster)
print(“ — — — — — — — — — — — — — — — — — — — — “)
if i > 2:
break
<partial> OUTPUT FROM SPACY

Check out this Interactive Example!

spaCy can be faster than the NLTK wrapped models or Stanford. It has a lot or richness to its functionality. Not always as pythonic but certainly the route to look for a large scale project where you know already what you want to do.

Google Cloud Natural Language API

from googleapiclient import discovery
import httplib2
from oauth2client.client import GoogleCredentials
import os
os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = os.path.expanduser(“~/my_key_from_google.json”)
DISCOVERY_URL = (‘https://{api}.googleapis.com/'
‘$discovery/rest?version={apiVersion}’)
http = httplib2.Http()
credentials = GoogleCredentials.get_application_default().create_scoped(
[‘https://www.googleapis.com/auth/cloud-platform'])
http=httplib2.Http()
credentials.authorize(http)
service = discovery.build(‘language’, ‘v1beta1’,
http=http, discoveryServiceUrl=DISCOVERY_URL)
service_request = service.documents().analyzeSyntax(
body={
‘document’: {
‘type’: ‘PLAIN_TEXT’,
‘content’: sent
}
})
response = service_request.execute()
for token in response[‘tokens’]:
print(“{} -> {}”.format(token[‘text’][‘content’],token[‘partOfSpeech’][‘tag’]))
Google output
response[‘tokens’][0][‘partOfSpeech’]

Lastly, I am adding a call to Google API for NLP. It was very easy to get up and running on this one. The down side, for over 5k calls it starts to cost. Likewise, be sides the exact functionality published in the API, you can’t get into the fine details as it is black boxed. Nonetheless, it was very easy to use, surprisingly fast even though it requires an API call over the internet, and adds some interesting context around the POS that can become essential if you need more context aware NLP.

Conclusion

It is no secret Python is great tool for NLP and has been there for awhile. I remain quite amazed how far some of these tool kits went. On the flip side, NLP still has a flavor of living in a niche market with some of these tools. I encourage authors to update to Python 3. The ease of installation with some of these is another issue. TextBlob and spaCy remain true to the Python ease of use. Other AI cloud services are closing in to make some the functionality more available like Google, Amazon, and Microsoft to name a few are investing heavily.

One thing that is missing in all of these is better fully integrated toolsets to help in writing POS rules for some pretty custom feature Engineering. I took a break on POSH syntax as I had mentioned before. I would love to pick this up again and perhaps have it compliment some of these other tools.

What remains to forever be true: language is hard. We need to utilize a simple language like Python to solve some of these hard problems. Likewise many of these tools are abstracted and glued together. Python works well as glue language. I will be happy to hear in comments to this post what other experience practitioners have in this area. What tools did I miss? Where is this going? How better can we implement context aware language parsing.

Thank you: