NLP in a Hurry!

Pier Lim
Temasek Root Access
7 min readNov 8, 2019

Throughout my work, there are many cases where we have to show initial results/proof-of-concept in a fast way. Quick NLP prototyping libraries enable us to adopt a “show, don’t tell” style of innovation, in line with Root Access team’s culture of scalable innovation here at Temasek.

Fast iterations of Proof-of-Concept prototypes allow us to work together with our colleagues across the firm to come up with better solutions via iterative and timely feedback. It also allows us to launch the development of the product with the right footing, reducing inefficiencies.

So as a result of all this, I’ve also collected a bunch of different Python libraries for Natural Language Processing which I think are invaluable to get a working prototype done in a jiffy. Kudos to Tan Li Ling, my NLP lecturer from NUS-ISS who introduced me to a few of these libraries.

I intend for this to be a running live blog post so that I can document new finds that allow me to get work done quickly.

Semantic Similarity

Sentence Transformershttps://github.com/UKPLab/sentence-transformers

Sentence Transformers is a library that leverages the all-famous BERT library to do semantic matching for sentences. In my instance, it was very useful to show people what was possible with sentence semantic similarity and how it differs from keyword searches.

For example, the following two sentences :

What gifts am I allowed to receive from my vendor?What presents can I get from my supplier?

Essentially mean almost the same thing. Hence they are semantically similar. Sentence Transformers allows us to do this easily.

Here is an example of semantic search (taken from Sentence Transformers example)

Call Sentence Transformer, and define the corpus across which we will search.

embedder = SentenceTransformer('bert-base-nli-mean-tokens')# Corpus with example sentences
corpus = ['A man is eating a food.',
'A man is eating a piece of bread.',
'The girl is carrying a baby.',
'A man is riding a horse.',
'A woman is playing violin.',
'Two men pushed carts through the woods.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'A cheetah is running behind its prey.']
corpus_embeddings = embedder.encode(corpus)

And then we compare the cosine similarities of the queries against the BERT sentence embeddings across the corpus.

queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

Sentence transformers itself is specifically tuned for common downstream tasks like sentence similarity so it works better out-of-the-box compared to say, Hugging Face’s pre-trained transformers.

If the results you get from BERT out-of-the-box is not sufficient, you can fine-tune this library via the instructions on the website here. Using this library, you can actually make a semantic search in quite a short time, check it out!

Rules-based Text Sentiment for Social Media

VaderSentimenthttps://github.com/cjhutto/vaderSentiment

Vadersentiment is a library that allows us to do text sentiment mining, even if you don’t have a whole lot of data. Nowadays, the natural inclination would be to turn to deep learning to get the best sentiment fidelity, but the problem with this approach is (1) you need a lot of data (2) you need time to tune your models. Although a rules-based text sentiment library, vaderSentiment has decent results, and I used it in a quick prototype for one of the departments in my company. What is great about vaderSentiment is that it takes into account things present in social media such as emoticons, and increases sentiment intensity for things such as “!!!!”. It also takes into account common cases of parsing natural language. Such as the word “but” — suppose someone says:

You are very pretty, but you .....

You don’t even hear the words “You are very pretty” anymore, our brain naturally focuses on the words that come after the “but”. Vadersentiment takes all this into account, giving more weight in this case to the words coming after “but”.

We can get results fast with vaderSentiment through a few simple lines of code.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this:
#from vaderSentiment import SentimentIntensityAnalyzer
# --- examples -------
sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
"VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
"VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted)
"VADER is VERY SMART, handsome, and FUNNY.", # emphasis for ALLCAPS handled
"VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity
"VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score
"VADER is not smart, handsome, nor funny.", # negation sentence example
"The book was good.", # positive sentence
"At least it isn't a horrible book.", # negated negative sentence with contraction
"The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
"The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
"Today SUX!", # negative slang with capitalization emphasis
"Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but"
"Make sure you :) or :D today!", # emoticons handled
"Catch utf-8 emoji such as such as 💘 and 💋 and 😁", # emojis handled
"Not bad at all" # Capitalized negation
]
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
vs = analyzer.polarity_scores(sentence)
print("{:-<65} {}".format(sentence, str(vs)))

The following is an example of the scores you get with vaderSentiment.

VADER is not smart, handsome, nor funny.------------------------- {'pos': 0.0, 'compound': -0.7424, 'neu': 0.354, 'neg': 0.646}

compound” is the score that you want as it gives the sentiment of the entire sentence as a whole, summing the valence scores of each element in the sentence, and normalizing to between -1 and 1.

pos’, ‘neu’ and ‘neg’ gives the proportions of the text that fall into positive, neutral or negative categories. These sum up to 1 or very close to 1.

So the above sentence is pretty negative, based on the -0.7424 score that we got. Pretty spot on!

It is all open-source, so you can delve into the code should you want to, for example, incorporate Singlish into the sentiment mining.

Named Entity Recognition

Spacyhttps://spacy.io/usage/linguistic-features

Yes, I know Spacy is very popular, but I found this particular Named Entity Recognition especially straightforward for convincing folks that NLP is able to recognise entities and relationships.

import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

The thing about Spacy’s NER model — you can fine-tune it as well! This blog post gives a good guide for this: https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718

There are several other visualizations that Spacy can generate that look pretty nice, like this POS visualization.

Production-Ready BERT Models

BERT-As-A-Service : https://github.com/hanxiao/bert-as-service

I guess most of you reading this post would already know of this library, written by Han Xiao, a lead at Tencent Labs. I shall not go too much into it as it is already really popular, you can read more at the website.

What BERT-As-A-Service has done is to wrap the BERT code and serve it using ZeroMQ, allowing one to serve BERT embeddings with just a few lines of code. It is fast (optimized), scalable and reliable and we are using it in one of our internal enterprise projects.

The thing about this is that it runs as two components, a client and a server. You typically start the server with a model of your choice.

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4

And then run the client to do the BERT encoding, like in the following example code.

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])

By the way, Han Xiao also recently came up with Generic Neural Elastic Search https://github.com/gnes-ai/gnes , which looks really impressive, but I haven’t had the time to play with it properly.

So that’s a not so comprehensive list of libraries I found useful while doing NLP stuff. Let me know if you’ve found others, I’ll add it to the list if we find it useful!

--

--

Pier Lim
Temasek Root Access

Machine Learning Lead at Temasek, Digital Technology