Sentiment Analysis with SparkNLP — It Couldn’t Be Easier

A mini sentiment analysis project using sparkNLP library

Elif Pekcokguler
Analytics Vidhya
5 min readMay 17, 2020

--

https://www.youtube.com/watch?v=SPmxsRDSmTc

Natural Language Processing is an exciting technology as there are breakthroughs day by day and there is no limit when you consider how we express ourselves. And when it comes to sentiment analysis it becomes more complicated. Let’s say if we’re analysing feedbacks from customers, customers can give both positive and negative feedbacks in one sentence, they can be sarcastic (!) :) and can misspell, etc. etc.. So what do we do?

Before I start, I’d like to mention that at the end this story you’ll find some of the articles that I came across while I am learning about NLP. These are just baby steps and I am just a beginner in this field…

For sentiment analysis, if the data is labelled you’re lucky , you can use Bag of Words/embeddings to represent the text numerical and train a classifier to run predictions on your test data.

If the data is not labelled, you’re again lucky because we have well pre-trained embeddings to help you get quite good results using clustering.

However you might have heard about sparkNLP , if you do a little search on Spark and NLP. We have pyspark.ml library but I have to say ; sparkNLP makes it easier than ever.

All you have to do is that install sparkNLP :

and start to play with it.

Dataset:

Please feel free to choose any other dataset, the steps are explained as generic as possible.

Environment : AWS EMR 5.29.0 with Spark 2.4.4 , TensorFlow 1.14.0

(You can find requirements in https://github.com/JohnSnowLabs/spark-nlp/blob/master/README.md)

0. Import libraries and the pretrained sentiment analyzer model:

Two ways to import pretrained pipeline:

a. offline :

Download the analyze_sentimentdl_use_twitter zip from : https://github.com/JohnSnowLabs/spark-nlp-models

Unzip and load to s3.

When you unzip , I urge you to check the folders and see what stages are in the pipeline. :)

b. online :

I added sparkNLP jar manually and if you also choose to do this way, while creating sparkSession , please make sure that you use the following settings, otherwise I got errors such as “Table not initialized”(setting the serializer solves this) ,java errors due to memory or Universal Sentence Encoder(which is one of the stages in this pipeline) causes some error when the cluster mode is “yarn”.

  1. Read the dataset into Spark DataFrame and preprocess if you wish :

a. Remove non ascii characters

b. Remove punctuation(punctuation might be useful in some cases!)

c. Remove texts shorter than a specific length

d. Correct spelling (you can use sparkNLP ‘check_spelling_dl’ for context preserving spell correction)

2. Just run the pretrained pipeline on your dataset:

3. PrintSchema and extract the result from sentiments output column:

Well it seems like this pipeline maybe did not work greatly for my dataset but it is clear that positive headlines are mostly positive, negative and neutral headlines are mostly negative.

You can use a different dataset(e.g. customer reviews for more “natural” language :) ) You can use sklearn metrics to evaluate the results with the confusion metrics and reiterate with adding more cleansing, removal of stopwords etc. to improve the output.

You might choose to build your own pipeline instead of using a prebuilt one. SparkNLP has great number of models, annotators for you to trial. You might use DocumentAssembler, Tokenizer, Normalizer, Lemmatizer and Word Embeddings(maybe + Sentence Embeddings) to build numerical representation of the text and train any classifier for your purpose. (If the dataset is unlabelled you can apply Kmeans clustering and find optimum K using elbow method, please see links at the end).

This story is short but neither NLP nor sentiment analysis is as easy as this story tells. I need to say that this work has just maybe the “a” of the abc of NLP ,with deep respect to people building their own embeddings, train their own deep learning networks … But again , I hope this story would build up some passion for NLP :)

Thank you for time!

Some articles about text preprocessing,sentiment analysis and sparknlp:

--

--