Sentiment Analysis with SparkNLP — It Couldn’t Be Easier

A mini sentiment analysis project using sparkNLP library

Published in

Analytics Vidhya

5 min readMay 17, 2020

https://www.youtube.com/watch?v=SPmxsRDSmTc

Natural Language Processing is an exciting technology as there are breakthroughs day by day and there is no limit when you consider how we express ourselves. And when it comes to sentiment analysis it becomes more complicated. Let’s say if we’re analysing feedbacks from customers, customers can give both positive and negative feedbacks in one sentence, they can be sarcastic (!) :) and can misspell, etc. etc.. So what do we do?

Before I start, I’d like to mention that at the end this story you’ll find some of the articles that I came across while I am learning about NLP. These are just baby steps and I am just a beginner in this field…

For sentiment analysis, if the data is labelled you’re lucky , you can use Bag of Words/embeddings to represent the text numerical and train a classifier to run predictions on your test data.

If the data is not labelled, you’re again lucky because we have well pre-trained embeddings to help you get quite good results using clustering.

However you might have heard about sparkNLP , if you do a little search on Spark and NLP. We have pyspark.ml library but I have to say ; sparkNLP makes it easier than ever.

All you have to do is that install sparkNLP :

Installation

Let’s create a new Conda environment to manage all the dependencies there. You can use Python Virtual Environment if…

nlp.johnsnowlabs.com

and start to play with it.

Dataset:

Please feel free to choose any other dataset, the steps are explained as generic as possible.

Title and Headline Sentiment Prediction

Predicting Sentiment Score for a Post’s Title and Headline.

www.kaggle.com

Environment : AWS EMR 5.29.0 with Spark 2.4.4 , TensorFlow 1.14.0

(You can find requirements in https://github.com/JohnSnowLabs/spark-nlp/blob/master/README.md)

0. Import libraries and the pretrained sentiment analyzer model:

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

Two ways to import pretrained pipeline:

a. offline :

Download the analyze_sentimentdl_use_twitter zip from : https://github.com/JohnSnowLabs/spark-nlp-models

Unzip and load to s3.

When you unzip , I urge you to check the folders and see what stages are in the pipeline. :)

pipeline = PretrainedPipeline.from_disk(model_s3_location)

b. online :

pipeline = PretrainedPipeline("analyze_sentimentdl_use_twitter", lang="en")

I added sparkNLP jar manually and if you also choose to do this way, while creating sparkSession , please make sure that you use the following settings, otherwise I got errors such as “Table not initialized”(setting the serializer solves this) ,java errors due to memory or Universal Sentence Encoder(which is one of the stages in this pipeline) causes some error when the cluster mode is “yarn”.

spark = SparkSession.builder.master("local[*]") \
.config("spark.driver.memory", "12g")\                                   .config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", jar_location)\
.config("spark.serializer ", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.sql.broadcastTimeout",  "360000")\
.getOrCreate()spark.sparkContext.addPyFile(jar_location)

Read the dataset into Spark DataFrame and preprocess if you wish :

a. Remove non ascii characters

b. Remove punctuation(punctuation might be useful in some cases!)

c. Remove texts shorter than a specific length

d. Correct spelling (you can use sparkNLP ‘check_spelling_dl’ for context preserving spell correction)

2. Just run the pretrained pipeline on your dataset:

# rename the text column as 'text', pipeline expects 'text' inputColdf_result = \
pipeline.transform(df.withColumnRenamed("Headline", "text"))

3. PrintSchema and extract the result from sentiments output column:

# extract results from "sentiments" columndf_result\
.selectExpr("text","explode(sentiment) sentiments", "SentimentHeadline")\
.selectExpr("text","sentiments.result result", "SentimentHeadline")\
.createOrReplaceTempView("result_tbl_")# sentiment labels in training data are float, so we map them to 
# categorical classesspark.sql("""
    SELECT
        text,
        CASE WHEN SentimentHeadline>0 THEN 'positive' 
        WHEN SentimentHeadline<0 THEN 'negative'
        ELSE 'neutral'
        END AS label,
        result
    FROM
    result_tbl_""").createOrReplaceTempView("result_tbl")# below is optional , visualize predictions
# calculate normalized percentage of results per labelsdf_counts = spark.sql("""
    
    WITH counts_tbl AS
    (SELECT COUNT(*) as label_count, label FROM result_tbl GROUP BY      label)SELECT
        joined_.result, 
        joined_.label,
        100*COUNT(*)/joined_.l_count AS normalized_percentageFROM
       (SELECT counts_tbl.label_count l_count, result_tbl.* 
        FROM result_tbl
        JOIN
        counts_tbl 
        ON
        counts_tbl.label = result_tbl.label) joined_GROUP BY  joined_.result, joined_.label, joined_.l_count""")# visualize results per labelsimport seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inlinep_Result = result_df_.toPandas()p = sns.barplot(x="label", y="normalized_percentage", hue="result", data=p_Result)plt.title('Pretrained(Twitter) Sentiment DL Model Results')
plt.savefig('analyze_sentiment.png')

Well it seems like this pipeline maybe did not work greatly for my dataset but it is clear that positive headlines are mostly positive, negative and neutral headlines are mostly negative.

You can use a different dataset(e.g. customer reviews for more “natural” language :) ) You can use sklearn metrics to evaluate the results with the confusion metrics and reiterate with adding more cleansing, removal of stopwords etc. to improve the output.

You might choose to build your own pipeline instead of using a prebuilt one. SparkNLP has great number of models, annotators for you to trial. You might use DocumentAssembler, Tokenizer, Normalizer, Lemmatizer and Word Embeddings(maybe + Sentence Embeddings) to build numerical representation of the text and train any classifier for your purpose. (If the dataset is unlabelled you can apply Kmeans clustering and find optimum K using elbow method, please see links at the end).

This story is short but neither NLP nor sentiment analysis is as easy as this story tells. I need to say that this work has just maybe the “a” of the abc of NLP ,with deep respect to people building their own embeddings, train their own deep learning networks … But again , I hope this story would build up some passion for NLP :)

Thank you for time!

Some articles about text preprocessing,sentiment analysis and sparknlp:

All you need to know about text preprocessing for NLP and Machine Learning — KDnuggets

By Kavita Ganesan, Data Scientist. Based on some recent conversations, I realized that text preprocessing is a severely…

www.kdnuggets.com

What the heck is Word Embedding

Looking at text data through the lens of Neural Nets

towardsdatascience.com

How to train custom Word Embeddings using GPU on AWS

Language is important. Humans use words to communicate, and they carry meaning. Can we train a machine to also learn…

towardsdatascience.com

Unsupervised Sentiment Analysis

How to extract sentiment from the data without any labels

towardsdatascience.com

1. Sentiment Analysis: TF-IDF

Explore and run machine learning code with Kaggle Notebooks | Using data from Bag of Words Meets Bags of Popcorn :)

www.kaggle.com

Bag of Words Meets Bags of Popcorn

Use Google’s Word2Vec for movie reviews

www.kaggle.com

Text Classification in Spark NLP with Bert and Universal Sentence Encoders

Training a SOTA multi-class text classifier with Bert and Universal Sentence Encoders in Spark NLP with just a few…

towardsdatascience.com

Sentiment Analysis with SparkNLP — It Couldn’t Be Easier

A mini sentiment analysis project using sparkNLP library

Installation

Let’s create a new Conda environment to manage all the dependencies there. You can use Python Virtual Environment if…

Title and Headline Sentiment Prediction

Predicting Sentiment Score for a Post’s Title and Headline.

All you need to know about text preprocessing for NLP and Machine Learning — KDnuggets

By Kavita Ganesan, Data Scientist. Based on some recent conversations, I realized that text preprocessing is a severely…

What the heck is Word Embedding

Looking at text data through the lens of Neural Nets

How to train custom Word Embeddings using GPU on AWS

Language is important. Humans use words to communicate, and they carry meaning. Can we train a machine to also learn…

Unsupervised Sentiment Analysis

How to extract sentiment from the data without any labels

1. Sentiment Analysis: TF-IDF

Explore and run machine learning code with Kaggle Notebooks | Using data from Bag of Words Meets Bags of Popcorn :)

Bag of Words Meets Bags of Popcorn

Use Google’s Word2Vec for movie reviews

Text Classification in Spark NLP with Bert and Universal Sentence Encoders

Training a SOTA multi-class text classifier with Bert and Universal Sentence Encoders in Spark NLP with just a few…

Written by Elif Pekcokguler