Simple Sentiment Analysis in Python: NYSK Dataset

Shraddha Anala

Published in

Analytics Vidhya

3 min readMay 11, 2020

This is an implementation of sentiment analysis using NLP techniques on the NYSK Dataset.

I’m expanding with more posts on ML concepts + tutorials over at my blog!

This article/tutorial has been brought to you by yours truly as part of her regular, random dataset challenge to improve, practice and expand her data science skills. She will also stop referring to herself in the third person after this sentence.

About the dataset:

The NYSK dataset available on the UCI Machine Learning Repository, is a collection of news reports, articles regarding allegations of sexual assault against former IMF Director, Dominique Strauss-Kahn.

Disclaimer: This code is only meant to implement sentiment analysis. Not for any other purpose. Besides, that’s the point of my random dataset challenge, I don’t know what dataset is going to pop up beforehand. And yes, this is a super professional disclaimer.

The data itself is an XML file that you’ll have to parse into a Python object, such as a Pandas DataFrame, suitable for analysis.

The root node of the XML document contains attributes like Document ID, Source, URL, Title and Summary, all of which will have to be extracted before any modelling takes place.

Therefore, I am using the xml.etree.ElementTree module to achieve the above. Diving straight into the code, you can see exactly how to achieve this.

The useful information is buried in the nodes within attributes. I am then searching for those nodes, like docid, source, url etc., and storing the text information contained in the 6 variables.

The variables are then used to build each row of the DataFrame, called dataset in my example, in a single, iteration.

Now the next step will usually involve cleaning up the textual data using some NLP techniques such as tokenization, lemmatization, tidying up special characters and punctuation, and removing stop words.

But using a fancy Python module called Valence Aware Dictionary and sEntiment Reasoner or Vader for short, I can eschew going through the steps of data preprocessing or building the word matrix later on, in favour of a simple implementation.

Moreover, this method gives me quantifiable sentiment analysis metrics, in terms of the positive, negative, neutral and compound scores calculated by the module.

Looking at the polarity scores for each summary we can easily infer the tone of the article and if any bias is present. A high neutral score implies that the article was written stating the facts and developments in the case, and adopting a neutral stance with respect to either party.

An article skewed either positively or negatively probably suggests that it perhaps includes the personal opinions of the author and is more subjective.

Here are some visual plots of the sentiment labels to see how the distribution of the three sentiment categories, viz., Negative, Neutral and Positive are related to the compound score (i.e, normalized score).

And here’s the code to plot these beautiful plots. You can play around with the colour maps and type of plots for more insights.

If you thought this is it, well for you it, technically is it. But while playing around with the above code, I came up with a new dataset that you will find very useful for sentiment analysis. You see I added another column called ‘Sentiment’, which lists, duh, the sentiment of the article.

You can use the above dataset to implement your own NLP processing model and further improve your skills.

Now, this is finally it. Hoped you enjoyed reading my article/tutorial. Please leave below any suggestions or any comments for improvements, further clarification etc.

Thank you very much for reading and I’ll see you soon.

Simple Sentiment Analysis in Python: NYSK Dataset

Written by Shraddha Anala