Navigating News: A Gentle Introduction to Sentiment Analysis. (Pt.1)

Published in

Analytics Vidhya

6 min readJul 1, 2020

Objective

Using sentiment analysis to “score” news articles pertaining to a certain stock, collect this data, and create a model to predict future stock movements.

Introduction

This article and the others to come, will detail my journey in learning and implementing Sentiment Analysis. With this in mind, these articles will be informative, however will also show my trials and errors. I feel like this aspect of learning,understanding, and implementing a new field of study is often overlooked. In my opinion, explaining the hurdles and trials makes a topic less intimidating and easier to absorb. Hopefully, you enjoy this article, and gain some insight! The entirety of my progression will be broken up into three articles…

Part 1 : General knowledge of Sentiment Analysis, and VADER

Part 2: TF-IDF

Part 3: Modelling, and regression analysis.

What is Sentiment Analysis and what are its applications?

Sentiment Analysis (SA from here on forth) interprets,and classifies text. SA falls under the umbrella of Natural Language Processing (NLP). Furthermore, NLP can be broadly defined as a field of study that analyzes, deduces, and derives human languages.

SA is a powerful tool, that allows machines to quantify the underlying sentiment of text. There are typically two approaches for Sentiment Analysis.

Supervised Approach: This consist of labeled training data which is used to develop a model to classify and label “future” text.
Unsupervised Approach: (Also known as Lexicon-Based) This approach does not required training data. Instead, the inner workings rely on the polarity of each word. In the case of a multiple words, the collective polarity is considered. (We will explore this approach further in the article.)

Furthermore Lexicon-Based SA is generally carried out with two methods..

Dictionary-Based Methods: A Lexicon dictionary is referenced in order to find out the polarity of a word.
Corpus-Based Methods: A corpus is used, and based on syntactic patterns other polarities can be found within the context.

So… How is SA practical? The way we consume media has changed drastically. Media as a whole has become so intertwined with everyday life, many do not consider how expansive, and powerful it truly is. Every political tweet, international tension, stock movement or trade agreement can be access almost instantly, by everyone, on a global scale… The ability to transform these subjective texts, into objective data serves as a power median for almost all fields of study. Professionally, SA is applied towards product research, measuring public opinion, and interpreting customer experience. We can even see financial institutions utilizing SA to deduce “FED minutes”, or law firms using SA to deduce statements.

NLTK and VADER

Natural Language Toolkit (NLTK) is one of the leading libraries for Natural Language Processing. It is open sourced, and offers numerous NLP tools, such as tokenization, stemming,etc…(It is important to note, that there are other libraries such as TextBlob which offer similar features, I would suggest exploring other libraries as well!)

VADER (Valence Aware Dictionary and sEntiment Reasoner)is a lexicon, rule based sentiment analysis tool. VADER is interesting because considers both the polarity and intensity of text. This means, conventional use of punctuation (ex. “!”) , word-shapes(ex. using all caps for a word), and degree modifiers(ex. “very”, “kind of”) are considered when evaluating the sentiment of text. It also has the ability to evaluate emojis (utf-8 encoded), slang words, and hip text language (ex. lol).

It is clear, that VADER was tailored to handle sentiments conveyed on social media! Lets run a sample code, using VADER.

Fig 1: Sentiment Analysis via VADER on the statement “I LOVE food!”

Let’s take a look at our output.The first three columns represents “Negative”, “Neutral”, and “Positive” (respectively). The values following the columns represent ratios of the texts that fall in each category. Therefore, these values should sum up to roughly around 1 (I say roughly due to the potential of different mantissa.) It is important to note, that standardizing thresholds for each classes is typical and useful.

The Compound value is computed by summing the sentiment scores of each word in the lexicon, adjusted, and then normalized in a range between [-1] and [1]. ([-1] represents the most negative, and [1] represents the most positive.)

Lets try out Sentiment Analysis with some good ole’ web-scraping.

* I will be running all codes on Google Colaboratory! A python notebook, produced by Google Research tailored made for machine learning, and data analysis. It offers free computing resources like GPU/ TPU (with some limitations). This allows everyone the ability to leverage specialty hardware for free! I would highly recommend!

Web-Scraping with BS4 and Selenium

I will be web-scraping news titles, related to Apple (AAPL) through The Motley Fool website.

For the worried data miners, web-scraping to certain degree is legal. In 2019, the decision in hiQ Labs. Inc v. Linkedin Corp. ruled that automated scraping of publicly available data is legal! However, scraping of sites that require authentication is not legal (ex. Anything that requires a “log in” to access data.) With this in mind, scrape carefully, and consciously”

These are the required packages for Web-Scraping!

If you receive a “ModuleNotFoundError” simply write “!pip install selenium” into the NoteBook cell and run it!

The second section of the code is a a method to circumvent a version mismatch (between Google Chrome, and Chromium Browser) when using Selenium. We are simply updating every package, to avoid this mismatch

When handling dynamic pages, with interactive content, Selenium is perfect. It is often used for web-based application, however it is also used for mimicking/ automating human behavior (such as button clicks).

Motley Fool web-page for stock news. We can use Selenium to emulate a button click.

Combining Selenium with BeautifulSoup, we can virtually scrape any website!

Lets take a look at the web- scraping code!

Notice that we are scraping two values. We have scraped the titles, author, and dates of numerous (50 to be precise) news articles. Lets look at the output!

The dates will be critical, if you want to create a time series analysis.Authors is a less crucial value, and can be removed by using slice notation.

Let’s run our sentiment analysis on each of our article titles.

Packages required for our Sentiment analysis

Fairly simple! Let’s take a look at the output.

It worked! However, notice many articles have a “compound score” of 0. This indicates that the article title was purely neutral, and no underlying sentiment could be determined.

VADER is such a fantastic tool, however when it comes to analyzing news articles it is lacking. My line of thinking, to utilize a Lexicon- Based approach was sound. In my next article, I will be using a popular statistical method known as Term Frequency-Inverse Document Frequency (TF-IDF) instead in hopes to produce more fruitful outputs!

Definitions

Lexicon: “A lexicon is a collection of information about the words of a language about the lexical categories to which they belong. A lexicon is usually structured as a collection of lexical entries, like (“pig” N V ADJ). “pig” is familiar as a N, but also occurs as a verb (“Jane pigged herself on pizza”) and an adjective, in the phrase “pig iron”, for example.”

source: http://www.cse.unsw.edu.au/~billw/nlpdict.html#firstC

Corpus: This can be simply thought of as a large collection of text. (Corpus in Latin, means body.)

Tokenization: A step in the pre-processing stage of NLP, which “breaks down” a string of words into fragments (formal known as tokens). For example, the sentence “Tokenization is critical for SA” would be converted into… [“Tokenization”, “is”, “critical”, “for”, “SA”]

Stemming: The process of reducing a word to its “root” (known as a lemma) For example, the word “playing” would be transformed into “play”.