An Introduction To Sentiment Analysis

Jacopo Mattei
Pills of BSDSA
Published in
6 min readApr 15, 2024

With the current rise in popularity of Machine Learning, the number of its uses is growing exponentially.

One very intriguing field of Machine Learning with widespread use is Sentiment Analysis.

Sentiment Analysis is the process of analyzing and classifying texts based on what emotional tone they have: positive, negative, or neutral.

It has varying areas of interest, such as product reviews, which is the main one, but also stock market or political campaigns: predictions on the results of an election can be based on the sentiment of the voters about the candidates, and the future performance of a specific firm’s stocks is highly connected to its public perception.

How does it work?

Sentiment Analysis utilizes Natural Language Processing and Machine Learning to classify the polarity of the emotion expressed in a text.

There are two main approaches to Sentiment Analysis:

  • Rule-based approach
  • Machine-learning approach

RULE-BASED APPROACH

This approach is the more traditional one and involves the use of NLP techniques like lexicons, tokenization and lemmatization.

To start, lists of words that express specific emotions, named lexicons, are created.

The text is then pre-processed to reach a state that the machine can understand.

This process involves lemmatization, which is the practice of reducing a word to its lemma (or its root form). For example, is, are, being are all derived from the same lemma, be.

Tokenization is the procedure of separating larger texts into smaller parts, down to singular words.

Another important process is the removal of stopwords, which are common but carry little to no significant value for the meaning of the text.

Checking every word in the written work, the polarity of the text is then evaluated, usually either on a scale from -100 to 100 or from 0% to 100%, with -100 and 0% being the most negative and 100 and 100% being the most positive.

The issues of the rule-based approach are based on the fact that it doesn’t consider the sentence as a whole: complex negations, metaphors and peculiar idioms could be misinterpreted by the machine.

MACHINE LEARNING APPROACH

Before a text can be evaluated it needs to be pre-processed, much like in the case of the rule-based approach.

The techniques of lemmatization, tokenization and stopword removal are used, with the addition of the process of vectorization, which implies the transformation of text into vectors of numbers.

The most common methods to obtain vectorization are the bag of words and the bag of n-grams method, which respectively count how many times a word or a sequence of words appear in a written work.

The algorithm is fed training data to learn to associate future input data with a correct level of polarity.

The new text is then finally given to the algorithm to be evaluated.

Some of the most common classification models use Naive Bayes, Logistic Regression, Linear Regression and Deep Learning.

NAIVE BAYES

Based on Bayes’ Theorem, this type of classification calculates the probability of each label for a particular text and then assigns the highest probability one.

It is called Naive because it operates on the assumption that each word is independent from the other words in the text, which is quite unrealistic.

Formula of Bayes’ theorem
Bayes’ Theorem

LOGISTIC REGRESSION

This classification algorithm returns a binary value based on independent variables, using the sigmoid function.

The outcome of the sigmoid function can be mapped to positive for 1 and negative for 0.

Sigmoid function
Sigmoid function

LINEAR REGRESSION

The objective of a linear regression model is to find a line or plane that can be used to classify sentiment expressed by a text.

An example of a simple linear regression model is Support Vector Machines.

This model plots data as points in a multi-dimensional space, and then divides them into two groups either through a line or a non-linear curve.

This curve is found by maximizing the distance of the curve itself to the closest data point of each group.

The two groups obtained can then be assigned to the positive and negative polarity of the written work.

DEEP LEARNING

In this case, a neural network performs multiple layers of processing.

Deep learning includes a vast set of algorithms that imitate human learning through abstractions. It has the benefit of being able to understand the context and even the mood of the author of the text.

Another advantage that Deep Learning has is the capability of the neural network to correct mistakes by itself, while other approaches require human intervention to correct errors.

Despite the notable recent improvements in the field of Sentiment Analysis there still are some challenges that SA struggles with:

  • Subjectivity: most models struggle to differentiate meanings associated with words based on context and the intent of the author: for example, if a product review were to say “This product is small” a model may classify this comment as being negative, while it could be positive in the case of the author wanting a product of small dimensions and pointing out a positive aspect.
  • Context: a text written in response to a direct question loses its meaning when separated from it: a response that lists some aspects of the product could be positive if the question is to list the useful features, but could be negative if the question is reversed.
  • Irony and sarcasm: humor, irony and sarcasm can be very challenging for ML models to identify.

The problem with sarcasm is the use of positive words to express negative opinions. It is often dependent on the context in which a review is found and what is expected from the product.

For example, a text that a model may interpret as positive while being strongly sarcastic is this portion of a letter sent by Arthur Hicks to LIAT Caribbean Airlines:

“Most other airlines I have travelled on would simply wish to take me from point A to B in rather a hurry. I was intrigued that we were allowed to stop at not a lowly one or two but a magnificent six airports yesterday”. (Grenoble)

To understand the sarcasm implied in this text, additional knowledge is required, specifically knowledge of what is usually expected of airline companies, which is fast transport.

  • Idioms: language-specific idioms usually are formed by a sequence of words whose meaning is strictly connected to the entirety of the idiom.

Since sentiment analysis models analyze words one by one, this arrangement could take a completely different connotation.

For example “Doesn’t float my boat” is an idiom that implies a dislike for the object, but that message may not be captured by sentiment analysis.

  • Negation: if not considered in the training of the model, the negation of a positive opinion may be considered positive, and vice versa a negation of a negative may be considered negative.

Double negation also needs to be considered when considering negation during the period of training.

  • Human error: the involvement of human beings to label a sentiment as positive or negative may introduce biases in the training data, be it deliberate or accidental, due to difficulties in the comprehension of text written by different authors.

This issue can be solved by using an inter-annotator agreement and by how similarly two different annotators label the polarity of a text.

Sentiment analysis is an ever expanding field that now has varying applications in product review, brand monitoring, HR and market research.

To get started on Sentiment Analysis, many Python libraries can be useful:

  • NLTK, a popular library with an integrated sentiment analyzer;
  • PyTorch is a very popular ML library, due to its simplicity of use;
  • spaCy allows people to build their own SA classifier.

Work Cited

Grenoble, Ryan. “LIAT, Caribbean Airline, Receives Hilarious Complaint Letter From Passenger Arthur Hicks.” HuffPost, 1 July 2013, https://www.huffpost.com/entry/liat-complaint-letter-arthur-hicks-branson_n_3529385.

--

--