Top 5 Techniques for Sentiment Analysis in Natural Language Processing

Syed Huma Shah
ILLUMINATION
Published in
12 min readDec 30, 2023

Understanding different approaches to sentiment analysis

Photo by Count Chris on Unsplash

Sentiment analysis is a common problem in natural language processing. There are many techniques that can be used for sentiment analysis, and in this article, we will explore a few of the most popular and effective methods but before that let’s understand what exactly sentiment analysis is.

Sentiment analysis, also known as opinion mining, is a natural language processing technique that is used to analyze the sentiment or emotional tone of a piece of text. It is the task of extracting subjective information from textual data and is commonly used to identify the overall sentiment of social media posts, product reviews, news articles, and other types of written or spoken language.

In general, sentiment analysis involves using machine learning algorithms to classify text as either positive, negative, or neutral in sentiment. This can be done by training a model on a large dataset of annotated text, where each piece of text has been labeled as either positive, negative, or neutral by a human annotator. Once the model has been trained, it can then be used to classify new pieces of text as having a positive, negative, or neutral sentiment.

Sentiment analysis can be used for a variety of applications, such as identifying customer sentiment towards a product or brand, tracking the sentiment of social media conversations about a particular topic, or analyzing the sentiment of news articles about a particular event or issue. It can also be used to identify trends and patterns in sentiment over time, which can be useful for businesses and organizations seeking to understand how their products or services are perceived by the public.

In this article, we will cover the various methods of sentiment analysis, an overview of each approach, illustrative examples, implementation code, and a critical examination of the advantages and disadvantages of each method.

The code in this article can be found here on github.

1. Lexicon-Based Approaches

One simple yet effective approach to sentiment analysis is to use a pre-defined list of words and their associated sentiment scores. This list is known as a lexicon.

Lexicon-based approaches for sentiment analysis involve using a pre-defined list of words, known as a “lexicon,” to identify the sentiment of a piece of text. Each word in the lexicon is typically associated with a specific sentiment, such as positive or negative, and the overall sentiment of the text is determined by counting the number of positive and negative words it contains and comparing them.

For example, a lexicon-based approach to sentiment analysis might involve creating a list of positive words like “love,” “happy,” and “exciting,” and a list of negative words like “hate,” “sad,” and “boring.” Then, to determine the sentiment of a piece of text, the algorithm would count the number of positive and negative words it contains and compare them. If there are more positive words than negative words, the text would be classified as having a positive sentiment. If there are more negative words than positive words, it would be classified as having a negative sentiment. If the number of positive and negative words is the same, the text would be classified as having a neutral sentiment.

One advantage of lexicon-based approaches is that they are relatively simple and easy to implement. However, they can be limited in their accuracy because they do not take into account the context in which words are used, and they may not be able to accurately classify words that have multiple meanings or that are used in unconventional ways.*

To perform sentiment analysis using a lexicon, we first tokenize the input text into individual words. We then look up each word in the lexicon and assign it a sentiment score. The overall sentiment of the text can be calculated by summing the sentiment scores of all the words, or by taking the average.

One popular lexicon for English is the AFINN lexicon, which contains a list of 2,477 English words and their associated sentiment scores. The scores range from -5 (very negative) to 5 (very positive).

Here is an example of how to use the AFINN lexicon to perform sentiment analysis in Python:

import nltk
nltk.download('afinn')
from nltk.sentiment.util import demo_afinn_sentiment
# Tokenize the input text
tokens = nltk.word_tokenize(input_text)
# Calculate the overall sentiment score
afinn_score = demo_afinn_sentiment(tokens)

Pros:

  • Simple and easy to implement
  • Can be relatively accurate for identifying basic sentiment (positive, negative, or neutral)
  • Can be used to identify the overall sentiment of a piece of text quickly and efficiently

Cons:

  • Limited in accuracy because they do not take into account the context in which words are used
  • May not be able to accurately classify words that have multiple meanings or that are used in unconventional ways
  • Can be biased if the lexicon is not balanced or if it does not contain a diverse range of words
  • May not be able to identify more subtle or nuanced sentiments, such as sarcasm or irony

2. Machine Learning-Based Approaches

Another approach to sentiment analysis is to use machine learning techniques to automatically learn the sentiment of text data. This is a more complex and time-consuming approach, but it can often lead to more accurate results, especially for large datasets.

Machine Learning-Based Approaches for sentiment analysis are methods that use algorithms trained on labeled data to classify text as positive, negative, or neutral.

To perform sentiment analysis using machine learning, we first need to prepare a labeled training dataset. This dataset should consist of text data that has been manually labeled as positive, negative, or neutral.

Next, we can use this training dataset to train a machine learning model to classify the sentiment of new, unseen text data. There are many different types of machine learning models that can be used for this task, such as logistic regression, support vector machines (SVMs), and deep learning models.

For example, Naive Bayes is a probabilistic algorithm that makes classifications based on the probability of a given input belonging to each class. In the case of sentiment analysis, the algorithm would calculate the probability of a given input (such as a tweet or a product review) belonging to the class of positive, negative, or neutral sentiment. The input would be classified based on the class with the highest probability.

An example of a complex machine learning algorithm for sentiment analysis is the Recurrent Neural Network (RNN) or its variant Long Short-term Memory (LSTM)

RNNs and LSTMs are neural networks that are designed to process sequential data, such as text. They work by processing the input text one word at a time and using the context of the previous words to make a prediction about the sentiment of the text. LSTMs are a variant of RNNs that are designed to handle long-term dependencies in the data, which makes them particularly well-suited for sentiment analysis.

RNNs and LSTMs are complex algorithms that require a lot of computational resources to train and can be difficult to interpret. However, they can achieve very high accuracy on sentiment analysis tasks and can handle complex data such as idiomatic expressions, sarcasm, and negations.

Machine learning-based approaches are able to learn from large amounts of data and can accurately classify text as positive, negative, or neutral. They can also handle complex data such as idiomatic expressions, sarcasm, and negations, which are often difficult for traditional rule-based approaches to handle. However, Machine learning-based approaches may require more computational resources and labeled data than rule-based approaches. It also can be difficult to interpret and understand the internal workings of the models.

Here is an example of how to perform sentiment analysis using a logistic regression model in Python:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
# Load the training dataset
df = pd.read_csv('training_data.csv')
# Split the dataset into features (X) and labels (y)
X = df['text']
y = df['sentiment']
# Use CountVectorizer to convert the text data into a numerical feature matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X, y)
# Predict the sentiment of new, unseen text
input_text = 'This movie was amazing!'
input_features = vectorizer.transform([input_text])
prediction = model.predict(input_features)[0]

Pros:

  1. High accuracy: Machine learning algorithms are able to learn from large amounts of data and can accurately classify text as positive, negative, or neutral.
  2. Handling complex data: Machine learning algorithms can handle complex data such as idiomatic expressions, sarcasm, and negations, which are often difficult for traditional rule-based approaches to handle.
  3. Handling new data: Machine learning algorithms are able to generalize to new data and can continue to improve over time with new training data.

Cons:

  1. High cost: Training machine learning models require large amounts of labeled data and computational resources, which can be expensive.
  2. Lack of interpretability: The internal workings of machine learning models can be difficult to interpret, making it hard to understand why a model is making certain predictions.
  3. Bias in data: Machine learning algorithms can learn and perpetuate biases present in the training data, which can lead to inaccurate or unfair predictions.

3. Rule-Based Approaches

Rule-based approaches to sentiment analysis involve defining a set of rules or heuristics to identify the sentiment of text data. You might define a rule that says any text containing the word “love” is positive, while any text containing the word “hate” is negative.

For example, a rule-based approach might use a list of positive and negative words and phrases, and then count the number of positive and negative words and phrases in a text to determine the overall sentiment. If the number of positive words is greater than the number of negative words, the text would be classified as positive, otherwise, it would be classified as negative.

Another example, a rule-based approach could use a set of grammatical rules, like the use of negative words, punctuation, and capitalization, to classify the text as positive, negative, or neutral.

Rule-based approaches are relatively simple to implement and can be easily customized for specific use cases by defining rules that are specific to that domain. They are also easy to interpret, which is beneficial for understanding how the model is making predictions. However, rule-based approaches are limited to the specific rules that are defined, and may not be able to handle complex data or new cases that are not covered by the rules. It can be difficult to anticipate and account for all the different ways that people express sentiment in a natural language only using rules. They may not be as accurate as machine learning-based approaches.

CODE:

def get_sentiment(text):
if 'love' in text.lower():
return 'positive'
elif 'hate' in text.lower():
return 'negative'
else:
return 'neutral'
input_text = 'I love this movie!'
sentiment = get_sentiment(input_text)

Pros:

  1. Easy to interpret: Rule-based approaches use a set of pre-defined rules to classify text, making it easy to understand how the model is making predictions.
  2. Fast: Rule-based approaches are typically faster than machine learning-based approaches because they don’t require training on large amounts of data.
  3. Handling specific use cases: Rule-based approaches can be tailored to specific use cases by defining rules that are specific to that domain.

Cons:

  1. Limited flexibility: Rule-based approaches are limited to the specific rules that are defined, and may not be able to handle complex data or new cases that are not covered by the rules.
  2. Limited accuracy: Rule-based approaches may not be as accurate as machine learning-based approaches, especially when dealing with idiomatic expressions, sarcasm, and negations.
  3. Difficult to maintain: Rule-based approaches can be difficult to maintain as the rules may need to be updated or changed as new data and use cases arise.

4. Hybrid Approaches

Hybrid approaches to sentiment analysis are methods that combine multiple techniques to determine the sentiment expressed in a text.

For example, a hybrid approach might use a rule-based approach to identify sentiment-bearing words and phrases in a text, and then use a machine learning-based approach to classify the text based on the identified words and phrases.

Another example, you might use a lexicon-based approach to identify the overall sentiment of a text, and then use a machine learning-based approach to classify any words or phrases that the lexicon does not cover.

The idea behind hybrid approaches is to combine the strengths of different techniques to improve the accuracy and robustness of the sentiment analysis.

Hybrid approaches can also be used to handle different types of texts, like short texts, long texts, and social media texts, where different techniques might work better.

In general, hybrid approaches can be more accurate than traditional approaches because they can combine multiple techniques to capture different aspects of sentiment in a text. However, they can also be more complex to implement and maintain.

CODE:

import nltk
nltk.download('afinn')
from nltk.sentiment.util import demo_afinn_sentiment
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
# Load the training dataset
df = pd.read_csv('training_data.csv')
# Split the dataset into features (X) and labels (y)
X = df['text']
y = df['sentiment']
# Use CountVectorizer to convert the text data into a numerical feature matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X, y)
# Tokenize the input text
tokens = nltk.word_tokenize(input_text)
# Calculate the overall sentiment score using the AFINN lexicon
afinn_score = demo_afinn_sentiment(tokens)
# If the AFINN score is 0, use the machine learning model to classify the sentiment
if afinn_score == 0:
input_features = vectorizer.transform([input_text])
prediction = model.predict(input_features)[0]
else:
# Use the AFINN score as the overall sentiment
if afinn_score > 0:
prediction = 'positive'
else:
prediction = 'negative'

Pros:

  1. High accuracy: Hybrid approaches combine the strengths of both rule-based and machine learning-based approaches, resulting in high accuracy.
  2. Handling complex data: Hybrid approaches can handle complex data such as idiomatic expressions, sarcasm, and negations by combining the flexibility of machine learning with the interpretability of rule-based approaches.
  3. Handling new data: Hybrid approaches can learn and adapt to new data and use cases by incorporating machine learning models that can continue to improve over time.

Cons:

  1. High cost: Hybrid approaches may require more computational resources and labeled data than rule-based approaches.
  2. Complexity: Hybrid approaches can be complex and difficult to implement and maintain, requiring expertise in both rule-based and machine learning-based approaches.
  3. Lack of interpretability: Despite being more interpretable than pure machine learning, hybrid approaches can still be difficult to understand and interpret, especially when involving complex models.

5. Context-Dependent Approaches

Finally, it’s important to note that the sentiment of a word or phrase can often depend on the context in which it is used. Context-dependent approaches for sentiment analysis are methods that take into account the context in which a text is written to determine the sentiment expressed in the text. For example, the word “good” can have a positive connotation when used to describe a person (“She’s a good friend”), but a neutral or even negative connotation when used to describe a situation (“I’m in a good mood” or “It’s good weather for a storm”).

To account for this context dependence, some sentiment analysis approaches use techniques like part-of-speech tagging or dependency parsing to identify the role that each word plays in the sentence. This can help the model to better understand the intended sentiment of the text.

CODE:

import spacy
nlp = spacy.load('en_core_web_md')
# Tokenize the input text and perform part-of-speech tagging
oc = nlp(input_text)
# Iterate over the tokens and calculate the sentiment of each token based on its part-of-speech tag
sentiment_score = 0
for token in doc:
if token.pos_ in ['ADJ', 'VERB']:
if token.lemma_ in positive_words:
sentiment_score += 1
elif token.lemma_ in negative_words:
sentiment_score -= 1
# Calculate the overall sentiment of the text
if sentiment_score > 0:
sentiment = 'positive'
elif sentiment_score < 0:
sentiment = 'negative'
else:
sentiment = 'neutral'

Pros:

  1. Improved accuracy: Context-dependent approaches take into account the context in which a text is written, allowing for a more accurate assessment of sentiment.
  2. Handling idiomatic expressions, sarcasm, and negations: By considering the context in which a text is written, context-dependent approaches can handle idiomatic expressions, sarcasm, and negations that can be difficult to interpret with traditional approaches.
  3. Handling new data: Context-dependent approaches can adapt to new data and use cases by considering the context in which a text is written.

Cons:

  1. High cost: Context-dependent approaches may require more computational resources and labeled data than traditional approaches.
  2. Complexity: Context-dependent approaches can be complex to implement and maintain, requiring expertise in natural language processing and machine learning.
  3. Lack of interpretability: Despite being more interpretable than pure machine learning, context-dependent approaches can still be difficult to understand and interpret, especially when involving complex models.
  4. Dependency on external factors: Context-dependent approaches might not work well when the context is not clear or not available.

Thank you for reading this article! If you enjoyed the content and would like to stay in the loop on future explorations into technology, AI, and beyond, please follow me on LinkedIn.

On my LinkedIn profile, I regularly delve into topics lying at the intersection of AI, technology, data science, personal development, and philosophy.

I’d love to connect and continue the conversation with you there.

--

--

Syed Huma Shah
ILLUMINATION

Senior Machine Learning Engineer | Applying AI to solve real-world problems