Sentiment Analysis Using Natural Language Processing (NLP)
Extracting insight from text using NLP
Sentiment analysis, also known as opinion mining, is a technique used in natural language processing (NLP) to identify and extract sentiments or opinions expressed in text data. The primary objective of sentiment analysis is to comprehend the sentiment enclosed within a text, whether positive, negative, or neutral.
Connect with me on LinkedIn.
Agenda
- How does sentiment analysis work
- Everyday use cases for sentiment analysis
- Challenges in sentiment analysis
- Description of Natural Language Processing (NLP) techniques
- Developing sentiment analysis machine learning model
- Conclusion
How sentiment analysis works:
Sentiment analysis is a process that involves analyzing textual data such as social media posts, product reviews, customer feedback, news articles, or any other form of text to classify the sentiment expressed in the text. The sentiment can be classified into three categories: Positive Sentiment Expressions indicate a favorable opinion or satisfaction; Negative Sentiment Expressions indicate dissatisfaction, criticism, or negative views; and Neutral Sentiment Text expresses no particular sentiment or is unclear.
Before analyzing the text, some preprocessing steps usually need to be performed. These include tokenization, breaking the text into smaller units like words or phrases, removing stop words such as common words like “and,” “the,” and so on, and stemming or lemmatization, which involves reducing words to their base or root form. At a minimum, the data must be cleaned to ensure the tokens are usable and trustworthy.
The strategies vary in complexity as well. In order of complexity
- Lexicon-Based Methods: Using dictionaries or lists of terms and their associated sentiment scores to determine overall sentiment. Consider a list of terms closely associated with positive sentiment within a domain and map those terms to a body of text to decide a final classification.
- Machine Learning and Deep Learning: One approach to classify sentiments is to use supervised learning algorithms or neural networks. These methods rely on pre-labeled data to accurately categorize different emotions or opinions.
- Hybrid Approaches: Combining multiple methods to improve accuracy, like machine learning models and lexicon-based analysis.
Sentiment analysis has multiple applications, including understanding customer opinions, analyzing public sentiment, identifying trends, assessing financial news, and analyzing feedback.
Challenges in sentiment analysis
It can be challenging for computers to understand human language completely. They struggle with interpreting sarcasm, idiomatic expressions, and implied sentiments. Despite these challenges, sentiment analysis is continually progressing with more advanced algorithms and models that can better capture the complexities of human sentiment in written text.
Description of Natural Language Processing (NLP) techniques
Natural Language Processing (NLP) models are a branch of artificial intelligence that enables computers to understand, interpret, and generate human language. These models are designed to handle the complexities of natural language, allowing machines to perform tasks like language translation, sentiment analysis, summarization, question answering, and more. NLP models have evolved significantly in recent years due to advancements in deep learning and access to large datasets. They continue to improve in their ability to understand context, nuances, and subtleties in human language, making them invaluable across numerous industries and applications.
There are various types of NLP models, each with its approach and complexity, including rule-based, machine learning, deep learning, and language models.
Developing sentiment analysis machine learning model
Our objective will be to train a machine learning model to predict the sentiment of reviews. Our corpus of data will contain reviews from Amazon product reviews.
Data
The data used for this task will be the Amazon reviews dataset, which consists of reviews from Amazon customers downloaded from Xiang Zhang’s Google Drive dir[1]. The dataset spans 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: Hidden Factors and Hidden Topics: Understanding Rating Dimensions with review text [2].
The Amazon reviews dataset is constructed by taking review scores 1 and 2 as negative and 4 and 5 as positive. Samples of score 3 is ignored. In the dataset, class 1 is the negative, and class 2 is the positive. Each class has 1,800,000 training samples and 200,000 testing samples.
Preprocessing
To prepare our data for model training, we need to convert our text data into features that our model will use to train and cast future predictions. We’ll use two preprocessing steps:
- Count vectorizing text
- tf-idf
Count vectorization is a technique in NLP that converts text documents into a matrix of token counts. Tokens can be words, characters, or n-grams. Each token represents a column in the matrix, and the resulting vector for each document has counts for each token.
Here’s an example of how we transform the text into features for our model. The corpus of words represents the collection of text in raw form we collected to train our model[3].
corpus = [
'This is the first document',
'This document is the second document',
'and this is the third one',
'is this the first document'
]
vec = CountVectorizer().fit(corpus)
vec.get_feature_names()
>>>
['and,' 'document,' 'first,' 'is, "one,' 'second,' 'the,' 'third,' 'this']
vec.transform(corpus).toarray()
>>>
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 0, 0, 1, 0, 1]])
The purpose of using tf-idf instead of simply counting the frequency of a token in a document is to reduce the influence of tokens that appear very frequently in a given collection of documents. These tokens are less informative than those appearing in only a small fraction of the corpus. Scaling down the impact of these frequently occurring tokens helps improve text-based machine-learning models’ accuracy.
Here’s an example of our corpus transformed using the tf-idf preprocessor[3].
vectorized = vec.transform(corpus).toarray()
tfid = TfidfTransformer().fit(vectorized)
tfid.transform(corpus).toarray()
>>>
array([[0.00, 0.46, 0.58, 0.38, 0.00, 0.00, 0.38, 0.00, 0.38],
[0.00, 0.68, 0.00, 0.28, 0.00, 0.53, 0.28, 0.00, 0.28],
[0.51, 0.00, 0.00, 0.26, 0.51, 0.00, 0.26, 0.51, 0.26],
[0.00, 0.46, 0.58, 0.38, 0.00, 0.00, 0.38, 0.00, 0.38]])
Classification algorithm
For this project, we will use the logistic regression algorithm to discriminate between positive and negative reviews. Logistic regression is a statistical method used for binary classification, which means it’s designed to predict the probability of a categorical outcome with two possible values. To learn more about logistic regression, read my other article here.
Constructing our model pipeline
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import re
data = pd.read_csv(
'/PATH-TO-DATA/train.csv',
names=['sentiment', 'title', 'review']
)
# Access the corpus and target variables
X = data.review
y = data.sentiment.replace({1:'Negative', 2:'Positive'})
# train test splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
# We'll use this function to replace numbers from the string
preprocessor = lambda text: re.sub(r'[^a-z ]', '', text.lower())
# construct the pipeline with the procedural steps to
# process the data and cast predictions
pipe = Pipeline([
('vec', CountVectorizer(stop_words='english', min_df=1000, preprocessor=preprocessor)),
('tfid', TfidfTransformer()),
('lr', SGDClassifier(loss='log'))
])
# fit the model to the data
model = pipe.fit(X_train, y_train)
Evaluation
We first need to generate predictions using our trained model on the ‘X_test’ data frame to evaluate our model’s ability to predict sentiment on our test dataset. We will then store the projections in the ‘y_test_pred’ variable. After this, we will create a classification report and review the results. The classification report shows that our model has an 84% accuracy rate and performs equally well on both positive and negative sentiments.
# predict sentiment on the test data frame
y_test_pred = model.predict(X_test)
# create the classification report
report = classification_report(y_test, y_test_pred)
print(report)
>>>
precision recall f1-score support
Negative 0.84 0.84 0.84 360052
Positive 0.84 0.84 0.84 359948
accuracy 0.84 720000
macro avg 0.84 0.84 0.84 720000
weighted avg 0.84 0.84 0.84 720000
That model seems acceptable based on the performance metrics. However, we can further evaluate its accuracy by testing more specific cases. We plan to create a data frame consisting of three test cases, one for each sentiment we aim to classify and one that is neutral. Then, we’ll cast a prediction and compare the results to determine the accuracy of our model.
test = {
'This gadget is awesome':'Positive',
'This gadget is terrible':'Negative',
'This gadget':'Neutral'
}
predictions = [[text, expected, model.predict([text])[0]] for text, expected in test.items()]
pd.DataFrame(
predictions,
columns=['Test case', 'Expected', 'Prediction']
)
Why was the third test case predicted ‘Positive’? Recall that the model was only trained to predict ‘Positive’ and ‘Negative’ sentiments. We need to add a ‘Neutral’ sentiment for these cases. But can we see how positive or negative the test was? Yes, we can show the predicted probability from our model to determine if the prediction was more positive or negative.
Here are the probabilities projected on a horizontal bar chart for each of our test cases. Notice that the positive and negative test cases have a high or low probability, respectively. The neutral test case is in the middle of the probability distribution, so we can use the probabilities to define a tolerance interval to classify neutral sentiments.
The model could improve a bit, and I may write a tuning guide as a follow-up.
Conclusion
Sentiment analysis is a technique used in NLP to identify sentiments in text data. NLP models enable computers to understand, interpret, and generate human language, making them invaluable across numerous industries and applications. Advancements in AI and access to large datasets have significantly improved NLP models’ ability to understand human language context, nuances, and subtleties.
References
- Xiang Zhang’s Google Drive dir, https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M?resourcekey=0-TLwzfR2O-D2aPitmn5o9VQ
- J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text, https://cs.stanford.edu/people/jure/pubs/reviews-recsys13.pdf
- Feature extraction from text, https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction