Fact vs Opinion Classification
Motivation :
We are living in the era of technology. Every second, more than petabytes of data is being generated. Every person who has access to internet is a data generator and as well as a data consumer. As a result it has become very difficult to manually analyze the data that is being published in the internet. Taking this as advantage many organizations started to post highly opinionated and biased information to the common people.
We can solve this problem without much manual work with the help of Natural language processing and Machine learning . Yes we can automatically analyze any type of data in very short time thanks to the above tools !
Problem Statement :
Since it is very difficult to manually classify whether text data is Fact or Opinion, in this project we have proposed a model that will automatically categorize given text into a fact or opinion.
Note : Since machine learning models requires labelled data, we have considered movies domain where we can easily find required data. The same methodology can also be used to any other domain.
Data set description :
Assumption : The main plot of the movie is considered as Fact and reviews on the movie are considered as opinions. Part of the data we have used is freely available in kaggle and can be downloaded from here. We also hand annotated most of the reviews to make data much balanced.
The data set contains 94,379 text samples out of which nearly 50,000 samples are Opinions (reviews) and remaining 44,379 samples are Facts(movies plot).
Data set visualization :
Sample rows in the data set :
The word cloud of each class is plotted to better understand the word distribution in each class.
As data of both classes belong to same domain, we can see some overlap of words between both classes. This is more interesting because Machine learning model has to try harder to find distinguish features which are more low level than just surface words.
Distribution of Lengths across facts and opinions :
- we can see that there are some data samples where number of words are greater than 500.
- We need to determine the optimal length for each sample to represent only most important features .
- This optimal length has to be find by using trail and error .
Preprocessing of Data samples :
- Tokenisation and Stemming
- Stop words removal
- Case conversion
- Removal of non alpha numeric words
- Removal of words less than length 3
Data after Preprocessing :
We have made 60 :10 : 30 split for training validation and testing.
Word Embeddings :
The next thing we have to do after pre processing of the data is converting data into machine understandable language i.e numbers. For this we need to map every word in the data to some numeric value. We can do this in many ways. Some of them are :
- Bag of Words (BOW)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Index (rank method)
The first two are very popularly used in classical machine learning approaches and last one Index word embedding is generally used in deep learning approaches.
In our project we have used first two embeddings in all the machine learning models and the last one is used in LSTM model.
Bag of Words (BOW ) embedding :
For this we can use Count Vectorizer method in scikit-learn library.
Because of processing constraint we have used only top 1000 features only. Basically, it generates vocabulary which is set of unique words and counts number of occurrences of each unique word in training set.
The top features in this embedding are :
Term Frequency-Inverse Document Frequency (TF-IDF) Embedding :
We have used tfidf vectorizer from scikit learn library.
The top features in TF-IDF embedding are :
For each word embedding we have used following machine learning algorithms:
- Multinomial Naive Bayes
- K Nearest Neighbors
- Decision trees
- Support Vector Machines
K Nearest Neighbor (KNN) :
The KNN algorithm considers the k nearest neighbors for a test vector and takes the majority vote of class labels from its neighbors and assigns it to the given test vector.
By taking k as hyper parameter we have used gridsearchcv to determine its optimal value for both bag of words and tfidf word embeddings.
Inference : As KNN suffers with curse of dimensionality problem i.e as number of features increases, its predictive power decreases. This can be seen in above plots. We have got less values for most of the metrics by using KNN.
The best k value based on above plots is 2.
Multinomial Naive Bayes :
Naïve Bayes learns the probability distribution of classes from the given training data. When a test sentence is given it calculates the probability of sentence belonging to each class. P(C|Test) is maximized. The class which gives maximum P(C|Test) is the class label of test data.
The smoothing factor (alpha) is the hyper-parameter in Naïve Bayes. We have used GridSearchCv to determine the smoothing factor value.
Inference : As we can observe from the above plots, as the value of smoothing factor (alpha) increases, every metric value is decreases.
The best alpha value in both the cases turned out be 0.01. When we apply the multinomial naive bayes to the test set with alpha = 0.01, we achieved good results.
Decision Trees :
Decision tree is a tree-based classifier which is highly interpretable. Here each internal node represents a feature where particular decision is made. We have Used “Gini-impurity” (which works on the concept of entropy) technique in deciding the what features to be considered as internal nodes.
The depth of the tree is the hyper-parameter in Decision Trees. we found out the best “depth” using grid-search cv technique applied on CV data on basis of accuracy metric.
Inference : Even though it is high variance algorithm, it worked exceptionally well in our case.
The best max depth value based on validation data is turned out to be 5.
Support Vector Machines :
SVM gives the hyperplane that best separates the data. It draws the maxim margin hyper plane between two classes.
We have used Grid Search cv to determine optimal values of popular hyper parameters in SVM that are kernel and “C” (regularization parameter).
Inference : SVM is one of the best and powerful classifier in machine learning because of its kernel trick. This has proved again on our data also. we achieved by far best results compared to any other Machine learning classification algorithms.
In both the embeddings, ‘linear’ kernel out performed ‘rbf’ kernel.
Long Short Term Model (LSTM) :
We thought of how deep learning algorithms work on our data, as our data is sufficiently big. So we tried out best recurrent model that is LSTM.
LSTM solves problem of long term dependencies due to vanishing gradients by using different types of gates. Because of this typical cell structure LSTM model achieved very good results in most of the NLP tasks.
we have achieved best results with LSTM and it even surpassed SVM results with only just 2 epochs !!!
Results on Test data :
By using hyper parameters that are obtained from the grid search cv for different models , we have noted how those work on test data.
Deployment :
As we achieved best results with LSTM model, we went one step further and deployed our model using flask library.
We need to enter the required text in the given box and enter predict button.
The model will first pre process the given text and classify the given text into either fact or opinion.
Conclusion :
Even though we worked with movies data set, the same methodology can be applied to any labelled news data set to determine biasness in the news article (future work).
Acknowledgements :
We have done this project under the guidance of Professor Dr. Tanmoy Chakraborthy in the Information Retrieval course at IIIT Delhi . Thanks to the professor and the teaching assistants who gave the motivation to do the project and guiding us till successful completion of the project.
I would also thank my team members Murali Krishna and kasarla mani kumar reddy who contributed equally in the project.
Contributions :
- Sarath Chandra Reddy Kikkuru (MT19037) : Annotation of around 350 data samples and preprocessing of data and implementation of Naive Bayes and K nearset neighbour.
2. Murali Krishna Kasarla (MT19132) : Annotation of more than 500 data samples and implementation of Decision trees and Support vector machines.
3. Mani Kumar Reddy Kasarla (Mt19065) : Implementation of Deep learning model (LSTM) and deployment of best model in flask.
References :
- Most of the earlier research on opinion classification is done by Wiebe and his collegues (Weibe et al., 1999). they proposed methods for discriminating subjective and objective features.
- Hatzivassiloglou and McKeown proposed an un supervised model for learning positively and oriented adjectives with accuracy over 90%.
- A similar study was conducted by Ahmet Aker et in this paper titled “Beyond opinion classification: extracting facts and opinions from health forums”.
