Fact vs Opinion Classification

pic credits : google

Motivation :

Problem Statement :

Data set description :

Data set visualization :

First five rows in the dataset
First five rows in the data
word cloud for both the classes

Distribution of Lengths across facts and opinions :

  • we can see that there are some data samples where number of words are greater than 500.
  • We need to determine the optimal length for each sample to represent only most important features .
  • This optimal length has to be find by using trail and error .

Preprocessing of Data samples :

  • Tokenisation and Stemming
  • Stop words removal
  • Case conversion
  • Removal of non alpha numeric words
  • Removal of words less than length 3
First 5 Rows after preprocessing and shuffling
train val test split

Word Embeddings :

  • Bag of Words (BOW)
  • Term Frequency-Inverse Document Frequency (TF-IDF)
  • Index (rank method)
BOW embeddings
Top features of BOW embedding
tfidf word embeddings
top tf-idf features
  • Multinomial Naive Bayes
  • K Nearest Neighbors
  • Decision trees
  • Support Vector Machines

K Nearest Neighbor (KNN) :

KNN- hyper parameter (k) vs metrics on both embeddings

Multinomial Naive Bayes :

Naive Bayes — Hyper parameter(alpha) vs Metrics in 2 embeddings.

Decision Trees :

max depth vs different metrics in both embeddings

Support Vector Machines :

Accuracy vs C on different kernels

Long Short Term Model (LSTM) :

lstm plots

Results on Test data :

models and metrics

Deployment :

deployement
prediction using lstm model

Conclusion :

Acknowledgements :

Contributions :

  1. Sarath Chandra Reddy Kikkuru (MT19037) : Annotation of around 350 data samples and preprocessing of data and implementation of Naive Bayes and K nearset neighbour.

References :

  1. Most of the earlier research on opinion classification is done by Wiebe and his collegues (Weibe et al., 1999). they proposed methods for discriminating subjective and objective features.
  2. Hatzivassiloglou and McKeown proposed an un supervised model for learning positively and oriented adjectives with accuracy over 90%.
  3. A similar study was conducted by Ahmet Aker et in this paper titled “Beyond opinion classification: extracting facts and opinions from health forums”.

--

--

A graduate student at IIIT Delhi

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store