Sentiment analysis on Twitter posts part 1.

Hi! I’m Przemysław from Custom Software Development Company Profil Software. This will be my 2nd article connected with AI. The first one can be found here:

This time I will try to present some hands-on examples of how to deal with simple NLP tasks like sentiment analysis. The solutions used in this article could be easily reused in other classification tasks. We will traverse the rough trails of AI starting with data cleaning and preprocessing, then move on to model definition and visualising the results. The solution was prepared using Google Colab and will be shared at the end. So are you ready? ;)

Requirements

While using Google Colab no external packages are required to run the solution. Libraries like scikit-learn are already included. While moving to another environment you can easily grab dependencies using the !pip freeze command in a notebook cell.

Dataset

In this article I will be using Twitter dataset from the kaggle competition (link). The dataset consists of 1.6M tweets written in English and extracted using the Twitter api. They are grouped into 2 classes (named targets):

  • positive (target 0 in csv)
  • negative (target 4 in csv)

The dataset also contains other columns like corresponding date or the user that posted the tweet. For the purpose of this article, we will be using text and target info only. What can be useful for further processing might be the data distribution over the 2 classes. Part of the code responsible for loading the dataset and counting target distribution is presented below:

Loading dataset and showing classes distribution
Classes distribution

Sample rows of the raw dataset are displayed below:

Raw dataset

At first glance we can see that data is a bit dirty and can be cleaned to remove bogus parts like links and mentions. To briefly apply simple preprocessing, some pandas utils were used. You can see that part below:

Data clean-up
  • remove all urls and user mentions, hashtags,
  • accept only letters and digits
  • remove extra spaces
  • parse everything to lowercase
  • rename target class 4 -> 1

Sample output of preprocessed data (can be compared with previous image):

Parsed data

Data split

To be able to train our AI model we need to first split the data into train and test sets. The code below shows how to do it:

Data split

I used therandom_state option to enable reproductivity between experiments, while the stratify option is responsible for enabling similar distribution of classes in both sets.

Model

The first model that will be checked is a so-called Bag of words model. The bag-of-words model is a simplifying representation used in NLP. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Its visualisation is presented below:

BoW visualization

I will implement it using CountVectorizer from sklearn which converts a collection of text documents to a matrix of token counts. This approach will then enable us to use preprocessed vectors as an input for the LogisticRegression model.

Model preparation

As you can see, with just a few lines of code we can prepare a fully working model. In the above code, NGRAM_RANGE describes how many coexisting words will be analysed as a feature. MAX_ITER enables us stop the algorithm after a certain amount of iteration; for such big datasets it is sometimes safe to limit that.

Results

And voilà, that was it, we did it. Now let’s see what results we receive.

For sklearn models, we can use visualisation functions that help with ad hoc prototyping.

Results visualization
  • classification_report: grabs prediction and true labels and prepares printable report
  • plot_confusion_matrix: plots a heat-map of classification; for binary classification it will have a structure of 4 squares

Outputs of both functions are presented below:

classification_report (training)
classification_report (testing)
Output of plot_confusion_matrix (up — train, down — test)

As we can see, the model reaches above 80% accuracy for the unknown samples which is a great result compared to the amount of code that was used to achieve that.

Feature importance

The logistic regression model enables us to use its features (in our case 1-word or 2-words pairs) and coefficients calculated during data fitting to obtain features that are influential for choosing a specific label. Below is a short snippet:

Feature importance
Positive features:   ['not sad', 'no problem', 'doesnt hurt', 'not bad', 'no problems', 'no prob', 'not problem', 'never too', 'no probs', 'cant miss']   Negative features:   ['clean me', 'not happy', 'sad', 'passed away', 'rip', 'not looking', 'funeral', 'headache', 'disappointing', 'upsetting']

Code

Here is the complete solution (https://github.com/profilsoftware/sentiment-detection-article):

Next steps

In the next article I will try to implement a model doing the same task but constructed with neural networks. Can’t wait for it and I hope that you cannot wait for it either! ;)

Source:

--

--