Sentiment Analysis using the ‘magical’ Sklearn Pipeline

4 min readAug 10, 2022

Scikit-learn pipeline, aka Sklearn Pipeline, is a low-code magical way of building ML models.

Our model classifies product reviews into positive / negative

As part of this project, we are building and then deploying a sentiment analysis model using the magical Sklearn Pipelines. So essentially, we are covering the complete ML journey of: problem formulation, model training, deployment & then testing in a live environment — with this project.

Sklearn pipeline is a major productivity tool to collapse all preprocessing and modelling steps into a single line of code.

Watch the video tutorial instead

We have done a video tutorial of this project. If you are more of a video person, go ahead and watch it on youtube..

All project related files are kept on GitHub. A detailed readme on how to run the project on your machine is provided on the GitHub repository page. The Skillcate project toolkit is also available on Google Drive.

On this note, let’s get started :)

Plan of Action

Setting the environment: We use Amazon Alexa Reviews dataset (~3150 reviews), that contains: verified_reviews, rating, date, variation & feedback . In the current scope, verified_reviews & rating are of interest to us, rest we will discard
Data preprocessing: Then, we generate sentiment labels: positive/negative, by marking positive for reviews with rating >3 and negative for remaining
Data transformation: Then, we transform reviews text-to-numeric representation using the popular feature representation technique — TF-IDF (term frequency-inverse document frequency)
Training Sklearn Pipeline: Post that, we use Sklearn Pipeline, within which we call our tokenizer TF-IDF & classifier (viz., Support Vector Classifier) for Model Fitting. Later, we check for model performance (we are getting >90% accuracy)
Having fun with Model: Lastly, we deploy our model using flask, to perform live queries with real Alexa reviews

Setting the environment

In this first part, we load the essential libraries and functions that shall be used later on..

Our dataset Amazon Reviews could be downloaded from this source: link. This dataset has 3150 movie reviews along with positive / negative sentiment labels. If you are using Google Colab & Drive, you may use the following code as-is, otherwise make appropriate changes.

In the later part, we are creating a dataframe called dataset, where we keep only the two data columns of our interest: verified_reviews and rating. We are also renaming them with a cleaner looking header text.

Data preprocessing

As next step, we are defining compute_sentiments function to generate binary labels from the rating (which is on a scale of 1 to 5). So all reviews with rating over 3, we assign sentiment=1 (viz., positive), and reviews with rating less than or equal to 3, we assign sentiment=0 (viz., negative).

Then we call this function for all rows under sentiment and get the labels.

Data transformation (with TF-IDF)

In this part, we are using the popular TF-IDF tokenizer for transforming our reviews text data to a numeric representation.

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The relevance increases proportionally to the number of times a word appears in the text, but is compensated by the word frequency in the corpus (dataset).

Talking about the code, first up we define our x and y. Then, we import CustomTokenizerExample from the tokenizer_input script (shared later in this section). With text_data_cleaning method from this script, we get cleaned reviews. Then, we also need a way to convert these cleaned reviews to numeric form. For this, we are using the popular TFIDF Vectorizer.

tokenizer_input is a separate python script we have written for data cleaning. Over here, we are first importing the popular NLP library called Spacy. Within Spacy, we are loading this en_core_web_sm package, that helps in breaking down sentences into its token components. We also download stopwords and punctuation marks.

Later within the CustomTokenizerExample class, we define this method text_data_cleaning, which performs a series of cleaning operations, like: lemmatisation, convert to lower-case, filtering out stopwords & punctuation marks on the input reviews. Finally, the method returns a list of cleaned tokens.

Training sklearn pipeline

In here, we prepare for model training, by splitting data into 80:20 training test. Then we define our Pipeline. As I explained previously, Sklearn Pipeline is basically a series of transformers glued to a final estimator. In our case, we have a single transformer, this tfidf, defined previously; and our classifier, defined here.

Let’s fit our Sentiment Analysis Pipeline on the training data. :)

Once training part is complete, we use pipeline method predict, to generate labels on unseen data. We also plot a confusion matrix. You may observe, we are getting over 90% classification accuracy.

Then we serialise our Sentiment Analysis Pipeline as a pickle file for later use. Using this next script, you may perform live queries on the trained pipeline within your notebook.

Model Deployment

To run this project on you machine, visit our GitHub project repository page. There’s a step-by-step guide on the homepage to do it.

Again, if you are more of a video person, you may check out this video we have done on this Sentiment Analysis Pipeline deployment using flask..

Brief about Skillcate

At Skillcate, we are on a mission to bring you application based machine learning education. We launch new machine learning projects every week. So, make sure to subscribe to our youtube channel and also hit that bell icon, so you get notified when our new ML Projects go live.

Shall be back soon with a new ML project. Until then, happy learning 🤗!!