What do people think about your Instagram? A simple application for sentiment analysis of Instagram posts.

In this post, we will cover how to build a simple machine learning application for sentiment analysis. Our focus here is to classify the comments of a specific Instagram post.

I assume you have a basic knowledge of programming on Python and the libraries Flask, scikit-learn, and NLTK. If you want to jump to code immediately, take a look at my GitHub here.

1-Download the data

The choice of the database is a critical task. The words present in the Instagram posts must be well represented in the data set we use to train our model. Unfortunately, I was unable to find a database containing Instagram posts, so I decided to use a database based on twitter posts since both are social networks.

Download the data at Sentiment140 Details or Google Drive and put it inside the /data folder.

Dataset details

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet ( 2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

2-Pre-processing/load the data

Our dataset is based on tweeters posts, which means we will see a lot of emojis, users' nicknames, and mentions. The function preprocess_sentence contains basic functions to remove HTML blocks, user’s tags, HTTP links, and convert emojis to their corresponding name.

After finalizing our preprocess function we can load the data. For this, I developed the function load_data which will open the data, assign the names of the corresponding columns, and reassign the correct labels. The function load_train_test is an auxiliary function that loads the data and serializes it.

3-Training the model

We want to classify the comments present on Instagram into negative or positive, in other words, this will be a binary classification. For this task, it is possible to choose complex and modern models like LSTM with embedding but for this tutorial, I opted for Naive Bayes classifier with TFIDF features, because it is simple and doesn’t need a powerful GPU to be trained.

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. wikipedia

The function fit_naivebayes gets the TFDF features and fits the model using partial_fit to train the classifier using chunks of data. I chose partial_fit Instead of using the fit function to avoid memory problems.

4-Running the server

The app.py contains all functions to build the API and run the server based on the Flask framework. If you are not familiarized with flask I suggest you take a look at this tutorial. The function predict_sentiment_post uses the library instaloader to download the comments for a single post. Afterward, we need to apply the same preprocess used in our training data then return to the user the result that contains the information about negative and positive comments. We will store this information in a dictionary called content and pass them into our web page index.html which will display all information.

5-Direct requests

Another way to use our API is just to do a POST requisition. For this purpose, we can use the request library. First, you need to pass the necessary information into a dictionary. In this case, the variables are the short_code, representing the code of an Instagram post and the max_comments, representing the max number of analyzed comments. The server will produce a dictionary that contains each comment with an associated label indicating either a negative or positive sentiment.


In this post, you could see a little bit of NLP and sentiment analysis for social media. Many improvements could be done like changing the model for deep learning approaches or just running the hyperparameter of the Naive Bayes model. Besides that, the TFIDF could be improved by just using n-gram methodology. Finally, if you want to do a real deployment, take a look at this tutorial.