Basic OCR and Sentiment analysis web app

Optical character recognition and sentiment analysis using PyTesseract and machine learning.

Published in

Analytics Vidhya

5 min readJul 26, 2020

In this basic application we take the input image from the local machine and extract the text out of the image and then predict whether the extracted text is positive or negative or neutral. It contains two approaches -

Extraction of text from image (OCR) using PyTesseract.
Sentiment prediction of text using NLP (nltk NaiveBayesClassifier)

So in this we will build Angular based (Ionic) web app for user interface and flask server to interact with PyTesseract and model training using nltk.

Steps to be followed —

1) Initialize the web app and setup the basic user interface —

2) Installing and implementing of PyTesseract —

3) Installing and using nltk for Sentiment Analysis —

We will go through each and every step with proper explanation.

Step 1 — Initialize the web app and setup the basic user interface

we can initialize the ionic web application using -

ionic start OCR-sentiment blank

Now we will setup basic UI for the app and code for calling the flask server using the REST API using HttpClient module of Angular

Here we use this choose button to use the image file and send it to the flask server using the following code.

Here we get the image file in this.fileObj and append it to the form data and use HttpClient for making a POST request to a flask server by passing the form data or the image data.

We make a route name /getText which is defined in flask server and respond with the extracted text and predicted sentiment.

Step 2 — Installing and implementing the PyTesseract

In this we will go through a short installation and rest of the server coding

A) Installing PyTesseract

pip install pytesseract

And for further installation you can download the installer for HERE

B) Implementing the OCR ( Extraction of text from image )

So in this part we will extract the text out of the image using following piece of code

In this we take a image file as a input and use the function image_to_string to extract the text out of the image.

There can be various other operations to be perform for the better performance of the OCR that is we can use Open CV thresholding of the input image for better result this tutorial do not focus on that.

We have used the Pillow python library for the taking image file and opening it for processing.We will later on tutorial include the file in the main server code.

Step 3 — Installing and training nltk NaiveBayesClassifier for the sentiment Analysis

This will installation and training of the machine learning model-

1) Installing nltk ( Natural language processing toolkit)

python -m nltk.downloader all

This will install all the necessary datasets and the library that we will be using in this tutorial

2) Training of the machine learning model

Now we will train the NaiveBayesClassifier using the movie_review dataset present in the nltk and the will we following the below code for training the model till dumping the model using pickle library

Here we are separating the positive and negative class in the movie_review dataset and further we will get the positive and negative feature out of the dataset

Now we do some splitting of data for the train and test of by deciding the threshold of 0.8

So the final step comes where we train the model and dump it (Serialize the model object) using pickle

It will create a serialized binary file which will save our classifier object and later on we will use it for the predicting out extracted text

Final Step — Make a route using the flask and calling ORC function and predicting sentiment

Now in this section we will be integrating all of our python and machine learning code and make a flask server

We will make route /getText using python

Above we have called this REST API from our front end client and get the predicted sentiment and extracted text

In the above code this ocr_extraction function that we had defined earlier we can import that function and use it get the text.

An further we pass the function in the predictSentiment function and we will define it below

Here we load the above described nltk NaiveBayesClassifier which will then predict the sentiment of the class of the text that is if it is positive, negative or the neutral text.

Then we will return the extracted text and predicted text using jsonify.

Demonstration of the application —

At first we will give the image containing the positive sentence in it

Now we pass this image using our front end and click on the get sentiment button and see the following output.

Now we will see the following output after hitting the get sentiment button

Here we can see the extracted text is as it was in the image and the predicted sentiment is also correct as the best word represents the positive sentiment.

Now we can try on with some negative sentences here which will help to know the classifiers accuracy.

Now here We can see the extracted document is also fine and the predicted sentiment is also correct as the bad word represent the negative sentiment

You can get the UI code HERE and the flask server code HERE.

Conclusion

In this tutorial we have learn about the extraction of text from the image using PyTesseract and sentiment analysis using the nltk package in python and we have also learnt the basics of the flask server and to make the routes for calling from the front end client.