Building a Chrome Extension which analyzes your emails live!
This summer I was fortunate to get the chance to intern at Kubric (kubric.io) for 10 weeks. It was an amazing learning experience, and I enjoyed working with the very helpful team!
My project (called ‘BrainMail’) was to create a google chrome extension which analyses emails in Gmail as they are being typed and returns an analysis on various features (politeness, tone etc.) using Natural Language Processing.
‘BrainMail’ could not have been possible without the guidance and support I received from the Kubric team at every step. A big thank you to them, for making this project possible!

Here is the link to the finished product: https://chrome.google.com/webstore/detail/brainmail/fdgfojifagignflnohmodckpllcopehb
Below is the documentation for my project:
Problem Statement:
Create a Google Chrome extension which analyses emails in Gmail as they are being typed, and returns analysis on various features using Natural Language Processing.
Tools Used:
1. GmailJS Library
2. GmailJS Node Boilerplate
3. Flask Service for Web Server
4. Sklearn SGD Classifier
5. NLTK Sentiment Intensity Analyser
6. Chart.js
Solution Architecture:
Chrome Extension
The chrome extension was initially created using the GmailJS Node Boilerplate. We then edited the chrome extension in order to read emails as they are being typed in the compose body of Gmail using Gmail.js library. This was done by accessing the compose reference of the email inside an on compose handler event. The chrome extension makes two AJAX calls.
1. An on-compose event handler has been added, inside it we needed set the window timeout to get access to the compose body of the email. We have set the interval so that every 2 seconds, the compose body data is pulled and send via a POST request to the flask service which returns a JSON object of scores as a response after doing analysis. We then update the UI every two seconds based on the JSON object returned.
2. A button has been added to the compose window, on clicking it, a GET request is made to the flask service that returns our template for popup.html. The template is displayed in a sidebar of the right side of the Gmail compose window.
Flask Service
Used Flask for our back-end. There are two server URLs used.
1. There is one server URL for the post method of the chrome extension which receives the body of the email, cleans out all the HTML tags, does the NLP and other analysis of the service and returns the scores as a JSON object.
2. Another URL for the GET request returns the popup.html file from the /templates/ folder which serves as the template for the UI of our project.
UI
The analysis of the emails appears in a sidebar on the right side of the compose window.
For the UI, we have decided to have the following features:
1. Overall Score
2. Complexity
3. Length
4. Politeness
5. Tone
In addition, we have a suggestions button which when clicked highlights the rude sentences, complex words, and long sentences. We are using chart.js to display a doughnut chart for politeness and tone, and we are using a google font called ‘Roboto’ throughout the UI. The overall score is a composite score of 4 factors, complexity, politeness, tone, and positivity (positivity is not displayed). All 5 categories are updated every once the new JSON object is retrieved.
Natural Language Processing
Natural Language Processing is used to calculate the scores for politeness, tone, and positivity.
Politeness
Corpus: Stanford politeness corpus which contains comments from Stack Overflow and Wikipedia labeled with politeness scores of 5 users from 1–25. Through trial-and-error I found that the neutral sentences in the database lowered the accuracy of the model so I filtered them out and added some rude/polite sentences manually.
1. We used Spacy’s ‘textcat’ predefined CNN model and trained it on our data using three classes: Polite, Neutral, Rude and also tried with the labeling of the scores in two classes: Polite and Rude. However, after many attempts of training with different ranges of scores attributed to the labels, we didn’t get satisfactory accuracy.
2. We then used TF-IDF vectorizer for 1-N grams as features, we labeled bottom quartile of scores as Rude and top quartile of scores as Polite. The model chosen was SGDClassifier which is supposed to work well with sparse data and TF-IDF vectors are generally sparse with a large number of dimensions. After changing the loss function to ‘modified_huber’, we got good results when integrated with the Chrome extension. We decided not to use the stack exchange data due to the excessive noise. In addition, we decided to train using the top 25% and bottom 25% scores of politeness. The middle 50% scores were not very reliable training data as there were lots of disagreements between the 5 ratings due to human biases. After doing this we reverted to a two-class model and eliminated “Neutral”.
Tone
Corpus: We used https://github.com/huseinzol05/NLP-Dataset a GitHub repository for NLP which had a dataset for 6 common emotions: joy, love, surprise, fear, sadness, anger. There were around 400,000 labeled sentences as one of these emotions. Due to memory constraint, and lack of improvement of accuracy beyond a point, we decided to use 15,000 labeled sentences from each emotion so a total of 90,000 labeled sentences.
1. We tried using Spacy’s textcat model specifying different correlations between the classes (mutually exclusive or not) and trained using the above corpus. The number of misclassifications were rather large and due to the activation function at the last layer, the misclassified scores were being boosted much higher than the others. Results were not satisfactory.
2. We then used the large dataset converting it into vectors using TF-IDF vectorizer and again SGDClassifier which gave the best results among all the linear classifiers. Spacy’s inbuilt vectors were also tried as features to reduce the dimensionality but didn’t give results as good as TF-Idf vectorizer.
Positivity
1. First, we used NLTK’s labeled movie reviews corpus as the data set. We classified the data into ‘positive’ and ‘negative’ by using Naive Bayes Classifier. To extract the features from the training data, we used the bag of words technique with count vectorizer(whether that word is present or not in the corpus). We achieved an accuracy of around 70%on the validation data.
2. We then used the Twitter dataset for positive and negative tweets. We extracted features using TF-IDF vectorizer for 1–2 grams and used Logistic Regression to classify the data. We achieved an accuracy of around 80% on validation. However, we were not getting satisfactory results when we tested the model after integrating with the chrome extension.
3. We then decided to use NLTK’s SentimentIntensityAnalyzer Class which uses a lexicon of positive and negative emotion words with different weights and calculates the positivity and negativity scores based on the number of occurrences of these words. This classifier was giving good results with the Chrome extension and accurately classified the data into positive, negative, or neutral.
I ended up not displaying this score but still using it in the overall score computation.
Complexity
Complexity was implemented using Flesch readability metric, this did not involve NLP. The Flesch readability metric calculates the complexity of a given text using a formula that takes into account, word count, syllable count, and sentence count. We implemented these three functions and using the Flesch readability metric, we were able to calculate a score of complexity from 0 to 100 with 100 being the simplest. We have subtracted the score obtained from 100 to obtain a complexity score from 0 to 100 with 100 being the most complex.
Subjectivity
I had a subjectivity score in an earlier iteration of the extension but later removed it to simplify the results displayed. The dataset used for subjectivity was the NLTK subjectivity corpus.
Overall Score
The overall score was implemented using a composite of the following 4 scores: positivity, politeness, tone, and complexity. The positivity score returned by NLTK’s SentimentIntensityAnalyzer (the ‘compound’ field) was between -1 and 1 and we have scaled this into a score between 0 and 25. For politeness, we already have the politeness score between 0 and 100 and have scaled it to 0 to 25. For tone, totally the sum of all 6 tones is equal to 1, we are summing up the three positive tones: joy, love, and surprise, and multiplying by 25. Finally, for complexity, we are giving a score between 0 and 50, a full 25 in the complexity portion of the overall score, and up to a certain point, complexity in emails does not affect the message. A score of 50 resembles language understood by 12th-grade students. Beyond this, for scores between 50 and 100, we are linearly scaling the score for complexity between 0 and 25, with a complexity of 50 given a score of 25 and a complexity of 100 given a score of 0. We are then adding up the four scores to create the composite overall score.
Detailed Analysis
We are giving suggestions based on three classes: rudeness, complexity, and length. First, we are iterating through each sentence and check if it is too long, i.e. the number of words in the sentence is more than 25, and highlighting this in blue. We are then checking the rudeness of each sentence, and if the rudeness is greater than 0.6 out of 1, then we are highlighting the sentence in red as rude. We are finally going through each word and checking if it is a complex word, which we have defined as a word with 3 or more syllables. These words are highlighted in yellow. Because of this order, highlighting of a complex word will be visible even if the sentence is rude/too long. Also, rudeness is given a higher priority than length. If changes are made to the email and the suggestions button is clicked again, the suggestions updates to the newly changed emails suggestions.
Future Improvements
As of now, the overall score is calculated using heuristics that have been manually decided. The overall score metric could be improved by tracking whether or not a user has replied to an email. This could be done in a few different ways. Firstly, an invisible image of 1 x 1 pixel could be placed at the bottom of the email. Thus, we could track whether or not the email has been opened similar to chrome extensions such as MailTrack. Another option is to redirect links that the user has put in their email. This would enable the detection of whether or not a link in the email has been clicked. Using either of these two approaches we would have labeled data of scores of various features and the conversion of an email true or false. We could then refine our overall-score calculator based on the training of the conversion of email data. We could determine out of the 4 features used to calculate the overall score, which ones are the most important and accordingly modify our overall score formula.
Links:
http://flask.pocoo.org/docs/1.0/installation/
https://www.chartjs.org/docs/latest/
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
https://github.com/josteink/gmailjs-node-boilerplate
https://github.com/KartikTalwar/gmail.js/tree/master
https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
https://stackoverflow.com/questions/9154027/java-writing-a-syllable-counter-based-on-specifications
https://www.computerhope.com/htmcolor.htm

