Building A Text Analysis Web App With Medium Claps Prediction

Zhijing Eu
Analytics Vidhya
Published in
18 min readSep 26, 2020

Data, data, everywhere but not a drop of information in sight !

(…With apologies to Coleridge )

Unlike numerical data that you can often boil down into a few summary statistics to distill some insight out of, text data is a bit harder to quickly process.

In a bid to see if Data Science / ML techniques could help me improve my writing , I built a simple Python Flask based web-app that quickly analyzes web-pages for some fast insights based on high level text analysis metrics.

The app returns statistical measures like word counts, sentence counts, etc but also NLP features like Sentiment Analysis scores and Readability Scores. Using pre-trained models some others have built, I also incorporated code that predicts the author’s personality type. The app also has a “clap predictor” functionality that estimates the number of expected claps via two approaches — a simple linear regression prediction model (warning, the accuracy pretty poor) and a classifier that uses document embedding vectorization (Doc2Vec)

I’ve posted the code online so feel free to leave me some feedback or clone my repo and improve it (e.g Build your own custom datasets and retrain the models etc)

App Demo Website: http://34.126.106.75:5000

Update Sep 2022: It’s been ~2 yrs since I wrote this and I’ve decided to turn off the App Demo Website as the low traffic doesn’t justify the monthly cost of hosting it on Google Cloud — if I get enough comments below, I might reconsider ;)

Web App Located At : http://34.126.113.131:5000

Why Did You Make This ?

I’ve been writing on Data Science related topics on Medium.com for a couple of months now and struggled with how best to improve my articles and boost my clap counts/views. So I began to wonder if I could use ML or Data Science techniques to help me write better.

There are AI Powered solutions like Grammarly (An AI Powered Writing Assistant). However I thought I’d start with a more modest ambition of making a simple web app that could process text and return some key metrics.

This article is a walk through of the steps I took along with some of my main take aways. I’ve put in some code snippets below but the full script is available here at this repo :

Outline

1.Extracting & Cleaning The Raw Data

2.Processing HTML/Text Data To Extract Key Metrics

3.Building Prediction Models

3.1.Personality Analysis Prediction — Myers Briggs Type Indicator

3.2.Personality Analysis Prediction — Big5 Traits

3.3.Clap Count Estimator

3.3.1 Exploratory Data Analysis

3.3.2 “Simple” Linear Regression Estimator

3.3.3 Document Embedding Based Classifier

4.Converting The Code Into A Simple Flask App

5.Conclusion

1.Extracting & Cleaning The Data

Source : Alamy

This was fairly straightforward as I used Python’s Requests libraries for most of the web-scraping

from bs4 import BeautifulSoup
import urllib.request
from urllib.request import Request, urlopen
class ArticleExtract():def __init__(self,url,text_input=False):
self.url = url
self.text_input= text_input # To allow for text input too
#Set get_html and cleaned_text as properties as these get re-used by other functions@property
def get_html(self):
user_agent_list = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15','Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0','Mozilla/5.0 (Windows NT 10.0; Win64;x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',]
if self.text_input==False:
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
req = Request(self.url, headers=headers) self._get_html = urlopen(req).read() if self.text_input==True:
self._get_html = self.url

return self._get_html
...

I started off by making a Class that takes in text or URL as input and has a whole bunch of methods to process the data and store the outputs. (However this class object eventually got bloated so I had to refactor this to be a set of functions in the Flask App- more on that later)

The raw HTML has all sorts of HTML tags in it. There are a number of ways to remove these but I used Beautiful Soup which has a useful get_text function and a parser which identifies and strips out the tags

Image Credit : John Tenniel — The Scene In Alice In Wonderland Where The Mock Turtle Tells Alice About His Beautiful Soup….
class ArticleExtract():...def cleaned_text(htmlinput):
cleaned_text=BeautifulSoup(htmlinput, "html.parser").get_text(" ").replace("\r", " ").replace("\t", " ").replace("\n", " ").replace(u'\xa0', u' ')
return cleaned_text
...

2.Processing The Data

The choice of which text analysis metrics to generate was guided by me trying to copy the functionality from these two commercial sites

https://readable.com

https://seoscout.com/tools/keyword-analyzer

However I also included some web page meta data like published dates, image and embedded video counts etc

The heavy lifting was done by the NLTK library which was the main tool i used for the basic tokenisation step that feeds many other subsequent functions.

Tokenisation is the splitting up the text into smaller chunks like sentences or words) that fed many of the other text metrics.

class ArticleExtract():...
@property

def tokens_alpha(self):
raw = BeautifulSoup(self.get_html, 'html.parser').get_text(strip=True)
words = nltk.word_tokenize(raw)
self._tokens_alpha= [word for word in words if word.isalpha()] # or use option of := if word.isalnum()
return self._tokens_alpha
...

An honorable mention also goes to the TextBlob library which I used for Sentiment Analysis.

Sentiment Analysis uses key words analysis to determine polarity (how negative or positive a statement is) and subjectivity (as opposed to neutral fact based statements e.g I think…, You should…, etc) The polarity score is a float within the range of -100% for negative statements and +100% for positive statements. The subjectivity score has a range where 0.0% is very objective and 100% is very subjective.

class ArticleExtract():...
def sentiment(self):
blob = TextBlob(self.cleaned_text)
split_text=blob.sentences

df=pd.DataFrame((''.join(split_text[i]) for i in
range(len(split_text))),columns=['Sentences'])

df[["TextBlob_Polarity","TextBlob_Subjectivity"]]=
pd.DataFrame((split_text[i].sentiment for i in
range(len(split_text))))
df=df[df['Sentences'].map(len) > 15] #Remove all short
sentences
#Avoid counting any sentences with Polarity 0 or
Subjectivity 0

TextBlob_Overall_Polarity=df[df["TextBlob_Polarity"] != 0]
['TextBlob_Polarity'].median()

TextBlob_Overall_Subjectivity=df[df["TextBlob_Subjectivity"]
!= 0]['TextBlob_Subjectivity'].median()

return
TextBlob_Overall_Polarity,TextBlob_Overall_Subjectivity
...

For Readability analysis, there is a dizzying array of different measures that revolve around either comparing the text to existing lists of “hard" words or counting syllable/sentence/word lengths.

For simplicity, I eventually settled on just a single metric called the Flesch Reading Ease Score which I estimated using a library called Textstat

It is calculated using a formula that considers Average Sentence Length and Average No Of Syllables Per Word where the higher the score, the easier it is to read the text (e.g. > 90 = Very Easy , < 30 = Very Confusing)

class ArticleExtract():...
def FS_ReadingEaseScore(self):

FS_GradeScore=textstat.flesch_reading_ease(self.cleaned_text)
return FS_GradeScore
...

One of the problems I was having in these first two steps was the accuracy of the figures I was getting. I eventually made a function to view the sentence by sentence level results where I discovered that my raw data was not properly “laundered”

Lesson 1 : Do not to underestimate the importance of proper clean up of html text. For example, I initially used BeautifulSoup.get_text() to clean up HTML tags instead of .get_text(“ ”). Therefore I was getting nonsense like this:-

BeautifulSoup('<span>this is a</span>cat').text
Output : u'this is acat

This meant that I was always unintentionally joining words together and everything else after that relied on the “cleaned up text” was WRONG — e.g. it drove up the sentence lengths and made readability scores worse and confused the sentiment analysis etc. Other areas are things like sentences with characters like “ — ” or “-" or tables/lists ,which depending on how they were set in html, may end up as gibberish and balloon up the sentence counts.

3.Building Prediction Models

Personality Analysis

Source : 123RF Stock Photos

Although strictly speaking Personality profiling isn’t part of text analysis per se, I thought it would be a fun addition to the app. I initially wanted to develop my own model but this proved to be a bit more work than I expected so instead I adapted from work a few authors posted online.

3.1 Myers Briggs Type Indicator

The Myers–Briggs Type Indicator (MBTI) tries to explain how people perceive the world and make decisions and assigns people to four categories: Introversion or Extraversion, Sensing or iNtuition, Thinking or Feeling, Judging or Perceiving. One letter from each category is taken to produce a four-letter test result, like “INFJ” or “ENFP”.

I reverse engineered the code in the article below and loaded their pre-trained models (which is a Logistic Regression model) into my app.

The article itself is worth a read as their full project tested the model with web scraped LinkedIn profiles for a few major consultancy firms to check if there were patterns in who worked there.

3.2 Big 5 Personality Analysis

Similarly, another popular Personality profiling approach is the Big 5 Traits model (also sometimes known as the O.C.E.A.N model) which is a grouping of five main personality traits:

  • openness to experience (inventive/curious vs. consistent/cautious)
  • conscientiousness (efficient/organized vs. extravagant/careless)
  • extraversion (outgoing/energetic vs. solitary/reserved)
  • agreeableness (friendly/compassionate vs. challenging/callous)
  • neuroticism (sensitive/nervous vs. resilient/confident)

I used a pre-trained model (which is a combination of Random Forest Regressor and Random Forest Classifier) from this project repository:

In my app I only used their pre-trained models for prediction. However the scope of the full project was impressive as it integrates Python , Django and a Node JS that scrapes your FB data , stores it on a database instance (it can be your own private one) and displays all the results on a custom web-app that can compare of all your friend’s personality profiles versus your own.

So for this section my key take-away was:-

Lesson 2 : Where Appropriate, Leverage On Work Of Others (Assuming it’s public and you give due credit to the original authors ;-) I’ve covered this theme in a previous article but it bears repeating — if you can define what you need and are conscious of the potential trade-off between price vs performance / level of support / documentation — there is likely already a service or some open source project out there already with a solution at hand.

3.3. Clap Predictions

For clap prediction, I could have taken a similar ‘copy & adapt’ approach as there are quite a number of good articles on this topic [1] , [2] , [3] and [4] .However the challenge was intriguing enough that I wanted to try it out for myself.

Source : Dreamstime Stock Photos

3.3.1 Exploratory Data Analysis

To follow along use the ExploratoryDataAnalysis.ipynb file in the repo.

I used the script built in step 2 to harvest about 200 articles from the front pages of Towards Data Science and Analytics Vidhya. The article URL-s and Clap Counts were harvested using Parsehub (A nifty tool I covered in an earlier article) (I know I could have done this natively in Python but after a few hours of messing about with Beautiful Soup and then the Selenium library I gave up — pagination is hard ‘kay ?)

I know 200 is tiny as a dataset but I wanted to start with a small ‘curated’ dataset so I could have a better feel of how the models would behave. Unfortunately, I found out later that most of the front page featured articles are recently published (and as a result the clap counts usually are only triple digits because the article hasn’t had time to “mature”) so I actually ended up with a somewhat unbalanced dataset. I also snuck in a few “hand-picked” late 2018/ 2019 articles to get a bit of variety on age and clap count. (Note however, this issue of over-representation of newer articles will come back and bite me in the @ss as you will see in a bit)

Most of the articles were in the low clap count category with only a few (mostly the older ones) with high clap count.

More than 3/4 of the 200 Articles had less than 5,000 claps…

Given this “power-law” type behaviour in the no of claps, I created a Log Clap Count metric as I thought it would be a better measure. I initially ran a simple Pearson’s correlation to understand the various metrics and see if there was a pattern or any strong correlations with the Log Clap Count. The results were inconclusive as the only metrics that showed strong correlation were :-

  • Age (From Published Date) — This does makes sense as older articles tend to have more views but this is an incomplete picture as obviously there must be other factors at play if not all old articles would have high claps
Correlation : 0.534
  • Sentence Count and Word Count — This was counter-intuitive as it seems to indicate that longer articles with more sentences have more claps ?

I’ll spare you the other charts but what was surprising to me was that metrics like Polarity or Subjectivity , Readability Scores and most of the Personality scores were all less than 0.20 in absolute terms.

To help me view the data differently and better visualize some of the outliers that may be skewing the correlations , I also made side by side box-plots for a Hi-Med-Lo clap counts (Chosen somewhat arbitrarily as H : > 5k (37 Articles) , M : 5k-0.5k (58 Articles) , L : < 0.5k Claps (105 Articles) ) against the total 200 articles

Using this box-plot type view, there does seem to be a weak relationship where Hi Clap articles tend to have a slightly higher FS Grade Score . This makes sense as a lower score means the text is harder to read. (Either that or my data or data processing was bad)

While there was no discernable pattern between the Hi/Mid/Lo Clap articles by predicted MBTI personality types, it was interesting to see that the logistic regression model used for the MTBI classifier seems to think most of the total 200 articles were written by #-#-T-J type authors.

INTJ-s make up ~40% of all articles

There is more detail in the IPYNB notebook on the Github repo if you are interested but nothing of note.

At this point, if I were doing this seriously , I would have probably stopped and went back to re-examine the accuracy of the data processing steps and/or expand the size of the data-set. However since the “clap prediction" was ultimately only an add on feature to my main goal of building a simple text analysis app, I just went ahead with what I had.

3.3.2 Clap Prediction Via A Linear Regression Model

I used a very plain vanilla approach that in hindsight probably does not capture the actual relationships very well but I ran a number of linear regression approaches out of the SciKit learn toolkit with a hand-crafted set of features where I more or less just dropped the entire MTBI and Big5 Personality metrics (as most of these had low correlations anyway)

df = pd.read_excel('Dataset.xlsx')
df['log_claps']=np.log(df.claps)
#Regression Variables were "hand-picked" to exclude non numerical and both MTBI and Big5 OCEAN characteristicscolumn_for_regression=["Age_19Sep20","sentence_count","title_word_count","average_word_count_per_sentence","text_word_count","vocab_count_excl_commonwords","imgs_per_1000words","FS_GradeScore","vids_per_1000words","polarity","subjectivity"]X=df.loc[:, df.columns.intersection(column_for_regression)]
y = df['log_claps']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lasso_alphas = 10**np.linspace(-5,0,1000)
lasso_cv = LassoCV(alphas = lasso_alphas, cv=5)
lasso_cv.fit(X_train, y_train)
lasso_cv.alpha_
lasso = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(X_train, y_train)
predicted_claps_lasso = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, predicted_claps_lasso)
r_sq = lasso.score(X_test, y_test)
print("Lasso Regression")
print("")
print('coefficient of determination:', round(r_sq,3))
print("")
print('intercept:', lasso.intercept_)
print("")
print('slope:', lasso.coef_)
print("")
print('Mean Sq Error in Log Claps',lasso_mse)
print("")
print('Mean Sq Error in Claps',np.exp(lasso_mse))

Lasso Regression seemed to do better than the Ridge Regression and the basic un-regularised Linear Regression.

So how well does it perform ? As you may expect — Not so great… :(

The straight-line is if Predicted = Actual (Note the scales are a bit misleading) so the further the dot is from the straight line, the worse the error

The errors are “blown up” because the regression is against Log Claps. The model seems to work okay for articles that were in actual fact less than 2,000 Claps but anything beyond that seems to fall over.

Therefore my advice would be to apply this clap prediction for new-ish articles (6–9 mths old from publish date) — which coincidentally matches up with the majority of the training data-set anyway. (What were the odds ?! *sarcasm*)

3.3.2 Clap Prediction Via A “Classification” Model

In this other approach, I ignored ALL the text metrics and focused only on the content of the (cleaned up) text. To follow along use the Training_A_Doc2Vec_Model.ipynb file in the repo.

There is a technique often used in Natural Language Processing called Word Embedding where you “vectorize” a word. My explanation is probably an oversimplifcation but it works by getting an algorithm to guess the word given the surrounding words (i.e context) and encoding that info into a vector of multiple dimensions.

In doing so, you can numerically represent any word analyzed within the same corpus as a equal length vector and then find the “distance” between that word and any others. Therefore you can do some pretty cool things like add the vector for King and Woman and come up with a vector that closely matches the vector for Queen.

There is an equivalent process for DOCUMENT level embedding called Doc2Vec that extends this idea where it can translate entire documents into a vector.

So my general approach here was to train a Doc2Vec model using the text from the 200 Articles and then find the “AVERAGE” vector for High to Low Clap Count articles. The implicit assumption being that there is something in the semantic content that can differentiate hi — med — lo clap articles.

Using the ave representative vectors for these classes, I can then compare every single article to these categories to “predict” which category it belongs to by a distance measure (i.e how similar the article is to the average class vector)

The approach was as per below , using the gensim library

import pandas as pdimport numpy as np
from numpy import save
from numpy import load
from scipy import spatialimport gensim
from nltk.corpus import stopwords
from collections import namedtuple
from gensim.models import doc2vec
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data_source = pd.read_excel('Dataset.xlsx')
data_source.drop(data_source.columns.difference(['ID','title','popularity_level','raw_text']), 1, inplace=True)
data = data_source["raw_text"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
#Code adapted from https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5max_epochs = 100
vec_size = 300
alpha = 0.025
Doc2VecModel = Doc2Vec(vector_size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)

Doc2VecModel.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
Doc2VecModel.train(tagged_data,
total_examples=Doc2VecModel.corpus_count,
epochs=Doc2VecModel.iter)
# decrease the learning rate
Doc2VecModel.alpha -= 0.0002
# fix the learning rate, no decay
Doc2VecModel.min_alpha = model.alpha

Doc2VecModel.save("Doc2Vec.model")

Using the similarity function within Doc2Vec that can find the most similar articles within the model , it produced a pretty accurate match — I am not sure if there is a distinctive writing style that it picked out but when I fed it one of my articles , it picked out ALL the ones written by me out of the other 199.

Unlike the H M L in the previous linear regression approach, I made a finer class split of VH to VL with the following details VH, >10000 claps (21 Articles ) ; H, 5000–10000 claps (15 Articles); M, 1000–5000 claps (29 Articles) ; L, 100–1000 claps (82 Articles) ;VL, <100 claps (53 Articles).

Since the Document vectors are 300 dimensions long, you can’t visualize it as is so processing the data using TSNE — a dimensionality reduction approach, I plotted all the article and colour coded them by the clap count class.

TSNE Of The 200 Article Document Vectors

Based on the TSNE plot, there does appear to be a pattern where the low clap articles are further away from the “centre” (Ref the light blue and purple dots). It may be a stretch to say so but the green and dark blue do appear to be closer to the centre. Again the problem is the classes are imbalanced (There are more low clap count articles than high clap count articles)

In any event, the proof of the pudding is in the eating, so how well does the classifier work ? Surprisingly…pretty good at about 80% accuracy !

Example of the results — The Prediction is based on which category the article has most similarity to…
160/200 Articles were accurately predicted !

Lesson 3 : The ML techniques (model selection, hyperparameter tuning, etc) are only one part of a successful model application — ultimately what is more CRITICAL is the quality of the training data. This ranges from everything about completeness (is it representative of what you are trying to predict/understand — noting that I had class imbalance issues with article ages and clap counts), consistency/correctness (in my case, I suspect the pre-processing steps may still be buggy) and context — i.e the understanding of how the ML model will be applied to give some guidance on what is an acceptable error level.

4.Converting The Code Into A Simple Flask App

I initially expected that translating my code into a Flask app would be relatively simple since I made a conscious decision not to include the usual web app features (e.g. no view counter, no log-in management , no feedback/comments section). I soon discovered that it was not :(

My fundamental problem was the way my initial code was written. I created these massive Classes that were chockful of methods and storing a lot of data. I did this for the ease of web-scraping and building the reference data set. I.e The way the ArticleExtract Class was set up, I just fed it a URL and it did all the rest.

This worked because I wasn’t optimizing for speed during the web scraping and still in the exploratory stage of the data analysis where after instantiating a class object with a particular URL , I was selectively using the other methods to call out specific metrics or plot charts or whatever. (I.e I was not running ALL the methods at once and my web-scraping (which I left running overnight) was only for high level page level metrics — not the detailed sentence by sentence analysis or clap prediction)

However, a cut-paste of all this code into a Flask app made things horrendously slow. Flask works by creating views for each page that work against a HTTP request sent by the individual pages — essentially making each view a separate function. Since by default Flask does not allow you to share data across views [1] , it also meant that for each page , I had to essentially create an entire new ‘local’ version of the ArticleExtract object for the same URL and RE-RUN everything each time.

Er…Did You Want That Analysis By Today ? Source : 123RF Stock Photos

Flask does have a session feature but that can store “cookies” but this is limited to 400kb only and most of the documentation recommends using a database solution to “persist” the data across different pages.

I eventually unbundled the ArticleExtract object into separate functions and created a “container” class to store the key data that was commonly shared across the different page views to speed up processing time.

Lesson 4: When developing code that needs to be deployed - PLAN AHEAD and consider the performance of the code from a processing time memory efficiency perspective . To make troubleshooting easier, it is also a good practice to “modularize” code into manageable chunks rather than writing a monolithic block of code that is tightly coupled (I.e The more dependencies/references are incorporated into the code flow means a higher chance of the whole code failing when any edits are made)

5.Conclusion

I hope this was useful to anyone else looking to build a similar app or anyone who’s getting started with text analysis tasks in Python.

To summarize, my main takeaways as a Data Science enthusiast and n00b developer were:

  • When performing any text wrangling — Do not to underestimate the importance of proper laundering of the raw html data !
  • Where Appropriate, Leverage On Work Of Others
  • The ML techniques are only one part of a successful model application — ultimately what is CRITICAL is the training data quality
  • PLAN AHEAD — consider the efficiency of how much memory required and processing time to run the code

I am still working on deploying the app online so that may be the subject of a future post but for now, you are welcome to clone my repo and try it out for yourselves.

--

--

Zhijing Eu
Analytics Vidhya

Hi ! I’m “Z”. I am big on sci-fi, tech and digital trends.