Natural Language Processing to evaluate English Newspapers for Vocab, Facts vs Opinions ( Sentimental Analysis ) using Python

Vedant Jain
Analytics Vidhya
Published in
6 min readMay 19, 2020

Text parsing (docx2txt), Tokenization, Stemming, Exploratory Data Analysis, Rule-based Sentiment Analysis, Image Extraction.

Natural Language Processing ; Image Credits MediaUpdate.co.za

I was wondering which newspaper I should go about reading, given the spare time at hand. The focus for me was to improve my vocabulary and facts. This forms the primary problem statement for the article.

This article is a basic approach to perform Natural Language Processing for evaluating some renowned English newspapers across India, published 10th of May, 2020.

NLP, as the name suggests, is a field that delves with matured, evolved Human spoken Languages like English, Hindi, or Chinese.

The points I considered to evaluate newspapers content :

— Vocabulary or Number of Unique words per Paragraphs.
— Count of images per page, assuming more graphics & images increases indulgence.
— Facts Vs Opinions ( Sentimental Analysis — TextBlob)
— Exploratory Data Analysis : Visualising the intermediate results and further cleaning and preprocessing text.
— Stats/Numeric Figures provided in general

NLP involves a step by step approach of repetitive text filtering and formatting, before beginning to apply further DS techniques to make any predictions.

PART A ( Data Collection )

Step I : Accumulating e-versions (pdf) of the newspaper from this site and converting it to word docx for these 4 newspaper: The Hindu, Times of India, Indian Express, Hindustan Times.

Step II : Loading newspaper text and images using docx2txt python library.

Step III : Using Regular Expressions to get all digits & decimals in the newspaper text. More the numbers. More the stats provided in numbers.

PART B ( Data Cleaning )

Step I : All words in lowercase

Step II : Taking only alphabets and removing punctuations, commas, and other characters like brackets, quotes.

Step III : Stemming — Bringing word down to root word, removing prefix, and suffix. eg. Running → Run; Bringing → Bring.

Step IV : Removing Stopwords. Stopwords are the ones that do not add value to the context of article. eg. ‘the’, ‘a’, ‘in’.

Here we hold our first round of data cleaning and save the data in a pickle.

Since above steps are processor-heavy sometimes, we save our outcome as a variable in a pickle ( corpus.pkl ), which we can later pull.

Part C : Creating Document Term Matrix

From the filtered out text, we create Document Term Matrix, A matrix keeping count of occurrence of each unique word in all of out texts.

cv = CountVectorizer(stop_words = ‘english’,ngram_range = (1,1) )
docTermMatrix = cv.fit_transform(corpus).toarray()
data_dtm = pd.DataFrame(docTermMatrix,columns = cv.get_feature_names())
data_dtm.index = pd.Index(newspapers)
data_dtm = data_dtm.transpose()

A transpose of DTM looks like this for us :

Document Term Matrix

Part D : Exploratory Data Analysis

Step I : Here we see that there are some single-character words like : ‘eeeee’ etc. So we add a step in Part A to cleanup such words as well.

Removing Single Character Words

Step II : With such a matrix at hand, it's easier to pull up the top 30 (most occurring) words for each of the newspapers. Then check top words collective in these and seeing top occurring words, common across all newspapers.

Since these words are so common, they don't do a value add to our DTM. We filter out these words by adding them as stopwords to the existing ones.

# Checking out top 30 words for all newspapers
top_dict = {}
for c in data_dtm.columns:
top = data_dtm[c].sort_values(ascending =False).head(30)
top_dict[c] = list(zip(top.index,top.values))
# checking top words collective in these and seeing top occurring words accross
words = []
for newspaper in data_dtm.columns:
top = [word for (word,count) in top_dict[newspaper]]
for t in top:
words.append(t)
from collections import Counter
Counter(words).most_common()

Step III : Again, with a new set of stopwords handy, we update out Document Term Matrix.

Step IV : We visualize out DTM as Word Cloud to see most occurring words.

WordCloud

Step V : Images we parsed and saved under ‘\NLP_ExtractImages’ directory have some small-sized as well, like lines and dot images. We look and find that in general for the current scenario, images over 5 kb are of significance.

Part E : Analysis

Step I : Since, the number of pages per newspaper varies. We get a count of pages for all newspapers.

Total count of unique words across all pages

Also, we check the count of unique words per page for every newspaper to create a plot :

Unique Words Count per page
# Getting unique words / Vocabulary
unique_list = []
for newspaper in data_dtm.columns:
uniques = data_dtm[newspaper].to_numpy().nonzero()[0].size
unique_list.append(uniques)
unique_words = pd.DataFrame(list(zip(newspapers,unique_list)),columns = [‘newspaper’,’unique_word’])
#unique_words= unique_words.sort_values(‘unique_word’,ascending = False)
# Manually checked
NoOfPages = [ [‘The Hindu’,22], [‘Times Of India’,18], [‘Indian Express’,18],[“Hindustan Times”,16] ]
NoOfPages = pd.DataFrame(NoOfPages, columns = [‘Newspaper’,’PageCount’])
NoOfPages = NoOfPages.transpose()
# Unique words per page
WPP = []
for i,j in enumerate(NoOfPages):
WPP.append( int(unique_words.unique_word[i] / NoOfPages[i].PageCount) )

Step II : Plotting numeric figures used in the texts. Since we know facts and figures keeps the audience engaged in any format of media or presentation.

file = open('stats.pkl', 'rb')
stats = pickle.load(file)
file.close()
statsLen = [len(li) for li in stats ]
barlist = plt.barh(X, statsLen , align= 'center', alpha = 0.5)
barlist[0].set_color('0.4')
barlist[1].set_color('r')
barlist[2].set_color('b')
barlist[3].set_color('g')
plt.yticks(X,newspapers)
plt.xlabel('Numeric Figures used')
plt.title('Numeric Figures used')
plt.show()
Numeric Figures used per newspaper

Step III : Sentimental Analysis using TextBlob

TextBlob is Rule based approach. So to keep in mind, all the English language words it contains for their polarity and subjectivity have been manually tagged by linguist “Tom De Smedt”. Thus, a word might be having different meanings in various contexts. Thus a word can repeat itself ( as in below image ) with different polarity and subjectivity.

Courtsey : Alice Zhao (adashofdata) @ YouTube

We calculate the subjectivity for every newspaper as :

from textblob import TextBlob
sentiment = []
for i in np.arange(4):
sentiment.append(TextBlob(corpus[i]).subjectivity)
plt.scatter(X,sentiment,linewidths=5)
plt.xticks(X,newspapers)
plt.ylabel(“←Facts — — — — — — — — -Opininios →”)
plt.title(“Subjectivity Graph”)
plt.show()
Subjectivity Plot

Step IV : Number of significant sized images used by every newspaper. Above 5 Kb as discussed in EDA as :

paths = [ BasePath + “\\TH\\”, BasePath + “\\TOI\\” , BasePath + “\\IE\\”, BasePath + “\\HT\\” ]
for path in paths:
os.scandir(path)
counter = 0
for entry in os.scandir(path):
size = entry.stat().st_size
if size > 5000 :
counter += 1
imagesCount.append(counter)
barlist = plt.bar(X, imagesCount , align= ‘center’, alpha = 0.5)
barlist[0].set_color(‘0.4’)
barlist[1].set_color(‘r’)
barlist[2].set_color(‘b’)
barlist[3].set_color(‘g’)
plt.xticks(X,newspapers)
plt.ylabel(‘No of Significant Images’)
plt.title(‘No of Significant Images’)
plt.show()
No. of Significant Images used

Head to my github for collective compiling code.

Conclusion

My initial intention was to pick a read which helps improve my vocabulary and facts. And the above workout gives me a good heads-up for the one I am going to go with. #StayHomeStaySafe

P.S. The article is purely for academic interest and Not to compare the respective “Printing Media Brands”.

--

--

Vedant Jain
Analytics Vidhya

Software Engineer @ Microsoft | Ex-Qualcomm | Voracious Programmer | Ferocious Extrovert | Avid Traveler