Analyzing Nehru with NLP — Part 2
Exploring the themes of ‘The Discovery of India’ with NLP
Analyzing Nehru with NLP : Part 1 went quite well. Got a few people interested in NLP as well. So that’s a plus. Stuff I learnt from Part 1 :
- No matter what data you obtain, it is always best if one could present it in a simple, lucid manner.
- Line graphs and Bar charts come in handy when you least expect them to. Here is a quick link to the matplotlib library used for plotting all diagrams needed in this article.
While Part 1 assessed the structural flow and verbosity of subjects, this part will analyze the topics covered by looking at the frequency distribution of words in the book.
Frequency Analysis of words
The words most frequently used by the the author often help identify things most key to him. It will also throw light on the topics most relevant in the chapters.
For the analysis, some light cleanup is in order :
- Word tokenization in NLTK sometimes gives words broken by hyphen as two separate words. Hence we joined them into single words for ease to analyze.
- There were multiple punctuations occurring and had to be removed. Moreover, the analysis did not depend on the capitalization of the letters, and hence all words were converted to lowercase.
- All words were represented in ASCII and we removed the most common stop words like i,this,am,an,as,would etc from the list of words to be analyzed so as to focus on just the keywords.
- We use the numpy functions to quickly calculate the frequency of each word and to extract quickly the top N words.
On taking the top 40 words from the entire book, it is quite obvious Nehru’s theme of the book and the topics under discussion.
Having being written at a time when the quest for independence was at it’s zenith, the book is intensely patriotic and throws light on India and its rich past. One can note the key themes in the book bubbling up :
- India(2112 times) and Indian(794)
Chapter wise Analysis
Once we analyze the top 10 words in each chapter, we see the focus with which each chapter was written and the underlying themes.
- In Chapters 1 and 10, we see the reference to life(63) being a common theme.
- Chapter 2 is about Kamala(25) and women(13)
Chapter 7 throws light of India(323) in the light of British rule(245).
Chapter 8 highlights India(185) and the Indian National Congress(145). The government(91) and the British(80) policies are discussed as well.
Dispersion plots are helpful to show the relative position of words in the book. A word in marked by a vertical mark when they appear. Here we show a dispersion plot of the top words we have identified in the above section.
Items to note :
- Here the red margins note the various chapters in the book (Preface + Chapters 1 to 10)
- Stop words have been removed for ease of analysis.
- Other signifies all the words that do not fall into the Top Words.
- The words India,Indian and country can be seen distributed in a uniform manner throughout the entire book.
- However, words like British,Congress,war etc. are more distributed towards the end chapters where the last phases of the freedom struggle are discussed.
Code to plot a dispersion map directly using matplotlib’s Scatter Plots
- It might be useful to understand the context in which this book was written and the mindset of Nehru during this period. [See Wikipedia for a brief overview]
- The book was the basis for Doordarshan’s Bharath Ek Khoj. In a time of extreme right wing politics, it might serve as an eye opener to many. There is no greater danger than misplaced patriotism. India through the eyes of a true liberal freedom fighter may displace many a myths that one may hold today.