Many people heard of it but few lucky get chance to read it. Or even fewer know thing in detail about it. As per global source of truth Wikipedia It is “longest poem ever written” in human history. It has around 1.8 million words in total. It is roughly 10 times the length of Iliad and Odyseey combined.
So, what is there in it? We can say everything. Love stories to stories about war to stories about revenge. Philosophy to military tactics. Policies on tax, rules of espionage, watch over enemies and how to find the most able people and make them ministers. Theory karma which is part of Gita, which is part of Mahabharata. Yog (not to confuse with physical Yoga people do) explained in Mahabharata and so do life, death and life after death. A tale where God not solve problems for you but guide you to solution, enable you face the problems. A story where God is not forefather but a friend who is always there for you.
A story where even God accept curse with due respect given by a mother. A story where story teller him self not only narrate the story but also integral part of it. A tale stays same but meaning of it keep changing to every reader. A story which has spiral of different small stories consisting further different stories. A infinite source of knowledge, wisdom and fun.
I can go on and on; but let’s come back to F#.
Natural Language Processing
Understanding the human language, words and sentiments always is exciting. Specially using your favourite programming language. Processing old epic books is always fun, as it not only tells about history / mythology but take back to your heritage and culture. So, This part is from my Mentorship program (more details at the end). An article born from home work I got recently.
I don’t know the complexity level of it. It totally depends on reader. But results are indeed too good.
Let the fun Begin
First thing is to find out source in English (As it is easy to compare with data sets.). Project Gutenberg is good place to find some license free text. If you like to do code with this article then get your favourite book from site or you can always download if from my project.
That is whole Mahabharata in four text files. So, I did some manual labour to separate the books from it. You can find them here. There are total 18 books (sub books are not separated).
Let’s start with simple File IO.
That’s quite a lot terms for small piece of code. Now, we are one step behind to become data scientist.
Let’s find the unique terms and frequency of them.
see another line and we are done. We are now official data scientists.
Go ahead and try for other books also. If you are feeling lazy you can check out results here or see below.
For Unique Terms:
For Unique Terms per Terms
What next? Let’s do the sentiment analysis of all this books and compare them with each others.
Analysis done to find out the tone of given text. Here we are having books. Basically using this we can find out that books is more joyful to read or tilted towards sadness. How many surprise elements book do have. It is useful to understand conversation, if it is more towards positive end or negative end.
It will make more sense while comparing.
As everything else in F# here we also start with type and start putting things in it. Let’s call it Book type. Because why not.
Now, for a second keep this aside. We need more details or here data to find out sentiments. Data with which we can compare our book terms. So, we will be using two data sets. For emotions we will be using this and for positive / negative ratings we will be using this.
Here are what they look like.
And Word with Ratings:
Traditional way to pull data out of CSV file is
for -> for -> for loops. But we are in F# land, we will be using csv Type Provider. Let's pull data out of CSV and shape it in types.
Here is the thing about this data. Emotion column is also specifying the emotion, not always but they are there. So, we need them too. That is reason for that extra calcution we are doing. Here for every emotion found we are adding 1 else it is 0.
If you can see I am not comparing with string but with concrete F# term. It is because in data we are having “anticip” for “anticipation”. Now, in future if we add another data to this collection and they have “anticipation” then it will add extra case for same result. So, it would be better to encapsulate them away. And clean way to do that using Active patterns.
Here is missing piece of code.
As we are having more than 7 cases, we will be using partial Active Patterns and join them in match statement.
Same can be done for word with Positive and Negative ratings.
Here is code for same.
Great. Now stage is set to convert books made of terms to books made of numbers. Let’s create for one book and then we will loop it for our array.
Single function and it’s done. That’s it. What we are doing in that? Creating our Book type.
Terms and Unique Terms were extracted away.
Little bit complicated looking part is where we are trying to find
Sentiment Index for book. So, first step is clean up the word set. That is the reason we are using unique terms. All terms are not present in data set we are having. So, there is need to take common terms. Again no more loops and conditions. They are two sets we need common terms so just intersect it. One line without performance over head. Now find out Sentiment details for that word and to get for book just fold it and do the sum of it. Done.
So, we are having all the data in memory. In our case in FSI / REPL.
As we are official data scientist we required that we see things in Graph format. So, first let’s conver things to JSON and write to disk so we can use it.
Now, once JSON is ready we can easily use to show in graph using any graph library.
Same can be done for positive and negative word set. (DYI for you.)
Reading the Graphs & Experience
One thing needed to be understand here. Data set are created by human, code is written by human and code is executed by dumb computer. So, there is and always will be little bit manual tweaking. As language is topic with perception. One need to understand culture / history of those words from where they are coming. Graphs should be read in that context only. Let me put an example here.
Check out word frequency graph here. Pick any graph. Here I am taking three graphs to compare.
“Great” would be always first. And “Fire” will come in last five always. In normal western literature “Great” word used as adverb for person or thing. But in Indian or Mahabharata context they use it to address someone. Like
Hey, Great worrier Arjuna. A poetic way of saying things. Looks good but also make word totally useless in context of understanding phrase. This issue can be solved with Inverse Document Frequency but again it is an extra effort. Same goes for word "Fire". It is having negative value in normal context but in this specific context it is not that negative. Fire God and Fire it self (yagna) are positive. It is very much contextual.
But again generating graph is much more data sciencey to explain.
Next step would be doing more detailed analysis of this Epic. Compare number / analysis with original context of books. Try to push number as near as possible to words. And probably extract some good NLP library from it.
All books are divided in sub-books telling different stories. Wiki links are added in Table list. If you are interested then check them out.
Download your favourite books and have fun with graphs.
Special Thanks to people without whom this project may not exist.
Mentorship program : In simple words. It is a program where Mentor and Mentee are tagged with each other. And they have hour / week to teach / learn some topic specific to F#. As highly technical people things looks little old school but the effect that flesh and blood can create nothing else can. No books or no recorded videos. 45–60 minutes can cover more than one can cover in month or two using deal material. It is always good to have someone there alive in front of you whom you can ask questions. Again that is all my preference and what I like about this program.
Andrea — For last season of Mentorship program.
My lovely fiancée for not only allowing me but also encouraging me to give extra time to this.
Saheb to be my phone a friend for any kind of machine learning and data science queries.
Note to my Mentor
Words can’t explain how much grateful I am to have Evelina Gabasova as my mentor or may be guru would make much more sense in current context. After a long time I can have kid’s curiosity and innocence to ask anything and everything to her. And she is always there with answer and ever smiling face. Always pushing extra mile to beat the time zone differences.
In Mahabharata, Krishna narrate Gita to Arjuna in middle of war field; empowering him with eternal knowledge. Just like that,
Hey, Great Evelina, I am no Arjuna but you are always been my Krishne. Guiding my way in the flood of data. It is always good to have you around. Be as mentor and as friend. Please be there always.
I like to close with few of my favourite picture describing war moments of Mahabharata
Krishna narrating Gita to Arjuna
Krishna driving Arjuna in war field
Krishna, Arjuna and Bhisma three great warriors but helpless in front of time (situation)
Originally published at kunjan.in.