I’ve started with classics (Robert Heinlein, Isaac Asimov) and got fascinated while diving in the worlds they’ve created, the worlds that were supposed to be our future (some of them are our present, some of them aren’t, but this is a topic for another post).
A lot of people enjoy reading books because of their emotional component — a good fiction book takes you on a crazy ride on a roller coaster, making you happy or sad, empathetic or indifferent.
While being in search for some ideas to practice my data science skills, I was thinking about a way of visualizing a book with just a single image. After iterating over several ideas, I thought of creating an emotional intensity poster of a book.
Although there’s not much Data Science involved in this idea, it uses basic concepts of Natural Language Processing like tokenization and sentiment analysis. The idea is pretty simple — assign each word in the book a hue based on the word’s sentiment and a specific tint to the hue based on the word’s valence (it’s intensity).
One of the challenges was to find a list of words that are labeled as having a positive/negative sentiment and assigned an intensity to them.
Luckily, there is such a list — the AFINN:
AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009–2011.
That is exactly what is needed! Now, we need a book to try this on…
For the purpose of demonstration, let’s take the first book in the series of “The Lord of the Rings,” by J.R.R.Tolkien. This is a fantasy series that is on my reading list for some time already and it is quite popular (a lot of people have at least watched the movies if not read the books) so that we could somehow interpret the results.
Fun fact: According to wikipedia, The Lord Of The Rings is the best-selling novel ever written, with over 150 million copies sold.
Let’s follow the process step by step:
1. Split the text into separate words and remove any punctuation sign.
Speaking in Data Science terms, this process is called tokenization. Depending on the language you choose you can achieve this in different ways. In this demo we will use a rough tokenization — meaning we won’t split text like you’re into you are, or can’t into can not, etc. Also, we won’t take into account the context of words. In English, the same words can have different meanings in different contexts (e.g. cool can be a temperature indicator or awesomeness indicator, depending on the context). We will just remove basic punctuation signs: , . ! ? : ; “ ‘ ( ) [ ] -. This may be inaccurate in some cases, but for our task this is good enough.
2. Assign each token a color of specific tint.
We’ll use green color for positive words and red color for negative words. And the stronger the valence of the word in the AFINN list is, the more intense the color will be. For the neutral words, that are not found in AFINN list, we’ll assign them the white color. Now, representing each word by a 4x4 pixel square, we have the following picture:
3. Let’s remove the neutral words and leave only the colored ones.
Thus we’ll be able to better see distribution of the positive and negative words and make the squares bigger. Also let’s add a popup with the corresponding word for each square on mouse over (medium does not allow inserting user generated html, therefore below is a simple image, without mouseover functionality, but at the end of this article there is a link to a page where you can analyze each square individually):
That’s it! Now we can see the emotional landscape of the book. It’s hard to say which color prevails though. But we can see now that the book starts on a positive note — the first three rows are mostly green, with few intense red squares. Also there’s one bold green square in the beginning that stands for “superb.” Approximately the second eighth of the book is mostly red, intense red. There’s even a sequence of four consecutive “hated” words followed by a “leave” word, then a “hated”, a “loved” and a “hated” again. The middle of the book is balanced and the end is mostly red — because of the battle that ends the first book.
4. Visualize the amount of each type of word and its intensity.
The Emotional Canvas (above) shows each word where it occurs within the text. But it is too detailed and hard to read. Let’s use a donut chart to visualize how many words of each emotional intensity are used in the text:
From this diagram we can see that the distribution of positive and negative words throughout the book is almost even, confirming the roller coaster comparison at the beginning of the article. There are 3896 emotionally charged words. 2012 out of them are positive and 1884 are negative words. Moreover, this chart shows the number of words of each type (the second level of the donut chart). Thus, there are very few words labeled with +4 and +5, 450 words labeled with +3 and so on.
It is interesting whether the amount of words coincides with the total intensity of these words. In order to find it out, let’s create another diagram, a horizontal bar chart, that will display the total intensity and total number of words.
The green areas are almost of the same size, meaning that the number of positive words is proportional to the total intensity. It might seem as this should always be true, but it is not, as you will see in some cases.
So, we analyzed the text on a very detailed level (emotional canvas) and a general level (donut and bar charts). Is something missing?
5. Visualize the frequency of positive and negative words.
Let’s analyze another dimension of the data — the frequency of words charged positively and negatively. Let’s use a histogram to visualize top 15 most frequent words:
Here they are! These charts show the amount of words sorted by their frequency (top label) as well as their emotional intensity (+ or - at the bottom of the bars).
Note: the histogram charts do not count the words “no” (-1) and “like” (+2) words that are present in the AFINN list. “no” is pretty common and its frequency is much higher than of the other words. “like,” depending on context, may mean something positive, or may show similarity. Because it is also pretty common (especially as a neutral word), I removed it as well, therefore bringing (hopefully) these two charts to a balance, as removal of “no” will level off the removal of “like.”
I hope this won’t affect anybody’s well being in any way.
This is pretty great, but there is something left… How does the plot evolve? We can infer it from the Emotional Canvas, but it is very hard and inaccurate.
6. Visualize the evolution in time of the text.
The problem with the Emotional Canvas is that it is represented in 2 dimensions, but the text is linear, we read it from start to end.
We could place all those squares in a single line, but this won’t be readable at all, as the line would be too long to fit on a screen. However, we can reduce the size of each square to a pixel and draw them according to their intensity on a line:
Well, that’s not much better… In order to make this chart more useful, let’s draw not the individual words, but the averages of previous words. We’ll use the Simple Moving Average (SMA) as the computation method for this task. Specifically this means that each point in time will be represented by the average of previous n points. The bigger the value of n the more smooth the graph will be, as more words will be taken into account, leveling up the resulting value. For our example, let’s take the value to be 40. That is, each point of the area chart is computed as the average of the previous 40 words. Also, to mention here is that these values are computed separately for positive and negative words:
At last we have it! The green and red triangles mark the highest positive/negative intensities where they happen in the text. From this chart we can see that everything starts pretty positive, but somewhere around 20% of the book there is a positive peak, followed by a big negative part, that also contains the most negative moment of the entire text (that is, the most negative intensity per 40 emotional words). Then the text goes more positive than negative, but at the end it turns more negative than positive.
You can also spot a flat beginning on both positive/negative areas. This is because at the beginning, there are no sufficient words to compute the average using 40 words, therefore the first thirty nine words have the same average as the fortieth word (another way is to compute the first 39 averages based on the available words, but we’ll leave this as is).
That’s it. Now we have tools to analyze the text from many aspects: we can see the Emotional Canvas, distribution and frequency of emotional words and the evolution of the whole text.
However, there is one more step to take in order to have a somewhat useful thing (did you notice when had this turned out from a bunch of colored squares into a tool for text analysis? ’cause I didn’t)
7. Make everything work on a “local” scale.
Wouldn’t it be great to see how all these charts look for a specific part of text? What are the words present in the part of text that contains the most negative moment? How many of them are there? What are their total intensity?
Well, if you have everything working on a global scale, it is quite easy to make it work for a specific portion that you, the user, can select. All you need is to put all those charts on a page and give the user the ability to select whatever part of text he or she likes:
And if you select another part of text, the charts adjust to that data.
Oh, and as this tool is based on colors, it would be rude to omit people that see differently that most people see. For this purpose, there is a “Switch color mode” button that switches colors to a more friendly palette:
As you can see, different parts of text are selected in the previous two images and all other charts have adjusted.
As a final touch, you can press on the ☰ symbol and see the placement of the emotional words within the actual text:
And now you have enough tools to make some analysis of your favorite books.
Going beyond books
Initially, I started this as a book analysis tool, but its implications are far beyond books. Emotions constitute an essential part of interpersonal interactions, be it from person to person, from person to group or between groups. We are emotional beings and we transfer these emotions throughout the words that we speak. Therefore, you can use Textury for analyzing any body of text on the subject of emotional component and how exactly the speaker uses emotions to make his point.
Below are just some ideas that you can analyze:
- Speeches of leaders in front of entire nations
- Commencement speeches
- Product launch speeches
- Political debates
- News articles
- Short stories
You can watch a short video that demonstrates how to use Textury, in which I analyze a political debate happened recently, a commencement speech and make a brief overview of the iPhone launch in 2007:
The tool is written in ES6 and the only framework used is D3.js v4 for visualization of data.
In conclusion I would like to say a few words of caution. As mentioned in the very beginning, this tool does not take into account the context of the words, therefore, in some situations it is not exact. It analyzes individual words and fails to tell you that “great danger” is actually a more intense negative combination rather than a positive “great” and a negative “danger” words. This is one of the things to further improve Textury.
If you have any ideas or suggestions, feel free to leave them in the comments below.
Finally, if you would like to use Textury yourself, go to textury.heroesofprogramming.com and make your own discoveries of the texts you are interested in.