TextRank NPM Package: A Non-machine Learning Approach to Text Summarization

Camden Ko
Camden Ko
Jul 21, 2017 · 8 min read

The night before Father’s day I scrambled around the city searching desperately for a gift. Department Store? Closed. Macy’s? Closed. Ace Hardware? Closed. 7/11? Open.

With my Arizona Iced Tea in hand I sat behind my computer to begin brainstorming about what I could possibly make for my Father to express how much I appreciate how much he has supported me throughout my life, and how much I appreciate his sacrifice. By the end of my Iced Tea, I settled for “something cool that might make his life a little bit easier.” So I went withthe idea of an app that would be able to record his daily conference calls and spit out a short summary of the call.

I hoped that the job would be simply connecting a bunch of Google APIs together, but as I scoured the internet for a simple to use text summarizer NPM package, none seemed attractive. It seemed that most natural language processors (NLPs) were developed in Python / C#. While it wouldn’t have been an issue to use child_process to simply run an instance of an already established NLP, my research indicated that these NLPs were extremely memory intensive and utilized machine learning for tasks like text summarization — there must be an easier way.

Luckily, I stumbled upon TextRank: Bringing Order into Texts, a paper that not only compared different summarization algorithms, but also proposes their own solution that performs better than or comparable with most established NLPs without any need for machine learning. Perfect.


The 411 on TextRank’s algorithm is simple: graphs are a tried and true method for citation analysis, social networks, and the structure of the web, so why not utilize those methods on language?

The TextRank algorithm turns text into a graph, then after running the algorithm on the graph several times, the pieces of text with the highest weight are considered to have been “recommended” the most by the other snippets of text…

Huh? Read on, it’s simpler than it sounds.

They explain their TextRank algorithm in two contexts: extracting the most impactful words from a sentence and extracting the most relevant sentences from a paragraph. I’ll just touch upon the latter case.

Firstly, an undirected weighted graph is constructed with every node connected to one another via a weighted edge. The nodes of the graph consist of sentences. The weight of each edges derives their value from similarity, a measure of their content overlap. In this case, content overlap is the number of words that both sentences have in common divided by the sum of the logs of the total number of words in each sentence (so longer sentences don’t completely dominate shorter sentences).

Similarity formula

What’s the easiest way to find out the overlap between two sentences? The sentences need to be tokenized into words. After creating a sentence class, I attached an object to it in the format {word: count}. Accessing an entry in an object is O(1) making it the premier choice when iterating through tens of thousands of objects.

The similarity is then calculated between every two nodes (sentences) on the graph, filling in all of the edge weights. Yay!

Now that the edge weights have been calculated, it’s time to work with the nodes. There’s no direct way of calculating graph weights in this context without there already being weights on the nodes. Luckily, random numbers will serve just fine as initial weights on the graph nodes. Any range of numbers could work, however the paper recommends 0–10.

What the paragraph looks like now

At this point the graph is made up of sentences with random weights connected to one another by edges that have weights based off their similarity — time to plug in the TextRank graph based ranking formula.

Graph based ranking

This formula is quite a doozy so let’s walk through it together by breaking it into chunks.


The weight (W) of the sentence we wish to rerank (Vi) is equal to one minus the dampening factor (d), .85.

Wait, what the heck is dampening factor?

Dampening factor is a value that’s been calculated by Google Engineers in their PageRank system to ensure that the weights of their nodes eventually settle down to a single value. The dampening factor can be anything from 0–1, however .85 is the commonly accepted “one-size fits all” value.

For our two sentences (i, j) the weight between the two of them (wij) is divided by the sum of every weight(w) between the sentence(j) and every sentence that it connects to (wjk) this value is then multiplied by the weight of sentence (j)

The previous process is then repeated for every sentence (Vj) that is connected into the sentence that we are reranking (Vi). Those values are then summed and multiplied by the dampening factor (.85).

That value is then added to the first value we calculated (.15) and the new weight of the sentence has been calculated!

But isn’t In(Vi) and Out(Vj) just all of the other nodes?

Yes! For our specific example every single sentence will be interconnected. However, what about the case of an essay? Perhaps we would want to connect every single sentence in a paragraph and then connect every conclusion sentence at the end of one paragraph with the introductory sentence of the next. This would result in a much more complicated graph and would In(Vi) and Out(Vj) would no longer be every other node.

With the new rank of one sentence calculated, the calculation must then be repeated for every sentence. The weight of each sentence is now significantly closer to what their “true” weight should be, but to ensure that the rank has been recalculated enough times, TextRank utilizes an “acceptable error rate” (.0001). This error rate is calculated by simply subtracting the weights of a sentence before and after every sentence has been reranked. Once the weight changes by less than .0001, the weights are now close enough to their “true” weights to stop the algorithm.

Now we simply have to look at the sentences with the highest rank and those are the sentences that are most relevant and therefore would make the best summary!

We’re done right? Well… not quite.

Inside the sentence class there’s the tokenized words, but that’s completely unreadable. So for each sentence that’s constructed the original sentence is added to the class, and now when the sentence class is sent to be outputted, the full original sentence is printed instead.

Outputting the first couple lines of the newly generated summary and a key issue arises: the sentences are out of order. In order to work around this issue, the position of how the sentences were ordered originally in the body of text was added to the sentence class. Ordering the sentences then became a matter of putting the top sentences inside of an array and then sorting them original sentence position.

Now we’re cooking! This process gives a decent result. After feeding my new text summarizer articles from NY times and some Jewish Science Fiction essays, I received results that were reasonable.


However, after reading through the summaries I realized a trend — the sentences which were being output had a high concentration of common words in sentences that play no part in the actual content of the piece. After a quick search I found that these words are called “Stop Words”. So, I created a Set that contains all of these stop words and only tokenized words that weren’t inside of this set. Implementing this showed a significant improvement in the sentences that were being returned, about a~33% improvement in results!

However, there was clearly more room for improvement inside the summaries so I tinkered with how much effect duplicated words in sentences had. For example, if two sentences both have the word “Trump” twice (very common in the NY Times), then both of those sentences had a very high rank, even if the rest of the sentence was fairly irrelevant to the rest of the piece. That’s because since both sentences have two occurrences, that totals to 4 overlaps! I added my own dampening factor to multiple occurrence words. After some testing, I found .7 to be the sweet spot. After this improvement, the text summarizer was now performing on par or better than most text summarizers that I found online. Excellent!


By the end of this process it was 5 AM and I was nowhere near close to something that I could give my father as a gift. Luckily, Ace Hardware opened at 6AM so I was able to run in, buy some crayons and draw a card. Then I purchased him some Sia coins and called it a night. ¯\_(ツ)_/¯

Dramatization of card

The next week, after having my code reviewed by my Fullstack Academy instructor, Nick, I uploaded my package to NPM, so it’s open to use in any Javascript project. If you’re interested in using it here’s a link. If you’re looking to get your hands dirty in the code, here’s a link to the Git, feel free to fork and make your own optimizations.

MRW someone makes it this far in the blog post

Thanks for reading! I hope that this offered some insight into TextRank. Here are some links for things that have helped me along the way and further reading:

TextRank

How to post an NPM package

PageRank

Introduction to Neural Networks

Fullstack Academy

)

Camden Ko

Written by

Camden Ko

Trading intern | Fullstack Fanatic | Monochrome | camdenko.github.io

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade