SMMRY, the Algorithm behind Reddit’s TLDR Bot

Matthew Plaut
5 min readApr 30, 2019

--

As I was scrolling through Reddit the other week, as one tends to do to kill time, I clicked on the comments for a r/worldnews article and at the top, I noticed there was a TLDR summary for the article. It caught my attention because it was generated automatically by a Reddit “bot”. That’s insane. A TLDR seems so abstract and non-formulaic, and yet here an algorithm was doing it automatically. I struggled to wrap my mind around this.

So, I did some research into the algorithm and it led me to SMMRY, which is what Reddit uses for this bot. The SMMRY website, www.smmry.com, lays out it’s algorithm in seven steps:

1) Associate words with their grammatical counterparts. (e.g. “city” and “cities”)
2) Calculate the occurrence of each word in the text.
3) Assign each word with points depending on their popularity.
4) Detect which periods represent the end of a sentence. (e.g “Mr.” does not).
5) Split up the text into individual sentences.
6) Rank sentences by the sum of their words’ points.
7) Return X of the most highly ranked sentences in chronological order.

So what does this all mean? Essentially, at its core, SMMRY ranks sentences by usage of common words and returns top sentences based on how long of a summary is requested. Awesome, that makes sense. But does it work?

Lets test is out on a previous blog post by yours truly. I’m sure all you readers out there remember all my blogs in detail, but in case you don’t this could be helpful. I decided to test the blog titled Fun with %W because it is the blog post that could most use a TLDR. Here’s the five-sentence summary that I got:

If the user is prompted to enter in their name in a form, a validation of: validates :name, presence: true would check to see if the user left the name field blank, and if so, the form would re-render with an error message instructing the user that a name must be entered.

Let’s say Michael Scott wanted to give the Dunder Mifflin employees an online form to fill out about the best paper company in the country so that he could prove Dunder Mifflin was the best.

The form would have two fields: name, and best paper company.

If we want a “Name” validation to include something that is more than one word, we cannot use %w. We have to use a simple array instead, which is not the end of the world.

If Micheal wanted employees to enter in their first names only on the form and he did not want any prank names, %w would work.

SMMRY reduced my original blog post 82 percent with this TLDR. And, it does kind of make sense if you have a basic understanding of the topic as a whole. You could read the above after skimming the blog post and get a basic understanding of %W. Awesome! This is especially helpful if you hate reading!

Below is an example of words that were ranked highly in the blog post:

I apparently used the word “name” a lot and so SMMRY ranked it very high. So, up to this point SMMRY, seems like a legitimate tool for creating TLDRs. But are there any problems?

Nothing is Perfect

Unfortunately, there are issues with the algorithm, but that is to be expected with such an ambitious goal. Lets look at the same blog post again. Maybe, I want to go back and edit it because I forgot how awesome Ruby on Rails is and I add:

Rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails.

Clearly this is not a coherent sentence, but I just can’t contain my excitement over Rails! So, I grabbed a five sentence summary from SMMRY again and got:

It’s week five of my Flatiron School experience and I’m deep into the weeds with Rails.

If the user is prompted to enter in their name in a form, a validation of: validates :name, presence: true would check to see if the user left the name field blank, and if so, the form would re-render with an error message instructing the user that a name must be entered.

Rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails.

The form would have two fields: name, and best paper company.

If Micheal wanted employees to enter in their first names only on the form and he did not want any prank names, %w would work.

Hmm, something seems off. What about a one sentence TLDR of my blog post?

Rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails rails.

Yikes, that doesn’t really give an accurate representation of my blog post. Maybe SMMRY is not so good after all!

The Bottom Line

So, in order to “break” SMMRY, I had to insert a random, non-coherent “sentence” of the same word. This is cheating. No legitimate article will have a sentence like the one I inserted into my blog post, so this likely won’t ever be a problem as long as the TLDR is of an actual published article. Furthermore, journalists are likely better writers than I am, so they will use fewer words more times, so the actual important words (unlike “name” from above) will come more to the forefront. Excellent.

All in all, I would highly recommend SMMRY to anyone hoping not to have to read that much but still would want an accurate representation of the reading.

--

--