Speed reading images in under 4000 characters of code

Reedr is a tool that helps you read pages of text more quickly. Here’s how it works: say you have a novel in your hands. Now take out your phone and go to reedr.me. Snap a picture of the text, specify your words per minute, and hit start. Reedr will flash the page at you, one word at a time, until you’re finished.

Reedr in action!

This is an explanation of how a seemingly non-trivial tool can actually be built from simple yet interesting concepts. In fact, the minified code behind it all fits in 4KB:

The entirety of minreedr, the minified version of Reedr.

YHack

First, the backstory: We’re Anthony, Liang, and Vahid, three bright-eyed MIT students who teamed up at YHack. Aside from trying Soylent for the first time and napping on air mattresses, we spent the weekend building Reedr.

Us at YHack!

We had so much fun making this! We decided to code everything from scratch: vanilla JavaScript that can detect, separate, and order words to be presented. Over the course of the weekend we encountered some really cool technical challenges and dealt with the basics of computer vision.

But first — why build a speed reader? Well, it turns out you are actually capable of reading much faster than you normally do, and one technique to achieve that is called Rapid Serial Visual Presentation (RSVP). The idea is that sequentially flashing words in the same place reduces eye movement — increasing efficiency and decreasing eye strain — and has been shown to actually increase reading speed by 33%. Right now there exist services such as Spritz and Spreeder that provide this functionality for computer text: stuff you can copy and paste into the system. We tried to fill the gap for physical pages of text.


Step 1: Getting the image

What’s the easiest way to bring a page of paper to our code? We opted to capture a photo through a phone’s camera and over the Internet, so we set out making a mobile-friendly webpage that accepted an image upload.

The image is represented on an HTML canvas as an array of RGBA values. We use these components for each pixel to first compute a grayscale version and then classify as black or white depending on whether the grayscale value exceeds a threshold. To normalize for exposure, the threshold is decided using the average grayscale value of the whole image. The end result is a binary, black-and-white grid of pixels. This simplifies any further processing but generally preserves what we care about: outlines of letters.

Step 2: Word segmentation

Our aim is not just to identify letters, but to also group them into words.

Line splitting (didn’t work)

Our first approach was based on the observation that separate lines of text have white space between them. Essentially, a row of mostly white pixels would be considered a line break. Once we have a line, we can rotate the process 90 degrees and repeat: within each line, a column (or rather, several columns) of white pixels separate two consecutive words.

We tested this idea out and it seemed to be working well at first. Unfortunately, once we started giving it slanted text and incomplete lines it broke down very quickly, and we realized we needed a method that was more forgiving about the text’s orientation.

Line splitting works in some cases, but fails in others.

Blurring

Instead of separating into lines, we link letters of the same word by blurring the image into blobs of connected black regions. We achieve this using a Gaussian filter: each pixel becomes an average of its nearby pixels (with closer pixels weighted higher) and then again converted to binary by a threshold. We found that intentionally biasing the process to blur horizontally more than vertically helped to avoid merging words from separate lines. This filter gave a pretty satisfactory set of word blobs.

Before and after blurring.

Flood fill

Now that we have an image with groups of black pixels, we’d like our program to keep track of words as distinct objects. This involves algorithmically associating connected groups of black pixels. Thankfully, there exists simple, reliable solution called flood fill for this purpose: choose an arbitrary black pixel and recursively add neighboring black pixels as much as possible, resulting in a word, and repeat until all words have been identified.

At this point, we have extracted a list of word blobs, with each word blob represented as a list of pixel coordinates.

Bounding boxes

In the process of extracting groups of words we have sadly blurred the image beyond legibility! But no reason to fret — we kept a copy of the original image, so all we need is to figure out which portions of the original image are words that should be displayed. To specify these portions, we find a bounding box of all the black pixels for each word. This is as easy as taking the minimum and maximum of each coordinate across all the pixels of a word.

Word segmentation process. From left to right: BW, blurred BW, original with bounding boxes.

Step 3: Ordering words

Great! Now we have all these boxes each cutting out a word of the text. Time to flash them to the reader and we’re done! Just kidding — we still don’t know what order to flash them. In reading text, order is determined by the positions of the words, so our next task is to use that information to sort the word boxes into reading order.

We made a few attempts at this. None of the methods alone worked, but put together they get the job done.

Assigning scores

Our first observation is that, generally, earlier words in the text are above and/or to the left of later words (assuming a left-to-write language). How can we take advantage of this to determine reading order? Perhaps we can give each word an “upper-lefted-ness score” and sort all the words by that score.

Level curves of the score function on a piece of text.

A simple way to score a word box is as follows: given a box whose center has x- and y-coordinates called x and y, its score is x+10y. Note that we weight the y-coordinate higher. This way, if a word far to the right of another is even a bit (one line) higher, it will be considered before the other.

Like a lot of things in life, the idea makes sense in theory, but it breaks down in practice. For example, consider text slanted slightly downward. The last word of one line might be even lower than the first word of the next line. By purely comparing these two boxes there is no way to tell which comes first.

Line splitting, take two

Ah, line splitting once again — this time with word boxes instead of pixels. If we could just separate out a set of boxes in the same line, we sort by x-coordinate and that’s it. So how do we create clusters of boxes in the same line?

Maybe we could use one of many known clustering algorithms, such as k-means clustering, on boxes’ y-coordinates.

Or maybe we could try a more homegrown approach: pick a random word box and every other word on the same line; repeat with the remaining words until finished. How do we tell if two words are on the same line? Here’s a simple potential rule: do the horizontal midlines of each word go through the other word? If yes, they’re on the same line; if no, they’re not. We call this the midline check.

Their horizontal midlines cut through each other, so we would say these words are on the same line.

Unfortunately, this again fails with downward-slanting lines. The last word of one line could be lower than the first word of the next line and thus inseparable into different lines.

In fact, we realize it’s simply impossible to compare two boxes given coordinates only: two boxes with given coordinates may be on the same line on one page of text but on different lines on another. No algorithm could classify both correctly.

Word chaining

Using that last insight, we try to incorporate more more information — the order of previously sorted words — in finding the next sorted word. We call this chaining.

Here’s a simple way of doing it: given the last sorted word box, choose from the remaining boxes the one whose left edge is closest to the last box’s right edge. This approach has a few holes:

  • How do we choose the first word?
  • What do we do at the end of a line?
  • What if the word below is closer than the word to the right (in text with small line spacing)?

Three wrongs make a right

Let’s summarize the strengths and weaknesses of each method:

  • It turns out the scoring method works great for finding the first word of the top line; anything below that is a little iffy.
  • While it’s hard to split entire lines, the midline check is fairly accurate for consecutive words on the same line.
  • Chaining is a good idea, but it lacks a mechanism to choose the first word and needs a more accurate way to decide when to jump lines between consecutive words.

Like puzzle pieces, these fit together to make for a pretty effective box-sorting algorithm:

1. Start with a set of unsorted boxes and a list of sorted boxes (empty at first).
2. Find the first word of the top line by scoring each unsorted box and choosing the optimal score. Add it to the list of sorted boxes. If no words exist at all, we’re done.
3. Take all remaining unsorted boxes that are on the same line as the last sorted box (as determined by the midline check). Of these, choose the leftmost box that is to the right of the last box, but not too far to the right . Add it to the list of sorted boxes. If no such box is found, go back to step 2.
4. Repeat step 3.

And that’s it! We flash the boxes in the sorted order at the desired rate.

The box sorting algorithm.

Step 4: Squeezing into 4000 bytes

We finally had a working Reedr, but we were at a hackathon and sponsor Datto was offering a special prize:

Awarded to the team with the best hack that has source code less than 4,000 bytes.

So we took the challenge: shrink our app of about 15 kilobytes into just 4.

There are obvious and not-so-obvious techniques for minifying a piece of code (which turns out to be a really intriguing problem!). For a start, there are minifier and compressor tools that replace your code with shorter syntax (using smaller variable names, replacing `Infinity` with 1/0, and so on). These minifiers employ some tricky tricks, but there are also minifications that they miss. Here are some we made:

  • Before anything else, style goes out the window (get it…). No CSS.
  • Of course, we did run our code through a JavaScript minifier. Actually, three of them. Each one catches some minification techniques that the others miss. For example, a piece of code that is originally 17292 characters comes out of Closure Compiler as 5179 characters. Running the result through jscompress.com reduces it further to 5135. Anyway, this becomes some sort of manual build process: copy code out of the editor and into a minifier; then copy the output, move to the next tab, and repeat.
  • Another obvious minification is to put all the JavaScript in the HTML file. This lets us drop the “src=…” attribute — precious characters we need to save.
  • It also turns out the HTML file can do without some basic tags, like <html> and <head>. We ended up with just three HTML tags, one after another: <canvas>, <input>, and <script>.
  • Minifiers are great at reducing variables like “firstNumber” to “a.” However, give them an array element like “numbers[0]” and the best they can do is “a[0]” — three extra characters. So, if a certain position of an array is retrieved many times, it’s advantageous to assign “firstNumber = numbers[0];” once and use “firstNumber” whenever it’s needed.
  • Similarly, minifiers don’t shrink names of built-in functions. But with functions as first-class variables in JavaScript, there’s no reason that can’t be done! We happened to use .push() and .forEach() on a lot of our arrays, so we renamed them:
var k = Array.prototype;
k.p = k.push;
k.q = k.filter;
k.r = k.forEach;
k.u = k.concat;
  • The minifier spits out an unintelligible blob of characters, but if you look closely you may be able to spot the keyword “var” here or there. You may realize that you have also spotted an opportunity to vanish four more characters (including the space). The keyword “var” isn’t actually always needed, so try deleting one and run the code. If it doesn’t break, you’re good! (Clearly we’re getting a bit desperate by now.)
  • Finally, constants just have to be close enough. Expressions can be simplified and numbers can be rounded to save space. Euler’s number e is 2.718, you say? Nah, it’s 3. There’s another four characters.

We spent the last two hours of the hackathon eliminating bytes, character by character, until about 10 minutes before the end we managed to dip under 4000. After a quick celebration, we submitted and ended up actually winning the Datto prize!

The minified version, dubbed “minreedr,” ended up with the same core functionality, but not much else.

A simple interface for minreedr: a canvas and a button.

That’s all

With some UI touches, Reedr entered its (albeit imperfect) working state and is hosted at reedr.me. Meanwhile minreedr is a single HTML file with one line of 3999 characters. To put that into perspective, this blog post has 13468 characters. It’s amazing how compact code can be!

This is minreedr. Every character counts.

The code is hosted on Github, Reedr here and minreedr here. Feel free to leave any questions in the comments!