Testing if machines can read
You may have recently read broad claims that systems designed by Alibaba and Microsoft surpassed human reading comprehension scores in Stanford’s SQuAD reading test. Those announcements were almost instantly panned by academics and practitioners that astutely pointed out just how static and uniform the ‘test” actually was. The Stanford test relies on questions and answers based on a known dataset to determine comprehension. Q&A tests of reading comprehension, like the college placement exams U.S. high schoolers take have been around forever yet neither are that great at determining the inteligence of the participant. Before we can test machines for cognition shouldn’t we check if they’re literate first? What if we had a new formula to test reading comprehension in machines that doesn’t require question answering? In this new test, any random text could be used and the system wouldn’t explicitly need be trained to comprehend it.
As a entrepreneur and practitioner in the Natural Language Processing (NLP) space I’ve spent much of the last six years thinking about if machines are capable of reading comprehension at the basic level and how to approach testing it. I’d like to introduce some of my ideas, theories and ultimately a new formula I believe can be used to test basic machine reading comprehension. My examples will be purposely simplified, mostly as an exercise in how to communicate NLP concepts in a way that anyone not familiar with that particular area of Artificial Intelligence can understand.
Introducing the Three Rs.
I’ve taken a deliberately autodidactic approach to many of my NLP experiments. In fact I’ve often referred to the kind of work I do as NLP hacking. I’m not interested in aligning with the consensus views from the world of academic Computational Linguistics to define and develop machine reading comprehension. There are major problems to be solved and it’s very likely that only the most creative solutions will prevail. That being said my definition of comprehension can be described quite simply as how fast you read (Rate) how well you recognize the words being read (Recognition) and finally how well you can remember what was read (Recall). These are the three Rs.
Rate — the speed of reading is a function of the time it takes to recognize words and the total number of words being read. Imagine that you are given a paragraph to read in 120 seconds. If you cannot recognize each word in that time period then you will not be capable of comprehending 100% of the paragraph. The rate of reading must be sufficiently fast enough that a literate person could complete the entire paragraph in a reasonable amount of time. A rate of speed that is either too slow or too fast has a direct effect on comprehension.
Recognition is by far the heaviest of the three factors in comprehension. When you first look at text, multiple complex things happen. Your brain first recognizes the language, it recognizes patterns like prose or rhyming words. It also identifies the complexity of the vocabulary and whether they’re real words or gibberish. There’s also the logical structure of the sentences called typology. Linguists might point out even more complexities but this is not an academic article so let’s not go there. Instead we’ll establish that the bare minimum for recognition would be recognizing the word, knowing which language the word is, knowing what the word means sufficiently enough that you can define it or use it in a sentence and finally knowing if the word is spelled correctly.
Early childhood books like Dr. Seuss’ Cat in the Hat contain simple vocabulary, well formed yet creative typology and a small word count. Because of these factors young readers perhaps between the ages of 4–7 should be able to read that entire book with a recognition percentage of 80–100% given a relative reading rate of 50 words per minute.
Recall is the last factor of reading comprehension and the one that I think humbles humans the most because it requires memory. Again using the Cat in the Hat what would we expect to be the average person’s ability to memorize, word for word the entire book and recite it a year later? Of course there will be some with exceptional photographic memories however for the wide majority of us the sad fact is that the more time passes the more we forget what we read.
To achieve 100% recall a person or machine must also have 100% recognition. While the ability to recognize words might change slightly over time it will not fluctuate as wildly as recall will. We humans are lucky if we can remember a few passages from our favorite childhood books. Recalling word for word, page for page and chapter for chapter are nearly impossible for the average person.
Let’s dig in
I think by now you’re starting to see where I’m going with this. According to many cognitive scientists reading is the least efficient and effective way for the human brain to learn and retain information. The average adult human can only recall 40–60% of what they read depending on the size of the text. In this case we’re talking about books, not speeches, song lyrics or other short form text commited to memory by repetitive practice. Our inefficiency in memorizing texts comes despite a respectable average reading rate of 200–300 words per minute for adults. By contrast we speak at a rate of 150–200 words per minute yet the spoken word for whatever reason can garner higher recall rates. Our brains and subsequently our memories are hard-wired for images, smells, touch and sounds but probably not written text.
“If you wish to forget anything on the spot, make a note that this thing is to be remembered.”
― Edgar Allan Poe
Where do machines come in to all of this? Well if we take the three Rs as a base level test for reading comprehension then machines already have humans beat in 2 out of the 3 attributes. An average programmer can make a program to parse written text and persist it to a relational database with an interface to recall any word, sequence of words, sentences, paragraphs or entire pages. That stored information is not going to degrade like human memory. It’s total recall. Likewise assuming base level recognition (which I will go into in bit) the machines can read words aided by NLP programming at a rate exceeding 50x that of a human.
So to recap, we established that basic recognition is determining the language and the meaning of a word. Both of these tasks are very doable with current NLP libraries, thanks in no small part to the application of hidden Markov models (is a grammar or sentence likely), embedded word vectorization (turn words into numbers and do math) and bayesian probability statistics (helps the markov model make predictions) to language processing. But what if we simply said that knowing the definition of a word is sufficient. As an example, the difference in literacy between adult and child readers is very often vocabulary. An educated adult is said to know or recognize between 6000 and 7500 words. There are approximately ~155,000 in Webster’s dictionary and a machine can recognize them all.
At this point some of you may be saying to yourself that the machines don’t really understand what they’re reading. You’d certainly be preaching to the choir if that is indeed your position. I myself believe that written text is more than the sum of its parts. That’s a conversation for another day. The combination of word definitions or meanings creates exponential concepts which are much harder for a machine to recognize. For today, at the most basic level words are their definitions. We will hold the machines to the same standard of basic comprehension that we hold children to.
Let’s do some math!
This is a formula that I wrote that shows the calculation of the rate of recognition. It’s like reading speed and recognition in one nice package. For every word w in a set W, starting with the first word all the way to the last we will apply a recognition process (lambda w) to the word and calculate the time t it takes to perform that process. The sum (Σ) of all of these process times will be divided by the total number of words in the set, giving us the output of lambda w for every word and the average time it took.
I applied this formula to some code in Python to bring it to life. I’m not going to post any code snippets here but you check out a demo on my Github .
I used everyone’s favorite pangram as my input.
The quick brown fox jumps over the lazy dog
The formula with my code recognized “brown”, “lazy”, “jump”, “fox”, “dog”, and “quick” as english words and defined each word using the Oxford dictionary API. It also recognized “the”, “over” and “the” as stop-words (common words like determiners and articles) and didn’t define those. The total time to read and recognize all 9 words took ~3 secs for a reading rate of 0.414 secs @ a comprehension rate of 100%.
The implications of programs like this are pretty clear. If optimized to run in parallel (splitting up the work among CPU cores), entire books could be read with basic comprehension in seconds. In a world where many of us must make hard choices on what we have time to read, I can think of many reading tasks where I wouldn’t mind a compentent machine’s help. Humans will be better at qualitative comprehension for quite some time but quantitative reading is increasingly becoming an issue for us.
We’d all love more time to enjoy a good book, a great journalistic article, an informative news story or entertaining blog. But what about user manuals, a financial prospectus, dozens of technical journals, thousands of emails or purposely obtuse terms of service agreements? Maybe it’s time to start outsourcing the things we don’t want to read to the machines.