How to measure what language a nonspeaking person can understand: is eye-tracking enough?

Innovative methods and surprising findings from a new study of autistic youth with minimal speech.

Autism researchers face a difficult challenge: how to understand the minds of people without communicative language.

As late as age eight, some 30% of autistic people remain minimally verbal (with no speech, or only single words) [2]. Some who can speak cannot consistently use language in a communicative way. That is, they may not be able to converse, ask questions, or describe their thoughts and feelings in ways neurotypicals can understand.

Developmental psychology offers several tools for studying people who cannot speak. Yet autism researchers have often excluded non-speaking people from studies. As a result, we know little about what language, and other concepts, non-speaking people can understand.

Unfortunately, some parents and teachers assume those who cannot speak comprehend nothing at all. It can be damaging to grow up being treated as incapable of understanding. A few, like Emma Zurcher-Long and Ido Kedar, learn to write, and describe the traumatizing treatment they endured.

In a previous post, I described a method psychologists can use to measure non-speaking people’s language comprehension: eye tracking.

I argued that this method is accessible because it demands no response from the participant. One must only passively view images. Thus, eye tracking can even be used with babies in their first few months of life. Removing the need for an active response matters because of several disabilities found among non-speaking autistic people. They may have difficulty choosing between multiple options, pointing, [3] and even making volitional movements [4]. In fact, the less speech an autistic person has, the greater difficulty they may have making even simple hand and mouth movements [3].

But is eye tracking the only or best method?

A new study by Helen Tager-Flusberg’s team [1] suggests otherwise.

What the Researchers Did

The researchers tested 19 participants (4 female), ranging from 5 to 21 years old. Their caregivers reported they were “minimally verbal,” meaning that they were at least 5 years old and lacked fluent spoken language. Specifically, they:

  • Did not use spoken phrases, other than echolalia, spontaneously and communicatively every day.
  • They could say fewer than 30 words and phrases that were not echolalia.

Tager-Flusberg’s team used several tasks varying in the demands they placed on participants.

  • MacArthur-Bates Communicative Development Inventory (MCDI), a parent questionnaire. Parents are given a checklist of words young children typically learn, and rate how consistently they believe their child says and understands them. (Tager-Flusberg’s team added additional words more age-appropriate for their participants).
  • Peabody Picture Vocabulary Test (PPVT), a standardized receptive vocabulary test often used to research both typical and atypical language development. Participants interact with an experimenter and must choose between four items, then respond either verbally or by pointing.
  • Looking while listening task [5] presented participants with a pair of pictures and a spoken word naming one of the pictures. Their gaze patterns were measured millisecond by millisecond using an eyetracker and time-locked to the speech signal. Children could demonstrate comprehension by looking at the labeled picture the majority of the time, or incomprehension by looking at the other picture.
  • Touchscreen task was created by the researchers to match the looking while listening task in every way except the response required. Participants saw and heard the same stimuli with the same timing, but had to respond by touching the correct picture.
Image pairs used in the eye tracking and touchscreen tasks. Target words are capitalized.

Crucially, the same words were used in the MCDI, the looking while listening task, and the touchscreen task.

The study aimed to compare these methods for measuring participants’ comprehension of the same words. Which measures would provide the most information and underestimate comprehension the least? Furthermore, it also tested the value of using multiple measures instead of just one. Non-speaking autistic participants rarely remain engaged and cooperative for long. So the more comprehension measures used, the more study sessions they must participate in.

Lessons Learned: How not to underestimate non-speaking people’s comprehension

1. Standardized verbal tests don’t work for this population

About a third of participants (32%) could not “establish a basal.” That is, a starting point could not be established for them because they did not get any items right. Another participant could not be administered the test at all.

Another 28% of participants scored at floor (standard score of 20), which provides very little information, besides the obvious: these youth were profoundly language impaired.

By contrast, many participants “untestable” on the PPVT could perform the eyetracking and touchscreen tasks, comprehending over 40% of the words.

2. Measure and take into account nonverbal reasoning ability

Researchers measured participants’ nonverbal IQ using Raven Colored Progressive Matrices (RCPM), a standardized test often used with disabled children and elderly adults. This test is less likely to underestimate autistic people’s IQ than the more common Wechsler test. [6]

Measuring nonverbal IQ helps researchers understand whether participants are developmentally delayed in areas other than language, and to what extent. A non-speaking person with high nonverbal IQ might be expected to show rapid learning and deep comprehension, if not required to use language, and if provided with sufficient accommodation. Under the same conditions, a person with lower nonverbal IQ would have lower performance.

Notably, every participant had higher standard nonverbal IQ than receptive vocabulary scores. That means all participants had higher nonverbal reasoning ability than their language abilities would predict.

Participants’ scores on nonverbal IQ (Ravens CPM) vs. receptive vocabulary (PPVT_4), separated into a) younger group; b) older group. Ratio scores mean the individual’s “age equivalence” score divided by his/her actual age, * 100, where a score of 100 is average.

Indeed, while most participants also had delays in nonverbal IQ, 1/6 (17%) were average or above average for age. By contrast, even the highest-scoring participant on the PPVT was markedly delayed, with a standard score 1.5 standard deviations below the mean.

It would be a mistake to judge the learning and thinking capabilities of non-speaking youth based on their spoken language impairment.

3. Modify all tasks to make them more accessible

The looking while listening and touchscreen tasks used more words than ever before. Previous studies have used less than 20 trials; this one used 84.

Because passively viewing 84 trials would be too boring even for typically developing children, these were broken into three blocks of 28 trials. Importantly, blocks were arranged in order of difficulty. Words learned earliest in typical development (assumed easiest) were presented first, while words learned latest in typical development were presented last. Increasing the difficulty maintained consistent levels of challenge and interest throughout the study and allowed for a rough gauge of participants’ vocabulary developmental level. Finally, it ensured that dropping out early would not lead to underestimating participants’ comprehension: the trials they failed to complete would more likely be ones they’d get wrong anyway.

The researchers modified the eye tracking task in two different ways to make it easier to pay attention.

Computerized tasks typically use a fixation cross between trials to cue participants to look at the center of the screen; this can work well for some autistic participants [7]. This subtle cue was replaced by a black screen showing a cartoon character from Thomas the Tank Engine, a TV show popular among autistic children.

Long series of identically formatted trials can become boring even for typically developing children. Thus, 5–10 second full-screen cartoon movies were randomly interspersed between trials. These were intended to ensure participants kept looking at the screen.

The research team even modified a standardized test to minimize verbal instructions. Although all problems in the RCPM are nonverbal, instructions are normally presented verbally. The researchers modified the test materials to enable instruction by demonstration instead. The geometric figures were made magnetic, to be placed on a magnet board in the space indicated on the board. The experimenter demonstrated the process in several trials, and sometimes used “hand over hand” teaching with participants who appeared not to understand the demonstration. (The researchers appear unaware of the ethical problems with this teaching method).

This modification was especially impressive because experimenters (for good reason) rarely modify standardized tests.

4. Task Demands Matter

Correlations between measures were moderate to high, ranging from .50 to .80. The highest correlations were between the two measures requiring a choice, the touchscreen task and the PPVT.

Correlations between measures used in the study. Boxes (added by me) indicate the tasks discussed in this post. (The ADOS is a diagnostic measure for autism, and the Vineland measures different aspects of language entirely).

That means a substantial amount of participants’ scores varied from task to task: between a fifth and a half.

Because so much of a participant’s score reflects their response to the specific task, not just their comprehension, using a single task won’t give you the whole picture.

5. A passive measure isn’t necessarily accessible

One weakness of the looking while listening task is that the results can only be interpreted if participants are looking at one of the two pictures presented. If they are looking at a blank part of the screen, or not looking at the screen at all, we can’t know whether they understood the word, and the data must be thrown out. This rarely happens with typically developing toddlers, but often did with these participants.

Despite the built-in attention-getting modifications, participants often did not look at the screen. Across all participants, a third of trials (33%) were thrown out for this reason. Although individuals varied, 55% of the participants had 50–80% usable trials, while 22% attended to the screen on fewer than half!

Even if they looked at the screen, participants did not necesarily attend either picture, making still more trials uninterpretable. Altogether, almost half the trials (48%) could not be analyzed.

The authors attributed the loss of data, and its variability across participants, to “heterogeneity in basic attentional processes in this population.” [italics theirs].

As it is, one could only assess comprehension for about half the words, meaning that half of participants’ time was wasted. This measure was so uninformative it would probably be considered unacceptable in an educational or clinical setting.

One can only imagine how little data would remain without the modifications researchers used. Perhaps this group would have been deemed “untestable.”

6. A passive measure can be less accessible than an active one

One might assume — as I did — that a passive measure will always be more accessible than a measure requiring participants to respond. First, for logical reasons: both tasks require attention, while a direct response task places an additional demand. Second, because in developmental psychology studies, children demonstrate understanding of concepts like “object permanence” earlier through passive viewing tasks than through active measures.

The results of this study were an exception to the usual rule.

Surprisingly, most participants were more accurate on the touchscreen task than the eye tracking task, even though the touchscreen task required an additional demand for active choice.

In fact, 61% of participants were more accurate on the touch screen task than the eye tracking one, while only 11% were more accurate on eyetracking.

Apparently, paying sustained attention is more difficult for this group than making volitional movements.

Researchers also had to throw out less data from the touchscreen task than the eyetracking task. Therefore, the touchscreen task allowed them to test comprehension of more words.

Counterintuitively, the touchscreen task is a better measure than passive viewing. This news will come as a relief to researchers, because touchscreen tasks require less complicated equipment and are easier to administer.

7. Parents don’t always underestimate their children’s comprehension

One might expect parents to underestimate their children’s comprehension. After all, autistic children do not respond to others’ speech (or even their own names) in conventional ways, making it hard to determine if they understand. Difficulties communicating through gesture, facial expression, and body language accompany their difficulty communicating through speech.

Furthermore, Tager-Flusberg’s team made the questionable decision to exclude echolalia (quotation) from language “said and understood.” Many language delayed children, including autistic ones, communicate through repetition, of their conversation partner’s previous utterance or of lines from favorite TV shows and other media. Thus, much of a child’s understood and communicatively-used speech may be excluded from consideration.

Now look again at the accuracy for each participant on the parent survey (dots) as opposed to the touchscreen (stripes) and eyetracking (solid) tasks.

As the correlations of .5 with eye tracking and .6 with the touchscreen task suggest, parent ratings differ surprisingly little from direct measures. While 47% have lower parent survey scores than touchscreen ones, as I’d predict, almost as many (42%) have parent survey scores at least as high. Parents often underestimate their children’s word comprehension, but any specific parent cannot be assumed to do so.

8. Use multiple measures with the same content, but different task demands

The authors conclude:

“New technologies could provide more reliable assessment of language comprehension than the commonly used but more limited standardized tests. However, clear advantages of one method over another did not emerge from this study…Nevertheless…an important avenue for capturing the true potential for language comprehension of minimally verbal children who remain otherwise untestable is to find individualized approaches to testing, using several types of assessment, including methods based on eye-tracking or touch-screen responding.”

Their conclusion that “clear advantages of one method over another did not emerge” does not fit their results. Both the eyetracking and touchscreen tasks could assess more participants than the PPVT, and touchscreens provided more data than eyetracking for most participants.

However, using multiple methods did seem to help. Why?

Understanding the role of vocabulary vs. task demands

Other studies have used multiple measures, but these researchers innovated by using the same words for all tasks.

If all measurement methods use different words, then if someone performs better one one test than another, we can’t know whether the difference comes from the vocabulary or the task demands. That means we don’t know whether it tells us about a participant’s knowledge (what words they understand) or their performance (the conditions under which they can demonstrate knowledge).

Suppose Bob performs well on the eye tracking task and is at floor on the PPVT. Does he know the words in the eye tracking task but not on the PPVT? Or, can he show his comprehension on a computerized task through unconscious eye movements, but not in a social interaction through pointing? There would be no way to decide.

Converging evidence

Using multiple tasks with the same words provides converging evidence. If children show evidence of comprehending a word across multiple methods of measurement, we can be more confident that they understand the word. Likewise, if they demonstrate comprehension of a word on none, we can be more confident they do not understand.

Using multiple methods could prevent underestimating the comprehension of those in the middle. Suppose Jane got ten words right on both the eye tracking and touchscreen tasks, the word “pasta” in only the eye tracking task, and the word “bicycle” in only the touchscreen task. She has shown evidence of understanding 12 words (albeit inconsistently).

Accessibility and individual differences

We have seen that individuals differ in which tasks they find most difficult, sometimes in ways that seem counterintuitive to a developmental psychologist. This population varies a lot in individual strengths and weaknesses. For example, one participant might have difficulty sustaining attention but no difficulty pointing; another might have the opposite pattern. Even with similar nonverbal IQ, different tasks might be accessible for different participants.

Thus, using more measurement methods ensures accessibility for more people.


So, when studying what language non-speaking autistic people understand, is eye tracking enough?

This study suggests two conclusions :

  1. Eye tracking is a good method, but not the most accessible for everyone.

2. Eye tracking should be just one of many methods used.


[1] Daniela Plesa Skwerer, Samantha E. Jordan, Briana H. Brukilacchio, & Helen Tager-Flusberg (2015). Comparing methods for assessing receptive language skills in minimally verbal children and adolescents with autism spectrum disorders. Autism, 1362361315600146. Open access PDF.

[2] Ericka L. Wodka, Pamela Mathy, & Luther Kalb (2013). Predictors of phrase and fluent speech in children with autism and severe language delay. Pediatrics vol. 131 no. 4. Open access PDF.

[3] Morton Ann Gernsbacher, Eve A. Sauer, Heather M. Geye, Emily K. Schweigert, and H. Hill Goldsmith (2008). Infant and toddler oral- and manual-motor skills predict later speech fluency in autism. Journal of Child Psychology and Psychiatry vol. 49, no. 1, pp. 43–50. Open access PDF.

[4] Lorna Wing and Amitta Shah (2000). Catatonia in autism spectrum disorders. British Journal of Psychiatry, vol. 176 no. 4, pp. 357–362. Open access PDF.

[5] Anne Fernald, Renate Zangl, Ana Luz Portillo, and Virginia A. Marchman (2008). Looking while listening: Using eye movements to monitor spoken language comprehension by infants and young children. In: Irina A. Sekerina, Eva M. Fernandez, & Harald Clahsen (editors), Developmental Psycholinguistics: On-line methods in children’s language processing xviii, pp. 97–135. Open access PDF.

[6] Michelle Dawson, Isabelle Soulieres, Morton Ann Gernsbacher, & Laurent Mottron (2007). The level and nature of autistic intelligence. Psychological Science vol. 18, no. 8, pp. 657–662. Open access PDF.

[7] Nouchine Hadjikhani et al. (2004). Activation of the fusiform gyrus when individuals with autism spectrum disorder view faces. NeuroImage vol. 22, pp. 1141–1150. Open access PDF. I am referring to p. 1148, bottom of the left column.