The war had not yet ended in mid-1944, but Vannevar Bush was already anticipating the homecoming of the American troops. As the director of the Office of Scientific Research and Development, his attention had turned from driving the war effort forward to the more delicate task of reintegrating thousands of young men — many seriously wounded on their tours of duty— back into the rhythms of daily life.
Of particular concern were blinded veterans. Although braille and audio books were already in (albeit limited) use, a man whose sight was taken by shrapnel on the front lines would be generally barred from accessing the multitude of print material available to his sighted peers. The familiar comforts of newspapers, bookstores, and handwritten correspondence would fade into a mere memory.
Vannevar Bush recalled that, since the early years of the twentieth century, there had been attempts to build reading machines — devices that converted the written word into sounds. The machines weren’t readers themselves; they neither recognized words nor synthesized speech. They merely converted pictures into other sensory modalities, like raised bumps (like the device of Gustav Tauschek, below) or noises. There was every confidence that, with the reading machine executed correctly and enough training time allowed, a man or woman of average intelligence could learn a sound alphabet. By associating distinct noises with letters, they could begin to assemble words in their heads. Intuitively, it was no different than learning to use an optical alphabet — more commonly spoken, the written word. Brains that could read by sight, could probably read by sound.
Of course, a reading machine had never worked before. But, remarking on the tremendous advancements in all areas of science during the war years, it seemed reasonable to return to the device once more. If the atom could be split, if the German “Enigma” cipher could be cracked, if radar could extend sight to the oceans and skies, was it any more miraculous to make a blind man read?
Haskins Laboratories, a small research center founded in Schenectady, New York founded in 1935 on the independent wealth of biophysicist Caryl Haskins with his collaborator, the physicist Franklin Cooper, was selected to host the project. An early-career psychologist from Yale, Alvin Liberman, was recruited. Liberman would eventually become one of the titans of speech perception and production research. He could not possibly have anticipated, in his transit to the small, independent lab, that the trajectory of his decades-long career would be set into motion by the reading machine.
The researchers began by borrowing and repairing a used Optophone from the American Foundation for the Blind. The Optophone is a device that scans over a line of text and plays a note whenever black ink occludes a light source. When a few channels are used, the working Optophone resembles the comb of a music box sliding over a studded cylinder, playing different notes when different regions of the paper were inked. An example of a simulated Optophone (developed in MATLAB by the author), playing the notes G, E, D, C, and G (one octave lower) over a scanned sentence, is found below.
Optimism for the reading machine was high at the outset. Alvin Liberman wrote that “…the perception of speech was thought to be no different from the perception of other sounds, except, as there was, in speech, a learned association between perceived sound and the name of the corresponding phoneme. Why not expect, then, that arbitrary but distinctive sounds would serve as well as speech, provided only that users had sufficient training?”
“Given that expectation, we were ill prepared for the disappointing performance of the nonspeech signals our early machines produced,” he concluded. The sound-alphabet approach did not yield easy success. Participants who interpreted sentences played through the Optophone could not do it quickly enough to be useful- most could not “read” more than ten words a minute, which is at least twenty times slower than most sighted people can read from paper. Performance did not improve after hours of training. Nor was the slowness a result of the playback speed — when the experimenters made the Optophone faster, the letters only blurred together, making them entirely indistinct and useless to the listener. Liberman called it “an imperspicuous buzz”.
Cooper, Liberman, and the rest of the team had tried a variety of sounds, hoping to find a more efficient signal. After much frustration, they began to look at human speech itself. Although the technology to synthesize appropriate speech from a picture of a word was out of reach, the Haskins group expected to gain an understanding of the principles of the speech signal that made it so easy to understand, even when delivered quickly. Whatever the organizing features of the speech signal were, they could borrow and apply to the output of the reading machine.
For the first time, speech could be visualized in exquisite detail thanks to a World War II invention, the spectrograph, which emerged from Bell Labs. The spectrograph represents the energy in a range of frequencies over time, a process similar to the workings of the human ear — within the cochlea, sound is separated into its constituent frequencies, converting the frequency profile of a sound into a spatial pattern of activation along the snail-shell shaped organ. The inventors of the spectrograph had hopes of teaching the deaf to use telephones by reading the images (spectograms) produced by their device, but this had proved nearly as unworkable as the Haskins’ reading machine. “As a matter of fact I have not met one single speech researcher who has claimed he could read speech spectrograms fluently, and I am no exception myself,” wrote Gunnar Fant, an early pioneer of synthetic speech.
Looking at spectrograms of the same phoneme (a unit of sound that distinguishes a word from another in a given language) revealed curious patterns that were at odds with the assumptions driving the reading machine. Whereas the sound alphabet had assumed that each phoneme in a word — the “d”, the “o”, and the “g” in “dog” — could be delivered separately, as its own unique segment, the spectrogram told a different story.
Consider the following classical example with spectrograms of the words “see” and “sue”:
Notice the “s” (boxed in orange) represented in the spectrograms. Before /i/, as in “see”, the consonant has a great deal of energy in the high frequencies. When the vowel that follows is /u/, as in “sue”, it is not only the vowel that changes — there is a noticeably different pattern of energy in the “s”. The acoustic signal for “s” is not a discrete segment, like a word in a letter, than can simply be slid into place besides any other sound. Just like nearly every other phoneme, its acoustics depend on its surroundings.
The phonemes blur into one another by the very nature of speech production. Even as the vocal tract is positioned to make one sound, it is prepared to produce the next. In the above example, there is less high frequency energy in the “s” in “sue” because the tongue and lip formation are already altered during production of the “s” in anticipation of the /u/. Early studies with electromyography, which measures the electrical activity of muscle tissue, suggested that the muscles of the vocal tract could function largely independently of one another, preparing for and executing gestures in parallel.
After two decades spent in a rabbit hole of speech perception studies probing these nuances and intricacies of speech, the Haskins group was forced to conclude that the acoustic representation of phonemes could not be easily split into discrete segments. The acoustic signature of a consonant informed the listener what vowel would follow, and vice versa. It was incontrovertibly clear that speech was a code that played by different rules than anything the Optophone had ever produced. Information bled in time, forward and backwards. In this way, speech was more efficient than any rigidly partitioned sound alphabet; no amount of training, and no amount of optimizing, could overcome that glaring limitation of the reading machine. Of course, this raised new questions — if the acoustic signatures of consonants differed depending on what vowel came next, and if the acoustic signatures differed so much by speaker, how could any listener so easily and confidently identify phonemes in speech? Sixty years after these questions were articulated, conclusive answers are still hard to come by.
The dream of the reading machine would not be realized until the digital age, and even then, not through the use of a sound alphabet. But something more fundamental was realized when the aspirations for reading by non-speech sounds crumbled away. In retrospect, Liberman wrote, “Acoustic alphabets cannot become part of a coherent process; I suspect, therefore, that there is nothing interesting to be learned. But speech was always before us, proof that there is a better way.” The discovery that understanding speech is characteristically different than reading, and possibly different from perceiving other kinds of sound, upended assumptions about the brain’s mechanism for converting acoustic signals into a meaningful percept of language. The next decades would see previously unimagined theories of speech perception and the fruition of ideas as divisive as they were progressive — controversies that would define the world of speech research as it exists today.
For further reading and even more history, please see:
Liberman, Alvin M. “On finding that speech is special.” Handbook of Cognitive Neuroscience. Springer US, 1984. 169–197.
Shankweiler, Donald, and Carol A. Fowler. “Seeking a reading machine for the blind and discovering the speech code.” History of psychology 18.1 (2015): 78.
Liberman, Alvin M., et al. “Perception of the speech code.” Psychological review 74.6 (1967): 431.