Talking to a crow will be possible in 50 years

Akseli Ilmanen
9 min readAug 10, 2023

Yossi Yovel and Oded Rechavi just gave us “The Dr. Dolittle Challenge — can you build an AI that communicates with animals”? In their paper, they discuss the criteria for passing the test and (on Twitter) claim that it’s impossible. I think a combination of corvids, neuroscience, multimodal AI, and Wittgensteinian language games can handle the challenge.

Yovel and Rechavi raise three obstacles. For obstacle 1, consider some high-dimensional bat vocalization data. Using unsupervised methods, we may extract informative features from this signal. But considering the limits of our human Umwelt, our species-specific sensory perception and way of life, how would we know what part of the signal is informative for the animal? And how do these features map to the animal contexts such as feeding, sleeping or mating?

Obstacle 3 starts with the famous Wittgenstein quote “Even if a lion could speak, we could not understand it”. I quite like Yovel and Rechavi’s version of this “If cats do not talk about their feelings with each other or find puns funny, we will never be able to ask them ‘how they feel or explain that ChatGPT already means CatGPT in French’ ”. Like obstacle 1, this gets at the problem of an “alien context”. If a species’ communication signals are evolutionarily derived from its unique Umwelt, how can we map one Umwelt onto another using those same communication signals as a medium?

Let me start with my response to obstacles 1 and 3, and we address obstacle 2 later. The wording of an “alien” context nicely captures what’s missing here. A shared environment on planet Earth. This gives us a much better starting place. Instead of insisting that two Umwelten are fundamentally incompatible, we can ask what similarities they share. If a lion and blue whale could speak, I am confident I would understand the lion better. It lives on land, is closer in size, and (probably) has a more similar time perception to me. To get a feeling for this, accept my hypothesis that the Umwelt of a zebra finch is characterized by faster time perception than ours. When slowing down the speed of their song by 4x, they seem more much interpretable.

A shared environment also points at (🫵) a crucial aspect of language: referentiality. Referentiality is not unique to humans. Phenomenal work by Toshitaka N. Suzuki shows that when Japanese Tits hear snake-specific alarm calls they have enhanced visual attention to snakelike objects (Fig 1). It seems very hard to explain these findings without the functional role of a mental image on the receiver’s side.

Fig 1. — Taken from Suzuki (2021).

The same paper also discusses evidence for compositionality in Japanese Tits, another key ingredient for communication with animals. Yet the title of this blog post is “Talking to a crow” not “Talking to a Japanese Tit”. Corvids are better candidates because they are more cognitively flexible, natural statisticians, and show human-directed kleptoparasitism.

Kleptoparasitism is a feeding strategy where one animal deliberately steals food from another. For example, ravens are known to preferentially follow foraging wolves or approach the gunshot sounds of human hunters. In both cases, they want to steal from the prey killed by a heterospecific animal (wolves and us).

Not only do we (humans) provide an affordance for corvids in terms of finding prey, but we also play a referential role in their alarm calls. Work by Yuiko Suzuki and Ei-Ichi Izawa showed that a type of alarm ka-ka call of large-billed crows may represent unfamiliar human males (cluster C in Fig 2). Interestingly, when females or familiar males approached the crows, they did not respond with any ka-ka calls. This suggests they could differentiate between human males/females and familiar/unfamiliar individuals, and found this distinction ecologically relevant.

Fig 2. — Taken from Suzuki and Izawa (2023).

The key difference between the snake-specific alarm call in Japanese tits and cluster C in Figure 2. is how recent it is. Whilst a snake-specific alarm call is likely the outcome of a slow evolutionary process, it’s plausible that Cluster C is evolutionary recent or even reflects behaviour that was learned by an individual large-billed crow in Japan and passed along in crow culture. Depending on how quickly crows can develop referential calls to something new, maybe we can establish bidirectional communication with them. I will return to this point at the end.

Now as a reward for getting this far, watch this short video of a “crow [that] can talk like a human”.

Of course from this (edited) video, I am not convinced that the crow has an understanding of what “I don’t know” or “Walter” mean. However, I am neither convinced that it or all the other corvids on YouTube imitate human words without any regard for the human context in which it has heard them.

One of the criteria for the Dr. Dolittle Challenge is that we should communicate with animals using their own endogenous signals. In the crow’s case ka-ka calls and the like. Yovel and Rechavi have good reasons for this criterion; through associative learning, one can train a dog to associate hundreds of actions with human vocal commands.

We can now address obstacle 2 and associative “Walter” calls together. Obstacle 2 is methodological. It gets at the ecological distinction between signals and cues. The former evolved to deliver a signal from a sender to a receiver, whilst the latter may carry information but did not evolve to do so. AI-based clustering methods might pick up on the statistical patterns in cues but miss the signals.

It’s time for neuroscience. Consider this recent work from Katharina Brecht, Stephanie Westendorff and Andreas Nieder. They trained crows to respond to the presentation of a specific cue (‘‘go-cue’’) with a vocalization and to refrain from vocalizing when another cue (‘‘catch cue’’) was presented. A correct vocalization within three seconds after the go-cue was defined as “hit” If the crows did not vocalize in response to the go-cue, the trial was counted as “miss”. In successful “hit” trials, neurons in the crows’ nidopallium caudolaterale (NCL), the homologue of the mammalian prefrontal cortex, were very active just before vocal onset (Fig. 3A). The authors interpreted that these neurons were involved in cognitive control of “volitional” vocalizations.

Fig 3. — Adapted from Brecht, Westendorff and Nieder (2023).

The researchers also wanted to control that the activity in the NCL was specific to volitional vocalizations and did not reflect the acoustic signature of any vocalization. They compared the volitional vocalizations and task-unrelated vocalizations and found that whilst neuronal firing was very different, they were acoustically very similar (Fig 3B). This suggests the NCL activity did not code for the acoustic features of the vocalizations.

This control provides great inspiration for our “Walter-Water experiment”. Assume “Walter” and “water” sound acoustically similar to a crow — something we could test with an auditory discrimination task. Let’s say we trained a crow to repeat the words “Walter” or “water” when we say them and subsequently fetch us the toy doll Walter or a bottle of water. (Ironically, this task would be not too different from Wittgenstein’s primitive language of the builder and his assistants in §2, §8 of ‘Philosophical Investigations’). Now somehow in the future, we are recording in all the right places in a crow brain, and we find similar neural trajectories that code for the similar acoustic features of “Walter” and “water” and dissimilar neural trajectories that are reliably detected when the crow is fetching the doll Walter or the bottle of water. At this point, we can neuronally dissociate the acoustic features of a signal from its semantic (referential) features. We might even have a mesoscale mechanistic story of how the acoustics-coding activity triggers the semantics-coding activity.

Now, with the help of continuous wireless neural recordings, after training the crow, we can put it back in a semi-naturalistic environment. One day at timestamp X, we notice that the crow is at a birdbath and calling for its friends to join (as in the video above). It might be communicating anything from “come here”, to “come here, here is water” or just “pay attention to me”. How would we know? Well, what if at timestamp X, we found that there is neural activity very similar to when the crow fetches the water for us. This makes the crow call “come here, here is water” more plausible. And since we are recording both neural and audio data at timestamp X, we can deconstruct which parts/features of the crow vocalizations map onto the water-semantics neural activity.

To differentiate between signals and cues (obstacle 2), one has to find the functional mappings between vocal communication and the behaviour it is coordinating. The Walter-Water experiment offers a stereotyped version of how neural-behavioural experiments can investigate these mappings. Yes, we violate Dr. Dolittle criteria by not relying solely on crows’ endogenous signals. But if we identify matching neural activity of water-semantics both in our constrained task and in the crows’ naturalistic environment, the associative learning argument loses its weight.

We still face obstacles 1 and 3; why should we assume that “water” is part of corvid referential communication. We shouldn’t assume but it’s definitely worth testing. More generally, we might develop behavioural tasks with a range of elements, including tools (stick, stone), diet (worm, nuts, fruit, carrion), conspecifics (nestlings, indiv. crow), and heterospecifics (indiv. human, indiv. dog).

I have chosen examples, that can be compositional units of a longer vocalization. For example, a crow on separate occasions could make two vocalizations, signalling either “come here, worms here” or “come here, carrion here”. Having dissociated neural activity coding for worm-semantics or carrion-semantics in the same syntactic context (“come here, …”) would be invaluable. These examples are all also physically instantiated, making them more easily and reliably identifiable by vision or posture-based classifiers. Combining such multimodal AI (see Fig 4. and Rutz et al. 2023) with neural recordings and constrained by priors from our neural-behavioural experiments, we might even be able to classify social events and map social relations across crow individuals — social network analysis on steroids.

Fig. 4 — Taken from Rutz et al. (2023)

In 50 years, we might be able to artificially generate the ka-ka equivalent of “come to human Josh (not Sara), worms and fruits there”. That will only be the beginning. Similar to how slowing down the Zebra Finch song by 4x made it seem more interpretable, we will figure out transformations by which we can convert the rules and features salient to crows into more intelligible structures for us. Of course, it will not map exactly to human language but with bidirectional curiosity and referential interaction in shared environments, we will communicate.

Wittgenstein’s notion of a language game can even give us some intuition about how that communication will look like. Imagine a scenario, where we want to verify whether we are using the ka-ka equivalent of “water” correctly. Naively, we falsely assumed that their calls for “water from a birdbath” and “water from a rain puddle” are the same. We would have to learn the rules when each specific ka-ka call is appropriate. In Philosophical Investigations, Wittgenstein explores rule-following by the example of a pupil learning a series of natural numbers (§185). Previously, being shown some examples of a “+1” series, the pupil is now instructed to carry on from 1000 with “+2”. When he writes 1000, 1004, 1008, 1012, the teacher says “What are you doing. You should have added two”. The pupil might respond “But I went on in the same way”. Indeed for the first three numbers, he did go “on in the same way”. Should it have been apparent to him, that he should apply the “+2” rule to all the numbers in the series? Wittgenstein’s point is that inferring the rules (or ‘correct’ language use) relies on others pointing out mistakes. Inference + social feedback = language game.

Talking to crows will require such bidirectional feedback mechanisms. After a century of operant conditioning, where animals are rewarded or punished based on task performance, I find the mental image of a crow pressing a “correct” or “wrong” buzzer based on our performance in a crow-language game quite amusing.

--

--

Akseli Ilmanen

Interested in brain evolution, corvids and more. Find my full blog and podcast (about Embodied Artificial Intelligence) ⬇️ https://linktr.ee/akseli_ilmanen