Can Speech Technologies Save Indigenous Languages from Dying?

Joshua Yang
7 min readNov 5, 2020

With the incredible rise of big data and machine learning in the past decade, speech technologies have been making significant advancements for computers to understand human languages. But not all languages have the luxury of data abundance. Trying to apply data-oriented approaches to preserve small endangered languages can be unintentionally detrimental to these already vulnerable speech communities.

A respeaking activity with a Kunwinjku speaker in Northern Australia

In the past few months, the #BlackLivesMatter movement has prompted people to learn about the struggles of Black and Indigenous peoples around the world. There are, once again, calls to decolonise our collective knowledge and scientific practices. As it turns out, the decolonising idea, with the need to put indigenous and black voices at the centre of the stage, is especially crucial for language technologies when it comes to revitalisation.

World’s languages are disappearing

A vast majority of the world’s languages are dying at an alarming rate. It is estimated that 60–90% of languages will vanish by the end of this century [1]. In the late 18th century, there were more than 250 known Australian languages spoken around the continent. Yet today, there are only 13 Indigenous Australian languages left that are still being learnt by children.

Languages carry the weights of histories. For example, many Australian Aboriginal stories tell of a time when rising sea level flooded the former coastline of the continent, referring to postglacial events that occurred more than 7000 years ago [2]. A people’s history is passed down through its language. When the language disappears, the intellectual wealth and unique world views that the people carry will also be washed away.

Zuckermann, a linguist who works on language revitalisation in South Australia, mentioned in his interview:

“The loss of language is more severe than the loss of land. Language death in my view means loss of cultural autonomy, loss of spiritual and intellectual sovereignty, loss of soul, if I may use this term metaphorically.”

Language endangerment is more than a loss of language. It is an enormous loss of accumulated knowledge and it is also inextricably linked with the loss of identity.

How are we documenting endangered languages?

As most of the endangered languages only exist in spoken form, the field of language documentation emerged to preserve records of events and support language learning. It aims to create a “digital Noah’s Ark of language” [4] that is available for multiple revitalisation purposes.

However, manual transcription is a laborious job because oral languages usually lack a standardised writing system. On average, it is estimated to require 50 to 100 hours for a trained linguist to transcribe 1 hour of speech recording, often longer if the linguist lack a solid understanding of the language.

This difficulty of transcription, along with the urgency of language endangerment, makes computational speech recognition necessary in transcription.

Speech recognition in language documentation

Speech recognition is a field that develops methods for computer to capture words and phrases of the spoken language. It is the underlying technology of how Siri or Google Assistant understand users’ ridiculous requests.

With the access to large amounts of language data, these high-resource speech recognition models utilise the power of machine-learning to process speech. The speech recognition approach usually predicts phonemes (sound-representing units), then segments these phonemes into words.

However, without enough data, speech recognition for endangered languages remains challenging. The task is framed as “zero resource” or “almost-zero resource” speech processing in computational linguistics.

Over the past two decades, we become increasingly fixated on data rather than understanding the processes behind collecting. Due to the success in high-resource speech technologies in the recent decade, the speech community tend to apply similar approaches in zero-resource speech processing. This usually relies on a phonemic approach described above that requires hundreds of hours of transcribed speech. The failure of technology being developed for endangered languages becomes the fault of these already-vulnerable languages themselves. And that is not cool.

The “big data” narrative could be problematic

In this zero-resource scenario, it is the “land of unseen languages” that data-thirsty machine learning methods set out to conquer, as Professor Bird described [4]. With the goals to “save endangered languages”, these computational methods tend to use high-resource techniques and treat indigenous knowledge as data. This thinking trivialises linguists’ input that includes transcriptions, dictionary, and phoneme inventory that often already exist. It also disenfranchises local knowledge authorities [5]. In many cases, Indigenous people have gotten resentful to the idea that their languages are mere data ready for the taking.

Another issue is that static language data is often depicts a low-resource language as a relic that is stuck in the past. As Roche [3] points out, one of the problem is that endangerment linguistics is essentialist, “that in conceiving of languages as bounded and stable objects of study, it overlooks the dynamic, fluid, and fuzzy nature of language.” This criticism is particularly true in how speech processing methods tend to view its language data. Simply treating language documentation as a static data problem is essentially claiming authority of what is the standardised language. Without an interactive dynamic with the living speakers, Bird (2020b) argues that treating indigenous knowledge as mere data could actually be detrimental to language vitality, and \reenact the causes of language endangerment” .

We can do better.

The quest of supporting language health

To truly hold language documentation accountable for the health of languages, we need to use community-based approaches. It is necessary to have a transcription pipeline that “work with people to meet their goals of developing curriculum materials” [4]. With various field experiences working with indigenous people, our Darwin-based language technology team aims to explore more collaborative transcription methods.

Largely based to Bird’s recommendations on Sparse Transcription [6], we examined a transcription pipeline that factors the needs of revitalisation into its methods and outputs. The pipeline aims to facilitate easy search access of speech, aid for manual transcription, and the creation of teaching materials. In this pipeline, we include the following methods.

(1) Partial Transcription: We dropped the unrealistic commitment to transcribe fully with the limited language resource. Partial transcriptions are more than sufficient for the needs of language revitalisation. This frees us from the need to apply traditional data-thirsty speech recognition methods on small isolated indigenous languages.

(2) Word-Spotting: It is natural for linguists and speakers to transcribe at the word level as words are meaningful units. We propose using word-spotting, instead of phonemic recognition that outputs sound-representing phonemes. This enables speakers and linguists to participate in the transcription process. With words as outputs, they can create teaching materials and identify errors in the transcription process.

(3) Respeaking: We incorporated the respeaking task into our word-spotting method. Respeaking is a common linguistic field method that collect careful speech for analysis and transcription. We use respeaking to standardise different spontaneous speech audios and eliminate unwanted variations. With collaboration with native speakers, respeaking maximises the knowledge that we could preserve.

We used Indigenous Australian language of Kunwinjku and English as examples in our experiment. We showed that collaborative methods, such as respeaking, make significant improvement to word-spotting performances. In this pipeline, we collected language resources along the language documentation process. Not only is the language resource dynamic, the form of this language resource can be also more fluidly defined for the use of language revitalisation.

So, can speech technologies save indigenous languages from dying?

No, not really. Even if technology helps, it would always be the people themselves who save their own languages.

Neither linguists nor speech technologies can claim such an ambitious goal on their own. But with the help of a community of indigenous people, all together in solidarity, maybe they still stand a chance. As indigenous languages are endangered due to the history of colonisation or forced marginalisation, there is so much more locally-meaningful work that needs to be done to put the communities back at the centre stage.

Exploring decolonising practices in speech and language technology, as Professor Bird puts it, is “not only the ethical way forward it is the most effective thing to do”.


[1] Romaine, Suzanne. (2007) Preserving Endangered Languages. Language and Linguistics Compass. 1. 115 132. 10.1111/j.1749 818X.2007.00004.x.

[2 ] Patrick D. Nunn & Nicholas J. Reid. (2016) Aboriginal Memories of Inundation of the Australian Coast Dating from More than 7000 Years Ago, Australian Geographer, 47:1, 11 47, DOI: 10.1080/00049182.2015.1077539

[3] Roche, Gerald. (2020) Abandoning Endangered Languages: Ethical Loneliness, Language Oppression, and Social Justice. American Anthropologist, 122: 164 169. doi:10.1111/aman.13372

[4] Bird, Steven. (2020) Decolonising Speech and Language Technology . COLING 2020. (to appear)

[5] Rice, Keren. (2009) Must there be two solitudes? Language activists and linguists working together. University of Toronto

[6] Bird, Steven. (2020) Sparse transcription: Rethinking oral language processing. Computational Linguistics.



Joshua Yang

Recent Melbourne CS Master Graduate with a focus on Speech Technology, HCI, NLP & Indigenous languages. Hails From Taiwan & Australia.