Creating Multilingual Digital Infrastructures

This essay shares reflections and learnings from research and mapping efforts to build multilingual digital infrastructures.

Sneha
Metadata Learning & Unlearning
6 min readOct 3, 2022

--

Cover Image of the State of the Internet’s Languages Summary Report, 2022. Multiple tiles, showing maps, photos and illustrations from the report.
Screenshot of cover from “State of the Internet’s Languages” report. Licensed under CC NC-SA 4.0.

The growing discourse on the need for a multilingual and diverse internet over the last several years, particularly in the wake of the pandemic, has opened up several important and broader questions about our engagement with digital infrastructures in general. As illustrated by a substantive body of work in digital rights and policy, research and computing among others, especially in the last decade, the development of inclusive and accessible digital content and spaces is also premised on addressing several systemic barriers to access, language being a persistent one among them. Challenges in reading, writing and speaking in multiple languages on digital interfaces continue to remain prevalent today across the world, especially for marginalised and non-dominant communities. While there are several important and longstanding efforts to address these challenges, they also remain limited by larger infrastructural challenges in access to and use of internet and digital technologies.

The Internet reinforces global language disparities

Ethnologue, a global reference publication on languages, notes that as of 2022, 3,045 languages are endangered, which is 42.5% of all living languages. Among the 7,151 spoken languages that we know of today, just 23 account for more than half the world’s population. The asymmetries in development and use of languages are stark as seen in these numbers, as is the scale of the problem. These disparities are reflected on the internet as well, with a selective number of languages available on digital interfaces, in multifunctional ways. And while it may present as a technological problem, the gaps actually predate the digital turn as they are a result of several forms of systemic social exclusion, many of which are located in colonial infrastructures of knowledge production. Today, with the emergence of new forms of colonisation of data, efforts to identify and address language inequities have become even more significant.

Wikipedia content and number of speakers for the 10 most widely spoken languages in the world. (Population estimate: Ethnologue 2019, which includes second-language speakers.) Screenshot taken August 12, 2022. Licensed under CC NC-SA 4.0.

Initiatives to document knowledge gaps

This blog post shares reflections and learnings from work undertaken on these issues, and mapping efforts to build multilingual digital infrastructures. The first is a recently published report on the ‘State of the Internet’s Languages’ (STIL), led by Whose Knowledge?, in collaboration with the Oxford Internet Institute and Centre for Internet and Society (CIS), along with over a 100 people across the world. Through an exploration of data and stories on how people read/write/speak online in multiple languages, the report offers an overview of some of the key issues related to language inequity online. Building on the premise that ‘language is a proxy for knowledge’, the report reflects on how human knowledge, especially that produced in non-dominant and marginalised languages, continues to remain underrepresented on the web, along with documenting several ongoing efforts to address these challenges. See a preview of the report here:

State of the Internet’s Languages Report Video, Licensed under CC NC-SA 4.0.

The second initiative is a set of short-term research projects on Wikimedia platforms and communities in India, undertaken by the Access to Knowledge programme at CIS. The research studies cover an array of topics, including systemic gaps like the gender bias and divide in Indian language Wikimedia projects, debates on open access and reuse across Wikimedia and Galleries, Libraries, Archives and Museums (GLAM) initiatives, and forms of multilingual pedagogy and content creation across diverse projects. Read a compilation of the projects completed between 2019–2021 here.

Compilation of research studies by the Access to Knowledge Program, Centre for Internet and Society. Shared on Wikimedia Commons under a CC BY-SA 3.0 license.

Language reflects larger power dynamics in society

Languages don’t exist in isolation — they grow with people and other languages. The STIL report talks of dominant and marginalised languages in multiple global and local contexts of power and privilege, and this is illustrated in the data narratives and stories in the report. Many of them speak of forms of interlanguage marginalisations, the relationship with colonial languages, and how this affects access to critical information and educational content, social and economic mobility, community identities and memory etc. For instance, as noted in the report, “the stories elaborate on online experiences and challenges of Indigenous languages such Chindali, Cree, Ojibway, Mapuzugun, Zapotec, and Arrernte, minority languages like Breton, Basque, Sardinian and Karelian, as well as regionally and globally dominant languages like Bengali, Sinhala and Indonesian (Bahasa Indonesia) and Arabic among others,” reflecting the evolving relationships among languages in changing socio-cultural and geographical contexts.

Wikipedia’s local-language prevalence. Are the most detailed representations of a country written in a local language (orange and beige), or a foreign language (blue)? (Language data: Unicode CLDR 2019). Screenshot taken August 12, 2022. Licensed under CC NC-SA 4.0.

Linguistic barriers also disproportionately affect marginalised and vulnerable groups, as they often open up space for harms such as misinformation, hate speech and gender-based violence. The growing body of work on the gender gap in terms of content about and participation by women across Wikimedia projects also illustrates that these gaps are tied to several factors such as access, infrastructure, and capacity-building. The limited availability of existing Indian-language resources on gender, sexuality and feminism in digital forms is an added impediment to addressing these gaps.

The role of technology

Learnings from these two projects also offer a multi-layered and intersectional perspective to understanding infrastructure conceptually and politically, because the technologies we use speak a ‘different language’ than what we may use to communicate with each other. For instance:

  • the need for development of keyboards, fonts and software in various languages,
  • the number of languages that are easily accessible on your smartphone,
  • lack of accurate translations into and from Indigenous and regional languages or for conceptual terms related to gender/sexuality/feminism, and
  • accessibility of content and devices for persons with disabilities.

Efforts in preservation, sourcing, digitising, translating, sharing and (re)using content in multiple languages (especially on open knowledge platforms like Wikimedia projects) are beset by multiple challenges, including legal and cultural factors. As mentioned earlier, these are not just technological gaps, but historical knowledge gaps that affect marginalised communities disparately by contributing to existing power inequalities.

While many of these challenges with the development of digital infrastructures remain prevalent across the world, there are also multiple affordances of these technologies and platforms which may actually lend themselves effectively to address these challenges. While the data narratives and maps in the STIL report offer an important macro perspective on the scale of these knowledge gaps, the stories present several embodied, experiential narratives of languages in the digital space, whether through speech, signs, emojis or text. The significance of orality and voice is emphasised across many narratives, which also questions the primacy of the textual on the internet and digital interfaces. Several communities across the world, as illustrated in these stories here also use digital tools and platforms, including social media in creative ways to circumvent the barriers created by lack of access and low resources. Efforts in multilingual content creation on Wikimedia projects such as Wikimedia Commons and Wikidata, also illustrate the necessity to map such existing content across diverse formats, and to invest in creation of inclusive and accessible structured and linked data. The multiplicity of forms and formats in content creation in diverse languages are an important factor therefore in terms of rethinking diversity in access and use.

Multilingual Internets require an intersectional approach

These are a few key learnings from the two projects; many of the concerns highlighted by diverse communities across the world are also informed by larger contexts of ownership and regulation of digital infrastructures. As mentioned earlier, while the projects are focused on linguistic disparities, the challenges are indicative of longer historical knowledge gaps. Efforts in building multilingual internets then need to take an intersectional approach, and importantly foreground community led initiatives in the space which have for long been working to address these gaps. The learnings from these projects, and indeed continued work in these areas aims to inform research and practice across different spaces — including but not limited to language related computing, archival practice, open educational resources and newer fields like digital humanities, etc. to name a few. Collaborations across these spaces would further support researchers, creative practitioners, academia and policymakers in aiding efforts to develop and foster open, decolonial and multilingual digital infrastructures.

Puthiya Purayil Sneha is a researcher with the Centre for Internet and Society (CIS), India. Her areas of interest and work include digital media and cultures, methodological concerns in arts and humanities practice and pedagogy, and access to knowledge.

--

--