From Digital Divide to Language Divide: Language Inclusion for Asia’s Next Billion

Thinking through language divides in online platforms and what we can do to reduce them

Published in

Words About Words

11 min readDec 9, 2015

Originally published in The Good Life in Asia’s Digital 21st Century, edited by Digital Asia Hub and the Berkman Center for Internet and Society. Republished with permission.

Xu Bing’s “Book from the Ground Up”. Image CC BY-SA Hrag Vartanian

Access the Internet today in English, and one can fnd a plethora of content: news, cat videos, social media sites, and search engines all optimized for English speakers. The same could be said for Mandarin, Arabic, Korean, and Japanese speakers, all of whom enjoy a broad array of content and dedicated social networks in their respective languages, such as Sina Weibo, KakaoTalk, and Line. But accessing the web in Ilokano, a minority language in the Philippines, or Bihari, a minority language in India, leads to a very diferent experience; almost no major websites support these languages. Users can technically access the sites, but without the ability to understand and interact with the content these websites remain practically unusable to the speakers of such minority languages.

While conversations around Asia’s digital divide have focused largely on mobile technology and digital literacy, long-standing language divides will continue to exacerbate issues of inequality and access for Asia’s “next billion” Internet users. This essay argues that work to close digital divides and improve digital literacy must be paired with eforts to close language divides in Asia. Then it argues from both an experience design and technological standpoint what must be done to make this latter work sustainable and successful, namely, through activating human communities and shifting how we structure linguistic data on the web.

Digital Ghettos Bound by Language

Initiatives to close digital divides have taken at least two forms. The frst focuses on improving access to communications technologies, focusing on, but not limited to, mobile phone distribution, an approach made more and more afordable by dropping phone prices thanks to designs specifc to emerging market needs. The informal market has opened pathways for low-cost phone distribution and repair, and small businesses have emerged to provide Internet cafes in both rural and urban parts of Asia.

Other initiatives concentrate on digital literacy, with a special focus on improving these skills for women and young people. While phones and computers are increasingly available to the next generation of technology users, the skills necessary to use these devices can remain limited. Access to phones correlates with formal education levels, often meaning that existing gender divides in more traditional societies exacerbate the gap (Davidson, 2011). This prevents women from picking up skills needed like word processing, social media usage, and basic privacy and security practices. NGOs like TESDA in the Philippines, the National Digital Literacy Mission in India, and ICDL Thailand have made digital literacy training and certifcation the focus in their work.

It is easy to imagine that getting people online and communicating via digital technologies can help ease inequalities, and evidence has largely demonstrated this. Access to mobile phones and the Internet correlates with increased opportunities for building livelihoods (Sife et al., 2015) and accessing educational materials (Valk et al., 2010). However, a significant divide remains: language. Asia’s next billion internet users speak hundreds, even thousands of languages, and most of the web’s most important content and conversations will be locked away from them, just as their content and conversations will be locked away from the world.

For example, Wikipedia, a thought leader in language access and frequently packaged with emerging market initiatives, supports hundreds of languages, but as translation depends on the Wikipedia community, some languages have more content than others. Articles available in the Konkani language of India range in the hundreds, compared to hundreds of thousands in Hindi and more than a million in Vietnamese (Wikipedia, 2015). Basic services like Google, Facebook, Line, WeChat, and WhatsApp support — on the high end — dozens of languages in their user interfaces and — on the low end — only a small handful.

Without improved language and writing script support, new netizens run the risk of living in digital ghettos created by their native tongues. Any online actions they engage in or media they create will be largely invisible and unappreciated by those outside their cultural-linguistic spheres. This can have significant efects, for instance, on human rights advocacy, which can depend so heavily on using social media and email to raise awareness amongst international news sources. Writing on the role of language in fostering attention for human rights, journalist Sarah Kendzior described the hierarchies of language implicit on the web:

When an Uzbek activist decides to share her ideas on the Internet, she has to balance linguistic prerogatives that few in other countries have to consider. If she knows Russian, she has to decide whether writing in Russian — and potentially reaching an international audience as well as the 41 percent of Uzbeks who can read Russian — outweighs not being able to reach non-Russian speaking Uzbeks or seeming to value a foreign language over one’s native tongue. If she writes in Uzbek, she has to choose which alphabet — Cyrillic, to reach older generations and Uzbeks in neighboring former Soviet republics who only know the Cyrillic version? Or Latin, to reach the younger readers who comprise the bulk of Uzbekistan’s Internet users? (Kendzior, 2014)

New Internet users who don’t speak majority languages will likely be unable to participate in global Internet culture and conversations as both readers and contributors; as Mark Graham and Matthew Zook have noted, minority languages speakers, especially those from the global south, will experience substantial information inequality online (Young, 2015). Indeed, people’s inability to speak English can significantly afect their very adoption and use of the Internet, even if they are aware of its existence (Pearce et al., 2014).

Wang Qingsong’s “Follow Me”. Image CC BY-SA Michael Davis-Burchat.

Closing Language Divides

Closing digital language divides within such a diverse linguistic region like Asia will require a multifactor efort in translating content around the web. I argue that this effort must focus on both developing networks of volunteer human translators, and designing and implementing changes in the web’s technological infrastructure.

Anyone who has used machine translation to translate between language pairs other than English and the Romance languages knows that the quality of machine translation engines is lacking, and improvements are slow to come by (indeed, even for the aforementioned language pairs). New language pairs require larger bodies of translation data before machine translation can be effective, and even then, we should expect similar plateaus as what we’ve seen with existing language pairs. In terms of accuracy, human translation offers significant benefts over machine translation, especially since bilingual individuals can understand the nuances of both languages in a way that machine translation cannot. However, it would be difficult for human translation to work under the current system — where translators are commissioned for piecemeal requests — to scale swiftly and sustainably enough to support the broad range and increasing amount of content requiring translation.

Rich, motivated translator communities composed of multilingual individuals can help assuage this issue. For years, fansubbing communities — groups of fans who translate and subtitle video content such as anime, Korean dramas, and American sitcoms — have shown the power and popularity of crowdsourced translation around popular media. People working in teams can translate a video from English to Chinese rapidly, disseminating that content through translators’ networks long before a commissioned translator can do their work. This informal translation work, widespread throughout many Asian countries, often reflects translators’ passions for connecting different language worlds and building their reputations in the community (Zhang, 2013).

More formal efforts have leveraged interest-based communities and others with great success. Sites such as Yeeyan in China, which enables fan translator communities to translate popular Western news outlets into Chinese; and Viki in Japan, which does the same for popular Asian and Western videos, demonstrate both the benefits and potential risks of a crowdsourced approach. These platforms rely on internal reputation systems, and peer and editorial review to ensure success while adopting a relatively open, crowdsourced approach at scale. Smaller networks can be equally effective: as co-founder of the Ai Weiwei English project, I have found that a small community (at most 10) can translate and contextualize foreign language content effectively for an English-speaking readership (the @aiwwenglish Twitter account now has 30,000 followers). Supplementing this work with machine translation support, translation memories, user-generated glossaries, and common dictionaries can help speed up the work of translators while maintaining accuracy.

Activating these communities will require a radical change in how web content is presented on the Internet. Sites like Hypothesis and Genius have demonstrated that the concept of annotating the web and its content broadly can generate both popular interest and interest from funders. We can imagine annotation extended to translation. Indeed, content on Wikipedia, whose design allows for both annotation and translation, benefits from multilinguals who facilitate content between languages and act as critical content bridges (Hale, 2014). Our strong belief at Meedan is that creating a translation layer for social content on the web can help establish translation as a commonplace action (Bice, 2015), such that any and all content can be translated, vetted, and approved with the same familiarity as leaving a comment, sharing, annotating, and/or explaining content. Just as importantly, accessing those translations must be as simple as Chrome’s current interaction for viewing a page through machine translation — in other words, part of the experience, rather than an extra, hidden feature.

Additionally, we must build and adapt technological infrastructures to allow for this work to occur at scale. The frst change revolves around the very technology standards of the web, to allow for the wide diversity of languages online to be expressed. Mozilla users in Cambodia, for instance, are likely to see blank boxes rather than Khmer script (Valentine, 2015), due to limited font availability in both browsers and computers. Minority language speakers are even less likely to find a keyboard — whether digital or physical — that supports the particular needs of their writing scripts. And while typographers of languages that utilize Latin letters can access a wide variety of fonts to maximize and improve readability on diferent devices, the font choices for minority languages remain severely limited.

Creative Internet users have always found ways around these strictures. In the Arabic-speaking world, for instance, “Arabizi” has emerged as a form of online Arabic using a combination of Roman letters and Arabic numerals that resemble popular Arabic letters (Yaghan, 2008). We can expect similar strategies amongst emerging language speakers in Asia whose written scripts are supported by neither screens nor device input. These user-driven workarounds should not be seen as a replacement for true language support for the broad diversity of written scripts used in Asia. Rather, they are better understood as an indicator of user need.

Similarly, technology platforms must accommodate for expressions of language that do not involve the written word. While typographical support will be important, many of the next billion will not be literate in their native languages, and, indeed, many languages may not have a standardized writing system at all (Bird et al., 2014). While the web continues to be optimized for textual input and consumption, improving designs for audio can signifcantly broaden who is able to participate online and how. An efective content translation solution must include this, with robust support for both speech to text and text to speech, regardless of the source language.

In the world described above, the barriers we imagine the Ilokano and Bihari speakers are encountering when they frst log onto the web will start to crumble as more speakers come online and translate diferent parts of the web. They will be able to read the news, search for educational material, watch funny videos, and contribute to social media conversations. This is not to say that they will automatically be treated equally — power on the Internet depends as much on cultural and political context as it does on language — but they will have a path to closing the gaps of language and material access that do not exist for speakers of English and other majority languages. Thus, the translation work described above will be an important part of closing information and communication divides on the global web.

Image CC BY-SA Yuvi Panda. Languages present: English, Hindi, Tamil, Kannada, Greek, base64 and Morse Code.

References / Links / Resources

Author’s Note: I wasn’t able to include these numbered resources in the original paper so I’m publishing them here. The citations below, however, do appear in the original paper.

For those passionate about issues of language divides and designing technologies to reduce these inequities, I recommend the following resources for further research and inspiration. Many of them are Meedan advisors and friends:

The lectures of Meedan CEO Ed Bice at the American University of Cairo’s House of Translation laid out an important vision for the role of the translator in a networked society and provide much of the conceptual backing of the translation work we do at Meedan. His talk on “Social Media Translation and Technology” outlines how humans and machines can work effectively in service of translation: http://new.aucegypt.edu/news/videos/ed-bice-social-media-translation-and-technology.
Scholar and technologist Scott Hale explores human-computer interaction and issues of language divides on the internet. The intersection of this and his interests in collective action and politics have helped me think about the importance of online translation as one of equity and access, with an important role in movements. His work can be found at http://www.scotthale.net/blog/.
Scholar Katy Pearce has written frequently about divides on the internet and their relationship with power. Her work on language and access in Central Asia touches on the importance of understanding how non-English proficiency can significantly impact adoption of the internet: http://washington.academia.edu/KatyPearce.
Scholar and technologist Steven Bird writes frequently about low resource languages and has developed Aikuma, a technology for recording, sharing and translating languages. His work can be found at http://www.stevenbird.net/.
Scholar and technologist Kevin Scannell explores low resource languages on social media. My favorite project of his, Indigenous Tweets (http://indigenoustweets.com), identifies the first Tweet made in a minority language, as well as the total number of Tweets and users using that language on Twitter. His linguistic work can be found at http://borel.slu.edu/nlp.html.

Author’s Note: Here are the original citations:

Bice, Ed (2015, July 1). 21,000 Miles of Translating Social Media. On Translating Digital Stories for Out of Eden Walk, Paul Salopek’s Slow Journey Around the World. Words About Words (Meedan Medium Channel). Retrieved October 6, 2015, from https://medium.com/meedan-labs/21-000-miles-of-translatingsocial-media-9b8be45bc323.

Bird, Steven, Lauren Gawne, Katie Gelbart, and Isaac McAlister (2014, August 23–29). Collecting Bilingual Audio in Remote Indigenous Communities. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 1015–24. Retrieved October 6, 2015, from http://www.aclweb.org/anthology/C14-1096.pdf.

Davidson, Brett (2011, July 7). Are Mobile Phones Bridging the Digital Divide or Deepening It? Open Society Foundation. Retrieved October 6, 2015, from https://www.opensocietyfoundations.org/voices/aremobile-phones-bridging-digital-divide-or-deepening-it.

Hale, Scott (2014, June 23–26). Multilinguals and Wikipedia Editing. In Proceedings of the 6th Annual ACM Web Science Conference, WebSci ’14, ACM.

Kendzior, Sarah (2014, October 1). Can Minor Languages Make Revolution? Uzbekistan and the codes of activism on the Internet. The Common Reader. Retrieved October 6, 2015, from http://commonreader.wustl.edu/c/languages-revolution-internet.

Pearce, Katy E. and Ronald E. Rice (2014). The Language Divide — The Persistence of English Profciency as a Gateway to the Internet: The Cases of Armenia, Azerbaijan, and Georgia. International Journal of Communication 8, 2834–2859.

Sife, Alfred Said, Elizabeth Kiondo and Joyce G. Lyimo-Macha (2010). “Contribution of mobile phones to rural livelihoods and poverty reduction in Morogo region, Tanzania.” EJISDC, 42(3), 1–15.

Valentine, B. [bennnyv]. (2015, September 17). in my @mozilla browser, I cannot see Khmer script. I can see Mandarin and Burmese, but Khmer shows up as boxes. cc @anxiaostudio [Tweet]. Retrieved from https://twitter.com/bennnyv/status/644495542693707776

Valk, John-Harmen, Ahmed T. Rashid, and Laurent Elder (2010). “Using Mobile Phones to Improve Educational Outcomes: An Analysis of Evidence from Asia.” The International Review of Research in Open and Distributed Learning, 11(1). Wikipedia.org. Retrieved October 6, 2015 from www.wikipedia.org.

Yaghan, Mohammad Ali (2008). “Arabizi”: A Contemporary Style of Arabic Slang. DesignIssues, 24(2), 39–52.

Young, Holly (2015). The digital language divide. Guardian. Retrieved October 6, 2015, from http://labs.theguardian.com/digital-language-divide/.

Zhang, Xiaochun (2013, July/August). Fansubbing in China. Multilingual, 30–37.

From Digital Divide to Language Divide: Language Inclusion for Asia’s Next Billion

Thinking through language divides in online platforms and what we can do to reduce them

Digital Ghettos Bound by Language

Closing Language Divides

References / Links / Resources

Written by an xiao mina