At Wikibabel, our mission is to equalize information access by bringing the most valuable Wikipedia content to the world’s underserved languages.
Case study: Swahili
In Tanzania, 70% of the population is rural, and have little access to traditional educational materials. There is mobile connectivity in many areas, through which Swahili speakers — some 50–100M humans — -could reach a wealth of information, if only it was in a language they understood. An example of the discrepancy is the Swahili Wikipedia’s Math section containing 117 articles, compared to 31,444 in English; with advances in machine translation, we can close this information gap.
The problem of choice
Our approach leverages Google Translate to make English Wikipedia articles accessible to underserved communities. But with a limited number of Translate credits, which of the 5.5 Million English articles do we translate?
We’d love to estimate demand for information that has insufficient supply. In an ideal world, we’d have the ability to determine which Swahili search engine queries returned no useful results. Lacking that, we used Wikimedia’s list of 10k vital articles as a starting point, and developed prioritization principles.
- Cultural universality. We want to be careful not to accidentally impose our culture and values. Therefore, categories that are highly variable among cultures, such as politics, may not be the best use of our limited $s. Categories such as science, math and technology are fairly independently of culture, so those are in.
- Economic Relevance
- Medical topics are genuinely scary without a human in the translation loop for fear of misdiagnosis or bad treatment. Translating the medical procedures category (vaccine, surgery, etc’) or medical fields of study or molecular biological processes are still on the table. But for specific disease pages, with symptoms and diagnosis, we’ll fire off an initial translation, but won’t publish until a human or two look it over.
- Not into spending money on the weapons section. Sue me.
- Use what exists. Swahili Wikipedia exists, and has 38K pages, many are one liners, but we should find the ones that have a similar information length to the English version and incorporate into our corpus
- Page Rank. The links in Wikipedia are crucial to both a sticky experience and deeper understanding than isolated pages would be. We prioritize topics by the number of incoming links from our the seed set, so the most traversed rabbit holes on Wikipedia are there to be discovered in Wikibabel too. We want to add knowledge that completes other knowledge.
- Inspiration! Our favourite category enhancing motivation, wonder and creativity. It also conflicts with the first principle of cultural universality. Does what inspires my sciencey nerdy Western(ish) self have anything to do with what inspires someone in rural Tanzania? Maybe in narrow categories. Got a kick when I glanced at the translation log and saw Hubble Telescope and Voyager, and who wouldn’t? Though in general, inspiring people and topics varies personally and culturally, so please leave in the comments what inspires you!
So, here’s what we’ve got translated so far (we’re rate limited though, so more to come!).
We’re excited to find out all the ways in which we’re wrong and what information is truly valuable and interesting to our users. And we’re measuring that!
Fundamentally, we want to make the information we use and love available to everyone. We can’t think of a better place to put our spare cycles. In the case of Swahili, Google Translate did the hard work of training amazing machine translation for a language with sparse training data; let’s use this to close some important gaps!
Got spare cycles? Catch us at firstname.lastname@example.org