Afaan Oromo Corpus

Technical data from a 2 million word dataset

Axüm Labs
2 min readApr 8, 2014

AFTER the overnight success of EthioType, a mobile keyboard for Amharic and Tigrinya input, the Axüm Labs team turn our attention to other writing systems in the Horn of Africa.

A vital part of creating any text input prediction engine is a corpus, a large and structured set of texts used as the source of linguistic data. Massive Arabic, Swahili and Somali copora have been freely available on the web for sometime. However, digital texts of the Oromo language, Afaan Oromoo, were scarce.

The Axüm Group, with the help of a computer scientist and numerous ebook authors has created one of the world’s largest Afaan Oromoo corpus to date.

Special thanks Prof. Kevin Scannell, Mootummaa Kabaa, Mootii Tufaa, and Daani’eel Dibaabaa for significant contributions.

Do you read Afaan Oromoo? Help us spell check by commenting on Google Drive

Our frequency analysis shows that the 1,000 most common words in Afaan Oromoo account for over 75% of the total vocabulary. This is attractive, from a programming perspective, because predictive algorithms could find a word by arbitrarily running down the list of most common words.

With advanced logic, it should be relatively easy to develop an engine with high predictive accuracy for Afaan Oromo. Furthermore, we are enlisting the help of volunteers to manually check our wordlist for spelling errors.

The full corpus is publicly available and constantly being updated, along with 1872 languages, by Prof. Scannell through his Crúbadán Project. Email him for raw data.

Are you a programmer interested with an interest in implementing technologies for Africa? Intern with us!

--

--

Axüm Labs

The R&D arm of the Axüm Group. Accelerating mobile technologies in the Horn of Africa.