WanaData
Hacks/Hackers Africa
4 min readJun 26, 2024

--

May 2024 Meetup Blog

By Community Coordinators

Hacks/Hackers recently received an educational talk from Researcher Bonaventure Dossou.

Hacks/Hackers Africa recently hosted Computer Science Researcher, Bonaventure Dossou. Dossou was invited to provide an educational talk on ML, AI systems and limited resources to all the Hacks/Hackers interested in software developments, Artificial Intelligence (AI), research and Machine Learning (ML). This session, hosted on 30 May, was attended by 20 Hacks/Hackers.

In his opening speech, Dossou cited that the lack of integration of African languages in AI systems was the motivation behind his research into African languages. He shared an example to support his statement, some of which include the most popular virtual assistants such as Siri and Alexa, who do not support any African language in their systems. Dossou commended Google Translate for making the effort to include 15 African languages into their system. However, ‘this is a big minority compared to the number of languages it supports’, he said.

Increasing access and discoverability of African languages

He then delved into his research of incorporating African languages into Artificial Intelligence (AI). To improve representation he began by investigating where data on African languages can be sourced. He worked with other researchers and developers to collect data (via crowdsourcing) on African languages, to add to the data found on hugging face. This yielded good results, and allowed Dossou and his colleagues to have sufficient data for language modelling and building more Afrocentric languages.

As a result of their success with crowdsourcing data on African languages, the team was able to build a number of databases:

  1. MasakhaPOS (POS dataset of 20 languages — ACL 2023)
  2. MasakhaNEWS (News dataset of 16 languages — ICLR 2023)
  3. AfriSpeech (200hr Pan-African corpus for clinical and general for 120 African accents — TACL 2023)

From this initiative, a new challenge was discovered i.e. the need to increase discoverability of African languages resources. ‘It is one thing to have a lack of data because it doesn’t exist. It is another thing to have a lack of data because you don’t know where to find it, “ he said.

Driven by the need to improve access and discoverability, he worked with a colleague to develop Lanfrica — an african languages repository. He built the first Fon-French dataset for non-french machine translation, at a time when there was none. This then led to more datasets being created and attracting more funding and staff to assist with increasing open access and discoverability of African languages.

According to Dossou, these developments led to the belief that experimenting with methods of teaching English-driven AI systems on African languages to execute tasks was possible. This process he called transfer learning on African languages.

Experimenting with transfer learning on African languages

To test the theory Dossou considered the amount of data needed in a transfer learning setting. This encouraged him and his colleagues to develop the ‘a few thousand translations go a long way! Leveraging pre-trained models for African news translation’ project. For this project, the team focused on ways to translate news from one language to the other combining 2 domains (REL+NEWS) to fine tune the aggression of religious and news domain data.

What this revealed was that innovators do not need great amounts of data. What is needed is a good community of people to produce data for the languages of interest. Several baselines proved that 1000 -2500 sentences are needed (minimum data) to be able to adapt a pre-trained model to a new and specific domain, such as news-powered domains.

Training

There are instances where innovators want to start AI training from scratch to address data scarcity, increase model robustness and ensure generalisation. In such instances, Dossou advised that ‘active learning’ needs to be applied. To apply active learning one can follow the steps below.

Step 1: Start with a small set

Step 2: compare or ‘query’ the small set to a new set that is not labelled. The goal of this is to speed up the learning process when there is a lack of large labelled dataset to practise traditional supervised learning methods on

This is what Dossou applied with the AfroLM project. What this approach revealed was that

  1. Active learning is very data-efficient and high performing
  2. It is possible to build powerful AI models, yet data-centric and very efficient
  3. Multitask learning is very promising for African languages

When closing his talk, Dossou shared that building models that require little data and perform several tasks at the same time, was possible. He cited a number of challenges/ limitations with the Fon language which included

  1. Very limited data available for downstream tasks
  2. High Morphological complexity
  3. Not integrated into any existing major technical system
  4. Very few NLP applications are available

However, this did not deter him from crowdsourcing more data and creating open-source adaptable systems.

This was an insightful session where the audience was encouraged to contribute to or create more resources for African languages. This is done to preserve the languages, make them accessible, make them discoverable and challenge their inclusion in tech innovations. ‘We should engage in more community efforts. Lead more Afrocentric research projects. Build and scale AI techniques proper to African languages’, said Dossou. The audience expressed gratitude to Dossou for sharing his work and contribution to the preservation of African languages.

Join Hacks/Hackers today!

Are you interested in how scientists, technologists and journalists can use technology to communicate effectively? Then you are just the Hacks/Hacker we are looking for! Join over 1000 Hacks/Hackers Africa members to get access to ongoing mentorship, global visibility, and professional development.

You can also link up with us on Twitter and Facebook, to stay up-to-date on our exciting chapter activities.

--

--