Principal Components: Vukosi Marivate on Data Science and NLP in Africa

Vukosi Marivate, Ph.D., chair of data science at the University of Pretoria, talks about the advancement of data science practice and education in Africa, especially around low-resource languages and NLP.

Susan Currie Sivek, Ph.D.
CodeX
4 min readNov 19, 2021

--

Photo by Syd Wachs on Unsplash

Vukosi Marivate, Ph.D., chair of data science at the University of Pretoria, brought his insights and experience to an episode of the Data Science Mixer podcast for a conversation about the status and growth of data science in Africa, as well as some fascinating collaborations around building natural language processing approaches for low-resource languages. He tells us about the Masakhane collaboration and the development of its open-source Python projects, and addresses how the lack of NLP efforts for low-resource languages is not just a data problem, but also a societal problem.

Here are three “principal components” of what Vukosi shared.

The history of language and culture has shaped the current state of NLP.

Our local languages were literally seen as being second rate.

I’m South African, so I come from a place where for a long period of time, our local languages were literally seen as being second rate. So there was not much development in the universities. … Other languages were chosen for the country, given our history of apartheid, to say, “Yes, we’re going to develop only these ones.”

So now, if you’re trying to play catch up, the amount of money that it takes to get back to that point becomes, “The money that we have currently as a country or the whole economy, do we spend it on developing these other, let’s say, nine other languages that are in the country, which are official, or do we spend it on other things in the country? Hey, we have poverty. We have all these other things.” So you can now see, that’s another issue that comes along in there.

An important part of data science education is learning to deal with real-world constraints.

They notice that what’s written on the paper and what actually they have is different.

If you’re a data scientist, this is a sometimes a blind spot. We think, “Oh, I have this tooling. I have my training in statistics. I have my training in computer science. I’m going to solve this problem.” When you get into the real world, we have to deal with this.

I teach in the master’s program for data science at the University of Pretoria, and one class I teach is our data science capstone, which takes all of the courses that the students might have done and puts them into a use case. The use case is always with a partner who we believe might benefit from data science, and they also want to take this data science journey, but they don’t come from the computer science department. They come from somewhere else in the university.

For a lot of the students, what they notice is that the partner will write the description and say, “This is what I would like to be done. This is the data that I have.” But once they have their first few meetings with the partner, they notice that what’s written on the paper and what actually they have is different, and then the students will gripe. They’ll say, “My data is not as … I thought I was going to get this.” And this is exactly why we do this.

Linking national data to machine learning to policy making can be challenging.

How do you collaborate with people in government for them to understand these things?

We took some time looking at public education data to see if you could use interpretable machine learning to identify factors that lead to good performance. That was done with South African data from a high school perspective, and then also with Sierra Leone data with one of my students. That was very interesting because, again, you take some things for granted if you’re in different places. South Africa comparatively has a very good national statistics office, and some data is very good. It’s easy to identify, “Hey, I’m looking for population data. I’m looking for health data. I’m looking for education data. How do I connect these three?” But if you go to other countries, this becomes a little bit harder.

And then when you run the interpretable models, you can then identify a factor like a school that you know has a very good cafeteria, tends to correlate very well with the performance being high there. We’re not talking about causation at the moment, but then we’re trying to say, “This could be very useful for policy making.” Now, the thing is, how do you get to policy making? How do you then collaborate with people in that space, in government, for them to understand these things? But we’ve done a lot of work on that. In this case, it then leads to people changing the way they do decision-making.

These interview responses have been lightly edited for length and clarity.

The podcast show notes and a full transcript are available on the Alteryx Community.

--

--

Susan Currie Sivek, Ph.D.
CodeX

Writer, storyteller, and data geek. Former journalism professor and researcher. Writer, knitter, hiker. she/her