Comparing the State of Data Science In Africa to Other Continents

Mabu Manaileng
Afro AI
Published in
9 min readMay 12, 2021

About Glander Baloyi

Glander Baloyi is a graduate from the University of the Witwatersrand, who studied Bachelor of Science in Animal, Plant and Environmental Science. Working on data from field research and experiments using R led to her burning interest in Data Science (DS) and Machine learning (ML).

Through mentorship programme with Mabu Manaileng, she immediately enrolled in a DS and ML career track with DataCamp after finishing her degree. This gave her hands-on experience of writing code in python and a smooth transition into the DS and ML space. The career track comprised 23 courses coupled with projects focused on real-world datasets.

Background

Data science is a fast growing and in-demand career path that can positively transform any industry and bring about valuable impact to companies in general.

Among those who are interested in this field, a lot of curiosity has been built around whether Africa is falling behind and what kind of future to look forward to in this space.

This inspired the questions asked in this project — which is the first of many projects Glander will be taking on as she progress on her journey in DS and ML.

The Problem

In this project, the aim is exposure to real world data and testing the understanding of fundamental data science skills such as cleaning, manipulating and visualizing data (Exploratory Data Analysis) using python-based tools. We’re also testing the ability to formulate relevant questions from the data given the problem at hand.

The investigation is focused primarily on measuring how competent Africa is, in the Data Science field and whether there is an existing gap of representation in Africa than in other continents. Some questions looked at this comparison from an angle of gender and experience writing code. From the visualizations, informative insights are drawn about Data Science across African countries and different continents.

The Data

This analysis employs the 2020 Kaggle Machine Learning & Data Science Survey data — listed below in the resources section.

Side note:

Most of the questions asked in the survey consisted of different options to choose from. One individual could choose more than one answer in the options provided, hence the data frame has multiple columns asking the same question. The data was converted to the format commonly known as a tidy dataset — championed by the renowned data guru Hadley Wickham.

This needed some advanced pandas including melt functions. You can melt the multiple-answer questions into one column while keeping the single-answers questions in the melted data frame throughout. A cross table function can then be used to create a subset of the desired columns to answer the questions. Helper functions were created for all of these — which came with some helpful python tricks.

EDA

But first, who participated in this survey?

Let’s gather some background understanding of the participants in the survey based on some key demographic indicators; age, country, gender, and highest level of formal education.

How is gender distributed across the age groups?

Most of the survey respondents are young with 20% within the ages 25–29. There are 18.8% and 17.3 % within the ages 18–21 and 22–24 which may be indicative that a lot of young people are taking interest in the Data Science field. In the ages 45–70+ participation barely reaches 5%.

India, USA, Brazil, Japan and Russia are the top 5 countries with the most participation, with India being the highest of all countries having 29.2% respondents, which is more than double the percentage of USA that follows (11.2%). Among African countries that participated, only Nigeria made it to the top 7 with 2.4% respondents, while the others are below 1%.

Unfortunately , participation is predominantly from male persons, reaching approximately 78.8% than females with only 19.4% respondents. This is not surprising as many tech spaces suffer low representation of females, however it is concerning. Individuals in the category Other are those who were classified under Prefer not to say, Prefer to self-describe and Nonbinary and have a participation below 2%.

Unsurprisingly, most participants have some level of higher education. Most participants have a Master degree(39.2%) almost equal to those with Bachelors degree (34.8%).

Africa vs the World

In this section, we look at intra Africa dynamics and then investigate how they compare to the rest of the continents.

The data science roles in Africa

All the African countries have a spike of students compared to the other roles. This illustrates that more and more African students are considering a career in data science. Are we heading to more data driven Africa?

Despite the high participation of Nigeria seen earlier, it is interesting to see that Ghana has taken a lead with individuals that are students. All the roles are below 20%. Zooming into the different roles, South Africa and Ghana have more Data Scientist, with Morocco, Nigeria and Tunisia almost reaching the same level of representation, and Ghana having the lowest proportion of data scientist. There is very poor representation of Database Engineers in Africa, few are present in Kenya, Morocco and Nigeria and none found in Egypt, Ghana, South Africa and Tunisia.

How data science roles compare between Africa and other continents

When comparing the roles in Africa to other continent, we see that all the roles in Africa are below 10% but always does better than Australia. Africa has more statisticians and very few product/project managers. Taking a look at unemployment, most continents have the least unemployed compared to the other roles except in Asia and Africa where they come second highest. Australia is an exception for this observation. Asia, Europe and North America are always in the top 3 for the different roles with Asia always having the highest proportion.

Let’s look at programming languages

Programming languages are used differently in the different countries. Although python is the commonly used language across Africa and Julia being the least, we see that the most and least common languages respectively, within Africa countries is as follows:

  • Julia and R for Egypt,
  • R and C for Kenya,
  • C and Python for Morocco,
  • Julia and C for Nigeria,
  • Swift and C/MATLAB for South Africa,
  • C and Bash/Other for Tunisia.
  • Ghana always has the lowest proportion for all the programming languages compared to other countries, with Javascript, R and C as the most and least common languages respectively.

A similar pattern of usage across the continents is observed for the various programming Languages. Python seems to have a spike in all the continents and SQL always coming second. Surprisingly, Python is mostly used by Africa than Asia and Europe and the other continents. Julia and swift have the lowest proportion of use across all the continents. The top three used languages in the studied continents are Python, SQL and R except in Asia where we see R replaced by Java and also being the least used compared to other continents.

How Africa compares on data science tooling

Scikit-Learn is commonly used in all the African countries, with TensorFlow coming second and Keras third. A similar pattern is observed across continents.

The top three machine learning algorithms in Africa are Linear and Logistic Regression, Decision Trees or Random Forest and Convolutional Neural Networks. In Kenya however, Bayesian Approaches are used more than Convolutional Neural Networks. Across continents, the same trend holds except that in some continents, Generative Adversarial Networks and Bayesian Approaches seem to overtake Convolutional Neural Networks.

Learning data science in Africa

Coursera is the most common platform where African people get Data Science education. But in South Africa, Udemy seems to be the platform commonly used, while in Kenya Kaggle Learn Courses are at the same level as Cloud Certificates programs. Fast.ai and Cloud-certification programs seem to be the least common platforms, barely reaching 5% proportion, except in Tunisia where the proportion of Cloud-certification programs is slightly above 5%. The top 5 common platforms in Africa are Coursera, Udemy, Kaggle Learn Courses, Datacamp and Udacity.

Looking at the gender dynamics in Africa

Do different genders prefer different programming languages?

Python is the most used programming language in Africa. The top 5 languages are Python, SQL, R, Java and Javascript. The least common languages are Julia and Swift.

Lessons Learnt

  • Often times, there are more insights that can be drawn from a visualization, and one question may lead to many more questions, exploring them may not be a bad idea (makes working with the data more fun) but maintaining focus to the main topic is effective.
  • How you ask your question influences what you can have in the x- and y-axis. For example, asking comparison questions like distribution of roles across African countries and continents may seem to take a similar structure having country or continent on the x-axis. However, switching the x-and y-axis around answers different questions. As a results, going for a visualization that can answer more questions is always better.
  • Referring back to the original information about the dataset you are working on is important to ensure that your findings are consistent with the data provided to avoid making wrong interpretations or conclusions.
  • Building on from the previous lesson, depending on what you are working on, dropping rows and columns should be the last thing to think about doing as some data may not be accounted for. For this dataset, I found combining similar categories (that could’ve led to imbalanced data) to one category a useful solution (Of course there is no one solution for everything).
  • What you learnt from your source of knowledge is not all there is, do not be afraid to use your search engine to learn new things.
  • Lastly, it’s always effective to read the documentation about a library or function to be used than reading a random solution on the internet.

Call to Action

Glander Baloyi is available for real world data science action. This analysis is a small part of data science training she received via Data Camp ranging from EDA with pandas to machine learning and deep learning.

Her strong competencies include importing, cleaning, manipulation, visualizing data and machine learning(ML) using python.

You can reach her on:

Resources

  1. Hadley Wickham
  2. Tidy Data (had.co.nz)
  3. 2020 Kaggle Machine Learning & Data Science Survey | Kaggle
  4. Learn R, Python & Data Science Online | DataCamp
  5. Code for this analysis is publicly available on this Kaggle Notebook

--

--