A Deep Dive into African Data Science on Kaggle

Published in

Afro AI

13 min readApr 29, 2021

Background and Introduction

From a random DM on LinkedIn to Mabu for data science advice; into a long advice call and a resultant surprise offer of mentorship; followed by periodic calls about progress (and frustrations) on knowledge gained from courses and its application on projects; now into what should be the first of many awesome projects by Kusasa (watch the space).

Kusasa is a data science entrant with a background in Geographic Information Systems (GIS) administration and analysis, he has worked in i) the fleet monitoring and management space focusing on administering the GIS system of applications and geospatial analyses, and ii) the wildlife monitoring and management space focussing on providing technical support to their partners using their suite of applications. He has developed a passion for machine learning (ML) and data science which lead to the random LinkedIn DM. Mabu and Kusasa designed a practical programme to make a data science transition via DataCamp. Using the handy Career Tracks — which strategically organises modules together towards a specific career, they identified Data Scientist with Python (comprising 23 courses) as one of the first tracks to tackle.

This is all about Kusasa’s first data science project — his Exporatory Data Analysis (EDA) of the current state of data scientists as sampled by the Kaggle Survey. In this project, his intention was to take on an African perspective to asking questions of this Kaggle Survey sample, and use python-based tools to answer these questions as he hones his invaluable data wrangling and EDA skills for the data science workflow. He found some of the resultant answers to be expected, some answers to be informative, others to be inspiring, whilst a handful to be shocking.

The Dataset

The dataset chosen for this analysis is the 2020 Kaggle Machine Learning & Data Science Survey listed in the resources section below. It’s messy survey data that requires some solid python and pandas knowledge to clean up and make useful. It’s also an important dataset for some insights in not only how African data scientists compare to trendsetter countries but also the intra-Africa dynamics.

Data Munging and Feature Generation

The raw data was composed of 355 columns, of which slicing the dataframe extracted the 176 columns of interest to Kusasa for this project.
The wide format of some of these 176 sliced columns was not ideal for his project. Pandas groupby() function, combined with a statistical method such as count(), was very handy for generating long formatted variables.
To minimize the grey areas with job roles, Kusasa extracted a subset of rows that have “Data Scientist” as the person’s job titles.
As he was interested in developing a comparison between the trendsetting countries and African countries, Kusasa extracted a subset of rows with person’s only from the trendsetting and African countries. Furthermore, he generated a new feature which actually states whether the person is from a trendsetting country or an African country.
Some values had undesirable strings in them, such as the phrase “years” in “3–4 years”. Kusasa replaced all the “years” values with an empty string.
He dealt with null values at an analysis by analysis basis. For example — some data scientists did not state their compensation, therefore for the compensation-related analyses he dropped the rows with null values.
To gain a spatial appreciation of the distribution of data scientists, Kusasa needed to acquire a layer which has the geometric locations of his countries of interest, and then join this geometric layer to his engineered dataframe.

Explanatory Data Analysis

As the Kaggle Executive Summary points out, India and the USA dominate participation representation on Kaggle. This study will therefore use these 2 countries as a comparison subset for the African countries (we’ll hereby refer to this group of 2 countries as Trendsetter countries).

The insights from this exploratory data analysis are intended for professionals who are based in Africa, who intend to make a career change into the data science field.

Where are the African data scientists?

Nigeria clearly has the most Kaggle data scientists in Africa. Out of 54 countries in Africa, it is shocking to see that only 6 of the countries appear to have data science activity on Kaggle. Again, this might also be an indicator of an untapped huge future job market potential in Africa.

What is the gender spread between African vs Trendsetting Countries?

Despite the trendsetter countries being only composed of 2 countries (India and the USA), they represent over a third of Kaggle data scientis in the entire world. They also have around 6 times the number of Kaggle data scientists compared to the 6 African countries combined. Potentially, there is still a big job market potential data scientists in Africa.

Both locally and abroad, female data scientists are dramatically under-represented. Understanding the drivers of this correlation, would be beneficial not only for the data science field, but also for Africa’s fourth industrialisation — as Africa has a high number of female-headed households.

What are the demographics of data scientists in Africa?

In line with the global trends, data science in Africa is still dominated by males across almost all age groups and countries. Also, most Kaggle data scientists in Africa are young adults across the countries (between 25 and 35 years old).

What is the educational background of Africa’s data scientists?

A huge majority of data scientists in Africa have a tertiary qualification. The biggest demographic being those that have a Bachelor’s degree, followed by those that have a Master’s degree. Even though it seems having a tertiary qualification positively correlates with getting a data scientist job, there is still a number of data scientists who secured a job without a tertiary qualification. Therefore not having a tertiary qualification is not a stubborn hindrance to securing a data scientist job in Africa.

What is the coding experience of the Kaggle data scientists in Africa vs the Trendsetter countries?

Coding experience in Africa vs the Trendsetter countries

Coding experience in Africa vs the Trendsetter countries by college degree

Proportional coding experience by Kaggle data scientists in Africa vs the Trendsetter countries

Compared to the trendsetter countries, Kaggle data scientists in Africa tend to have much less years of coding experience. A handful of Africa’s data scientists don’t even have coding experience. The coding experience requirement in order to become hired as a data scientists seems lower than most of us may expect.

In the trendsetter countries, the dominant years-of-coding group is most influenced by the data scientists with masters degrees. Whereas in the African countries, the dominant years-of-coding group is most influenced by the data scientists with lower hierarchy degrees (bachelor degrees).

What about Machine Learning experience?

Machine learning experience of Kaggle data scientists in Africa vs the Trendsetter countries

Machine learning experience of Kaggle data scientists in Africa vs the Trendsetter countries by college degree

Proportional view of machine learning experience by Kaggle data scientists in Africa vs the Trendsetter countries

Interestingly, in terms of years of machine learning (ML), Kaggle data scientists in Africa show a similar pattern of experience as the trendsetter countries — where most data scientists have less than 2 years of ML experience. Surprisingly, both in Africa and in the trendsetter countries, there is some data scientists who do not even use machine learning in their jobs.

The machine learning data scientists in the trendsetters tend to have more higher level degrees (masters and doctorates) compared to Africa (bachelors).

The data scientists which have no ML experience:

They are mostly young (in their late twenties), and are degree holders. Therefore they are likely to be entry level data scientists.
They tend to be working for small companies and in small teams. Therefore their limited man hours may be over-occupied by the lower hierarchy data science tasks.
They usually work on building and maintaining data infrastructure, and also using the data to do analyses that feed business insights.

Which programming languages are commonly used and which are commonly suggested for others to learn?

Top 3 used and suggested programming languages by Kaggle data scientists in Africa vs the Trendsetter countries

The exact same pattern happens globally, python is the most used programming language of data science workflows, followed by SQL for storing and accessing structured data. R is still heavily used by some data scientists as an alternative to python.

How does the trend compare amongst African countries?

Top 3 used and suggested programming languages by Kaggle data scientists in African countries

With regards to programming languages used and suggested, the pattern is the same between the African countries. Python is by far the most used programming language by the data scientists, followed by SQL and R. It’s worth mentioning the heavy use of C, C++ and Java in some African countries — these are most probably used in the data engineering and production phases of the data science workflow. In that order, this seems like the order of priority that a budding data scientist must use in his/her learning journey.

Investigating the compensation dynamics of Kaggle data scientists

It’s not surprising that there is a huge number of data scientists that form part of the lowest compensation tier — given data science’s recent rise (in Africa) as an attractive job prospect for new employees (including graduates and professionals switching to the data science field).

At entrant job level, there is the same pattern of compensation in Africa as compared to the Trendsetter countries. The massive difference starts showing at the mid and top tiers of the data science job market. On average, the Kaggle data scientists in Africa earn around $15k per annum, versus the Trendsetters’ $83k per annum (a difference of almost 600%). Africa’s biggest earners get around $124k per annum, versus the trendsetters’ > $500k per annum (a difference of at least 400%).

Unsurprisingly, amongst the trendsetter countries, it is the data scientists in the USA that tend to earn the big bucks.

How does coding and machine learning experience relate to compensation for African data scientists?

Unsurprisingly, there appears to be a positive correlation between the number of years that a data scientist has been coding and their salary. Since there is a small number of data scientists who have been in the field for a long time, their low supply increases market competition to acquire their services.

Similarly, there appears to be a positive correlation between the number of years that a data scientist has been using machine learning and their salary. Compensation seems to peak at around $80k for data scientists with 5–10 years of ML experience. For those with more than 10 years of ML experience, compensation peaks at around $60K. The trend seem to say that having ML experience beyond 10 years doesn’t give you more money — which is interesting. Traditionally, academia has more experienced people with less compensation whereas corporate tends to have less experienced people paid handsomely. One can speculatively use these dynamics to extrapolate the analysis of this trend.

A complicated look at how college degree together with coding and ML experience relate to compensation.

Having a tertiary qualifications potentially has an impact on increasing the salary of the Kaggle data scientists in Africa. The highest earners have a Master’s degree. However, having a doctorate does not necessarily improve the earning ability of the data scientists. The same academia vs corporate speculation can be used here.

A look at the size of companies hiring Kaggle data scientist

Company and data science team sizes of Kaggle data scientists in Africa vs Trendsetter countries

The Kaggle data scientists in Africa tend to work for small companies and in small teams. This may mean that most data scientists in Africa have to have a broad skillset to build and operate the entire data science ecosystem (skills mostly associated with software engineers and data engineers), and may likely spend lesser time developing ML models.

Number of data scientists by team size per given company size

Of particular note is the small companies (0–49 employees) that have dramatically large data science teams. These are likely start-ups which are heavily focused on selling data-related services — in line with the current high attractiveness of the data science field.

How does the company and data science team size affect the compensation?

Even though the data scientists in Africa tend to be hired more by small companies and small data science teams, it is the the data scientists who are in big companies and big data science teams that tend to earn dramatically more.

What are the African data scientists doing in their jobs?

The Kaggle data scientists in Africa spend most of their time on building and maintaining data infrastructure, and using the data to do analyses that feed business insights. This is inline with the previous realization that most of the data scientists work in small companies that also have small data science human resources. Therefore aspiring data scientists should prepare themselves accordingly, and also see this as an opportunity to engage the entire data science ecosystem. Furthermore, for those wanting to get their hands dirty on a variety of responsibilities, working for a smaller company may be a good option.

Exploring the plotting tools used by Kaggle data scientists in Africa vs the Trendsetter countries

Matplotlib and seaborn are the most used plotting libraries in Africa and the Trendsetters, closely followed by Plotly — in line with the high usage of python over R. These 3 plotting libraries should likely be the ones that aspiring data scientists must focus on.

How about ML frameworks?

Scikit-learn is still the landing data analysis framework for machine learning, followed by keras and tensorflow more so for deep learning. These 3 machine learning frameworks should likely be the ones that aspiring data scientists must focus on.

Focusing on ML algorithms

As relatively simple as they are, the linear/logistic regressions are still the most used ML algorithms, which are closely followed by the decision trees/random forests algorithms. Together with gradient boosting machines, these 3 ML algorithms should likely be the ones that aspiring data scientists must focus on.

How are Kaggle data scientists sharing their work?

Github and Kaggle are the most used platform where data scientists publicly share their work. Therefore these are the platforms that aspiring data scientists should use not only to find and learn from other data scientists’ work, but also to eventually start sharing their own work for profiling their work experience.

How are they keeping their knowledge relevant?

Online learning platforms used by data scientists in Africa vs Trendsetter countries

Coursera and Udemy are the to-go-to platforms for online learning for data scientists in Africa and the trendsetter countries. Whereas the third most used platform in Africa in DataCamp, in the trendsetter countries it is University Courses which actually result in university degrees. This points to universities starting to play catch up for accommodating the booming data science field.

How do they consume data science news?

For a fast growing field such as data science, staying up-to-date with the leading technologies, processes and thinking is paramount. The commonly used media by the data scientists are Blogs, Kaggle and YouTube.

Lessons Learnt

Online courses are a great starting point, but what is invaluable is wrestling with projects.
It’s very unlikely that you can escape data wrangling — embrace it, appreciate it. For a unique data analysis, the source data may need to be cleaned, filtered and/or transformed in some way in order to be usable in your unique data analysis.
The EDA process is an invaluable step in any data science project.
Do not limit your project (and knowledge) to certain tools. Let the tools/libraries you use be driven by the problem you are trying to solve.
The data science workflow is not linear, but cyclic. As you explore your data, embrace the fact that you might have to do further data wrangling and even add other analyses to your project, as you prod your data and ask questions of it.
Maintain your inquisitiveness, and enjoy the journey.

Let’s chat

Besides testing and showcasing EDA skills, this project is also meant to spark some conversation around ML and Data Science in Africa. The comparison with trendsetting countries is aimed at identifying gaps that we can learn from and have productive conversations about.

So talk to us :)

Do you identify with some of the conclusions made?
Do you see yourself anywhere in the survey?
What have you learnt?
What more would like to see?

You can follow Kusasalethu Sithole on LinkedIn and Mabu Manaileng here.

Resources

All the code for this analysis can be found on this Colab
DataCamp
DataCamp Career Tracks
2020 Kaggle Machine Learning & Data Science Survey