Analyzing GADS 2020 Phase-One

Published in

DataSeries

12 min readNov 19, 2020

EDA on Google Africa Developer Scholarship phase-one 2020…

Google has a mission for Africans… Let’s hear what Andela, the principal organizing partner for this scholarship program has to say…

In line with Google’s goal of training 100,000 Africans, we have, over the last 3 years, executed 5 programs in partnership with Google, Udacity, and Pluralsight, training a cumulative of 60,000+ learners across 54 Countries in Africa.

This is indeed a highly commendable program from a company that needs no introduction and for me it gets personal. The first GADS program some 3 years ago, was my first serious attempt to coding and building applications. I had tinkered with HTML in the past, but GADS was the program that took me from the sidelines to mid-stream.

Today I am a Data Specialist. And although I started learning Java and Android first at GADS, that learning formed the foundation on which I have switched and built my Data Science career.

Okay, so here we are at GADS-2020, I got a mail inviting me to apply and choose one option from three options below. The training for qualified students will be done at Pluralsight.

Mobile Web Specialist Track
Associate Cloud Engineer Track
Android Developer Track

Of course, as a Data Specialist, the Cloud track is the most applicable to my work, so I opted for it.

The Program Time-Line:

The program takes about 15 weeks from early July to the Project-phase in October. There’s an additional 6 weeks certification stage extending till December.

Here’s my mandatory skill-IQ test result from Pluralsight at the end of learning phase-two. I’m required to score at least 180 on the skill-IQ test or achieve 20 learning hours on Pluralsight… I achieved more than both.

My Skill-IQ score of 220 / 300 as at end of learning phase-two

The analysis below was done in the first week of August 2020, Just after the end of learning phase-one in July…

The Data:

Luckily for me, GADS has a leaderboard up on some website. The leaderboard has full details of all Learners as at end of learning phase-one.

Data Pre-processing:

To answer pressing questions, I needed to get my hands on the leaderboard data, so that I can process it for analysis.

I web-scraped the leaderboard website using the request and beautifulsoup libraries.
I cleaned the data, removing HTML links and web parts that are unnecessary for my analysis.
I read the data into a structured Pandas Dataframe.
I converted data types from object to int and checked for missing values and other data properties.
Finally, I made the `Rank` column the index of the Dataframe.

The data has details for 13,571 Learners as at end of learning phase-one. This data is for all three tracks earlier mentioned. For each student/observation, the data has the following features…

Rank: The rank of the student
Name: The name of the student
Country: The country of the student
Total Points earned: => f(skill-IQ-score, learning-hours)
Skill-IQ Score: Skill-IQ score of the learner
Learning-hours: Total learning-hours as at end of phase-one

The Analysis

As a Data Professional, curiosity is a huge part of my daily life. Therefore, at the end of learning phase-one, I said to myself…

What can I learn from the engagement of over 15,000 African students across three learning tracks?

Some interesting insights are…

What insights can we learn from the engagement of Learners as shown on the Leaderboard?
Is there a correlation between watching long hours of content and scoring high Skill-IQ scores?
How many countries are represented and which countries have the highest percentage of Qualified Learners as of 31st July?
Is it better to only go for skill-IQ or only go for Learning-hours or should Learners aim to meet both qualifying requirements?

Visual Overview:

Let’s look at the distribution of the numeric columns of the leaderboard. These are the `Skill-IQ`, `Learning-hours` and `Total-points`.

Histogram of Total-points, Skill-IQ and Learning-hours…

What does the GADS Histogram tell us?

Total- points have a distribution between 0 and 35
Learning-hours is concentrated between 0 and 35, but we have about 2000 students learning between 35 and 60 hours. We have about 400 students learning between 60 and 90 hours. Finally, we have outliers of very few students between 100 and 258 learning hours.
The distribution of skill-IQ score has a rather uniform distribution, but of course, it peaks around 120 to 150 bin.

Descriptive Statistics for Qualified Learners data:

Next, I’d select the data for all qualified Learners who either had a minimum of 10 Learning-hours or 100 skill-IQ scores or both. Then I’d use the Pandas describe() function to get the descriptive statistics data for the Learners as of 31st July.

From the descriptive statistics of the leaderboard data, we can see that:-

Total number of qualified students is 13,571 as at end of phase-one
The average Learning-hours for all qualified is 19 hours
The average skill-IQ score for all qualified is 120
The average total-points for all qualified is 6
The maximum total-points, skill-IQ and Learning-hours are; 35, 300 and 258.

Let’s see the distribution for qualified Learners across these three metrics.

Histplot distribution of qualified Learners across the 3 metrics

Interpreting the Histogram:-

Skill-IQ histogram is roughly uniformly distributed. With a unimodal shape at the beginning containing large numbers of values between 0 and 15 skill-IQ scores.
The Learning-hours Histogram is also Unimodal at the same 0 to 15 range. But the histogram is heavily right-skewed with higher values that pull the tail of the histogram to the 258 marks.
The Total-points histogram is Bimodal around 5 and 9 points. It roughly appears symmetrical between 0 and 12. But like learning-hours, it contains outliers that right-skew its shape down to the 35 marks.

To get better insights about the shape of the leaderboard data for qualified Learners, let’s see a boxplot.

Boxplot distribution of qualified Learners across the 3 metrics

Interpreting The Box-Plot:

Generally, the Box-plots for Total-Points and Learning-Hours contain a lot of outliers, while that of SkillIQ, contains very little.

Skewness and Kurtosis:

Skewness tells us about the shape of the distribution of the data, in respect to the symmetry. While Kurtosis tells us about the outliers in the data. The larger the kurtosis, the more outliers are present and the longer the tail in the distribution. Let’s define a simple method for measuring kurtosis and apply it to the data.

Function for calculating the kurtosis or rate of outliers of a distribution

The measure of outliers or kurtosis for the data is:

Kurtosis for Total-points = 4.82
Kurtosis for Skill-IQ = 2.32
Kurtosis for Learning-hours = 19.77

A. Box-Plot for Total-Points:

The minimum value is 0 and maximum 1s about 14.9, then we have outliers from point number 15 up to 35, this makes the histogram to be right-skewed as seen above. The Box-plot also shows that 75% of Qualified Learners, scored no more than 8 points with 50% scoring no more than 6 total points.

B. Box Plot for Skill-IQ score:

The minimum score is also 0 and the maximum is 300. We also see that 75% of Qualified learners do not score above 170 skill-IQ. The data is roughly randomly distributed, but we still have a large concentration of low figures as seen in the Skill-IQ histogram above causing a Unimodal shape at the beginning of the histogram. Also, 50% of all Qualified Learners, do not score more than 130 skill-IQ scores.

C. Box Plot for Learning-Hours:

The minimum value is 0 and maximum is about 50. But we have outliers from around 51 to 258. We also see that 75% of Qualified Learners did not learn above 24 learning hours. with 50% of all Qualified Learners learning less than 16 hours total.

QUESTION:

Is there a correlation between watching long hours of content and scoring high Skill-IQ scores?

At this point, I’m very curious to find out if watching long hours of content leads to higher skill-IQ scores. First, let’s see a correlation matrix of Skill-IQ scores, Learning-hours and Total-points.

Correlation matrix fro Skill-IQ, Learning-hours and Total-points

Corr matrix Inference:

The correlation matrix shows some very interesting insights

Total-Points and Skill-IQ-scores are strongly correlated (0.84). This means as one increases, the other is very likely to increase too.
Total-Points and Learning-hours have a positive correlation slightly above average (0.55). This means as one increases the other generally tends to increase too.
Amazingly, Learning-Hours and Skill-IQ scores have virtually no correlation at all (0.012).

In statistics, we say correlation does not imply causation.

Yet it is a wonder that scoring high skill-IQ scores appear to have nothing to do with watching long-hours of content and also watching long hours of content appears to have nothing to do with scoring a high skill-IQ score.

The data shows that watching long hours of the course content has no relationship with scoring high skill-IQ scores and vice-versa.

Let’s visualize these relationships using a scatter plot.

Scatter plot distribution of qualified Learners across the 3 metrics

The scatter-plot clearly shows that:

As learning-hours increase, Total-points tend to also increase.
As skill-IQ scores increase, Total-points show a strong correlation to also increase.
But as Learning-hours increase, there is no visible trend that shows skill-IQ scores increasing and vice-versa.

We can also see a Probability Density Function (PDF) plot of these metrics…

A PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value

PDF plot distribution of qualified Learners across the 3 metrics

QUESTION:

Is it better to only go for skill-IQ or learning-hours or should Learners aim to meet both qualifying requirements?

Remember that there were 13,571 total qualified Learners at the end of learning phase one. Some of these Learners qualified on only skill-IQ scores, some on Learning-hours and some on both conditions.

Let’s see a breakdown of how many students qualified per conditions above:-

Pie chart showing the percentage split of qualified Learners

Pie-Chart shows the following:

43.6% of qualified learners qualified in both skill-IQ scores and Learning-hours. This is a figure of 5,917 Learners, represented by the blue slice above.
29.9% of qualified Learners qualified in only Learning-hours. This is a figure of 4,051 Learners, represented by the pink slice above.
26.5% of qualified Learners qualified in only skill-IQ scores. This is a figure of 3,603 Learners, represented by the green slice above.

So let’s inspect these three groups of Qualified Learners to find out the best group.

First, let’s compare the average scores of Learners in each group…

Bar chart of average scores based on qualifying mix of Learners

Comparing Average Inference:

We can clearly see that:-

Learners who meet both the minimum skill-IQ score and minimum Learning-Hours seem to be the most serious set of Learners. This is because:-

On average, they achieve:- 33% higher Learning-hours than those who only qualify on Learning-hours and 600% higher Learning-hours than those who only qualify on skill-IQ
On average they score- 3% higher skill-IQ scores than those who only qualify on skill-IQ scores and 475% higher skill-IQ scores than those who qualify on only Learning-hours.
On average they score- 200% higher Total-points than those who qualify on only Learning-hours and 50% higher Total-points than those who qualify on only skill-IQ.

This shows that on average, Learners who meet at least the minimum requirements for both skill-IQ scores and Learning-hours, perform better than those who only meet either.

Let’s compare the maximum performance of Learners within these three categories, to see which group of Learners seem to perform best…

Comparing Maximum Inference:

We can clearly see that:-

Learners who meet both the minimum skill-IQ score and minimum Learning-hours seem to be the most serious set of learners, again, this is because:-

Their maximum score on Learning-hours is 55% higher than Learners who only qualify on Learning-hours and 2767% higher than those who only qualify on skill-IQ
Their maximum score on skill-IQ is 203% higher than learners who only qualify on Learning-hours and 0.7% higher than those who only qualify on skill-IQ
Their maximum score on Total-points is 94% higher than learners who only qualify on Learning-hours and 192% higher than those who only qualify on skill-IQ

This shows that when it comes to maximum performance, Learners who meet at least the minimum requirements for both skill-IQ scores and Learning-hours, perform better than those who only meet either.

Once again, we have seen that the best group is the group of Learners who meet at least the minimum requirements for both skill-IQ scores and Learning-hours.

Although I’m not in any way privy to the list of Learners that will be chosen for phase two, I can beat my chest and state authoritatively, that…

Based on the Data, the 5,917 Learners who meet both minimum skill-IQ and Learning-hours will most likely be chosen first.

QUESTION:

How many countries are represented and which countries have the highest percentage of Qualified Learners as of 31st July?

Analyzing the data shows that the Qualified Learners as at 31st July represent 52 different African countries. See the 52 countries below:-

Let’s see the Top ten countries with the highest number of qualified Learners…

Bar chart showing the percentage of qualified Learners from the top-10 countries.

It turns out that Nigeria, Kenya and South-Africa combined, produce more than 70% of total qualified Learners, with Nigeria alone producing 45% of qualified Learners.

My Skill-IQ score of 291 / 300 at the end of the Project-phase…

Summary:

This has been an exciting exercise for me. First I got curious about the data and applied some techniques to web-scrape the data and perform some Exploratory Data Analysis (EDA). Using the data to answer some key pressing questions.

I hope that I have been able to provide better insights to Learners as well as for the program organisers Andela-Learning-Community (ALC), using the students’ data from the leaderboard.

Some key findings from the data include:

There are 13,571 qualified Learners as at 31st July
5,917 met the minimum requirements for both skill-IQ and Learning-hours, while 4,051 and 3,603 Learners, met only Learning-hours and skill-IQ scores respectively.
Total-points is correlated to both Learning-hours and skill-IQ scores.
There is virtually no correlation between Learning-hours and skill-IQ scores. But we must remember that the best group of Learners had an average of 28 learning-hours and 161 average skill-IQ scores.
The best group to belong to is the group of Learners that meet at least both minimum requirements to qualify.

For the complete analysis with codes and charts, kindly visit my Github Repo.

Aftermath:

After sharing my analysis with the organizers of the Google Africa Development Scholarship(GADS), I was invited to a zoom-meeting with Andela on the 6th of August 2020.

It turned out to be an insightful session, where I shared the impactful insights we can derive from continually analyzing the data in the course of this cohort and subsequent programs.

I hope to be a part of the Data Team in the future of GADS, as we help to train 100,000 Africans with cutting-edge skills of the future.

Cheers!

About Me:

Lawrence is a Data Specialist at Tech Layer, passionate about fair and explainable AI and Data Science. I believe that sharing knowledge and experiences is the best way to learn. I hold the Data Science Professional and Advanced Data Science Professional certifications from IBM and the IBM Data Science Explainability badge. I’m also a recipient of the Artificial-Intelligence-Nanodegree at Udacity. I have conducted several projects using ML and DL libraries, I love to code up my functions as much as possible. Finally, I never stop learning and experimenting and, I have written several highly recommended articles.

Feel free to find me on:-

Github

Linkedin

Twitter

Analyzing GADS 2020 Phase-One

The Program Time-Line:

The Data:

Data Pre-processing:

The Analysis

Visual Overview:

What does the GADS Histogram tell us?

Descriptive Statistics for Qualified Learners data:

Interpreting the Histogram:-

Interpreting The Box-Plot:

Skewness and Kurtosis:

A. Box-Plot for Total-Points:

B. Box Plot for Skill-IQ score:

C. Box Plot for Learning-Hours:

QUESTION:

Is there a correlation between watching long hours of content and scoring high Skill-IQ scores?

Corr matrix Inference:

QUESTION:

Is it better to only go for skill-IQ or learning-hours or should Learners aim to meet both qualifying requirements?

Comparing Average Inference:

Comparing Maximum Inference:

QUESTION:

How many countries are represented and which countries have the highest percentage of Qualified Learners as of 31st July?

Summary:

Aftermath:

About Me:

Written by Lawrence Alaso Krukrubo