Every year since 2011 the site, Stack Overflow , one of the most famous Q&A sites of Q&A for coders, conducts a survey with its members. The survey covers a lot of information in there, salary., such as salary. country, what are the things that you consider when changing your job….
We could ask ourselves a lot of questions. But, as a aspiring data scientist from an underdeveloped country (Brazil), I couldn’t help but to inquiry myself about the differences between the developed and underdeveloped countries when it comes to their respective job markets for coders.
With this in mind, the aim of this article is to analyze the data of the Stack Overflow 2020 survey (the most recent one available) to search for such differences (or similarities). In this article, I define “developed countries” as the ones that are rated as “very high” by the UN in its Human Development Index (HDI). This rating and its definitions are not exempt from discussion: one could question how to factor in countries that make such list but don’t have a significant high tech economy, the fact that HDI is considered a flawed index by some etc. This discussion, however, goes beyond the scope of this analysis. Here, we will take HDI at face-value and focus on highlighting the issues I considered the most relevant for the coding field and try to formulate explanations about them, based on data.
Also, I’m only taking into consideration the data from the workers (i.e. those who earn salaries from their work as coders), as that is the group I want to know more about.
Before getting into the analysis, I should highlight two things: first, the number of respondents vary from country to country, with more coders living in developed countries than in underdeveloped ones. Second, the use of US Dollars to standardize different currencies may be in disadvantage for underdeveloped countries, especially as we are not including variables to account for differences in life standards in the developed and underdeveloped worlds.
The main questions I want to answer are the following: are there differences between the job market for coders in developed and underdeveloped countries when it comes to salaries, work hours, education, years of coding experience and the coding languages used (as these seem to be the most relevant for describing the field)?
The notebook and the data this article is based on can be found here: https://github.com/leandrominer85/Stack_Overflow_survey
Handling the data
As with every data set, the first thing to do is to clean and transform the data. As can be seen below, the DataFrame has 64461 rows and 61 columns but there is a lot of NANs in the salary column: 0,46%
So the first thing to do is to clean this data to only remain with the coders that work in the field. Only with this first clean almost half of the entries are gone. But nevertheless there is 34756 entries still.
The next step is to separate the data in two DataFrames, one with the developed countries and the other with the underdeveloped ones. For this I created a list of the developed countries and then used a filter for both classes:
For the sake of sanity lets check the shapes: the first DataFrame has 26022 rows and the second 8734. As we can see the sum of number of rows in both Dataframes is the same as in the original DataFrame without the empty salaries. Further cleaning will be done during the analysis, but I’ll use a function to clean the data from the language the coders use:
What the survey tell us ?
As a start, and to provoke our curiosity, let’s look first at the countries with the most coders in each group and the coding languages used by them:
The data shows that the ranks by each country’s coders’ population are similar to the rank by each countries’ total populations (with slight variation). China is absent from this survey as the internet control seems to block the site Stack Overflow.
I created a function to extract the languages (as one coder can use more than one) and made a dictionary with the count of each language:
Finally, it’s noticeable that Bash/Shell/PowerShell is the 5th most used language in the developed group, but only appears at the 9th position in underdeveloped countries. This rank is interesting, but the differences between the relative distribution in the developed and underdeveloped countries aren’t that big: the larger is in the one found in Bash/Shell/PowerShell, and it’s only a 3,3% difference.
Now it’s time to look further into the data. First let’s look at the salaries:
The results show that the salaries in developed countries are much higher than in underdeveloped countries. The average salary is very different: US$ 128243 for the developed countries and US$ 30796 for the underdeveloped. This is expected, as the currency of the underdeveloped countries is less valuable than that of the most developed (since the data is converted to US dollars, as mentioned earlier).
At first glance the distribution is also very different, as the standard deviation shows. And this also seems to be the case in the boxplot (the outliers were removed for better visualization):
But if we focus on the distribution of salaries inside the groups (with a boxplot) we can see that it is very similar:
Data regarding the education data has to be transformed before we can put it in a graph, mainly because the values are long strings.
Now we can do the distribution graphs. Additionally, I’ll do a DataFrame that shows the inter-classes differences:
The main difference between developed and underdeveloped countries is found between coders with Bachelor’s degree, that appears with a surplus of over 13% in underdeveloped countries. In developed countries, this is reflected by a larger representation of coders with higher education degrees, in comparison to what’s seen in underdeveloped ones.
Now, let’s look at coding experience. The data below regards years of coding experience, including experience earned by studying code. I will divide this data into 5 categories. The idea is to somehow to emulate the real-life market division of worker experience. The classes are:
- 0 to 3
- 3 to 5
- 5 to1
- 10 to 20
- above 20 years
For both the years coding with and without education time I’m dropping the NANs values, as it represents less than 1% of the data.
Are the trends the same in years coding when education coding experience is not considered?
As we can see, there is a large difference between the groups of countries. The underdeveloped ones have younger (in terms of years coding) coders, with the maximum difference seen in the middle category (5–10 years, including education experience), with 13% surplus in comparison with the developed ones. This trend shifts drastically in the two “older” categories, mainly in the last one with a 16% difference in favor of developed countries.
There is a change when we subtract the educational coding experience. The main difference seen in this data is for the younger coders, with an advantage of 13% for underdeveloped countries. But the trend in favor of developed countries having older coders is also present here.
As expected, without education, the years of coding data gets shifted to younger classes. The education seems to have a great impact on the coding experience as the data of the underdeveloped countries shows that with the max of 5 years of experience with education there is 21% of the coders. If we take out the education this group go up to 60% (with 42% in the developed countries for this level of education).
If we look at the average age of the respondents of each group, it is noticeable that the coders in underdeveloped countries are younger than those of the developed ones: 28.89 years old in underdeveloped, against 33.17 years in developed ones.
Finally, let’s see the over-time in working hours:
About the results and some thoughts
As I said in the beginning of this article, the aim here was to see if there are significant distinctions between the “developed world” and the other countries when it comes to their job market for coders. The data of these two groups of countries are not similar and could lead to misinterpretations; nevertheless it’s a valid comparison for the brief study conducted here.
The education levels and the over-time worked hours can be analyzed as a whole. Both suggest a disadvantage in underdeveloped countries: they work more and have lower education levels. Combining these results with the results from the analysis of the years coding (with or without education), we could formulate some possible answers (keeping in mind answering those questions in full is beyond the scope of this paper).
As the coding field is relatively new in the world and had an earlier start in more developed and rich countries, the possible answer for the differences we’ve seen could be that the field is still not consolidated in underdeveloped countries. This could be the reason that, in comparison to workers in developed countries, workers in underdeveloped ones work more, are less formally educated and have less years coding. This hypothesis is reinforced by the average coder’s age, which is higher in developed countries. But further research is necessary to further analyze and fully understanding this.