Yet Another Post about “How much Colombian software developers really make”

UPDATE: Code and data can be found in a Github repository.

TL;DR

Famous Colombian entrepreneur Alexander Torrenegra through his company Bunny Inc recently released the results from a Survey about the salaries of Colombian Software Developers. The data although interesting is messy. To get extra juice from this data, I use Open Refine to clean it up and cluster similar answers. Based on the cleaned data I do a bit of data exploration.

Intro

Famous Colombian entrepreneur Alexander Torrenegra recently released the results from a Survey about the salaries of Colombian Software Developers. There have been already interesting analyses (and controversy) regarding the results of the survey.

Both of the mentioned analysis tackle the data in an interesting way and provide insight. I like the on one from David in particular.

I take a different approach and I ask different questions. I also only focus on a couple of fields, on which I try to follow the data driven journalism / data science process to explore the data, namely:

  1. Obtaining the data
  2. Looking at the Data
  3. Clean the data
  4. Explore
  5. Visualize

And with it, I will try to answer some questions (based on the survey data):

  • Is there a correlation between the role and the salary of the person?
  • Do the job title affect the salary of the person?

Obtaining the data

The data can be downloaded from the original article. The data was originally obtained using a Google Forms, so a CSV is available for download.

Looking at the data

Let’s start to get our hands dirty. For doing that I use pandas to take a look at the data, let’s take a look to the columns in the CSV.

['How many years of software development experience do you have?',
'In what technologies are you proficient?',
'What university degree did you obtain?',
'university',
'What percentage of your work time do you spend developing software as opposed to performing other tasks like managing other people or projects, making sales, etc.?',
'Please select all that apply:',
'Referring to the company or project that you work for most of the time:',
'If applicable, how many full-time employees does the company that you work for have?',
"If applicable, what's your job title at the company that you work at?",
'Excluding equity, how much money do you earn on a monthly basis?',
'Are you earning equity (stock, stock options, etc.)?']

And let’s take a peek into the data:

Sample of the data found in the CSV file.

Some Particularities of this data set

From the sample above you can already notice some peculiarities. Some of these come from the design of the of the questions. And well, from the fact that the data comes from a survey. This data suffer from convenience of sample and Non-response biases. In other words, as this data come from a survey, iut cannot be taken representative sample (i.e. the sample is not random).

Now for the data itself, the design of the fields are problematic. Take for example the salaries. The survey divided the salaries into the following buckets:

    $0
USD$1 - USD$999
USD$1,000 - USD$1,999
USD$3,000 - USD$3,999
USD$6,000 - USD$6,999
USD$2,000 - USD$2,999
USD$5,000 - USD$5,999
USD$4,000 - USD$4,999
USD$8,000 - USD$8,999
USD$7,000 - USD$7,999
USD$9,000 - USD$9,999
> USD$10,000

As both David and Isabel mentioned, the salaries are reported in dollars, but earned in colombian pesos. In Colombia a range from USD1−USD999 or USD1,000−USD1,999 is too broad and grouping them into such buckets makes impossible to have better view of the salary distribution. This even more noticeable, when 90% of the survey participants earn below USD $3000.

To try to obtain a better middle point, Isabel assumes the salary distribution is normal of that 90% is normal. That can be a reasonable assumption, but salary data is rarely normal and is generally right skewed. That means that the ‘Average Joe’ might be getting even less that the assumed by Isabel. My assumption would be that if that is the situation for salaries < USD $3,000, I would expect a even lower real mean.

Another fields are also problematic. Take for example the question on the university from which the participant obtained his/her degree. In the original survey it is free a text field. This creates different spelling for the same answer, making difficult any clustering. Most of the other fields presented similar issue.

Another peculiarity is that the data is “bilingual”. There are entries with names and descriptions in English and other in Spanish.

Cleaning the Data

Given the issues described in the previous section, in order to answer some of the questions I posted at the beginning of the article required a bit more of data juggling. Recently I found a tool called Open Refine (formerly Google Refine) that helps when dealing with messy data. Among its functionality are the cleaning and clustering fields using String similarites. For example for the university field using it I could cluster answers and I went from 436 different answers to 156. This was not automatic and it took me a while to obtain result I considered good enough. In addition to this preprocessing I also normalized strings and removed encoding some encoding problems.

An overview of the ‘If applicable, what university(ies) did you attend?’ field (Original data)
An overview of the ‘If applicable, what university(ies) did you attend?’ field (After clustering).

imilar preprocessing was applied to the If applicable, what\’s your job title at the company that you work at?’ field. I also expanded some fields with multiple answer so they transformed into more than one field. The modified data set can be found in Github.

Visualizing Correlations

We the data a bit cleaner we can start trying to look for some patterns.

Let’s plot the overall distribution on salaries so we can further compare it with the salaries for the universities.

Histogram of Salaries

Let’s start with the university where the person obtained a major. How is the distribution of salaries based on this.

University vs Salary

Let’s take a closer look into the ‘What university degree did you obtain?’:

count                                  861
unique 156
top universidad nacional de colombia
freq 109

So from 1230 entries, only 861 reported some university. Also the most common value is the ‘universidad nacional decolombia’, that is, the National University of Colombia(UNAL). Let’s take a look at the most common universities.

universidad nacional de colombia       109
universidad de los andes 62
eafit 57
universidad del norte 53
universidad del valle 46
universidad distrital fjdc 43
pontificia universidad javeriana 29
universidad de antioquia 23
universidad industrial de santander 21
universidad del quindio 16

I am a bit guilty here that the UNAL is the most common university, as I merged all the campuses (namely Bogotá and the Medellín campuses) into one cluster. Some people consider it different universities and some the same. In my opinion they should be treated like one but that is up for discussion (however, I don’t think is that relevant now).

Let’s plot the histograms per university. Given the imbalance of the counts, I will plot against the percentage of the total answer per university.

Histograms of salary and university (Y axis: percentage of total counts for that university)

Although the distribution looks similar, it would appear that graduates from “La Universidad de los Andes” tend to have better salaries than the rest-Only small percentage of graduates are in the USD-USD999 bucket. It also looks that salaries from the “La Universidad del Norte” are overall lower than the rest of the universities here. Remember I am just showing just correlation, not causation. There are many factors to take into consideration, for example the city where the person reporting the salary lives. For example, if I assume most of the graduates from a university in Medellin , remain in that area. It would expect then that their salaries are lower, just because the cost of live is lower than in Bogotá.

Job Title vs Salary

Let’s see if we can identify a particular pattern where it comes to the position of the person. Now, this data was interesting as it was bilingual. Some reported the the position name in English, some in Spanish. Does that mean that the ones answering in English are hired by a international company? or have better command of English? Does that affect somehow the salary. Let’s take a look at the data:

First let’s take a look at the most common reported positions:

Desarrollador de Software           99
Software Developer 91
Desarrollador de Software Senior 65
Software Engineer 60
Analista de Desarrollo 52
Desarrollador Junior 28
Technical Lead 23
Frontend Developer 23
CTO 22
Director 19

Remember that this is the cleaned data. However, I couldn’t clean as much as I wanted. Too many different spellings and variations of the same job title. Let’s see what we’ve got:

Histogram of salary buckets and job title (Y axis: percentage of total counts for that job title)

We can see that the distribution among the different buckets similar for “Software Developer” and “Software Engineer”. A bad analyst would say that if your job title is in English, you have more chances to be better paid ☺. It is interesting to see how is easy identifiable the “Analista de Desarrollo” is a Colombia only position, as the salaries never go above of the USD2,000−USD2,999 bucket.

Recommendations for the Next Survey

I think the survey was a nice exercise and I thank the Bunny Inc. guys for creating it. However in order to obtain more value from the data I would propose some modifications:

  • Ditch Google Forms and create a custom form with a friendly and less prone to manual entry error. For example load a list of Colombian universities or a list of common positions (Software Developer, CTO, Data Engineer, Backend Engineer).
  • Make the reported currency COP. Dollar fluctuates a lot (and much more recently). Therefore I believe is better to report COP instead of USD. Furthermore, if most of the people answering to that survey actually get COP. Conversion can easily be done afterwards.
  • Ditch the buckets and let people enter a exact value. This will allow to obtain better statistics and create buckets that make more sense for the Colombian software developer community.
  • Add additional fields, for example the industry or the actual sex of the person. The location of the person is also important, as the cost of life varies. That way we could also separate those colombian software developer working abroad.

Conclusion

Playing with messy data is not fun, but tools like Open Refine make it way easier. The data obtained from this survey although not perfect allow us to have a glimpse of the distribution of salaries of the colombian software developers. However, I hope the next Survey from the Bunny Inc. guys contain a cleaner data set so that more in depth analysis can be made☺.

The graphs and analysis was done using IPython/Jupyter)(a project initiated by the colombian Software Developer Fernando Perez) and the Pandas, Seaborn and Matplotlib. I will publish a link to the complete IPython notebook.