I think StackOverflow needs no introduction for anyone even remotely belonging to the software industry. What some people might not know is, StackOverflow publishes a yearly survey illustrating various interesting trends of a software developer’s life. What is probably even lesser known is StackOverflow also publishes this yearly dataset publicly to enable curious minds to analyze and draw their own sets of conclusions.
Since the last 2–3 years, I have enjoyed and at times, have been intrigued by the trends showcased in the StackOverflow survey reports. I recently watched Josh Bernhard analyze these datasets to answer his own sets of questions and form his own trends. This acted as inspiration for me to dig deeper into all the past years of the StackOverflow dataset and analyze various trends of the last decade.
Quite obviously, the first thing that jumped to my attention was the money trend. So, I am restricting this article to analyze and present some decade-long trends related to a software developer’s salary. I am excited to replicate and extend this blog to other relevant trends like: How has remote work shaped over the last decade, How has gender-gap fared over the last decade, etc. But, I chose to pick up the easiest one first. (Greedy Algorithm!)
Why are USA and Canada popular destinations for higher studies?
For answering this question, let’s find out the top 3 regions having the highest salaries for each year during the period 2011–2019.
Taking a look at the accumulated data, the United States seems to be the clear winner featuring in top 3 every year since 2011. Canada also seems to be a favourable destination with 3 features in the 9 year span. What would probably come as a surprise to some is: Australia. It features in the top 3 for 5 years since 2011 and hence, is an excellent region for software developers seeking better payment avenues.
Taking a closer look at the average annual salary figures for the US, we can clearly observe a distinct rise in the salary for the software developers in the last 2 years. Thus, the US has become a happy hunting ground for the software industry and unsurprisingly, the country attracts some of the best minds each year.
Exploring for a pattern for other countries like Canada and Australia, we find a similar trend in all the 3 countries. Last 2 years have indeed proven to be increasingly favourable for the finances of the software developers. This also shows the difference in average annual salary figures between the 3 countries.
Now, if you are a student or a developer looking for better prospectives, this data proves which countries are the most favourable and why. Of course, you also have to consider a laundry list of other factors like cost of living, housing prices, crime rate, health-care standards, blah, blah and blah.
Has StackOverflow Survey has grown in popularity over the years?
Of course, it has. Actually this question is a kind of misnomer. My real goal is to present what quantity of data are we dealing with here. Specifically, we are looking at the number of participants of the survey for each year. We also conclude a fairly increasing number of developers contributing to this survey. I am pretty sure, this fact can serve as the base for many observations and conclusions of other trends.
Let me guess how much you earn!
What good is so much of data, if you can’t apply Machine Learning to it! So, now let’s do that. However, our purpose won’t be as rudimentary as using bunch of data to predict salary. You will see!
Using one of the easier machine learning algorithms, Linear Regression, I sought to build a model to predict the salaries of individuals. Specifically, I was seeking answers to 3 questions:
1. How accurately can I predict the salaries? Does this accuracy improve with growing data and increasing features over the years?
2. What factors consistently influence the salary of a software developer?
3. What factors are commonly believed to salary influencers, but in reality, don’t contribute much?
Applying the ML model on the data obtained from three years: 2011–13, we can build a decent predictive model for each of the years. Here, model accuracy is measured in R2 (R-Squared) score, where R2 score of 1 means 100% accuracy.
Although the number of respondents, or simply put the size of data, does play an important role in improving the accuracy of the model, yet the more prominent factor is the number of features used to build the model. 2013 had more respondents compared to other years, however, the number of features available in 2013 was also significantly higher than other years.
Another interesting quality for improving model accuracy is the quality of the features. For example, there is an essential features like Age. But, if there are less respondents answering the question, the quality of the feature reduces and its impact on improving model accuracy also weakens.
Some things never change — Consistent Influencers of Salary
At the end of our ML model, we are left with a set of features. These features can be used to deduce which factors influenced our model and which didn’t. The basis for our argument of influencers is: these factors improved the accuracy of our model significantly.
Some of the most prominent salary influencers were: Country, Age, Years of experience, Size of company and Role/Position of Employee. These factors were consistent influencers for models of all the 3 years. There were also some influencers that seemed surprising like: Money spent on purchasing personal tech gadgets.
There were some features whose influence was not consistent. Coding language can seem an essential salary influencer. However although its role was prominent in 2012, it was quite insignificant in 2013. Another such example was: Kind of industry that the developer works in.
It’s not how you think — Performance of Widely-believed Salary Influencers
Following on the logic defined earlier, the factors which either had negligible impact on the accuracy of the model or which reduced the accuracy of the model cannot be considered as an influencer for salary. Even if we leave these features out, the accuracy of prediction won’t decrease.
Job satisfaction is considered one of the most important part of any employment. So, it was inherently apparent that this should form an essential factor for determining salary. However, its impact on model accuracy was negligible.
Other detractors of salary were fairly conclusive like StackOverflow usage/reputation, Desktop OS used, Role in Purchasing and any other descriptive answer.
So, what did I learn and unlearn?
Pfff! So, this was quite a learning experience. Some of the conclusions were fairly obvious, but it felt good to substantiate the understanding with the data-points. And, some of the conclusions were really surprising. To be fair, there could be multiple reasons behind the surprising results like lesser number of people responding to a certain question, or the form of the answers obtained for the question.
Based on these observations, it will be interesting to learn some of the other crucial aspects from this data like the impact of gender or the pay-offs of remote-work conditions. But, that story is for another day.