Get more $ for software developer jobs!

4 min readAug 13, 2019

I’m taking a Data Science course on Udacity. In that course, we have a process called CRISP-DM and were guided through that process using 5 steps below.

Business Understanding
Data Understanding
Prepare Data
Data Modeling
Evaluate the Results

Business Understanding

Our business was to become gainfully employed as a Data Analyst. The gainfully employed part focused us on what attributes it took to become successful in this occupation, and the key attribute of success was Salary and to a lesser extent education and other training.

Data Understanding

We used data from a Stack Overflow survey. I used a different data set than was provided in the class. I used a set that was more current (2019) and may have had slightly different attributes. The survey questions were provided in the github repository as a .pdf file. Very important for data understanding to actually see the survey that the survey takers see.

Prepare Data

We were given an example of Stack Overflow survey results that we used to import into a Pandas data frame. I only imported the first 20000 rows to make my processing faster. We removed rows that had no salary data because rows with no salary data will not add to answering our primary questions as salary is our primary measure for our business objective of determining gainful employment attributes. Salary is our success objective here. We imputed the average for the remaining attributes that were numeric and imputed dummy values for categorical variables (using get_dummies pandas function) which converts categorical values to true false values (numbers).

The code used is in a github repository at https://github.com/mgrierson/Project1

We created a training data subset and a test data subset from the sample of 20000.

Data Modeling

We used the lm_modeling code provided and generated a model using a The attribute of country had the strongest coefficients.

Below is a graphic that shows the r-squared values by the number of features used for the model we trained. The test results are against a 30 percent test dataset.

The graphic indicates the percentage of variation that our equation or model explains about the variation of compensation. It looks like our model only explains about 27% of the variation. The results below are from the graph above.

0.2718958935678364 test data set r2
0.3808497962745685 training data set r2

Evaluate the Results

I’ll use this data to address the questions below.

Why country was the strongest attribute for salary in our classroom example.
Is there any useful meaning to negative vs. positive coefficients?
What other attributes affect compensation?

This data set and model resulted in the coefficients indicating that the countries with the highest cost of living (Australia, UK, and US) had the strongest negative coefficients and the countries with the lowest costs of living (Brazil, Poland and India) had the strongest positive coefficients. This may explain why country is the factor with the strongest factor in the class examples.

A bar chart of the compensation in dollars shows a wide variation in values with many outliers.

I’ve taken the compensation average for each CurrencyDesc (currency description) and the results (for averages over more than 200 data points to remove outliers) is in the table below. This table shows a wide variation by country in average salary.

                            average   cnt
CurrencyDesc                             
Australian dollar     146974.970395   304
Brazilian real         33409.493243   296
Canadian dollar       138960.627184   515
European Euro          92091.576950  2872
Indian rupee           24969.921412   878
Polish zloty           34523.693141   277
Pound sterling        160039.771645   924
Russian ruble          23827.428571   238
United States dollar  234109.737758  3676

The model also showed that the CurrencyDesc coefficient on a yearly basis for low cost of living countries was negative and that the CurrencyDesc coefficient on a weekly basis was significantly positive for countries with a higher cost of living. With one exception the Pound sterling (GBP currency symbol) which I don’t understand. This means there are other factors at play and the coefficients are not a reliable indicator of anything meaningful here, except which direction the Salary will move relative to movement in that currency.

                                               est_int         coefs 
165                     CurrencyDesc_Australian dollar -2.282377e+18   
157                                 CurrencySymbol_AUD  2.282377e+18   
166                        CurrencyDesc_Brazilian real  1.015331e+18   
158                                 CurrencySymbol_BRL -1.015331e+18   
171                        CurrencyDesc_Pound sterling  7.515512e+17   
161                                 CurrencySymbol_GBP -7.515512e+17   
163                                 CurrencySymbol_PLN -5.145374e+17   
170                          CurrencyDesc_Polish zloty  5.145374e+17   
164                                 CurrencySymbol_USD  3.628599e+17   
172                  CurrencyDesc_United States dollar -3.628599e+17   
167                       CurrencyDesc_Canadian dollar -2.285384e+17   
159                                 CurrencySymbol_CAD  2.285384e+17   
169                          CurrencyDesc_Indian rupee  2.183194e+17   
162                                 CurrencySymbol_INR -2.183194e+17   
168                         CurrencyDesc_European Euro  1.386951e+16   
160                                 CurrencySymbol_EUR -1.386951e+16   
173                                    CompFreq_Weekly  5.084837e+05   
23                               Country_United States  2.427301e+05   
15                                      Country_Canada  2.260991e+05   
174                                    CompFreq_Yearly -2.189984e+05   
13                                   Country_Australia  2.149730e+05   
14                                      Country_Brazil -1.334084e+05   
21                                       Country_Spain  1.033981e+05

Lastly, the education level of the developer had a significant positive coefficient relative to compensation and education level is frequently a result of choices an individual can make (one can’t choose one’s East Asian ethnicity). Part-time employment had a significantly negative coefficient.

14                                      Country_Brazil -1.334084e+05   
21                                       Country_Spain  1.033981e+05   
11                       Employment_Employed part-time -9.554332e+04   
17                                     Country_Germany  8.812777e+04   
16                                      Country_France  7.502532e+04   
345                               Ethnicity_East Asian  6.891680e+04   
12   Employment_Independent contractor, freelancer,... -5.977058e+04   
28   EdLevel_Other doctoral degree (Ph.D, Ed.D., etc.)  5.632639e+04   
180  WorkChallenge_Distracting work environment;Mee...  4.736766e+04

See this link for more on get_dummies pandas function: https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

Get more $ for software developer jobs!

Written by Michael Grierson