Get more $ for software developer jobs!

Michael Grierson
4 min readAug 13, 2019

--

I’m taking a Data Science course on Udacity. In that course, we have a process called CRISP-DM and were guided through that process using 5 steps below.

  • Business Understanding
  • Data Understanding
  • Prepare Data
  • Data Modeling
  • Evaluate the Results

Business Understanding

Our business was to become gainfully employed as a Data Analyst. The gainfully employed part focused us on what attributes it took to become successful in this occupation, and the key attribute of success was Salary and to a lesser extent education and other training.

Data Understanding

We used data from a Stack Overflow survey. I used a different data set than was provided in the class. I used a set that was more current (2019) and may have had slightly different attributes. The survey questions were provided in the github repository as a .pdf file. Very important for data understanding to actually see the survey that the survey takers see.

Prepare Data

We were given an example of Stack Overflow survey results that we used to import into a Pandas data frame. I only imported the first 20000 rows to make my processing faster. We removed rows that had no salary data because rows with no salary data will not add to answering our primary questions as salary is our primary measure for our business objective of determining gainful employment attributes. Salary is our success objective here. We imputed the average for the remaining attributes that were numeric and imputed dummy values for categorical variables (using get_dummies pandas function) which converts categorical values to true false values (numbers).

The code used is in a github repository at https://github.com/mgrierson/Project1

We created a training data subset and a test data subset from the sample of 20000.

Data Modeling

We used the lm_modeling code provided and generated a model using a The attribute of country had the strongest coefficients.

Below is a graphic that shows the r-squared values by the number of features used for the model we trained. The test results are against a 30 percent test dataset.

The graphic indicates the percentage of variation that our equation or model explains about the variation of compensation. It looks like our model only explains about 27% of the variation. The results below are from the graph above.

0.2718958935678364 test data set r2
0.3808497962745685 training data set r2

Evaluate the Results

I’ll use this data to address the questions below.

  1. Why country was the strongest attribute for salary in our classroom example.
  2. Is there any useful meaning to negative vs. positive coefficients?
  3. What other attributes affect compensation?

This data set and model resulted in the coefficients indicating that the countries with the highest cost of living (Australia, UK, and US) had the strongest negative coefficients and the countries with the lowest costs of living (Brazil, Poland and India) had the strongest positive coefficients. This may explain why country is the factor with the strongest factor in the class examples.

A bar chart of the compensation in dollars shows a wide variation in values with many outliers.

I’ve taken the compensation average for each CurrencyDesc (currency description) and the results (for averages over more than 200 data points to remove outliers) is in the table below. This table shows a wide variation by country in average salary.

                            average   cnt
CurrencyDesc
Australian dollar 146974.970395 304
Brazilian real 33409.493243 296
Canadian dollar 138960.627184 515
European Euro 92091.576950 2872
Indian rupee 24969.921412 878
Polish zloty 34523.693141 277
Pound sterling 160039.771645 924
Russian ruble 23827.428571 238
United States dollar 234109.737758 3676

The model also showed that the CurrencyDesc coefficient on a yearly basis for low cost of living countries was negative and that the CurrencyDesc coefficient on a weekly basis was significantly positive for countries with a higher cost of living. With one exception the Pound sterling (GBP currency symbol) which I don’t understand. This means there are other factors at play and the coefficients are not a reliable indicator of anything meaningful here, except which direction the Salary will move relative to movement in that currency.

                                               est_int         coefs 
165 CurrencyDesc_Australian dollar -2.282377e+18
157 CurrencySymbol_AUD 2.282377e+18
166 CurrencyDesc_Brazilian real 1.015331e+18
158 CurrencySymbol_BRL -1.015331e+18
171 CurrencyDesc_Pound sterling 7.515512e+17
161 CurrencySymbol_GBP -7.515512e+17
163 CurrencySymbol_PLN -5.145374e+17
170 CurrencyDesc_Polish zloty 5.145374e+17
164 CurrencySymbol_USD 3.628599e+17
172 CurrencyDesc_United States dollar -3.628599e+17
167 CurrencyDesc_Canadian dollar -2.285384e+17
159 CurrencySymbol_CAD 2.285384e+17
169 CurrencyDesc_Indian rupee 2.183194e+17
162 CurrencySymbol_INR -2.183194e+17
168 CurrencyDesc_European Euro 1.386951e+16
160 CurrencySymbol_EUR -1.386951e+16
173 CompFreq_Weekly 5.084837e+05
23 Country_United States 2.427301e+05
15 Country_Canada 2.260991e+05
174 CompFreq_Yearly -2.189984e+05
13 Country_Australia 2.149730e+05
14 Country_Brazil -1.334084e+05
21 Country_Spain 1.033981e+05

Lastly, the education level of the developer had a significant positive coefficient relative to compensation and education level is frequently a result of choices an individual can make (one can’t choose one’s East Asian ethnicity). Part-time employment had a significantly negative coefficient.

14                                      Country_Brazil -1.334084e+05   
21 Country_Spain 1.033981e+05
11 Employment_Employed part-time -9.554332e+04
17 Country_Germany 8.812777e+04
16 Country_France 7.502532e+04
345 Ethnicity_East Asian 6.891680e+04
12 Employment_Independent contractor, freelancer,... -5.977058e+04
28 EdLevel_Other doctoral degree (Ph.D, Ed.D., etc.) 5.632639e+04
180 WorkChallenge_Distracting work environment;Mee... 4.736766e+04

See this link for more on get_dummies pandas function: https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40

--

--