
Do freshly minted college graduates get the salary package that they deserve ?
Are you the HR of a company that is recruiting freshly minted graduates into your workforce? Do you ever wonder, while conducting a recruitment drive in college campus, that maybe some candidates are more sound in knowledge or skills than others? If your company has defined a basic pay structure for these candidates on their onboarding, do you ever think not all candidates deserve the same salary structure?
I, for one definitely had these curious questions and decided to do dive into this study at once ! My focus was to understand what are the different factors which can or do affect, not just the employee package, but even other factors (or in a more technical language — different predictors).
To conduct this study, I started off with the first step any aspiring Data Scientist takes, that of Exploratory Data Analysis. For those who are coming across this term for the first time, EDA is a method to play around with data and twist & turn it so that it eventually coughs up some (hidden) information.
So, I picked up a dataset from kaggle which consists data for freshers in an India-based company. It contains school, college, graduation details of employees. Before recruiting the employees, the company conducts some tests during recruitment — English, Technical Knowledge, Domain Knowledge, Aptitude & Soft Skill test — for each employee. Based on all these factors the company assigns salary (in INR).
You can access the dataset here.
As we discussed earlier, the company conducted few tests namely —
- English Test
- Aptitude Test
- Domain Test
- Soft Skill Test
- Technical test
It further gave the (yet to be recruited) employees the option to opt out of technical test, i.e., it wasn’t mandatory to sit for this test. All others were mandatory.
Now, we had few values that were missing for those employees who opted out this test. Had they been random missing values, we would have conducted a Missing Value Imputation based on certain criteria. But here we have a very valid reason why so many (namely 860) missing values exist. To avoid any kind of confusion, we shall extract these 860 values, and put them in a different data table. We shall return to this soon !
For the remaining 19,000 odd rows we will now begin our exploration.
Employee Graduation
Let us see what is the distribution of graduations that was completed by each employee.


We notice that almost 95% of the employees have completed B.E or B.Tech. Since we do not have much information about the company, it can be safe to assume that it might be a leading IT company who are usually interested to hire engineering or computer application enthusiasts. Now let us see what course these employees had preferred during graduation.

Recently in 2018 a research was conducted to find out which engineering branch has highest recruitment rate*. CS, Mechanical, EXTC were some of the top branches according to that study. We can also see almost 80% of employees of this dataset are also a part of these top branches. But the interesting point to notice here is that even employees from varied branches like Mechatronics and Chemical Engineering have opted for IT company.
Grouping the data
In the original dataset from kaggle we had multiple columns for aptitude test score, technical test score, soft skill score etc. Since many were on the same scale it became obvious to merge some columns for making our study easier.
Grouping of soft skills

The above 5 personality traits were merged into a single attribute of their overall soft skill.
For personality traits the range usually lies between -10 to +10 and average number of people fall around zero. We can observe this in the below graph for Soft Skill.



An interesting plot to notice is the Domain score of employees. The huge spread tells us that there exists a big variation in technical domain knowledge among employees. Why do you think this exists ? This maybe the case where a student is good academically but their industry and domain knowledge is extremely poor and thus fail to perform well in this test.
Next, let us see how well defined the gender diversity is in this company.
Gender Diversity


Although there is vast difference between the number of male and female employee, it is interesting to note that there is almost equal median salaries of both genders.
There can be two reasons for this vast gender gap.
- Either the company was unable to collect equal data, which is when such few conclusions become unreliable.
- Or the company is biased to male workers. When does such a situation arise ? Say, for example, the company has most of its clients offshore. Here the company may prefer male employees to work over weeknights to overlap with different time-zones.
According to a recent study *, the gender pay gap in Indian IT industry is around 19%. We don’t see such a high difference in our dataset, since this gender pay gap increases with work experience. In the initial years, the salary of freshers is almost approximately equal.

From the above we can understand the spread of salary between genders across different graduations. Notice that even though number of outliers is high, the median salary across all six plots is more or less the same i.e. around 5.2 LPA.
Had this data had an approximately equal number of female employees, we could have made a statement that, not much median gender pay gap among fresh graduates
City Tier

Can you see any difference between the two plots ? Probably not. You might want to look more closely, but the difference is extremely small.
Initially, one of our hypothesis was that employees who belong to tier-2 cities might get lesser salary offered. This plot rejects our hypothesis and how brutally so ! So it is safe to assume that the city you completed your initial education has no effect on salary.
Correlation
To understand how strong a relationship is between two variables let us try to study the correlation matrix. The correlation matrix ranges between -1 to +1. You may notice that the relation of one variable with itself gets the value 1. Therefore, any two variables that have correlation value closer to 1 are said to be correlated. Similarly, any value that is closer to -1 is said to be not correlated.

From the above heat map, we can notice that the English Score and Aptitude Score is highly correlated. Why do you think this happened ? Maybe because, every aptitude test consists of topics like Logical reasoning, Quants & Verbal Ability. If an employee is good in English test, he or she is bound to perform well in Aptitude test also.
We can also notice that Soft Skill test results are very less correlated with any other test scores. If an employee is good in Technical or Domain knowledge, it might not be necessary he or she has the most appropriate soft skills such as leadership, teamwork, extroversion, openness etc.
One left aside…
Remember we kept aside the data which consisted of information of employees who opted out of Technical test ? Now let us compare the salaries of employees who gave and who did not sit for technical interview.

Here we can notice that :
Average salary of employees who sat for the test. is much higher than for those who opted out of technical test.
Also, there is a huge gap in the maximum salary of both.
Similar trends can be observed for a different city tier.
Conclusion :
We may conclude that Salary predictor is not highly dependent on 10th & 12th scores. Instead it is more likely to depend on how well one performs in Aptitude, Domain, Soft skill, Technical interviews, the tests on which recruitment procedure is based upon. Also, if someone performs exceedingly well they should be given the salary that he or she deserves instead of a generic sum assigned in the books.
This gives us a totally new insight on Indian Education System ! Maybe, instead of assigning boards for more classes (read : TN govt. to introduce boards for st.5 & std.8) we need to start focusing on time based aptitude skills, technical knowledge and increasing domain knowledge after all !
Why EDA ?
After reading this article you can now understand the importance Exploratory Data Analysis plays in a Data Scientist’s role. EDA helped us unlock the insights of a data, spin a fascinating story around it and make some conclusions that can be helpful in further analysis of a real life business problems.
Future works :
- We have thoroughly understood relationship between certain predictors (eg. 10th score, 12th score, salary, gender, aptitude tests). We can construct modelling techniques such as Linear Regression on this dataset for future predictive analysis.
- Economists and Analysts are trying to identifying any positive relationship between conscientiousness (one of the soft skills) and wages. Also, they observe that, contrary to previous findings, women and men have similar returns to personality traits *. Both of these, though might not have a large business impact, but can be few of the really fascinating topics to dive into for future study.
- Another skill based study can be to identify an ideal work environment.

Citations :
Glassdoor — Fresher’s salary in India
Shiksha.com — http://shiksha.com/b-tech/articles/highest-paying-engineering-branches-know-top-recruiters-salaries-offered
5 personality traits — https://www.verywellmind.com/the-big-five-personality-dimensions-2795422
Gender pay gap — https://www.livemint.com/industry/human-resource/gender-pay-gap-still-high-women-in-india-earn-19-less-than-men-report-1551948081615.html
Introduction of more board exams — http://www.newindianexpress.com/states/tamil-nadu/2019/sep/13/tamil-nadu-introduces-board-exams-for-classes-5-and-8-2033171.html
Future works — https://www.researchgate.net/publication/330137392_Wage_premia_for_skills_the_complementarity_of_cognitive_and_non-cognitive_skills
Project Partner : Juilee Talele
