HR Analytics — Job Change of Data Scientists
Machine Learning Approach to predict who will move to a new job using Python!
A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. The company wants to know which of these candidates really wants to work for the company after training or looking for new employment because it helps reduce the cost and time and the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience is in hands from candidates signup and enrollment.
- Predict the probability of a candidate will work for the company, is a binary classification problem
- Interpret model(s) such a way that illustrates which features affect candidate decision
The Colab Notebooks are available for this real-world use case at my GitHub repository or Check here to know how you can directly download data from Kaggle to your Google Drive and readily use it in Google Colab!
This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision.
There are a total 19,158 number of observations or rows. It contains the following 14 columns:
- enrollee_id: Unique ID for the candidate
- city: City code
- city_ development _index: Development index of the city (scaled)
- gender: Gender of the candidate
- relevent_experience: Relevant experience of the candidate
- enrolled_university: Type of University course enrolled if any
- education_level: Education level of candidate
- major_discipline: Education major discipline of the candidate
- experience: Candidate total experience in years
- company_size: No of employees in current employer’s company
- company_type: Type of current employer
- last_new_job: Difference in years between previous job and current job
- training_hours: Training hours completed
- target: 0 — Not looking for a job change, 1 — Looking for a job change
Note: In the train data, there is one human error in column company_size i.e. Oct-49, and in pandas, it was printed as 10/49, so we need to convert it into np.nan (NaN) i.e., numpy null or missing entry.
Basic Exploratory Data Analysis
What is the total number of observations? — — 19,158.
Are there any missing values in the data?
Note: 8 features have the missing values.
- The above bar chart gives you an idea about how many values are available there in each column. In our case, company_size and company_type contain the most missing values followed by gender and major_discipline.
- Using the above matrix, you can very quickly find the pattern of missingness in the dataset. In our case, the columns company_size and company_type have a more or less similar pattern of missing values.
- Heatmap shows the correlation of missingness between every 2 columns. In our case, the correlation between company_size and company_type is 0.7 which means if one of them is present then the other one must be present highly probably.
What is the distribution of the target?
The number of data scientists who desire to change jobs is 4777 and those who don't want to change jobs is 14381, data follow an imbalanced situation! 😥
Which are the top 10 cities?
What is the maximum index of city development?
- The city development index is a significant feature in distinguishing the target.
The gender-wise desire for a job change:
Does relevant experience affect?
Does the type of university of education matter?
- There are around 73% of people with no university enrollment.
Does education level affect?
What is the effect of a major discipline?
Do years of experience has any effect on the desire for a job change?
- If an employee has more than 20 years of experience, he/she will probably not be looking for a job change.
What is the effect of company size on the desire for a job change?
Does company type matter?
Does the gap of years between previous job and current job affect?
Does more pieces of training will reduce attrition? — Not at all, I guess!
Let us first start with removing unnecessary columns i.e., ‘enrollee_id’ as those are unique values and ‘city’ as it is not much significant in this case.
Missing Value Imputation — MICE
MICE (Multiple Imputation by Chained Equations) Imputation is a multiple imputation method, it is generally better than a single imputation method like mean imputation. As seen above, there are 8 features with missing values. MICE is used to fill in the missing values in those features.
Once missing values are imputed, data can be split into train-validation(test) parts and the model can be built on the training dataset. Before this note that, the data is highly imbalanced hence first we need to balance it. For this, Synthetic Minority Oversampling Technique (SMOTE) is used.
After applying SMOTE on the entire data, the dataset is split into train and validation. StandardScaler is fitted and transformed on the training dataset and the same transformation is used on the validation dataset. StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.
After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion.
Model Building and Validation
The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations.
Following models are built and evaluated.
XGBoost and Light GBM have good accuracy scores of more than 90. As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.
Light Gradient Boosting Model
There has been only a slight increase in accuracy and AUC score by applying Light GBM over XGBOOST but there is a significant difference in the execution time for the training procedure. Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets.
For more on performance metrics check — https://medium.com/nerd-for-tech/machine-learning-model-performance-metrics-84f94d39a92
Thanks for reading ❤
- Kaggle Competition — https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists.
For any suggestions or queries, leave your comments below and follow for updates.
If you liked the article, please hit the 👏 icon to support it. This will help other Medium users find it. Share it, so that others can read it!
Happy Learning! 😊