Predicting Employee Attrition: Comparing Logistic Regression and Decision Tree Models

9 min readJul 7, 2023

In my previous blog, I discussed the concept of employee attrition and its impact on organizations. Today, we will focus on building a machine learning model to predict employee attrition.

Continuing from where we left off in the data exploration blog, our focus now shifts to the next step in the data science pipeline: building a machine learning model. I will guide you through the process of selecting the right model, preparing the data, and implementing the necessary steps to train and evaluate your model effectively.

Data Preparation:

First step in building the model will be, preparing the data. In this step, we need to ensure that our data is formatted correctly for training the machine learning model.

To prepare our data for training the attrition prediction model, we follow these steps:

Importing Necessary Packages:

from sklearn.model_selection import train_test_split

2. Perform One-hot Encoding:

Since our dataset contains categorical variables, we need to convert them into numerical representations that can be understood by the model. We use the get_dummies() function from the pandas library to perform one-hot encoding. This function creates dummy variables for each categorical feature.

data_dummies=pd.get_dummies(data)

3. Select the Relevant Features:

From the Data Frame which contain the dummy variables, we select the features that we believe are important for predicting attrition. By choosing the appropriate features, we aim to capture the most relevant information for our model. For now I am selecting all the features except the target dummy variables (‘Attrition_Yes’, ‘Attrition_No’) and then based on models performance we can change the features list.

X =  data_dummies.drop(columns=['Attrition_Yes', 'Attrition_No'])
y=data_dummies[['Attrition_Yes']]

X will be the input features and y will the target variable.

4. Split The Data Into Training And Testing sets:

It is crucial to evaluate our model’s performance on unseen data to assess its ability to generalize. We use the train_test_split() function to split the data into training and testing sets. The X_train, X_test, y_train, and y_test variables store the resulting split, with 75% of the data used for training and 25% for testing. The random_state parameter ensures reproducibility of the split.

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

Now our data is ready and the next step will be choosing the right model.

Choosing the right model:

When it comes to selecting the right model for predicting attrition, it’s essential to consider several factors that can influence the model’s performance and accuracy.

Here are some key considerations to keep in mind:

Model Complexity
Data Availability and Size
Feature Importance
Performance Metrics
Algorithm Suitability
Time and Computational Constraints

By carefully considering these factors and assessing the trade-offs, you can select the most suitable classification model for predicting attrition in your organization. Remember, experimentation and iteration are key, and it’s often beneficial to try multiple models and compare their performance before finalizing the best approach.

To select the most appropriate model for predicting attrition for our specific dataset, we can consider trying both Logistic Regression and tree-based models and evaluate their performance. These two types of models have proven to be effective in classification tasks and are commonly used for predicting attrition. By comparing the results and considering the strengths and limitations of each model, we can make a more informed decision.

Logistic Regression is a straightforward and interpretable model that works well when the relationship between the input features and the target variable is approximately linear. It can provide insights into the importance and impact of individual features on the attrition prediction. Logistic Regression is also computationally efficient and performs well with smaller datasets.

On the other hand, tree-based models, such as decision trees and random forests, are known for their ability to capture complex relationships and interactions between features. These models can handle non-linear relationships and automatically perform feature selection, making them suitable for datasets with a large number of features. Random forests, in particular, can reduce the risk of overfitting and provide robust predictions.

To determine the best model among these options, we can consider metrics such as accuracy, precision, recall, and F1-score to compare the models’ predictive abilities.

Logistic Regression

Logististic regression is a supervised classification algorithm which predicts if something true or false. This fits an s shaped logistic function called sigmoid function.The curve goes from 0 to 1. It tells you the probability that a output belongs to particular class. It can be used to classify different types of data and it is also used to asses what varibles are useful in classifying the samples. It is commonly used in binary classification.

For building the model, first we need to import the necessary packages.

from sklearn.linear_model import LogisticRegression

Next step would be, we need to load the model and fit them to our dataset.

log_reg=LogisticRegression()
log_reg.fit(X_train,y_train)

After fitting the model, we need to find the metric score for our model. We have various metrics like Accuracy, Precision, Recall and F1-Score.

print('Traning Model accruracy scores: {:.3f}'.format(log_reg.score(X_train,y_train)))
print('Test Model accruracy scores: {:.3f}'.format(log_reg.score(X_test,y_test)))

The training model accuracy score of 0.845 and the test model accuracy score of 0.848 indicate that the logistic regression model performed well in predicting employee attrition. These scores demonstrate the model’s ability to accurately predict attrition based on the given features. However, it’s important to consider additional evaluation metrics for a comprehensive understanding of the model’s performance.

y_test_pred = log_reg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

The classification report provides a summary of the model’s performance on the test data.

Precision: The precision for class 0 is 0.85, which means that out of all the predicted instances labeled as class 0, 85% were actually correct. The precision for class 1 is 0.75, indicating that only 75% of the predicted instances labeled as class 1 were correct.
Recall: The recall for class 0 is 1.00, indicating that the model correctly identified all instances of class 0. However, the recall for class 1 is only 0.05, suggesting that the model struggled to identify instances of class 1 correctly.
F1-score: The F1-score combines both precision and recall into a single metric. For class 0, the F1-score is 0.92, representing a balance between precision and recall. However, for class 1, the F1-score is only 0.10, indicating a poor balance between precision and recall.
Support: The support refers to the number of instances in each class. In this case, there are 310 instances of class 0 and 58 instances of class 1 in the test data.

Overall, the model achieved an accuracy of 0.85, meaning that it correctly classified 85% of the instances in the test data. However, the low recall and F1-score for class 1 suggest that the model struggles to identify instances of attrition accurately. This could indicate a class imbalance issue or the need for further model refinement to improve performance.

To enhance the performance of the logistic regression model, we can incorporate hyperparameters during the model training process. After trying out different hyperparameters, I have decided to use the following hyperparameters, as they were giving good results.

log_reg=LogisticRegression(C=1000,max_iter=10000)
log_reg.fit(X_train,y_train)

Again we need to evaluate the model checking it’s metrics.

y_test_pred = log_reg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

Initially, without any hyperparameters, The accuracy of this model was 0.85, but it exhibited low recall and F1-score for class 1, indicating difficulties in correctly identifying instances of attrition.

After including two hyperparameters, C=1000 and max_iter=10000, the updated model’s metrics improved significantly. The accuracy of this updated model increased to 0.90, demonstrating better overall performance. The precision, recall, and F1-score for both classes improved, indicating improved predictive capability. The weighted average F1-score increased to 0.89, further confirming the model’s enhanced performance.

These results suggest that tuning the hyperparameters C and max_iter had a positive impact on the model’s ability to predict employee attrition accurately.

When choosing between precision and recall, consider the following: Precision is important when minimizing false positives is crucial, such as in medical diagnoses, while recall is important when minimizing false negatives is more critical, such as in detecting fraud. In the context of predicting employee attrition, both precision and recall are important, but depending on the specific goals and priorities of the organization, one might be more important than the other.

Decision Trees

Next, I will explore building a decision tree model for predicting employee attrition. Decision trees are a popular choice for classification tasks as they provide interpretable results and can handle both numerical and categorical features effectively.

To build the decision tree model, I will utilize the scikit-learn library, which provides a user-friendly implementation of decision trees. Similar to the logistic regression model, I will split the data into training and test sets using the train_test_split function.

Then, I will create an instance of the DecisionTreeClassifier class and fit the model to the training data. This process involves recursively splitting the data based on feature thresholds to create a tree-like structure that predicts the target variable, in this case, attrition.

Once the model is trained, I will evaluate its performance on the test data. This includes calculating accuracy, precision, recall, and F1-score to assess how well the decision tree model predicts employee attrition.

By comparing the performance of the decision tree model with the logistic regression model, we can gain insights into which approach yields better results in predicting attrition for our dataset.

import sklearn.tree as tree

dt = tree.DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

y_test_pred = dt.predict(X_test)

#testing data
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

Upon comparing the results of the decision tree model and the logistic regression model, we can observe some notable differences in their performance.

For the decision tree model, the precision for predicting attrition (class 1) is relatively low at 0.50, indicating that the model has a higher tendency to generate false positives. The recall, which measures the proportion of actual attrition cases correctly identified, is also quite low at 0.09. Consequently, the F1-score, which combines precision and recall, is only 0.15 for class 1. On the other hand, the model performs well in predicting non-attrition cases (class 0) with a high precision of 0.85 and a recall of 0.98, resulting in an F1-score of 0.91 for class 0.

Overall, the logistic regression model demonstrates a better trade-off between precision and recall, yielding higher F1-scores for both classes compared to the decision tree model.

Conclusion

Considering these results, it appears that the decision tree model is not performing as well in predicting employee attrition. One possible reason for this could be that the data may not exhibit a complex relationship between the predictor variables and the target variable (attrition). Decision trees are known for their ability to capture complex interactions and non-linear relationships in the data. If the data in our case is relatively straightforward or does not contain intricate patterns, the decision tree model may struggle to improve its predictive performance.

In contrast, logistic regression models are particularly effective at capturing linear relationships between variables. They assume a linear relationship between the predictors and the log-odds of the target variable, which can be suitable for datasets where the relationship is more straightforward. Therefore, the logistic regression model might have performed better in our case, as it could effectively capture the linear aspects of the data and make accurate predictions regarding attrition.

It is important to note that the performance of different models can vary depending on the nature and complexity of the dataset. Therefore, it is crucial to explore and compare multiple models to determine the most suitable approach for predicting employee attrition accurately.

You can find the code here.