【Data Analysis】Prediction of Employee Turnover Rate

Published in

TEJ-API Financial Data Analysis

8 min readJun 13, 2023

Construct a Polynomial Linear Regression Model with Multiple Factors, assessing Enterprise’s Future Employee Turnover Rate.

Keywords: Polynomial Linear Regression、Backward Elimination、Turnover Rate

Highlights:

Difficulty：★★☆☆☆
Construct a Polynomial Linear Regression Model and then implement Backward Elimination to optimize the model’s accuracy.

Preface:

Employee Turnover Rate indicates the flow, change, and movement in Human Resources in a certain period, which attributes to employee resignation and onboarding. It is a crucial indicator for evaluating the stability of the organization and team. The lower rate represents a slighter variety of HR. Conversely, the higher rate implies that there are latent problems in the organization, such as organizational issues, discontent in the workplace, or others, which might impact the company’s operation and the office atmosphere.
Monitoring the Employee Turnover Rate can assist the company in understanding and assessing the effectiveness of HR management strategy, taking measures to maintain employee satisfaction, and ensuring long-term stability and development. Besides, predicting Employee Turnover Rate can help enterprises improve HR management, reduce costs, raise employee willingness to stay, and facilitate the organization’s efficiency.

Programming environment and Module required

This article uses Mac OS as a system and jupyter as an editor.

# 載入所需套件
import pandas as pd 
import numpy as np 
import tejapi
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.regression.linear_model as sm
import matplotlib.pyplot as plt

# 登入TEJ API
api_key = 'YOUR_KEY'
tejapi.ApiConfig.api_key = api_key
tejapi.ApiConfig.ignoretz = True

Database

TWN/ACSR01A(Employee turnover rate)
TWN/AXEMPA(Constitution of employee educational degree)
TWN/ACSR20A(Enterprises violating the Labor Standards Act)
TWN/ACSR19A(TWSE Corporate Governance Evaluation)

Data Import

Today’s practice is going to use below columns, please note that column “violate_times” is not included in origin dataframe, it emerges after preprocessing.

Label

turn_rate : Employee turnover rate

Features

num_resign : number of resignation
num_staff : number of staff
workage_avg : average annual salary
age_avg : average age
apct : percentage of PhD
bpct : percentage of master
cpct : percentage of bachelor
dpct : percentage of senior high school
epct : percentage of junior high school
emp_sum : sum of employee
emp_age : employee average age
emp_yr : employee average annual salary
rating : rate of TWSE Corporate Governance Evaluation
violate_times : times of violation of Labor Standards Act

gte, lte = '2021-01-01', '2021-12-31'
TR = tejapi.get('TWN/ACSR01A',
                   paginate = True,
                   mdate = {'gte':gte, 'lte':lte},             
                  )

gte, lte = '2020-01-01', '2021-01-01'
ED = tejapi.get('TWN/AXEMPA',
                   paginate = True,
                   mdate = {'gte':gte, 'lte':lte},             
                  )

gte, lte = '2020-01-01', '2021-01-01'
LAW = tejapi.get('TWN/ACSR20A',
                   paginate = True,
                   mdate = {'gte':gte, 'lte':lte},             
                  )

gte, lte = '2021-01-01', '2021-12-31'
TWSE = tejapi.get('TWN/ACSR19A',
                   paginate = True,
                   mdate = {'gte':gte, 'lte':lte},             
                  )

Prepossessing

Select columns

Select the columns we need in dataframes — “Constitution of employee educational degree” and “Enterprises violating the Labor Standards Act”, count the times of each company’s violation. Next, in “TWSE Corporate Governance Evaluation”, transfer the companies’ id type from integer to string, because we are going to merge dataframes on this column.

ED = ED[["coid", "apct", "bpct" ,"cpct", "dpct", "epct", "fpct" , "emp_sum"]]
LAW = pd.DataFrame(data = [Counter(LAW["coid"]).keys(), Counter(LAW["coid"]).values()]).T.rename(columns = {0:"coid", 1:"violate_times"})
TWSE = TWSE[["coid", "rating"]]
TWSE["coid"] = TWSE["coid"].astype(str)

Merge dataframes

df_main = TR
for i in [ED, TWSE, LAW]:
    df_main = pd.merge(df_main, i, on = "coid")

Data cleaning

Because Nan exists in our dataframe, which will cause the failure of train model, we fill Nan with zero. By the way, it is worth to mention, filling zero is not the only and the best way to deal with Nan. How to fill Nan is a difficult question depending on the type of data, expected output, and the processes of each practice. Today’s practice is primary focus on building a model though, it is enough for us that simply replace Nan with zero.

Next, we use “Ordinal Encoding” to transfer “rating” from string to integer.

dataset_regression = df_main[df_main["turn_rate"].notna()].drop(columns = ["source", "fpct"])
dataset_regression["rating"].unique()
trans = {
    "A+":"7",
    "A":"6",
    "B":"5",
    "C":"4",
    "C-":"3",
    "D":"2",
    "D-":"1",
    "NA":"0",
}

dataset_regression["rating"] = dataset_regression["rating"].apply(lambda x : trans.get(x))
dataset_regression = dataset_regression.fillna(0)

Split label and features

X = dataset_regression.iloc[:, 3:].values
y = dataset_regression.iloc[:, 2].values

Construct the model

We split train set and test set by 8:2 portion and set “random_state” to ensure the same result of model.
After training, let’s compare the predictive value with actual value.

x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.8, random_state=42)
regressor = LinearRegression()#創建一個名為regressor的物件
regressor.fit(x_train, y_train)#訓練線性歸模型

y_pred = regressor.predict(x_test)
print(y_pred.shape)#y_pred 為一維向量

np.set_printoptions(precision = 2)#顯示兩位小數
P_vs_T = np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1)
print(P_vs_T)
#將y_pred與y_test分別轉為len(y_pred),1)的2維陣列後合併

Observing the result, we can find out some values have drastic gaps, even negative values. To clearly realize the performance of our model, we visualize the result with a bar chart and define that if the gap is greater than 1 percent, it is an invalid prediction.

classification = [1 if abs(x-y)>1 else 0 for x,y in P_vs_T ]
print(Counter(classification))
plt.bar(x = [0,1], height = Counter(classification).values(),tick_label=["False", "True"] )
plt.show()

Fortunately, the result looks not bad. The model’s accuracy is about 88%.

#輸出模型截距與係數
print(regressor.fit(x_train,y_train).intercept_)
print(regressor.fit(x_train,y_train).coef_)

intercept and coefficient

Backward Elimination

Backward elimination is a way of feature selection. It removes the feature that doesn’t significantly affect our output. It eliminates features one by one since the initial model eliminates the least effective feature each time, till remained features reach a certain level (such as significance level, typically when the p-value is smaller than 0.05). This kind of measure can prevent the model from over-fitting, reduce the complexity of the model, and also raise the ability of prediction and interpretability. Backward elimination is broadly used in polynomial linear regression to filter the most practical features.

Somebody must wonder why when the p-value is smaller than 0.05, we can argue that the model reaches a certain level. The idea comes from Hypothesis testing in the statistic.

Hypothesis testing is a statistical method used for making statistical inferences about hypotheses regarding population parameters. In hypothesis testing, we propose two opposing hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis typically represents ineffectiveness, no difference, or no association, while the alternative hypothesis represents effectiveness, difference, or association.
Collecting sample data and conducting statistical analysis, we use a test statistic to evaluate the support or rejection of the null hypothesis based on the data. By calculating the value of the test statistic and comparing it to a pre-set significance level, we can make statistical judgments about the results. If the value of the test statistic deviates significantly from the null hypothesis, we may reject the null hypothesis and propose the alternative hypothesis. Otherwise, we fail to reject the null hypothesis.
The p-value represents the probability of obtaining the observed results or more extreme results, assuming the null hypothesis is true. We can reject the null hypothesis when the p-value is smaller than the pre-set significance level (typically 0.05). This is because a p-value smaller than 0.05 indicates that the observed results are very rare, considering the assumptions of the null hypothesis. Rejecting the null hypothesis implies that we have sufficient evidence to support the alternative hypothesis.
However, it’s important to note that 0.05 is just a commonly used significance level. In some cases, depending on the nature of the study or the severity of the question, a more stringent significance level (e.g., 0.01) can be chosen. Additionally, other factors such as sample size, study design, and effect size should be considered in making appropriate statistical judgments and interpreting the results. Rejecting the null hypothesis is only part of statistical inference and requires careful interpretation and reporting of the results to avoid over interpretation or misleading conclusions.

We use for loop to gradually eliminate arguments.

# 加上常數1以符合回歸模型 y = b0+b1*X+b2*X^2+b3*X^3...
x_train = np.append(arr = np.ones((len(x_train[:,1]),1)).astype(int), values = x_train, axis = 1)

# back elimination
col = [0,1,2,3,4,5,6,7,8,9,10,11,12,13]

R_square = []
for i in range(len(col)):
    x_opt=np.array(x_train[:,col], dtype=float)
    regressor_OLS=sm.OLS(endog=y_train, exog= x_opt).fit()
    R_square.append(regressor_OLS.rsquared)
    
    if regressor_OLS.rsquared == max(R_square):
        summary = regressor_OLS.summary()
        attribute = col.copy()
        
    P = regressor_OLS.pvalues.tolist()
    print(col)
    col.pop(P.index(max(P)))

We will present a line graph showing the R-squared values after each parameter elimination. R-squared is a statistic used to assess the extent to which a regression model explains the variability of the dependent variable. It indicates the percentage of the dependent variable’s variance that the independent variables in the model can explain. The R-squared value ranges from 0 to 1, where a value closer to 1 indicates a better ability of the model to explain the variability of the dependent variable. In contrast, a value more relative to 0 suggests a weaker explanatory power of the model. Specifically, an R-squared of 0 means that the model cannot explain the variability of the dependent variable, while an R-squared of 1 indicates that the model thoroughly explains the variability of the dependent variable.

plt.plot(R_square,'ro--', linewidth=2, markersize=6)
plt.xticks(ticks = [i for i in range(0,14)], labels= [i for i in range(1,15)][::-1])

plt.title("change in R_square")
plt.xlabel('number of attribute')
plt.ylabel('R_square')

plt.show()

We select the parameters of model at the highest R-squared and print out their columns name.

col_map = dict(zip([0,1,2,3,4,5,6,7,8,9,10,11,], dataset_regression.iloc[:, 3:].columns))

print(attribute)
print(list(map(lambda x:col_map.get(x), attribute)))

id corresponds with columns name

Checking the statistic info when R-squared is the highest.

Source code

Through backward elimination, we successfully improved the accuracy of the model from an initial 46% to 77%. By analyzing the composition of the parameters, we found that employee turnover is primarily influenced by variables such as employee tenure, salary, and the proportion of employees with education below college level. However, these conclusions are based on statistical principles, and the interpretation of R-squared should be combined with specific circumstances and characteristics of the model. It can help evaluate the goodness of fit and predictive ability of the model, as well as compare the strengths and weaknesses of different models. Therefore, the use of R-squared should be approached with caution and combined with other evaluation metrics and professional judgment for comprehensive analysis, in order to fully understand the explanatory power and limitations of the model.