Good model by default using XGBoost
No Data Scientist is the Same — part 3
This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled changes constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.
In this article we will discuss Meta Oric’s approach. Let me first remind you of who Meta oric is:
Meta Oric: “It’s all about Meteoric speed”
Meta Oric is always very busy. She works on a lot of different projects and as a result she does not have much time to spend on each individual model that she has to build. This also applies when she is presented with the challenge to predict whether employees will look for a new job or not. She is very excited that her company is starting with HR analytics and therefore does not want to miss out on this opportunity. But she is already struggling with how to finish her current projects. Therefore, she decides to get a piece of the action by trusting the power of XGBoost. In previous projects, she had great experiences with the use of XGBoost and with little dataprep and reusing an old notebook she can quickly produce a model.
Why is Meta so excited about XGBoost?
One of the most popular algorithms for both regression and classification tasks nowadays, is XGBoost. But what makes XGBoost so special and why is Meta so excited about it?
The two main reasons why XGBoost is so popular are:
- Execution speed: The core XGBoost algorithm is parallelizable and can thereby use all of the processing power of your machine or it can be run in parallel on multiple machines. This makes XGBoost fast and makes it possible to train the algoritmn on very large sets of data.
- Model performance: Its outstanding model performance on structured or tabular datasets is the main reason why XGBoost is so popular. Many challenges and competitions (Kaggle and others) have been won by using XGBoost (see here a list of machine learning winning solutions with XGBoost)
There are a lot of good introductions to XGBoost available, a very nice introduction to the model can be found here. In short, eXtreme Gradient Boosting (XGBoost) is an ensemble learning method constructed from decision tree algorithms. XGBoost is an ensemble method, which means that it is a combination of multiple base models (multiple decision trees) in order to produce one optimal predictive model. There are different kinds of ensemble methods, among which bagging and boosting. XGBoost uses a gradient boosting framework. Meaning, the decision trees in XGBoost are grown in a sequential manner, as an iterative learning process: A first tree is trained and evaluated, after that the incorrect predicted data points are given more weightage in the next iteration. Then a next tree is trained and evaluated and the wrong predictions are given more weightage in the following iteration etc.
Boosting visualized (source):
By giving wrongly predicted data points more weightage, these data points have a higher probability to appear in the next trained tree. XGBoost uses a weighted quantile sketch algorithm to effectively find the splitting points in weighted data. This boosting ensemble method is the main difference between XGBoost and Random Forest. With Random Forest, new training datasets are formed by random sampling with replacement from the original dataset, where each observation has the same probability to appear in a new dataset (also referred to as Bagging). Thus, Random Forest and boosted trees are both tree-based ensemble algorithms and the difference between them arises from how the models are trained.
What makes Extreme Gradient Boosting ‘Extreme’?
XGBoost wasn’t the first algorithm to use a boosting strategy in predictions. Earlier on, Gradient Boosting models were popular but they were surpassed when XGBoost was introduced. Why? The picture above shows the improvements that make XGBoost an advanced version of Gradient Boosting (GBM).
First of all, system improvements:
- Parallelization: Making use of all available CPU cores.
- Tree Pruning: A gradient boosting model stops splitting when it encounters a negative loss in the split, making it a more greedier algorithm. However, XGBoost makes splits until the provided max_dept is reached and then starts pruning the tree backwards and remove splits beyond which there is no positive gain.
- Cache awareness and out-of-core computing: XGboost has been designed to make efficient use of hardware resources.
Second, algorithmic improvements:
- Regularization: A general principle of model building is that we would like to have a simple and predictive model. The tradeoff between the two is also referred to as bias-variance tradeoff in machine learning. As Tiangi Chen, one of the builders of XGBoost, emphasizes a difference between GBM and XGBoost is that XGBoost controls for the complexity of the model through both LASSO (L1) and Ridge (L2) regularization, which helps us to avoid overfitting. Therefore XGBoost is also known as a ´regularized boosting´ technique.
- Handling of missing/Sparsity-aware: XGBoost contains a sparsity-aware split finding algorithm and handles different types of sparsity patterns in the data. Thereby it handles missing data (which most other algorithms can’t) and also sparsity that could occur due to one-hot encoding, zero entries etc.
- Built-in cross-validation: The algorithm has a built-in cross-validation method at each iteration which removes the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.
Disadvantages of XGBoost:
There are a few reasons why some of our data scientists will consider other algorithms in our series of articles:
- The biggest disadvantage of XGBoost is that it is a black box in nature. XGBoost does not provide you with effect sizes. You need to program this part yourself.
- XGBoost performs less on unstructured data.
- It has many hyperparameters and is thereby harder to tune.
- Although XGBoost is less sensitive to overfitting than GBM, it is still sensitive to overfitting if parameters aren’t tuned properly.
- Like every boosting ensemble method XGBoost is sensitive to outliers, since every classifier tries to fix the errors in the predecessors.
Using XGBoost to predict which Data Scientists are likely to change jobs
Meta Oric wants to get good predictions with minimal time spent and for that reason she has a strong preference for XGBoost. First, Meta imports the necessary packages, among which the package that enables her to use XGBoost: ‘xgboost’. This package offers a wraper class which enables Meta to use XGBoost in a scikit-learn framework.
pip install xgboost# Importing packages and settings:
import warnings
warnings.filterwarnings(action= 'ignore')
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
import joblib
Second, Meta loads the dataset. A bit of preparation on this data was done, as described here. The target variable is conveniently named target and indicates whether a data scientist in this historic dataset has left the company. All other columns in the dataset are the features that might help to predict which data scientists are likely to leave the company soon.
df_prep = pd.read_csv('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/df_prepared.csv')df = df_prep.drop(columns=['Unnamed: 0','city', 'experience', 'enrollee_id']) df.head()
Output:
Because Meto Oric is very busy, she doesn’t have a lot of time to prepare the data. However, some dataprep is required to be able to train the XGBoost. She imputes the missing values, converts the categorical variables into dummies, and standardizes the numerical variables. Why does Meta impute the missing values herself while we just discussed that XGBoost is able to do this by itself? Meta prefers to impute the missings herself to have maximum control. And letting XGBoost handle the missings by itself does not appear to lead to better results compared to standard imputation (source). Other strategies for dealing with missings are discussed in more detail in this article of our series.
To save time and to re-use a lot of code from previous projects, Meta makes use of pipelines to prep the data, to conduct cross-validation and to train the XGBoost. If you want to read more about scikit-learn pipelines, this is one of the many available sources.
In the code below Meta separates the target and the features, creates train and test sets, and creates the pipelines to prep the features:
# Define the target vector y
y = df['target'] # Creating a dataset without the DV:
X = df.drop('target', axis = 1)# Split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=1121218
)# Creating an object with the column labels of only the categorical features and one with only the numeric features:
categorical_features = X.select_dtypes(exclude="number").columns.tolist()
numeric_features = X.select_dtypes(include="number").columns.tolist()# Create the categorical pipeline, for the categorical variables Meta imputes the missing values with a constant value and we encode them with One-Hot encoding:
categorical_pipeline = Pipeline(
steps=[
("impute", SimpleImputer(strategy= 'constant', fill_value= 'unknown')),
("one-hot", OneHotEncoder(handle_unknown="ignore", sparse=False))
]
)# Create the numeric pipeline, for the numeric variables Meta imputes the missings with the mean of the column and standardize them, so that the features have a mean of 0 and a variance of 1:
numeric_pipeline = Pipeline(
steps=[("impute", SimpleImputer(strategy="mean")),
("scale", StandardScaler())]
)# Combining the two pipelines with a column transformer:
full_processor = ColumnTransformer(transformers=[
("numeric", numeric_pipeline, numeric_features),
("categorical", categorical_pipeline, categorical_features),
]
)
Now that everything is ready to prep the data, Meta quickly creates a XGBoost classifier with default parameters and evaluates its performance. Meta uses the logloss as evaluation metric, which is the default for classification. Meta also sets a random seed to be able to reproduce her results. Meta evaluates the model performance initially only on the trainset with the use of stratified 5-folds cross-validation. Stratified cross-validation is a variation of KFold cross-validation that returns stratified folds. The folds are made by preserving the percentage of samples for each class. Stratified cross-validation is recommended to use if the classes are imbalanced. Our target is imbalanced, 25% of the employees are looking for a new job. If you are not familiar with cross-validation this is a good source where you can read more about it.
# Meta instantiates the XGBClassifier, since we are dealing with a classification task:
xgb_cl = xgb.XGBClassifier(eval_metric='logloss', seed=7) # Create XGBoost pipeline:
xgb_pipeline = Pipeline(steps=[
('preprocess', full_processor),
('model', xgb_cl)
])# Evaluate the model with the use of cv:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7) #, shuffle=True with or without shuffle??
scores = cross_val_score(xgb_pipeline, X_train, y_train, cv=cv, scoring = 'roc_auc')
print("roc_auc = %f (%f)" % (scores.mean(), scores.std()))
Output:
roc_auc = 0.791519 (0.004802)
With only default parameters without hyperparameter tuning, Meta’s XGBoost gets a ROC AUC score of 0.7915. As you can see below XGBoost has quite a lot of hyperparameters that can be tweaked, to create the optimal model for the provided dataset.
# The default hyperparameters of the XGBoost:
xgb_cl
Output:
XGBClassifier(base_score=None, booster=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, enable_categorical=False, eval_metric=’logloss’, gamma=None, gpu_id=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=None, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, seed=7, subsample=None, tree_method=None, validate_parameters=None, verbosity=None)
Meta does not have time to try to tune and optimize the model, so she sticks with the default XGBoost parameters. To be able to compare Meta’s approach with the approaches of Aki Razzi, Andy Stand, and Eqaan Librium later on, we’ll fit her model on the trainset and save it. Just to be sure, we quickly test if her model is saved correctly.
# Fit Meta's default XGBoost pipeline:
xgb_pipeline.fit(X_train, y_train)#Saving Meta's final XGBoost pipeline:
joblib.dump(xgb_pipeline, 'pipe_meta.joblib')
Output:
[‘pipe_meta.joblib’]
Code:
#Testing if Meta's model is correctly saved:# Load the models:
upload_pipe_meta = joblib.load('pipe_meta.joblib')# Use it to make the same predictions:
print(upload_pipe_meta.predict(X_train))
Output:
[0 0 1 … 0 0 1]
I hope you enjoyed reading this article and getting to know Meta. Meta’s aim was to quickly produce a XGBoost. In the next article Aki Razzi will try to improve Meta’s default XGBoost by tuning the hyperparameters. Additionally, in the next article the performance of the XGBoost models of Meta and Aki will be evaluated on the testset.
This article is part of our No Data Scientist Is The Same series. The full series is written by Anya Tonne, Jurriaan Nagelkerke, Karin Gruijs-Vodde and Tom Blanke. The series is also available on theanalyticslab.nl.
An overview of all articles on Medium within the series:
- Introducing our data science rock stars
- Data to predict which employees are likely to leave
- Good model by default using XGBoost
- Hyperparameter tuning for hyperaccurate XGBoost model
- Beat dirty data
- The case of high cardinality kerfuffles
- Guide to manage missing data
- Visualise the business value of predictive models
- No data scientist is the same!
Do you want to do this yourself? Please feel free to download the Notebook on our gitlab page.