LightGBM, XGBoost and CatBoost — Kaggle — Santander Challenge

Achieved a score of 1.4714 with this Kernel in Kaggle

Rohan Paul
Nov 15, 2020 · 26 min read

Link To my Kaggle Notebook

(If you like the Kaggle Notebook, please consider upvoting it in Kaggle)

Getting the data and Kaggle Challenge Link

Gradient Boosted trees have become one of the most powerful algorithms for training on tabular data. Over the recent past, we’ve been fortunate to have may implementations of boosted trees — each with their own unique characteristics. In this notebook, I will implement LightGBM, XGBoost and CatBoost to tackle this Kaggle problem.

What is Boosting

To understand the absolute basics of the need for Boosting algorithm, let's ask a basic question — If a data point is incorrectly predicted by our first model, and then the next (probably all models), will combining the predictions provide better results? Such questions are handled by boosting algorithm.

So, Boosting is a sequential technique which works on the principle of an ensemble, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

The basic principle behind the working of the boosting algorithm is to generate multiple weak learners and combine their predictions to form one strong rule. These weak rules are generated by applying base Machine Learning algorithms on different distributions of the data set. These algorithms generate weak rules for each iteration. After multiple iterations, the weak learners are combined to form a strong learner that will predict a more accurate outcome. Note that a weak learner is one that is slightly better than random guessing. For example, a decision tree whose predictions are slightly better than 50%.

Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Here’s how the algorithm works:

Step 1: The base algorithm reads the data and assigns equal weight to each sample observation.

Step 2: False predictions made by the base learner are identified. In the next iteration, these false predictions are assigned to the next base learner with a higher weightage on these incorrect predictions.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

Therefore, the main aim of Boosting is to focus more on miss-classified predictions.

img
img

Source

These techniques are used to build ensemble models in an iterative way. On the first iteration, the algorithm learns the first tree to reduce the training error, shown on left-hand image above. The right-hand image above, shows the second iteration, in which the algorithm learns one more tree to reduce the error made by the first tree. The algorithm repeats this procedure until it builds a decent quality mode.

img
img

The common approach for classification uses Logloss while regression optimizes using root mean square error. Ranking tasks commonly implements some variation of LambdaRank.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import preprocessing, model_selection, metrics
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

from IPython.display import display # Allows the use of display() for DataFrames

import warnings
warnings.filterwarnings('ignore')
train_df = pd.read_csv('../input/santander-value-prediction-challenge/train.csv')

test_df = pd.read_csv('../input/santander-value-prediction-challenge/test.csv')

train_df.head()
png
test_df.head()
png
train_df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB

Initial Observations looking at the above data

  • Column name does not mean anything now, as they are all anonymized
  • The dataframe is full of zero values.
  • The dataset is a sparse tabular one refer this

Target Variable:

First doing some scatter plot of the target variable to check for visible outliers.

print('Train rows and columns: ', train_df.shape)

# Keeping below line commented out as its huge 49,342 row file with 1gb size and so take longer to run each time
print('Test rows and columns: ', test_df.shape)
Train rows and columns: (4459, 4993)
Test rows and columns: (100, 4992)
Keeping below lines commented out during development

plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df['target'].values))
plt.xlabel('index', fontsize=12)
plt.ylabel('Target', fontsize=12)
plt.title('Distribution of Target', fontsize=14)
plt.show()
Image for post
Image for post

Checking for missing / null values in data

print("All Features in Train data with NaN Values =", str(train_df.columns[train_df.isnull().sum() != 0].size) )
# print("All Features in Test data with NaN Values =", str(test_df.columns[train_df.isnull().sum() != 0].size) )
All Features in Train data with NaN Values = 0

Remove constant columns from data

const_columns_to_remove = []
for col in train_df.columns:
if col != 'ID' and col != 'target':
if train_df[col].std() == 0:
const_columns_to_remove.append(col)

# Now remove that array of const columns from the data
train_df.drop(const_columns_to_remove, axis=1, inplace=True)
test_df.drop(const_columns_to_remove, axis=1, inplace=True)

# Print to see the reduction of columns
print('train_df rows and columns after removing constant columns: ', train_df.shape)

print('Following `{}` Constant Column\n are removed'.format(len(const_columns_to_remove)))
print(const_columns_to_remove)
train_df rows and columns after removing constant columns: (4459, 4737)
Following `256` Constant Column
are removed
['d5308d8bc', 'c330f1a67', 'eeac16933', '7df8788e8', '5b91580ee', '6f29fbbc7', '46dafc868', 'ae41a98b6', 'f416800e9', '6d07828ca', '7ac332a1d', '70ee7950a', '833b35a7c', '2f9969eab', '8b1372217', '68322788b', '2288ac1a6', 'dc7f76962', '467044c26', '39ebfbfd9', '9a5ff8c23', 'f6fac27c8', '664e2800e', 'ae28689a2', 'd87dcac58', '4065efbb6', 'f944d9d43', 'c2c4491d5', 'a4346e2e2', '1af366d4f', 'cfff5b7c8', 'da215e99e', '5acd26139', '9be9c6cef', '1210d0271', '21b0a54cb', 'da35e792b', '754c502dd', '0b346adbd', '0f196b049', 'b603ed95d', '2a50e001c', '1e81432e7', '10350ea43', '3c7c7e24c', '7585fce2a', '64d036163', 'f25d9935c', 'd98484125', '95c85e227', '9a5273600', '746cdb817', '6377a6293', '7d944fb0c', '87eb21c50', '5ea313a8c', '0987a65a1', '2fb7c2443', 'f5dde409b', '1ae50d4c3', '2b21cd7d8', '0db8a9272', '804d8b55b', '76f135fa6', '7d7182143', 'f88e61ae6', '378ed28e0', 'ca4ba131e', '1352ddae5', '2b601ad67', '6e42ff7c7', '22196a84c', '0e410eb3d', '992e6d1d3', '90a742107', '08b9ec4ae', 'd95203ded', '58ad51def', '9f69ae59f', '863de8a31', 'be10df47c', 'f006d9618', 'a7e39d23d', '5ed0abe85', '6c578fe94', '7fa4fcee9', '5e0571f07', 'fd5659511', 'e06b9f40f', 'c506599c8', '99de8c2dc', 'b05f4b229', '5e0834175', 'eb1cc0d9c', 'b281a62b9', '00fcf67e4', 'e37b65992', '2308e2b29', 'c342e8709', '708471ebf', 'f614aac15', '15ecf7b68', '3bfe540f1', '7a0d98f3c', 'e642315a5', 'c16d456a7', '0c9b5bcfa', 'b778ab129', '2ace87cdd', '697a566f0', '97b1f84fc', '34eff114b', '5281333d7', 'c89f3ba7e', 'cd6d3c7e6', 'fc7c8f2e8', 'abbbf9f82', '24a233e8f', '8e26b560e', 'a28ac1049', '504502ce1', 'd9a8615f3', '4efd6d283', '34cc56e83', '93e98252a', '2b6cef19e', 'c7f70a49b', '0d29ab7eb', 'e4a0d39b7', 'a4d1a8409', 'bc694fc8f', '3a36fc3a2', '4ffba44d3', '9bfdec4bc', '66a866d2f', 'f941e9df7', 'e7af4dbf3', 'dc9a54a3e', '748168a04', 'bba8ce4bb', 'ff6f62aa4', 'b06fe66ba', 'ae87ebc42', 'f26589e57', '963bb53b1', 'a531a4bf0', '9fc79985d', '9350d55c1', 'de06e884c', 'fc10bdf18', 'e0907e883', 'c586d79a1', 'e15e1513d', 'a06067897', '643e42fcb', '217cd3838', '047ebc242', '9b6ce40cf', '3b2c972b3', '17a7bf25a', 'c9028d46b', '9e0473c91', '6b041d374', '783c50218', '19122191d', 'ce573744f', '1c4ea481e', 'fbd6e0a0b', '69831c049', 'b87e3036b', '54ba515ee', 'a09ba0b15', '90f77ec55', 'fb02ef0ea', '3b0cccd29', 'fe9ed417c', '589e8bd6f', '17b5a03fd', '80e16b49a', 'a3d5c2c2a', '1bd3a4e92', '611d81daa', '3d7780b1c', '113fd0206', '5e5894826', 'cb36204f9', 'bc4e3d600', 'c66e2deb0', 'c25851298', 'a7f6de992', '3f93a3272', 'c1b95c2ec', '6bda21fee', '4a64e56e7', '943743753', '20854f8bf', 'ac2e428a9', '5ee7de0be', '316423a21', '2e52b0c6a', '8bdf6bc7e', '8f523faf2', '4758340d5', '8411096ec', '9678b95b7', 'a185e35cc', 'fa980a778', 'c8d90f7d7', '080540c81', '32591c8b4', '5779da33c', 'bb425b41e', '01599af81', '1654ab770', 'd334a588e', 'b4353599c', '51b53eaec', '2cc0fbc52', '45ffef194', 'c15ac04ee', '5b055c8ea', 'd0466eb58', 'a80633823', 'a117a5409', '7ddac276f', '8c32df8b3', 'e5649663e', '6c16efbb8', '9118fd5ca', 'ca8d565f1', '16a5bb8d2', 'fd6347461', 'f5179fb9c', '97428b646', 'f684b0a96', 'e4b2caa9f', '2c2d9f267', '96eb14eaf', 'cb2cb460c', '86f843927', 'ecd16fc60', '801c6dc8e', 'f859a25b8', 'ae846f332', '2252c7403', 'fb9e07326', 'd196ca1fd', 'a8e562e8e', 'eb6bb7ce1', '5beff147e', '52b347cdc', '4600aadcf', '6fa0b9dab', '43d70cc4d', '408021ef8', 'e29d22b59']

Remove Duplicate Columns

I will be using the duplicated() function of pandas — here’s how it works:

Suppose the columns of the data frame are ['alpha','beta','alpha']

df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].

Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])

Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.

Note: the above only checks columns names, not column values.

train_df = train_df.loc[:,~train_df.columns.duplicated()]
print('Train rows and columns after removing duplicate columns: ', train_df.shape)
Train rows and columns after removing duplicate columns: (4459, 4737)

Handling Sparse data

What is Sparse data

As an example, let’s say that we are collecting data from a device which has 12 sensors. And you have collected data for 10 days.

The data you have collected is as follows:

enter image description here
enter image description here

The above is an example of sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.

So basically, sparse data means that there are many gaps present in the data being recorded. For example, in the case of the sensor mentioned above, the sensor may send a signal only when the state changes, like when there is a movement of the door in a room. This data will be obtained intermittently because the door is not always moving. Hence, this is sparse data.

First lets have a look at or train_df data again, that how much of sparse data is there. And as we can see there are plenty of ‘0’

train_df.head()
png

Check and handle total memory of data

get_dummies pandas function converts categorical variables into indicator variables.

def print_memory_usage_of_df(df):
bytes_per_mb = 0.000001
memory_usage = round(df.memory_usage().sum() * bytes_per_mb, 3)
print('Memory usage is ', str(memory_usage) + " MB")

print_memory_usage_of_df(train_df)
print(train_df.shape)
Memory usage is 168.978 MB
(4459, 4737)
dummy_encoded_train_df = pd.get_dummies(train_df)
dummy_encoded_train_df.shape
(4459, 9195)print_memory_usage_of_df(dummy_encoded_train_df)Memory usage is 188.825 MB

We see that the memory usage of the dummy_encoded_train_df data frame is larger compared to the original, because now the number of columns have increased in the data frame.

So lets apply sparse=True if it reduces the memory-usages to some extent.

This parameter sparse defaults to False. If True the encoded columns are returned as SparseArray. By setting sparse=True we create a sparse data frame directly

dummy_encoded_sparse_train_df = pd.get_dummies(train_df, sparse=True)
dummy_encoded_sparse_train_df.shape
(4459, 9195)print_memory_usage_of_df(dummy_encoded_sparse_train_df)Memory usage is 168.965 MB

But looks like in this case the reduction in memory_size was not a huge amount. So lets try some other alternative

Pandas Sparse Structures

Pandas provides data structures for efficient storage of sparse data. In these structures, zero values (or any other specified value) are not actually stored in the array. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

Storing only the non-zero values and their positions is a common technique in storing sparse data sets.

This hugely reduces the memory usage of our data set and “compress” the data frame.

In our example, we will convert the one-hot encoded columns into SparseArrays, which are 1-d arrays where only non-zero values are stored.

def convert_df_to_sparse_array(df, exclude_columns=[]):
df = df.copy()
exclude_columns = set(exclude_columns)

for (column_name, column_data) in df.iteritems():
if column_name in exclude_columns:
continue
df[column_name] = pd.SparseArray(column_data.values, dtype='uint8')

return df

# Now convert our earlier dummy_encoded_train_df with above function and check memory_size

# train_data_post_conversion_to_sparse_array = convert_df_to_sparse_array(dummy_encoded_train_df)
# print('Sparse Array Train_DF rows and columns: ', train_data_post_conversion_to_sparse_array.shape)
# print_memory_usage_of_df(train_data_post_conversion_to_sparse_array)

# Commenting the above out - for running the Notebook faster during my development
# Because df.iteritems() will take a huge time to process the data - see warning below

We see the that the memory_usage is substantially reduced now

A warning on using df.iteritems()

The df.iteritems() iterates over columns and not rows. Generally iteration over dataframes is an anti-pattern, and something we should avoid, unless you want to get used to a lot of waiting.

Sparse Data Removal (Following simpler plain-vanilla technique)

For this notebook, I will go with the easier approach to handle sparse data — which is just to drop it from the dataframe

like below code, I will do this for the sake of running this notebook faster for now

def drop_sparse_from_train_test(train, test):
column_list_to_drop_data_from = [x for x in train.columns if not x in ['ID','target']]
for f in column_list_to_drop_data_from:
if len(np.unique(train[f]))<2:
train.drop(f, axis=1, inplace=True)
test.drop(f, axis=1, inplace=True)
return train, test

train_df, test_df = drop_sparse_from_train_test(train_df, test_df)

Split data into Train and Test for Model Training

X_train = train_df.drop(['ID', 'target'], axis=1)

y_train = np.log1p(train_df['target'].values)

X_test_original = test_df.drop('ID', axis=1)

X_train_split, X_validation, y_train_split, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

LightGBM Model Training

Fundamentals of LightGBM Model

It is a gradient boosting model that makes use of tree based learning algorithms. It is considered to be a fast processing algorithm.

While other algorithms trees grow horizontally, LightGBM algorithm grows vertically, meaning it grows leaf-wise and other algorithms grow level-wise. LightGBM chooses the leaf with large loss to grow. It can lower down more loss than a level wise algorithm when growing the same leaf.

img
img

Source of Image

Light GBM is prefixed as Light because of its high speed. Light GBM can handle the large size of data and takes lower memory to run.

Another reason why Light GBM is so popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.

Leaf growth technique in LightGBM

LightGBM uses leaf-wise (best-first) tree growth. It chooses to grow the leaf that minimizes the loss, allowing a growth of an imbalanced tree. Because it doesn’t grow level-wise, but leaf-wise, over-fitting can happen when data is small. In these cases, it is important to control the tree depth.

When to use LightGBM ?

LightGBM is not preferred for a small volume of datasets as it can easily overfit small data due to its sensitivity. Hence, it generally advised for data having more than 10,000+ rows, though there is no fixed threshold that helps in deciding the usage of LightGBM.

What are LightGBM Parameters?

While, LightGBM has more than 100 parameters that are given in the documentation of LightGBM, let’s checkout the most important ones.

Control Parameters

Max depth: It gives the depth of the tree and also controls the overfitting of the model. If you feel your model is getting overfitted lower down the max depth.

Min_data_in_leaf: Leaf minimum number of records also used for controlling overfitting of the model.

Feature_fraction: It decides the randomly chosen parameter in every iteration for building trees. If it is 0.7 then it means 70% of the parameter would be used.

Bagging_fraction: It checks for the data fraction that will be used in every iteration. Often, used to increase the training speed and avoid overfitting.

Early_stopping_round: If the metric of the validation data does show any improvement in last early_stopping_round rounds. It will lower the imprudent iterations.

Lambda: It states regularization. Its values range from 0 to 1.

Min_gain_to_split: Used to control the number of splits in the tree.

Core Parameters

Task: It tells about the task that is to be performed on the data. It can either train on the data or prediction on the data.

Application: This parameter specifies whether to do regression or classification. LightGBM default parameter for application is regression.

Binary: It is used for binary classification.

Multiclass: It is used for multiclass classification problems.

Regression: It is used for doing regression.

Boosting: It specifies the algorithm type.

rf : Used for Random Forest.

Goss: Gradient-based One Side Sampling.

Num_boost_round: It tells about the boosting iterations.

Learning_rate: The role of learning rate is to power the magnitude of the changes in the approximate that gets updated from each tree’s output. It determines the contribution of each tree on the final outcome and controls how quickly the algorithm proceeds down the gradient descent (learns); Typical values between 0.001–0.3. Smaller values make the model robust to the specific characteristics of each individual tree, thus allowing it to generalize well. Smaller values also make it easier to stop prior to overfitting; however, they increase the risk of not reaching the optimum with a fixed number of trees and are more computationally demanding. This hyperparameter is also called shrinkage. Generally, the smaller this value, the more accurate the model can be but also will require more trees in the sequence.

Num_leaves: It gives the total number of leaves that would be present in a full tree, default value: 31

Metric Parameter

It takes care of the loss while building the model. Some of them are stated below for classification as well as regression.

Mae: Mean absolute error.

Mse: Mean squared error.

Binary_logloss: Binary Classification loss.

Multi_logloss: Multi Classification loss.

def light_gbm_model_run(train_x, train_y, validation_x, validation_y, test_x):
params = {
"objective" : "regression",
"metric" : "rmse",
"num_leaves" : 100,
"learning_rate" : 0.001,
"bagging_fraction" : 0.6,
"feature_fraction" : 0.6,
"bagging_frequency" : 6,
"bagging_seed" : 42,
"verbosity" : -1,
"seed": 42
}
# Given its a regression case, I am using the RMSE as the metric.

lg_train = lgb.Dataset(train_x, label=train_y)
lg_validation = lgb.Dataset(validation_x, label=validation_y)
evals_result_lgbm = {}

model_light_gbm = lgb.train(params, lg_train, 5000,
valid_sets=[lg_train, lg_validation],
early_stopping_rounds=100,
verbose_eval=150,
evals_result=evals_result_lgbm )

pred_test_light_gbm = np.expm1(model_light_gbm.predict(test_x, num_iteration=model_light_gbm.best_iteration ))

return pred_test_light_gbm, model_light_gbm, evals_result_lgbm
# Training and output of LightGBM Model
predictions_test_y_light_gbm, model_lgbm, evals_result = light_gbm_model_run(X_train_split, y_train_split, X_validation, y_validation, X_test_original)
print('Output of LightGBM Model training..')
Training until validation scores don't improve for 100 rounds
[150] training's rmse: 1.66447 valid_1's rmse: 1.63996
[300] training's rmse: 1.5765 valid_1's rmse: 1.5927
[450] training's rmse: 1.49849 valid_1's rmse: 1.55466
[600] training's rmse: 1.42919 valid_1's rmse: 1.52339
[750] training's rmse: 1.36631 valid_1's rmse: 1.49837
[900] training's rmse: 1.30931 valid_1's rmse: 1.47791
[1050] training's rmse: 1.25734 valid_1's rmse: 1.46143
[1200] training's rmse: 1.20984 valid_1's rmse: 1.44818
[1350] training's rmse: 1.16678 valid_1's rmse: 1.43796
[1500] training's rmse: 1.12698 valid_1's rmse: 1.42969
[1650] training's rmse: 1.09049 valid_1's rmse: 1.42292
[1800] training's rmse: 1.05661 valid_1's rmse: 1.41849
[1950] training's rmse: 1.02528 valid_1's rmse: 1.41488
[2100] training's rmse: 0.995869 valid_1's rmse: 1.41222
[2250] training's rmse: 0.968211 valid_1's rmse: 1.40996
[2400] training's rmse: 0.941985 valid_1's rmse: 1.40807
[2550] training's rmse: 0.917269 valid_1's rmse: 1.40669
[2700] training's rmse: 0.893978 valid_1's rmse: 1.40569
[2850] training's rmse: 0.871822 valid_1's rmse: 1.40492
[3000] training's rmse: 0.850995 valid_1's rmse: 1.40427
[3150] training's rmse: 0.831253 valid_1's rmse: 1.40393
[3300] training's rmse: 0.812591 valid_1's rmse: 1.40376
Early stopping, best iteration is:
[3289] training's rmse: 0.813895 valid_1's rmse: 1.40374
Output of LightGBM Model training..

Hyper-Parameter Tuning in LightGBM

Parameter Tuning is an important part that is usually done by data scientists to achieve a good accuracy, fast result and to deal with overfitting. Let us see quickly some of the parameter tuning you can do for better results.

num_leaves: This parameter is responsible for the complexity of the model. I normally start by trying values in the range [10,100]. But if you have a solid heuristic to choose tree depth you can always use it and set num_leaves to 2^tree_depth — 1

LightGBM Documentation says in respect — This is the main parameter to control the complexity of the tree model. Theoretically, we can set num_leaves = 2^(max_depth) to obtain the same number of leaves as depth-wise tree. However, this simple conversion is not good in practice. The reason is that a leaf-wise tree is typically much deeper than a depth-wise tree for a fixed number of leaves. Unconstrained depth can induce over-fitting. Thus, when trying to tune the num_leaves, we should let it be smaller than 2^(max_depth). For example, when the max_depth=7 the depth-wise tree can get good accuracy, but setting num_leaves to 127 may cause over-fitting, and setting it to 70 or 80 may get better accuracy than depth-wise.

Min_data_in_leaf: Assigning bigger value to this parameter can result in underfitting of the model. Giving it a value of 100 or 1000 is sufficient for a large dataset.

Max_depth: Controls the depth of the individual trees. Typical values range from a depth of 3–8 but it is not uncommon to see a tree depth of 1. Smaller depth trees are computationally efficient (but require more trees); however, higher depth trees allow the algorithm to capture unique interactions but also increase the risk of over-fitting. Larger training data sets are more tolerable to deeper trees.

num_iterations: Num_iterations specifies the number of boosting iterations (trees to build). The more trees you build the more accurate your model can be at the cost of: — Longer training time — Higher chance of over-fitting So typically start with a lower number of trees to build a baseline and increase it later when you want to squeeze the last % out of your model.

It is recommended to use smaller learning_rate with larger num_iterations. Also, we should use early_stopping_rounds if we go for higher num_iterations to stop your training when it is not learning anything useful.

early_stopping_rounds — “early stopping” refers to stopping the training process if the model’s performance on a given validation set does not improve for several consecutive iterations. This parameter will stop training if the validation metric is not improving after the last early stopping round. It should be defined in pair with a number of iterations. If we set it too large we increase the chance of over-fitting. The rule of thumb is to have it at 10% of your num_iterations.

So for my above implementation of LightGBM, initially for two of the LightGBM parameters as below got me a score of 1.47953 (in Kaggle Public Board)

"num_leaves" : 40,
"learning_rate" : 0.004,

And now if I only tune these parameters as below

"num_leaves" : 100,
"learning_rate" : 0.001,

I got my score very very slightly updated to 1.4714 (in Kaggle Public Board)

I also tried the below one (keeping ‘num_leaves’ at 70 to avoid over-fitting)

"num_leaves" : 70,
"learning_rate" : 0.001,

With this — I got a score of 1.47234 (in Kaggle Public Board)

Features Importance in LightGBM

gain_light_gbm = model_lgbm.feature_importance('gain')
feature_imp_light_gbm = pd.DataFrame({'feature': model_lgbm.feature_name(),
'split': model_lgbm.feature_importance('split'),
'gain': 100 * gain_light_gbm / gain_light_gbm.sum()
}).sort_values('gain', ascending=False)
print(feature_imp_light_gbm[:50])
feature split gain
4135 f190486d6 6623 6.950093
2378 58e2e02e6 6012 4.948486
3470 eeb9cd3aa 5474 4.035842
2617 9fd594eec 4157 3.262016
4025 15ace8c9f 5198 2.999663
8 20aa07010 3628 1.903282
3576 58232a6fb 3458 1.472419
834 6eef030c1 3798 1.292199
1459 b43a7cfd5 4241 1.255624
2690 fb0f5dbfe 4482 1.118876
3666 491b9ee45 2625 1.040685
1484 024c577b9 2931 1.027777
4348 1702b5bf0 2993 0.934598
4190 f74e8f13d 3636 0.931987
566 66ace2992 3053 0.899695
4513 c47340d97 3209 0.888001
3727 d6bb78916 3296 0.863508
2082 58e056e12 3651 0.859513
863 fc99f9426 2474 0.778488
3816 adb64ff71 2664 0.749633
4458 190db8488 3040 0.721582
4033 5c6487af1 2454 0.703383
3224 ced6a7e91 1666 0.667195
3796 ed8ff54b5 737 0.643995
2137 241f0f867 2467 0.641655
537 26fc93eb7 2559 0.620171
3872 2288333b4 1182 0.614189
3891 50e4f96cf 1147 0.594101
2619 fb387ea33 998 0.583353
828 6786ea46d 806 0.559286
1380 6cf7866c1 985 0.547168
4346 e176a204a 2342 0.544077
34 87ffda550 1353 0.531567
2214 1931ccfdd 2020 0.517504
853 bc70cbc26 973 0.505660
4321 c5a231d81 2253 0.502023
3010 703885424 1944 0.488684
3784 70feb1494 1879 0.486723
3474 324921c7b 2431 0.477780
213 186b87c05 345 0.476618
2934 91f701ba2 1983 0.439053
3988 45f6d00da 1365 0.421545
624 0c9462c08 1232 0.420413
4022 62e59a501 1955 0.418382
545 0572565c2 1652 0.395456
1750 5f341a818 877 0.389509
1850 5a1589f1a 1570 0.379379
1712 2ec5b290f 2078 0.377675
645 6619d81fc 1627 0.365498
1360 1db387535 1774 0.365378

XGBoost Model Training

Note on XGBoost

Below we will be using XGBoost which is an advanced version of Gradient boosting method, it literally means eXtreme Gradient Boosting. XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The XGBoost library implements the gradient boosting decision tree algorithm.

Different from the traditional gradient descent technique, gradient enhancement helps to predict the optimal gradient of the additional model. This technique can reduce the output error at each iteration.

In practice what we do in order to build the learner is to:

  • Start with single root (contains all the training examples)
  • Iterate over all features and values per feature, and evaluate each possible split loss reduction:
  • gain = loss(father instances) — (loss(left branch)+loss(right branch))
  • The gain for the best split must be positive (and > min_split_gain parameter), otherwise we must stop growing the branch.

Leaf growth

XGboost splits up to the specified max_depth hyperparameter and then starts pruning the tree backwards and removes splits beyond which there is no positive gain. It uses this approach since sometimes a split of no loss reduction may be followed by a split with loss reduction. XGBoost can also perform leaf-wise tree growth (as LightGBM).

Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that I_L and I_R are the instance sets of left and right nodes after the split. Then the loss reduction after the split is given by,

Image for post
Image for post

Differences in LightGBM & XGBoost

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. Here instances mean observations/samples.

Let’s see how pre-sorting splitting works-

  • For each node, enumerate over all features
  • For each feature, sort the instances by feature value
  • Use a linear scan to decide the best split along that feature basis information gain
  • Take the best split solution along all the features

In simple terms, Histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram. While, it is efficient than pre-sorted algorithm in training speed which enumerates all possible split points on the pre-sorted feature values, it is still behind GOSS in terms of speed.

XGBoost Model Parameters

I am explaining only those Parameters that I will be implementing below in my function. For an exhaustive explanation of all of them see here

objective [default=reg:linear]

This defines the loss function to be minimized. Mostly used values are:

  • binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
  • multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities) you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
  • multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

eval_metric [ default according to objective ]

The metric to be used for validation data. The default values are rmse for regression and error for classification. Typical values are:

  • rmse — root mean square error
  • mae — mean absolute error
  • logloss — negative log-likelihood
  • error — Binary classification error rate (0.5 threshold)
  • merror — Multiclass classification error rate
  • mlogloss — Multiclass logloss
  • auc: Area under the curve

eta [default=0.3]

  • Analogous to learning rate in GBM.
  • Makes the model more robust by shrinking the weights on each step.
  • Typical final values to be used: 0.01–0.2

colsample_bytree: We can create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model. That is, tuning Column Sub-sampling in XGBoost By Tree. This is controlled by the colsample_bytree parameter. The default value is 1.0 meaning that all columns are used in each decision tree. A fraction (e.g. 0.6) means a fraction of columns to be subsampled. We can evaluate values for colsample_bytree between 0.1 and 1.0 incrementing by 0.1.

A Note on regularization in XGBoost

XGBoost adds built-in regularization to achieve accuracy gains beyond gradient boosting. Regularization is the process of adding information to reduce variance and prevent overfitting.

Although data may be regularized through hyperparameter fine-tuning, regularized algorithms may also be attempted. For example, Ridge and Lasso are regularized machine learning alternatives to LinearRegression.

XGBoost includes regularization as part of the learning objective, as contrasted with gradient boosting and random forests. The regularized parameters penalize complexity and smooth out the final weights to prevent overfitting. XGBoost is a regularized version of gradient boosting.

Mathematically, XGBoost’s learning objective may be defined as follows:

obj(θ) = l(θ) + Ω (θ)

Here, l(θ) is the loss function, which is the Mean Squared Error (MSE) for regression, or the log loss for classification, and Ω (θ) is the regularization function, a penalty term to prevent over-fitting. Including a regularization term as part of the objective function distinguishes XGBoost from most tree ensembles.

The learning objective for the th boosted tree can now be rewritten as follows:

img
img

reg_alpha and reg_lambda : First note the loss function is defined as

img
img

So the above is how the regularized objective function looks like if you want to allow for the inclusion of a L1 and a L2 parameter in the same model

reg_alpha and reg_lambda control the L1 and L2 regularization terms, which in this case limit how extreme the weights at the leaves can become. Higher values of alpha mean more L1 regularization. See the documentation here.

Since L1 regularization in GBDTs is applied to leaf scores rather than directly to features as in logistic regression, it actually serves to reduce the depth of trees. This in turn will tend to reduce the impact of less-predictive features. We might think of L1 regularization as more aggressive against less-predictive features than L2 regularization.

These two regularization terms have different effects on the weights; L2 regularization (controlled by the lambda term) encourages the weights to be small, whereas L1 regularization (controlled by the alpha term) encourages sparsity — so it encourages weights to go to 0. This is helpful in models such as logistic regression, where you want some feature selection, but in decision trees we’ve already selected our features, so zeroing their weights isn’t super helpful. For this reason, I found setting a high lambda value and a low (or 0) alpha value to be the most effective when regularizing.

From this Paper

You find the mathematical underpinnings for XGBoost model by Tianqi Chen et al. A couple of mathematical deviations of this model form the classic Friedman’s GBM are:

  • Regularized (penalized) parameters (and remember that parameters in the boosting are the function, trees, or linear models): L1 and L2 are available.
def xgb_model_run(train_x, train_y, validation_x, validation_y, test_x):
params = {
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'eta': 0.001,
'max_depth': 10,
'subsample': 0.6,
'colsample_bytree': 0.6,
'alpha':0.001,
'random_state': 42

}

training_data = xgb.DMatrix(train_x, train_y)
validation_data = xgb.DMatrix(validation_x, validation_y)

watchlist = [(training_data, 'train'), (validation_data, 'valid')]

model_xgb = xgb.train(params, training_data, 50, watchlist, maximize=False, early_stopping_rounds=100, verbose_eval=100 )

data_test = xgb.DMatrix(test_x)
predict_test_xgb = np.expm1(model_xgb.predict(data_test, ntree_limit=model_xgb.best_ntree_limit ) )

return predict_test_xgb, model_xgb

Training XGB

predictions_test_y_xgb, model_xgb = xgb_model_run(X_train_split, y_train_split, X_validation, y_validation, X_test_original)
print('Completion of XGB Training!!')
[0] train-rmse:14.08765 valid-rmse:14.07678
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 100 rounds.
[49] train-rmse:13.42470 valid-rmse:13.41331
Completion of XGB Training!!

Hyper-Parameter Tuning in XGBoost

As an example, on the above mode, for our XGBoost function we could fine-tune five hyperparameters. The ranges of possible values that we could consider could be as below:

{"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
"max_depth" : [ 3, 4, 5, 6, 8, 10, 12, 15],
"min_child_weight" : [ 1, 3, 5, 7 ],
"gamma" : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
"colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }

CatBoost Model Training

CatBoost is another competitor to XGBoost, LightGBM and H2O. “CatBoost” name comes from two words “Category” and “Boosting”.

The library works well with multiple Categories of data, such as audio, text, image including historical data.

The CatBoost library can be used to solve both classification and regression challenge. For classification, you can use “CatBoostClassifier” and for regression, “CatBoostRegressor“.

Yandex is relying heavily on Catboost for ranking, forecasting and recommendations. This model is serving more than 70 million users each month.

“CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.”

Overall some of the algorithmic enhancements that Catboost brought:

  1. For data with categorical features the accuracy of CatBoost would be better compared to other algorithms.
  2. Better over-fitting handling: — CatBoost uses the implementation of ordered boosting, an alternative to the classic boosting algorithm, which will be specially significant on small datasets
  3. GPU-training: — The versions of CatBoost available from pip install (pip install catboost) and conda install (conda install catboost) have GPU support out-of-the-box. You just need to specify that you want to train your model on GPU in the corresponding HP (will be shown below).

For GPU system requirements of CatBoost

The versions of CatBoost available from pip install and conda install have GPU support out-of-the-box. Devices with compute capability 3.0 and higher are supported in compiled packages. Training on GPU requires NVIDIA Driver of version 418.xx or higher. The Python version of CatBoost for CUDA of compute capability 2.0 can be built from source.

To check Compute Capability of your CUDA-GPU check this NVIDIA Official Link

Further for Training on GPU

The parameters that enable and customize training on GPU are set in the constructors of the classes — CatBoost (fit), CatBoostClassifier (fit), CatBoostRegressor (fit). task_type - The processing unit type to use for training. Possible values are - "CPU" or "GPU" . An example below

model = CatBoostClassifier(iterations=1000,
task_type="GPU",
devices='0:1')
model.fit(train_data,
train_labels,
verbose=False)

Categorical features handling in CatBoost Algorithm

The below is taken from this paper

Categorical features have a discrete set of values called categories which are not necessary comparable with each other; thus, such features cannot be used in binary decision trees directly. A common practice for dealing with categorical features is converting them to numbers at the preprocessing time, i.e., each category for each example is substituted with one or several numerical values. The most widely used technique which is usually applied to low-cardinality categorical features is one-hot encoding: the original feature is removed and a new binary variable is added for each category [14]. One-hot encoding can be done during the preprocessing phase or during training, the latter can be implemented more efficiently in terms of training time and is implemented in CatBoost.

For further details on this red CatBoost’s documentation

Leaf growth algorithm in CatBoost

Catboost grows a balanced tree. In each level of such a tree, the feature-split pair that brings to the lowest loss (according to a penalty function) is selected and is used for all the level’s nodes. It is possible to change its policy using the grow-policy parameter.

CatBoost Training Parameters

Let’s look at the common parameters in CatBoost:

loss_function alias as objective — Metric used for training. These are regression metrics such as root mean squared error for regression and logloss for classification.

eval_metric — Metric used for detecting over-fitting.

iterations — The maximum number of trees to be built, defaults to 1000. It aliases are num_boost_round, n_estimators, and num_trees. Some notes on Total num of Trees - In bagging and random forests the averaging of independently grown trees makes it very difficult to overfit with too many trees. However, in GBMs this function differently as each tree is grown in sequence to fix up the past tree’s mistakes. For example, in regression, GBMs will chase residuals as long as we allow them to. Also, depending on the values of the other hyperparameters, GBMs often require many trees (sometimes many thousands of trees). But also more trees, can easily overfit we must find the optimal number of trees that minimize the loss function of interest with cross validation.

learning_rate alias eta — The learning rate that determines how fast or slow the model will learn. The default is usually varies between 0.01 to 0.03.

random_seed alias random_state — The random seed used for training.

l2_leaf_reg alias reg_lambda — Coefficient at the L2 regularization term of the cost function. The default is 3.0.

bootstrap_type — Determines the sampling method for the weights of the objects, e.g Bayesian, Bernoulli, MVS, and Poisson. depth — The depth of the tree.

grow_policy — Determines how the greedy search algorithm will be applied. It can be either SymmetricTree, Depthwise, or Lossguide.

SymmetricTree is the default. In SymmetricTree, the tree is built level-by-level until the depth is attained. In every step, leaves from the previous tree are split with the same condition. When Depthwise is chosen, a tree is built step-by-step until the specified depth is achieved. On each step, all non-terminal leaves from the last tree level are split. The leaves are split using the condition that leads to the best loss improvement. In Lossguide, the tree is built leaf-by-leaf until the specified number of leaves is attained. On each step, the non-terminal leaf with the best loss improvement is split

min_data_in_leaf alias min_child_samples — This is the minimum number of training samples in a leaf. This parameter is only used with the Lossguide and Depthwise growing policies.

max_leaves alias num_leaves — This parameter is used only with the Lossguide policy and determines the number of leaves in the tree.

ignored_features — Indicates the features that should be ignored in the training process.

nan_mode — The method for dealing with missing values. The options are Forbidden, Min, and Max. The default is Min. When Forbidden is used, the presence of missing values leads to errors. With Min, the missing values are taken as the minimum values for that feature. In Max, the missing values are treated as the maximum value for the feature.

leaf_estimation_method — The method used to calculate values in leaves. In classification, 10 Newton iterations are used. Regression problems using quantile or MAE loss use one Exact iteration. Multi classification uses one Netwon iteration.

leaf_estimation_backtracking — The type of backtracking to be used during gradient descent. The default is AnyImprovement. AnyImprovement decreases the descent step, up to where the loss function value is smaller than it was in the last iteration. Armijo reduces the descent step until the Armijo condition is met.

boosting_type — The boosting scheme. It can be plain for the classic gradient boosting scheme, or ordered, which offers better quality on smaller datasets.

score_function — The score type used to select the next split during tree construction. Cosine is the default option. The other available options are L2, NewtonL2, and NewtonCosine.

early_stopping_rounds — When True, sets the over-fitting detector type to Iter and stops the training when the optimal metric is achieved.

classes_count — The number of classes for multi-classification problems. task_type — Whether you are using a CPU or GPU. CPU is the default. devices — The IDs of the GPU devices to be used for training. cat_features — The array with the categorical columns. text_features — Used to declare text columns in classification problems.

# Now Catboost model training

model_catboost = CatBoostRegressor(iterations=500,
learning_rate=0.01,
depth=10,
eval_metric='RMSE',
random_seed = 42,
bagging_temperature=0.2,
od_type='Iter',
metric_period=50,
od_wait=20
)

model_catboost.fit(X_train_split, y_train_split,
eval_set=(X_validation, y_validation),
use_best_model=True,
verbose=50
)

predictions_test_y_catboost = np.expm1(model_catboost.predict(X_test_original))

Creating Output file for Submission

submission_final = pd.read_csv('../input/santander-value-prediction-challenge/sample_submission.csv')

submission_lgb = pd.DataFrame()
submission_lgb['target'] = predictions_test_y_light_gbm


submission_xgb = pd.DataFrame()
submission_xgb['target'] = predictions_test_y_xgb

submission_catboost = pd.DataFrame()
submission_catboost['target'] = predictions_test_y_catboost

submission_final['target'] = (submission_lgb['target'] * 0.5 + submission_xgb['target'] * 0.3 + submission_catboost['tartet'] * 0.2)
submission_final.head()submission_final.to_csv('submission_combined_lgb_xgb_catboost.csv', index=False)

Further Reading and References

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Rohan Paul

Written by

DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Rohan Paul

Written by

DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store