# LightGBM, XGBoost and CatBoost — Kaggle — Santander Challenge

## Achieved a score of 1.4714 with this Kernel in Kaggle

(If you like the Kaggle Notebook, please consider upvoting it in Kaggle)

Getting the data and Kaggle Challenge Link

Gradient Boosted trees have become one of the most powerful algorithms for training on tabular data. Over the recent past, we’ve been fortunate to have may implementations of boosted trees — each with their own unique characteristics. In this notebook, I will implement LightGBM, XGBoost and CatBoost to tackle this Kaggle problem.

What is Boosting

To understand the absolute basics of the need for Boosting algorithm, let's ask a basic question — If a data point is incorrectly predicted by our first model, and then the next (probably all models), will combining the predictions provide better results? Such questions are handled by boosting algorithm.

So, Boosting is a sequential technique which works on the principle of an ensemble, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.

The basic principle behind the working of the boosting algorithm is to generate multiple weak learners and combine their predictions to form one strong rule. These weak rules are generated by applying base Machine Learning algorithms on different distributions of the data set. These algorithms generate weak rules for each iteration. After multiple iterations, the weak learners are combined to form a strong learner that will predict a more accurate outcome. Note that a weak learner is one that is slightly better than random guessing. For example, a decision tree whose predictions are slightly better than 50%.

Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Here’s how the algorithm works:

Step 1: The base algorithm reads the data and assigns equal weight to each sample observation.

Step 2: False predictions made by the base learner are identified. In the next iteration, these false predictions are assigned to the next base learner with a higher weightage on these incorrect predictions.

Step 3: Repeat step 2 until the algorithm can correctly classify the output.

Therefore, the main aim of Boosting is to focus more on miss-classified predictions.

Source

These techniques are used to build ensemble models in an iterative way. On the first iteration, the algorithm learns the first tree to reduce the training error, shown on left-hand image above. The right-hand image above, shows the second iteration, in which the algorithm learns one more tree to reduce the error made by the first tree. The algorithm repeats this procedure until it builds a decent quality mode.

The common approach for classification uses Logloss while regression optimizes using root mean square error. Ranking tasks commonly implements some variation of LambdaRank.

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerfrom sklearn.decomposition import TruncatedSVDfrom sklearn import preprocessing, model_selection, metricsfrom sklearn.model_selection import train_test_splitimport lightgbm as lgbimport xgboost as xgbfrom catboost import CatBoostRegressorfrom IPython.display import display # Allows the use of display() for DataFramesimport warningswarnings.filterwarnings('ignore')train_df = pd.read_csv('../input/santander-value-prediction-challenge/train.csv')test_df = pd.read_csv('../input/santander-value-prediction-challenge/test.csv')train_df.head()`
`test_df.head()`
`train_df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 4459 entries, 0 to 4458Columns: 4993 entries, ID to 9fc776466dtypes: float64(1845), int64(3147), object(1)memory usage: 169.9+ MB`

Initial Observations looking at the above data

• Column name does not mean anything now, as they are all anonymized
• The dataframe is full of zero values.
• The dataset is a sparse tabular one refer this

Target Variable:

First doing some scatter plot of the target variable to check for visible outliers.

`print('Train rows and columns: ', train_df.shape)# Keeping below line commented out as its huge 49,342 row file with 1gb size and so take longer to run each timeprint('Test rows and columns: ', test_df.shape)Train rows and columns:  (4459, 4993)Test rows and columns:  (100, 4992)Keeping below lines commented out during developmentplt.figure(figsize=(8,6))plt.scatter(range(train_df.shape), np.sort(train_df['target'].values))plt.xlabel('index', fontsize=12)plt.ylabel('Target', fontsize=12)plt.title('Distribution of Target', fontsize=14)plt.show()`

## Checking for missing / null values in data

`print("All Features in Train data with NaN Values =", str(train_df.columns[train_df.isnull().sum() != 0].size) )# print("All Features in Test data with NaN Values =", str(test_df.columns[train_df.isnull().sum() != 0].size) )All Features in Train data with NaN Values = 0`

## Remove constant columns from data

`const_columns_to_remove = []for col in train_df.columns:    if col != 'ID' and col != 'target':        if train_df[col].std() == 0:            const_columns_to_remove.append(col)# Now remove that array of const columns from the datatrain_df.drop(const_columns_to_remove, axis=1, inplace=True)test_df.drop(const_columns_to_remove, axis=1, inplace=True)# Print to see the reduction of columnsprint('train_df rows and columns after removing constant columns: ', train_df.shape)print('Following `{}` Constant Column\n are removed'.format(len(const_columns_to_remove)))print(const_columns_to_remove)train_df rows and columns after removing constant columns:  (4459, 4737)Following `256` Constant Column are removed['d5308d8bc', 'c330f1a67', 'eeac16933', '7df8788e8', '5b91580ee', '6f29fbbc7', '46dafc868', 'ae41a98b6', 'f416800e9', '6d07828ca', '7ac332a1d', '70ee7950a', '833b35a7c', '2f9969eab', '8b1372217', '68322788b', '2288ac1a6', 'dc7f76962', '467044c26', '39ebfbfd9', '9a5ff8c23', 'f6fac27c8', '664e2800e', 'ae28689a2', 'd87dcac58', '4065efbb6', 'f944d9d43', 'c2c4491d5', 'a4346e2e2', '1af366d4f', 'cfff5b7c8', 'da215e99e', '5acd26139', '9be9c6cef', '1210d0271', '21b0a54cb', 'da35e792b', '754c502dd', '0b346adbd', '0f196b049', 'b603ed95d', '2a50e001c', '1e81432e7', '10350ea43', '3c7c7e24c', '7585fce2a', '64d036163', 'f25d9935c', 'd98484125', '95c85e227', '9a5273600', '746cdb817', '6377a6293', '7d944fb0c', '87eb21c50', '5ea313a8c', '0987a65a1', '2fb7c2443', 'f5dde409b', '1ae50d4c3', '2b21cd7d8', '0db8a9272', '804d8b55b', '76f135fa6', '7d7182143', 'f88e61ae6', '378ed28e0', 'ca4ba131e', '1352ddae5', '2b601ad67', '6e42ff7c7', '22196a84c', '0e410eb3d', '992e6d1d3', '90a742107', '08b9ec4ae', 'd95203ded', '58ad51def', '9f69ae59f', '863de8a31', 'be10df47c', 'f006d9618', 'a7e39d23d', '5ed0abe85', '6c578fe94', '7fa4fcee9', '5e0571f07', 'fd5659511', 'e06b9f40f', 'c506599c8', '99de8c2dc', 'b05f4b229', '5e0834175', 'eb1cc0d9c', 'b281a62b9', '00fcf67e4', 'e37b65992', '2308e2b29', 'c342e8709', '708471ebf', 'f614aac15', '15ecf7b68', '3bfe540f1', '7a0d98f3c', 'e642315a5', 'c16d456a7', '0c9b5bcfa', 'b778ab129', '2ace87cdd', '697a566f0', '97b1f84fc', '34eff114b', '5281333d7', 'c89f3ba7e', 'cd6d3c7e6', 'fc7c8f2e8', 'abbbf9f82', '24a233e8f', '8e26b560e', 'a28ac1049', '504502ce1', 'd9a8615f3', '4efd6d283', '34cc56e83', '93e98252a', '2b6cef19e', 'c7f70a49b', '0d29ab7eb', 'e4a0d39b7', 'a4d1a8409', 'bc694fc8f', '3a36fc3a2', '4ffba44d3', '9bfdec4bc', '66a866d2f', 'f941e9df7', 'e7af4dbf3', 'dc9a54a3e', '748168a04', 'bba8ce4bb', 'ff6f62aa4', 'b06fe66ba', 'ae87ebc42', 'f26589e57', '963bb53b1', 'a531a4bf0', '9fc79985d', '9350d55c1', 'de06e884c', 'fc10bdf18', 'e0907e883', 'c586d79a1', 'e15e1513d', 'a06067897', '643e42fcb', '217cd3838', '047ebc242', '9b6ce40cf', '3b2c972b3', '17a7bf25a', 'c9028d46b', '9e0473c91', '6b041d374', '783c50218', '19122191d', 'ce573744f', '1c4ea481e', 'fbd6e0a0b', '69831c049', 'b87e3036b', '54ba515ee', 'a09ba0b15', '90f77ec55', 'fb02ef0ea', '3b0cccd29', 'fe9ed417c', '589e8bd6f', '17b5a03fd', '80e16b49a', 'a3d5c2c2a', '1bd3a4e92', '611d81daa', '3d7780b1c', '113fd0206', '5e5894826', 'cb36204f9', 'bc4e3d600', 'c66e2deb0', 'c25851298', 'a7f6de992', '3f93a3272', 'c1b95c2ec', '6bda21fee', '4a64e56e7', '943743753', '20854f8bf', 'ac2e428a9', '5ee7de0be', '316423a21', '2e52b0c6a', '8bdf6bc7e', '8f523faf2', '4758340d5', '8411096ec', '9678b95b7', 'a185e35cc', 'fa980a778', 'c8d90f7d7', '080540c81', '32591c8b4', '5779da33c', 'bb425b41e', '01599af81', '1654ab770', 'd334a588e', 'b4353599c', '51b53eaec', '2cc0fbc52', '45ffef194', 'c15ac04ee', '5b055c8ea', 'd0466eb58', 'a80633823', 'a117a5409', '7ddac276f', '8c32df8b3', 'e5649663e', '6c16efbb8', '9118fd5ca', 'ca8d565f1', '16a5bb8d2', 'fd6347461', 'f5179fb9c', '97428b646', 'f684b0a96', 'e4b2caa9f', '2c2d9f267', '96eb14eaf', 'cb2cb460c', '86f843927', 'ecd16fc60', '801c6dc8e', 'f859a25b8', 'ae846f332', '2252c7403', 'fb9e07326', 'd196ca1fd', 'a8e562e8e', 'eb6bb7ce1', '5beff147e', '52b347cdc', '4600aadcf', '6fa0b9dab', '43d70cc4d', '408021ef8', 'e29d22b59']`

## Remove Duplicate Columns

I will be using the duplicated() function of pandas — here’s how it works:

Suppose the columns of the data frame are `['alpha','beta','alpha']`

`df.columns.duplicated()` returns a boolean array: a `True` or `False` for each column. If it is `False` then the column name is unique up to that point, if it is `True` then the column name is duplicated earlier. For example, using the given example, the returned value would be `[False,False,True]`.

`Pandas` allows one to index using boolean values whereby it selects only the `True` values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie `[True, True, False] = ~[False,False,True]`)

Finally, `df.loc[:,[True,True,False]]` selects only the non-duplicated columns using the aforementioned indexing capability.

Note: the above only checks columns names, not column values.

`train_df = train_df.loc[:,~train_df.columns.duplicated()]print('Train rows and columns after removing duplicate columns: ', train_df.shape)Train rows and columns after removing duplicate columns:  (4459, 4737)`

## Handling Sparse data

What is Sparse data

As an example, let’s say that we are collecting data from a device which has 12 sensors. And you have collected data for 10 days.

The data you have collected is as follows:

The above is an example of sparse data because most of the sensor outputs are zero. Which means those sensors are functioning properly but the actual reading is zero. Although this matrix has high dimensional data (12 axises) it can be said that it contains less information.

So basically, sparse data means that there are many gaps present in the data being recorded. For example, in the case of the sensor mentioned above, the sensor may send a signal only when the state changes, like when there is a movement of the door in a room. This data will be obtained intermittently because the door is not always moving. Hence, this is sparse data.

First lets have a look at or train_df data again, that how much of sparse data is there. And as we can see there are plenty of ‘0’

`train_df.head()`

## Check and handle total memory of data

`get_dummies` pandas function converts categorical variables into indicator variables.

`def print_memory_usage_of_df(df):    bytes_per_mb = 0.000001    memory_usage = round(df.memory_usage().sum() * bytes_per_mb, 3)    print('Memory usage is ', str(memory_usage) + " MB")print_memory_usage_of_df(train_df)print(train_df.shape)Memory usage is  168.978 MB(4459, 4737)dummy_encoded_train_df = pd.get_dummies(train_df)dummy_encoded_train_df.shape(4459, 9195)print_memory_usage_of_df(dummy_encoded_train_df)Memory usage is  188.825 MB`

We see that the memory usage of the dummy_encoded_train_df data frame is larger compared to the original, because now the number of columns have increased in the data frame.

So lets apply `sparse=True` if it reduces the memory-usages to some extent.

This parameter `sparse` defaults to False. If True the encoded columns are returned as SparseArray. By setting `sparse=True` we create a sparse data frame directly

`dummy_encoded_sparse_train_df = pd.get_dummies(train_df, sparse=True)dummy_encoded_sparse_train_df.shape(4459, 9195)print_memory_usage_of_df(dummy_encoded_sparse_train_df)Memory usage is  168.965 MB`

But looks like in this case the reduction in memory_size was not a huge amount. So lets try some other alternative

## Pandas Sparse Structures

Pandas provides data structures for efficient storage of sparse data. In these structures, zero values (or any other specified value) are not actually stored in the array. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

Storing only the non-zero values and their positions is a common technique in storing sparse data sets.

This hugely reduces the memory usage of our data set and “compress” the data frame.

In our example, we will convert the one-hot encoded columns into SparseArrays, which are 1-d arrays where only non-zero values are stored.

`def convert_df_to_sparse_array(df, exclude_columns=[]):    df = df.copy()    exclude_columns = set(exclude_columns)    for (column_name, column_data) in df.iteritems():        if column_name in exclude_columns:            continue        df[column_name] = pd.SparseArray(column_data.values, dtype='uint8')    return df# Now convert our earlier dummy_encoded_train_df with above function and check memory_size# train_data_post_conversion_to_sparse_array = convert_df_to_sparse_array(dummy_encoded_train_df)# print('Sparse Array Train_DF rows and columns: ', train_data_post_conversion_to_sparse_array.shape)# print_memory_usage_of_df(train_data_post_conversion_to_sparse_array)# Commenting the above out - for running the Notebook faster during my development # Because df.iteritems() will take a huge time to process the data - see warning below`

We see the that the memory_usage is substantially reduced now

## A warning on using df.iteritems()

The df.iteritems() iterates over columns and not rows. Generally iteration over dataframes is an anti-pattern, and something we should avoid, unless you want to get used to a lot of waiting.

## For this notebook, I will go with the easier approach to handle sparse data — which is just to drop it from the dataframe

like below code, I will do this for the sake of running this notebook faster for now

`def drop_sparse_from_train_test(train, test):    column_list_to_drop_data_from = [x for x in train.columns if not x in ['ID','target']]    for f in column_list_to_drop_data_from:        if len(np.unique(train[f]))<2:            train.drop(f, axis=1, inplace=True)            test.drop(f, axis=1, inplace=True)    return train, testtrain_df, test_df = drop_sparse_from_train_test(train_df, test_df)`

## Split data into Train and Test for Model Training

`X_train = train_df.drop(['ID', 'target'], axis=1)y_train = np.log1p(train_df['target'].values)X_test_original = test_df.drop('ID', axis=1)X_train_split, X_validation, y_train_split, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=42)`

## Fundamentals of LightGBM Model

It is a gradient boosting model that makes use of tree based learning algorithms. It is considered to be a fast processing algorithm.

While other algorithms trees grow horizontally, LightGBM algorithm grows vertically, meaning it grows leaf-wise and other algorithms grow level-wise. LightGBM chooses the leaf with large loss to grow. It can lower down more loss than a level wise algorithm when growing the same leaf.

Source of Image

Light GBM is prefixed as Light because of its high speed. Light GBM can handle the large size of data and takes lower memory to run.

Another reason why Light GBM is so popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.

Leaf growth technique in LightGBM

LightGBM uses leaf-wise (best-first) tree growth. It chooses to grow the leaf that minimizes the loss, allowing a growth of an imbalanced tree. Because it doesn’t grow level-wise, but leaf-wise, over-fitting can happen when data is small. In these cases, it is important to control the tree depth.

## When to use LightGBM ?

LightGBM is not preferred for a small volume of datasets as it can easily overfit small data due to its sensitivity. Hence, it generally advised for data having more than 10,000+ rows, though there is no fixed threshold that helps in deciding the usage of LightGBM.

## What are LightGBM Parameters?

While, LightGBM has more than 100 parameters that are given in the documentation of LightGBM, let’s checkout the most important ones.

## Control Parameters

Max depth: It gives the depth of the tree and also controls the overfitting of the model. If you feel your model is getting overfitted lower down the max depth.

Min_data_in_leaf: Leaf minimum number of records also used for controlling overfitting of the model.

Feature_fraction: It decides the randomly chosen parameter in every iteration for building trees. If it is 0.7 then it means 70% of the parameter would be used.

Bagging_fraction: It checks for the data fraction that will be used in every iteration. Often, used to increase the training speed and avoid overfitting.

Early_stopping_round: If the metric of the validation data does show any improvement in last early_stopping_round rounds. It will lower the imprudent iterations.

Lambda: It states regularization. Its values range from 0 to 1.

Min_gain_to_split: Used to control the number of splits in the tree.

## Core Parameters

Task: It tells about the task that is to be performed on the data. It can either train on the data or prediction on the data.

Application: This parameter specifies whether to do regression or classification. LightGBM default parameter for application is regression.

Binary: It is used for binary classification.

Multiclass: It is used for multiclass classification problems.

Regression: It is used for doing regression.

Boosting: It specifies the algorithm type.

rf : Used for Random Forest.

Num_boost_round: It tells about the boosting iterations.

Learning_rate: The role of learning rate is to power the magnitude of the changes in the approximate that gets updated from each tree’s output. It determines the contribution of each tree on the final outcome and controls how quickly the algorithm proceeds down the gradient descent (learns); Typical values between 0.001–0.3. Smaller values make the model robust to the specific characteristics of each individual tree, thus allowing it to generalize well. Smaller values also make it easier to stop prior to overfitting; however, they increase the risk of not reaching the optimum with a fixed number of trees and are more computationally demanding. This hyperparameter is also called shrinkage. Generally, the smaller this value, the more accurate the model can be but also will require more trees in the sequence.

Num_leaves: It gives the total number of leaves that would be present in a full tree, default value: 31

## Metric Parameter

It takes care of the loss while building the model. Some of them are stated below for classification as well as regression.

Mae: Mean absolute error.

Mse: Mean squared error.

Binary_logloss: Binary Classification loss.

Multi_logloss: Multi Classification loss.

`def light_gbm_model_run(train_x, train_y, validation_x, validation_y, test_x):    params = {        "objective" : "regression",        "metric" : "rmse",        "num_leaves" : 100,        "learning_rate" : 0.001,        "bagging_fraction" : 0.6,        "feature_fraction" : 0.6,        "bagging_frequency" : 6,        "bagging_seed" : 42,        "verbosity" : -1,        "seed": 42    }    # Given its a regression case, I am using the RMSE as the metric.    lg_train = lgb.Dataset(train_x, label=train_y)    lg_validation = lgb.Dataset(validation_x, label=validation_y)    evals_result_lgbm = {}    model_light_gbm = lgb.train(params, lg_train, 5000,                      valid_sets=[lg_train, lg_validation],                      early_stopping_rounds=100,                      verbose_eval=150,                      evals_result=evals_result_lgbm )    pred_test_light_gbm = np.expm1(model_light_gbm.predict(test_x, num_iteration=model_light_gbm.best_iteration ))    return pred_test_light_gbm, model_light_gbm, evals_result_lgbm# Training and output of LightGBM Modelpredictions_test_y_light_gbm, model_lgbm, evals_result = light_gbm_model_run(X_train_split, y_train_split, X_validation, y_validation, X_test_original)print('Output of LightGBM Model training..')Training until validation scores don't improve for 100 rounds	training's rmse: 1.66447	valid_1's rmse: 1.63996	training's rmse: 1.5765	valid_1's rmse: 1.5927	training's rmse: 1.49849	valid_1's rmse: 1.55466	training's rmse: 1.42919	valid_1's rmse: 1.52339	training's rmse: 1.36631	valid_1's rmse: 1.49837	training's rmse: 1.30931	valid_1's rmse: 1.47791	training's rmse: 1.25734	valid_1's rmse: 1.46143	training's rmse: 1.20984	valid_1's rmse: 1.44818	training's rmse: 1.16678	valid_1's rmse: 1.43796	training's rmse: 1.12698	valid_1's rmse: 1.42969	training's rmse: 1.09049	valid_1's rmse: 1.42292	training's rmse: 1.05661	valid_1's rmse: 1.41849	training's rmse: 1.02528	valid_1's rmse: 1.41488	training's rmse: 0.995869	valid_1's rmse: 1.41222	training's rmse: 0.968211	valid_1's rmse: 1.40996	training's rmse: 0.941985	valid_1's rmse: 1.40807	training's rmse: 0.917269	valid_1's rmse: 1.40669	training's rmse: 0.893978	valid_1's rmse: 1.40569	training's rmse: 0.871822	valid_1's rmse: 1.40492	training's rmse: 0.850995	valid_1's rmse: 1.40427	training's rmse: 0.831253	valid_1's rmse: 1.40393	training's rmse: 0.812591	valid_1's rmse: 1.40376Early stopping, best iteration is:	training's rmse: 0.813895	valid_1's rmse: 1.40374Output of LightGBM Model training..`

## Hyper-Parameter Tuning in LightGBM

Parameter Tuning is an important part that is usually done by data scientists to achieve a good accuracy, fast result and to deal with overfitting. Let us see quickly some of the parameter tuning you can do for better results.

num_leaves: This parameter is responsible for the complexity of the model. I normally start by trying values in the range [10,100]. But if you have a solid heuristic to choose tree depth you can always use it and set num_leaves to 2^tree_depth — 1

LightGBM Documentation says in respect — This is the main parameter to control the complexity of the tree model. Theoretically, we can set num_leaves = 2^(max_depth) to obtain the same number of leaves as depth-wise tree. However, this simple conversion is not good in practice. The reason is that a leaf-wise tree is typically much deeper than a depth-wise tree for a fixed number of leaves. Unconstrained depth can induce over-fitting. Thus, when trying to tune the num_leaves, we should let it be smaller than 2^(max_depth). For example, when the max_depth=7 the depth-wise tree can get good accuracy, but setting num_leaves to 127 may cause over-fitting, and setting it to 70 or 80 may get better accuracy than depth-wise.

Min_data_in_leaf: Assigning bigger value to this parameter can result in underfitting of the model. Giving it a value of 100 or 1000 is sufficient for a large dataset.

Max_depth: Controls the depth of the individual trees. Typical values range from a depth of 3–8 but it is not uncommon to see a tree depth of 1. Smaller depth trees are computationally efficient (but require more trees); however, higher depth trees allow the algorithm to capture unique interactions but also increase the risk of over-fitting. Larger training data sets are more tolerable to deeper trees.

num_iterations: Num_iterations specifies the number of boosting iterations (trees to build). The more trees you build the more accurate your model can be at the cost of: — Longer training time — Higher chance of over-fitting So typically start with a lower number of trees to build a baseline and increase it later when you want to squeeze the last % out of your model.

It is recommended to use smaller `learning_rate` with larger `num_iterations`. Also, we should use `early_stopping_rounds` if we go for higher `num_iterations` to stop your training when it is not learning anything useful.

early_stopping_rounds — “early stopping” refers to stopping the training process if the model’s performance on a given validation set does not improve for several consecutive iterations. This parameter will stop training if the validation metric is not improving after the last early stopping round. It should be defined in pair with a number of iterations. If we set it too large we increase the chance of over-fitting. The rule of thumb is to have it at 10% of your `num_iterations`.

## So for my above implementation of LightGBM, initially for two of the LightGBM parameters as below got me a score of 1.47953 (in Kaggle Public Board)

`"num_leaves" : 40,"learning_rate" : 0.004,`

And now if I only tune these parameters as below

`"num_leaves" : 100,"learning_rate" : 0.001,`

## I got my score very very slightly updated to 1.4714 (in Kaggle Public Board)

I also tried the below one (keeping ‘num_leaves’ at 70 to avoid over-fitting)

`"num_leaves" : 70,"learning_rate" : 0.001,`

With this — I got a score of 1.47234 (in Kaggle Public Board)

## Features Importance in LightGBM

`gain_light_gbm = model_lgbm.feature_importance('gain')feature_imp_light_gbm = pd.DataFrame({'feature': model_lgbm.feature_name(),                                      'split': model_lgbm.feature_importance('split'),                                      'gain': 100 * gain_light_gbm / gain_light_gbm.sum()}).sort_values('gain', ascending=False)print(feature_imp_light_gbm[:50])feature  split      gain4135  f190486d6   6623  6.9500932378  58e2e02e6   6012  4.9484863470  eeb9cd3aa   5474  4.0358422617  9fd594eec   4157  3.2620164025  15ace8c9f   5198  2.9996638     20aa07010   3628  1.9032823576  58232a6fb   3458  1.472419834   6eef030c1   3798  1.2921991459  b43a7cfd5   4241  1.2556242690  fb0f5dbfe   4482  1.1188763666  491b9ee45   2625  1.0406851484  024c577b9   2931  1.0277774348  1702b5bf0   2993  0.9345984190  f74e8f13d   3636  0.931987566   66ace2992   3053  0.8996954513  c47340d97   3209  0.8880013727  d6bb78916   3296  0.8635082082  58e056e12   3651  0.859513863   fc99f9426   2474  0.7784883816  adb64ff71   2664  0.7496334458  190db8488   3040  0.7215824033  5c6487af1   2454  0.7033833224  ced6a7e91   1666  0.6671953796  ed8ff54b5    737  0.6439952137  241f0f867   2467  0.641655537   26fc93eb7   2559  0.6201713872  2288333b4   1182  0.6141893891  50e4f96cf   1147  0.5941012619  fb387ea33    998  0.583353828   6786ea46d    806  0.5592861380  6cf7866c1    985  0.5471684346  e176a204a   2342  0.54407734    87ffda550   1353  0.5315672214  1931ccfdd   2020  0.517504853   bc70cbc26    973  0.5056604321  c5a231d81   2253  0.5020233010  703885424   1944  0.4886843784  70feb1494   1879  0.4867233474  324921c7b   2431  0.477780213   186b87c05    345  0.4766182934  91f701ba2   1983  0.4390533988  45f6d00da   1365  0.421545624   0c9462c08   1232  0.4204134022  62e59a501   1955  0.418382545   0572565c2   1652  0.3954561750  5f341a818    877  0.3895091850  5a1589f1a   1570  0.3793791712  2ec5b290f   2078  0.377675645   6619d81fc   1627  0.3654981360  1db387535   1774  0.365378`

## Note on XGBoost

Below we will be using XGBoost which is an advanced version of Gradient boosting method, it literally means eXtreme Gradient Boosting. XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The XGBoost library implements the gradient boosting decision tree algorithm.

Different from the traditional gradient descent technique, gradient enhancement helps to predict the optimal gradient of the additional model. This technique can reduce the output error at each iteration.

In practice what we do in order to build the learner is to:

• Iterate over all features and values per feature, and evaluate each possible split loss reduction:
• gain = loss(father instances) — (loss(left branch)+loss(right branch))
• The gain for the best split must be positive (and > min_split_gain parameter), otherwise we must stop growing the branch.

Leaf growth

XGboost splits up to the specified max_depth hyperparameter and then starts pruning the tree backwards and removes splits beyond which there is no positive gain. It uses this approach since sometimes a split of no loss reduction may be followed by a split with loss reduction. XGBoost can also perform leaf-wise tree growth (as LightGBM).

Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that I_L and I_R are the instance sets of left and right nodes after the split. Then the loss reduction after the split is given by,

## Differences in LightGBM & XGBoost

LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. Here instances mean observations/samples.

Let’s see how pre-sorting splitting works-

• For each node, enumerate over all features
• For each feature, sort the instances by feature value
• Use a linear scan to decide the best split along that feature basis information gain
• Take the best split solution along all the features

In simple terms, Histogram-based algorithm splits all the data points for a feature into discrete bins and uses these bins to find the split value of histogram. While, it is efficient than pre-sorted algorithm in training speed which enumerates all possible split points on the pre-sorted feature values, it is still behind GOSS in terms of speed.

## XGBoost Model Parameters

I am explaining only those Parameters that I will be implementing below in my function. For an exhaustive explanation of all of them see here

objective [default=reg:linear]

This defines the loss function to be minimized. Mostly used values are:

• binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
• multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities) you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
• multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

eval_metric [ default according to objective ]

The metric to be used for validation data. The default values are rmse for regression and error for classification. Typical values are:

• rmse — root mean square error
• mae — mean absolute error
• logloss — negative log-likelihood
• error — Binary classification error rate (0.5 threshold)
• merror — Multiclass classification error rate
• mlogloss — Multiclass logloss
• auc: Area under the curve

eta [default=0.3]

• Analogous to learning rate in GBM.
• Makes the model more robust by shrinking the weights on each step.
• Typical final values to be used: 0.01–0.2

colsample_bytree: We can create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model. That is, tuning Column Sub-sampling in XGBoost By Tree. This is controlled by the colsample_bytree parameter. The default value is 1.0 meaning that all columns are used in each decision tree. A fraction (e.g. 0.6) means a fraction of columns to be subsampled. We can evaluate values for colsample_bytree between 0.1 and 1.0 incrementing by 0.1.

## A Note on regularization in XGBoost

XGBoost adds built-in regularization to achieve accuracy gains beyond gradient boosting. Regularization is the process of adding information to reduce variance and prevent overfitting.

Although data may be regularized through hyperparameter fine-tuning, regularized algorithms may also be attempted. For example, Ridge and Lasso are regularized machine learning alternatives to LinearRegression.

XGBoost includes regularization as part of the learning objective, as contrasted with gradient boosting and random forests. The regularized parameters penalize complexity and smooth out the final weights to prevent overfitting. XGBoost is a regularized version of gradient boosting.

Mathematically, XGBoost’s learning objective may be defined as follows:

## obj(θ) = l(θ) + Ω (θ)

Here, l(θ) is the loss function, which is the Mean Squared Error (MSE) for regression, or the log loss for classification, and Ω (θ) is the regularization function, a penalty term to prevent over-fitting. Including a regularization term as part of the objective function distinguishes XGBoost from most tree ensembles.

The learning objective for the th boosted tree can now be rewritten as follows:

reg_alpha and reg_lambda : First note the loss function is defined as

So the above is how the regularized objective function looks like if you want to allow for the inclusion of a L1 and a L2 parameter in the same model

`reg_alpha` and `reg_lambda` control the L1 and L2 regularization terms, which in this case limit how extreme the weights at the leaves can become. Higher values of alpha mean more L1 regularization. See the documentation here.

Since L1 regularization in GBDTs is applied to leaf scores rather than directly to features as in logistic regression, it actually serves to reduce the depth of trees. This in turn will tend to reduce the impact of less-predictive features. We might think of L1 regularization as more aggressive against less-predictive features than L2 regularization.

These two regularization terms have different effects on the weights; L2 regularization (controlled by the lambda term) encourages the weights to be small, whereas L1 regularization (controlled by the alpha term) encourages sparsity — so it encourages weights to go to 0. This is helpful in models such as logistic regression, where you want some feature selection, but in decision trees we’ve already selected our features, so zeroing their weights isn’t super helpful. For this reason, I found setting a high lambda value and a low (or 0) alpha value to be the most effective when regularizing.

From this Paper

You find the mathematical underpinnings for XGBoost model by Tianqi Chen et al. A couple of mathematical deviations of this model form the classic Friedman’s GBM are:

• Regularized (penalized) parameters (and remember that parameters in the boosting are the function, trees, or linear models): L1 and L2 are available.
`def xgb_model_run(train_x, train_y, validation_x, validation_y, test_x):    params = {        'objective': 'reg:squarederror',           'eval_metric': 'rmse',          'eta': 0.001,          'max_depth': 10,           'subsample': 0.6,           'colsample_bytree': 0.6,          'alpha':0.001,          'random_state': 42              }    training_data = xgb.DMatrix(train_x, train_y)    validation_data = xgb.DMatrix(validation_x, validation_y)    watchlist = [(training_data, 'train'), (validation_data, 'valid')]    model_xgb = xgb.train(params, training_data, 50, watchlist, maximize=False, early_stopping_rounds=100, verbose_eval=100 )    data_test = xgb.DMatrix(test_x)    predict_test_xgb = np.expm1(model_xgb.predict(data_test, ntree_limit=model_xgb.best_ntree_limit ) )    return predict_test_xgb, model_xgb`

## Training XGB

`predictions_test_y_xgb, model_xgb = xgb_model_run(X_train_split, y_train_split, X_validation, y_validation, X_test_original)print('Completion of XGB Training!!')	train-rmse:14.08765	valid-rmse:14.07678Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.Will train until valid-rmse hasn't improved in 100 rounds.	train-rmse:13.42470	valid-rmse:13.41331Completion of XGB Training!!`

## Hyper-Parameter Tuning in XGBoost

As an example, on the above mode, for our XGBoost function we could fine-tune five hyperparameters. The ranges of possible values that we could consider could be as below:

`{"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] , "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15], "min_child_weight" : [ 1, 3, 5, 7 ], "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ], "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }`

## CatBoost Model Training

CatBoost is another competitor to XGBoost, LightGBM and H2O. “CatBoost” name comes from two words “Category” and “Boosting”.

The library works well with multiple Categories of data, such as audio, text, image including historical data.

The CatBoost library can be used to solve both classification and regression challenge. For classification, you can use “CatBoostClassifier” and for regression, “CatBoostRegressor“.

Yandex is relying heavily on Catboost for ranking, forecasting and recommendations. This model is serving more than 70 million users each month.

“CatBoost is an algorithm for gradient boosting on decision trees. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.”

Overall some of the algorithmic enhancements that Catboost brought:

1. For data with categorical features the accuracy of CatBoost would be better compared to other algorithms.
2. Better over-fitting handling: — CatBoost uses the implementation of ordered boosting, an alternative to the classic boosting algorithm, which will be specially significant on small datasets
3. GPU-training: — The versions of CatBoost available from pip install (pip install catboost) and conda install (conda install catboost) have GPU support out-of-the-box. You just need to specify that you want to train your model on GPU in the corresponding HP (will be shown below).

The versions of CatBoost available from pip install and conda install have GPU support out-of-the-box. Devices with compute capability 3.0 and higher are supported in compiled packages. Training on GPU requires NVIDIA Driver of version 418.xx or higher. The Python version of CatBoost for CUDA of compute capability 2.0 can be built from source.

To check Compute Capability of your CUDA-GPU check this NVIDIA Official Link

Further for Training on GPU

The parameters that enable and customize training on GPU are set in the constructors of the classes — CatBoost (fit), CatBoostClassifier (fit), CatBoostRegressor (fit). `task_type` - The processing unit type to use for training. Possible values are - "CPU" or "GPU" . An example below

`model = CatBoostClassifier(iterations=1000,                           task_type="GPU",                           devices='0:1')model.fit(train_data,          train_labels,          verbose=False)`

Categorical features handling in CatBoost Algorithm

The below is taken from this paper

Categorical features have a discrete set of values called categories which are not necessary comparable with each other; thus, such features cannot be used in binary decision trees directly. A common practice for dealing with categorical features is converting them to numbers at the preprocessing time, i.e., each category for each example is substituted with one or several numerical values. The most widely used technique which is usually applied to low-cardinality categorical features is one-hot encoding: the original feature is removed and a new binary variable is added for each category . One-hot encoding can be done during the preprocessing phase or during training, the latter can be implemented more efficiently in terms of training time and is implemented in CatBoost.

For further details on this red CatBoost’s documentation

Leaf growth algorithm in CatBoost

Catboost grows a balanced tree. In each level of such a tree, the feature-split pair that brings to the lowest loss (according to a penalty function) is selected and is used for all the level’s nodes. It is possible to change its policy using the grow-policy parameter.

## CatBoost Training Parameters

Let’s look at the common parameters in CatBoost:

loss_function alias as objective — Metric used for training. These are regression metrics such as root mean squared error for regression and logloss for classification.

eval_metric — Metric used for detecting over-fitting.

iterations — The maximum number of trees to be built, defaults to 1000. It aliases are `num_boost_round`, `n_estimators`, and `num_trees`. Some notes on Total num of Trees - In bagging and random forests the averaging of independently grown trees makes it very difficult to overfit with too many trees. However, in GBMs this function differently as each tree is grown in sequence to fix up the past tree’s mistakes. For example, in regression, GBMs will chase residuals as long as we allow them to. Also, depending on the values of the other hyperparameters, GBMs often require many trees (sometimes many thousands of trees). But also more trees, can easily overfit we must find the optimal number of trees that minimize the loss function of interest with cross validation.

learning_rate alias eta — The learning rate that determines how fast or slow the model will learn. The default is usually varies between 0.01 to 0.03.

random_seed alias random_state — The random seed used for training.

l2_leaf_reg alias reg_lambda — Coefficient at the L2 regularization term of the cost function. The default is 3.0.

bootstrap_type — Determines the sampling method for the weights of the objects, e.g Bayesian, Bernoulli, MVS, and Poisson. depth — The depth of the tree.

grow_policy — Determines how the greedy search algorithm will be applied. It can be either SymmetricTree, Depthwise, or Lossguide.

SymmetricTree is the default. In SymmetricTree, the tree is built level-by-level until the depth is attained. In every step, leaves from the previous tree are split with the same condition. When Depthwise is chosen, a tree is built step-by-step until the specified depth is achieved. On each step, all non-terminal leaves from the last tree level are split. The leaves are split using the condition that leads to the best loss improvement. In Lossguide, the tree is built leaf-by-leaf until the specified number of leaves is attained. On each step, the non-terminal leaf with the best loss improvement is split

min_data_in_leaf alias min_child_samples — This is the minimum number of training samples in a leaf. This parameter is only used with the Lossguide and Depthwise growing policies.

max_leaves alias num_leaves — This parameter is used only with the Lossguide policy and determines the number of leaves in the tree.

ignored_features — Indicates the features that should be ignored in the training process.

nan_mode — The method for dealing with missing values. The options are Forbidden, Min, and Max. The default is Min. When Forbidden is used, the presence of missing values leads to errors. With Min, the missing values are taken as the minimum values for that feature. In Max, the missing values are treated as the maximum value for the feature.

leaf_estimation_method — The method used to calculate values in leaves. In classification, 10 Newton iterations are used. Regression problems using quantile or MAE loss use one Exact iteration. Multi classification uses one Netwon iteration.

leaf_estimation_backtracking — The type of backtracking to be used during gradient descent. The default is `AnyImprovement`. `AnyImprovement` decreases the descent step, up to where the loss function value is smaller than it was in the last iteration. Armijo reduces the descent step until the Armijo condition is met.

boosting_type — The boosting scheme. It can be plain for the classic gradient boosting scheme, or ordered, which offers better quality on smaller datasets.

score_function — The score type used to select the next split during tree construction. `Cosine` is the default option. The other available options are `L2`, `NewtonL2`, and `NewtonCosine`.

early_stopping_rounds — When True, sets the over-fitting detector type to `Iter` and stops the training when the optimal metric is achieved.

classes_count — The number of classes for multi-classification problems. task_type — Whether you are using a CPU or GPU. CPU is the default. devices — The IDs of the GPU devices to be used for training. cat_features — The array with the categorical columns. text_features — Used to declare text columns in classification problems.

`# Now Catboost model trainingmodel_catboost = CatBoostRegressor(iterations=500,                                   learning_rate=0.01,                                   depth=10,                                   eval_metric='RMSE',                                   random_seed = 42,                                   bagging_temperature=0.2,                                   od_type='Iter',                                   metric_period=50,                                   od_wait=20                                   )model_catboost.fit(X_train_split, y_train_split,                   eval_set=(X_validation, y_validation),                   use_best_model=True,                   verbose=50                   )predictions_test_y_catboost = np.expm1(model_catboost.predict(X_test_original))`

## Creating Output file for Submission

`submission_final = pd.read_csv('../input/santander-value-prediction-challenge/sample_submission.csv')submission_lgb = pd.DataFrame()submission_lgb['target'] = predictions_test_y_light_gbmsubmission_xgb = pd.DataFrame()submission_xgb['target'] = predictions_test_y_xgbsubmission_catboost = pd.DataFrame()submission_catboost['target'] = predictions_test_y_catboostsubmission_final['target'] = (submission_lgb['target'] * 0.5 + submission_xgb['target'] * 0.3 + submission_catboost['tartet'] * 0.2)submission_final.head()submission_final.to_csv('submission_combined_lgb_xgb_catboost.csv', index=False)`

## Analytics Vidhya

### By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

Written by

Written by

## Rohan Paul

#### DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/ 