Run through LightGBM Fast Training Techniques

Published in

The Startup

6 min readJan 14, 2021

A complete explanation for LightGBM - The Fastest Gradient Boosting Model

LightGBM is a Gradient Boosting Decision Tree Model(GBDT) developed by Microsoft in 2016, compared with other GBDT models, LightGBM is most featured by its faster training efficiency and great accuracy.

There is no fundamental structure difference between LightGBM and general Gradient Boosting Decision Tree model, but with the following special techniques, LightGBM make itself faster in training.

Gradient-based One-Side Sampling(GOSS)
Histogram Based Best Value Search in Tree Node Splitting
Optimal Split for Categorical Features
Exclusive Feature Bundling
Leaf-wise Tree Growth Strategy
Parallel optimization

1. Gradient-based One-Side Sampling(GOSS)

The classic tree based gradient boosting(GBDT) training is a repeating process to train a new tree to fit prediction errors of previous tree-set on all training set instances. (The prediction errors are loss function gradients on all the training set instances)

So, by default, GBDT using all training set instances to train each tree in its ensemble.

Target on this, LightGBM introduce GOSS in which we just need use partial of training set to train each ensemble tree. The GOSS intuition is

An training instance with Large Gradient means this instance has large current prediction error and should be a primary target to fit in next ensemble tree
An training instance with Small Gradient means this instance has relatively small current prediction error and don’t need the next ensemble tree to worry too much, so we can discard it at some probability.

Generally speaking, the main idea of GOSS is, before training next ensemble tree, we keep training instances which have relatively large gradient, and discard some training instances which have small gradient.

The below pictures show GOSS algorithm.

All training instances are sorted by gradients, a presents sampling percentage for large gradient instances, b presents sampling percentage for small gradient instances.

By using GOSS, we actually reduce the size of training set to train the next ensemble tree, and this will make it faster to train the new tree.

2. Histogram Based Tree Node Splitting

In searching for best feature value to split a tree node, LightGBM use feature values histogram, and try all the histogram bin values instead of trying all the possible feature values, so it can reduce the time and computation to find the best feature spitting value. By the way, LightGBM splitting criteria is to reduce gradient variance from parent to children.

For example, given the below age feature, histogram discrete feature values into different range bins, so we can use spiting criteria like Age⩽30, Age⩽40, , , Age⩽100 instead of trying all the possible age value like Age⩽31, Age⩽32 etc.

By using bin to replace the original data is equivalent to increasing regularization, the number of bins determines the degree of regularization. The smaller the bin, the more severe the penalty and the higher the risk of underfitting.

Also in the tree splitting scenario, for a given feature, histogram is additive

Parent Node Histogram = Left Child Histogram + Right Child Histogram

So when calculating children histograms, we just need calculate one child(choose smaller size child) histogram and the other child histogram is parent histogram minus the calculated one.

3. Optimal Split for Categorical Features

Normally when dealing with categorical features in tree node splitting, we always use One Vs Rest as node splitting rule, for example splitting condition is “Weather = Sunny” vs “All other Weather (Rainy, Cloudy, Snowy etc)”. Generally, the problems for this One Vs Rest strategy is

It tends to generate unbalanced data points allocation in children nodes (for example left child has much more data points assigned than right child)and needs to grow very deep to achieve good accuracy
Because needs to grow very deep tree, we need many node splitting, so the tree building efficiency is very low.

Motivated by these problems, LightGBM utilize a Many to Many strategy as following.

For a given categorical feature

For each category of the feature, calculate mean value Sum(y)/Count(y)
Sort all categories by its mean value(indicated by below picture).
Enumerate split value from the lowest mean value to largest mean value to find best splitting value. The split value groups all categories into two parts(category mean value less than or large than splitting value) and that is the node splitting condition.

4. Exclusive Feature Bundling

EFB aims to reduce features by merging features, specifically, merge the mutually exclusive features, which rarely take non-zero values at the same time.

LightGBM provides the below two algorithms to implement

Identify mutually exclusive feature bundles from training set
Merge feature bundle and assign a value to the bundle

Below is an EFB example shows the results of feature merging.

In the example, max conflict count K=2, and it show original 5 features can be reduced to 3 features according to EFB algorithms.

Source from 04–8: Ensemble Learning — LightGBM

5. Leaf-wise Tree Growth Strategy

LightGBM abandoned the level-wise decision tree growth strategy used by most GBDT tools, and used a leaf-wise algorithm with depth limitations.

In Leaf-wise strategy, every time from all the leaves, find the leaf with the highest split gain, then split and cycle.

In the above tree growth process, the green leaf node is the node which has highest split gain, so do the split on it, then re-evaluate to find next green leaf node.

The benefit of leaf-wise is, for every node split, we always can bring highest gain to the tree, so it is more efficient to growth tree than level-wise. But we need add tree-depth and some other limits to avoid overfitting.

6. Parallel optimization

To deal with super large dataset, LightGBM introduces distributed processes to calculate feature histogram and best splitting feature value in a parallel way.

LightGBM supports two parallel strategies - feature parallelism and data parallelism

Feature Parallel Algorithm

Training data is vertically(column or feature) splitted and allocate to different working computers to calculate local histogram and local best split for the allocated features, then global choose best split from all workers output.

Data Parallel Algorithm

Training data is horizontally(row) splitted and allocate to different working computers to calculate local histograms for all features based on allocated training subset, then merge all features histograms from all workers’ local histograms.

LightGBM also has a further optimization on Data Parallel Algorithm, the idea is each worker choose top K best splitting features locally, then vote for the top feature(s) globally.

Once get the top feature(s), we only need merge top feature(s) histogram from all workers local histogram.