xgboost’s New Fast Histogram (tree_method = hist)
Laurae: This post is about the new feature of xgboost: the histogram tree grow method. Currently, it provides error in R but works in Python (?). You can find the pull request #1940 here. I’ll get benchmarks on my customized Bosch data set when the R package works.
Summary from PR author
Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set tree_method=hist
to configuration.
Support multiple tree growing strategies. For now, two policies are supported:
grow_policy=depthwise
(default): favor splitting at nodes closest to the root, i.e. grow depth-wise.grow_policy=lossguide
: favor splitting at nodes with highest loss change
Improve single-threaded performance
- Unroll critical loops
- Introduce specialized code for dense data (i.e. no missing values)
Additional training parameters: max_leaves
, max_bin
, grow_policy
, verbose
Extra comments by myself
I put this here for more details (could be useful to know what to add exactly for R/Python function descriptions, along with the conditions on the new parameters).
The new/edited fields:
- Histogram method: must set
tree_method
tohist
(3) - the tree updater becomesgrow_fast_histmaker
- Verbose flag: I guess, for the command line? (it prints timing and some other details, see the end for details) — to ignore for R/Python packages?
- Maximum leaves: using
max_leaves
, with no limit (setting to0
is same as infinite, automatically limited bymax_depth
) - Histogram binning: using
max_bin
with a value between2
to256
(restricted to256
maximum for a 8-bit representation optimization) - Growing policy: using
grow_policy
, you have the choice betweendepthwise
(split using depth) andlossguide
(split using loss), see below for comparison between Depth / Loss:
inline static bool depth_wise(ExpandEntry lhs, ExpandEntry rhs) {
return lhs.depth > rhs.depth;
}
inline static bool loss_guide(ExpandEntry lhs, ExpandEntry rhs) {
return lhs.loss_chg < rhs.loss_chg;
}
Additional designs:
max_depth
can be unlimited whilemax_leaves
is limited only if not using a depth-wise growing policymax_leaves
can be unlimited whilemax_depth
is limited- There are performance optimizations depending on 0-indexed dense data, 1-indexed dense data, or sparse data
gbtree
does a batch of 8 instead of a batch of 1 to make predictions on single instances- Pre-initialize some variables
DMLC_DECLARE_FIELD(tree_method).set_default(0)
.add_enum("auto", 0)
.add_enum("approx", 1)
.add_enum("exact", 2)
.add_enum("hist", 3)
.describe("Choice of tree construction method.");... DMLC_DECLARE_FIELD(verbose)
.set_lower_bound(0)
.set_default(0)
.describe(
"Setting verbose flag with a positive value causes the updater "
"to print out *detailed* list of tasks and their runtime");... DMLC_DECLARE_FIELD(max_leaves).set_lower_bound(0).set_default(0).describe(
"Maximum number of leaves; 0 indicates no limit.");
DMLC_DECLARE_FIELD(max_bin).set_lower_bound(2).set_default(256).describe(
"if using histogram-based algorithm, maximum number of bins per feature");
DMLC_DECLARE_FIELD(grow_policy)
.set_default(kDepthWise)
.add_enum("depthwise", kDepthWise)
.add_enum("lossguide", kLossGuide)
.describe(
"Tree growing policy. 0: favor splitting at nodes closest to the node, "
"i.e. grow depth-wise. 1: favor splitting at nodes with highest loss "
"change. (cf. LightGBM)");
Logging details:
- Initialize data
- Setup nodes
- Create histogram
- Evaluate splits
- Apply splits
- Sum of all the time
LOG(INFO) << "\nInitData: "
<< std::fixed << std::setw(4) << std::setprecision(2) << init_data
<< " (" << std::fixed << std::setw(5) << std::setprecision(2)
<< init_data/total_time*100 << "%)\n"
<< "InitNewNode: "
<< std::fixed << std::setw(4) << std::setprecision(2) << init_new_node
<< " (" << std::fixed << std::setw(5) << std::setprecision(2)
<< init_new_node/total_time*100 << "%)\n"
<< "MakeHist: "
<< std::fixed << std::setw(4) << std::setprecision(2) << make_hist
<< " (" << std::fixed << std::setw(5) << std::setprecision(2)
<< make_hist/total_time*100 << "%)\n"
<< "EvaluateSplit: "
<< std::fixed << std::setw(4) << std::setprecision(2) << evaluate_split
<< " (" << std::fixed << std::setw(5) << std::setprecision(2)
<< evaluate_split/total_time*100 << "%)\n"
<< "ApplySplit: "
<< std::fixed << std::setw(4) << std::setprecision(2) << apply_split
<< " (" << std::fixed << std::setw(5) << std::setprecision(2)
<< apply_split/total_time*100 << "%)\n"
<< "========================================\n"
<< "Total: "
<< std::fixed << std::setw(4) << std::setprecision(2) << total_time;
Does it work in R?
It does not (yet).
> temp_model <- xgb.train(data = data,
+ params = list(nthread = 12,
+ max_depth = 5,
+ eta = 1.0,
+ subsample = 1.0,
+ colsample_bytree = 1.0,
+ tree_method = ‘hist’,
+ grow_policy = ‘lossguide’,
+ booster = “gbtree”,
+ verbose = 1),
+ objective = “binary:logistic”,
+ nrounds = 5,
+ verbose = FALSE,
+ callbacks = list(cb.evaluation.log()))
[13:48:15] Tree method is selected to be ‘hist’, which uses histogram aggregation for faster training. Using default sequence of updaters: grow_fast_histmaker,prune
Error in xgb.iter.update(bst$handle, dtrain, iteration — 1, obj) :
[13:48:15] amalgamation/../src/tree/tree_updater.cc:18: Unknown tree updater grow_fast_histmaker
Does it work in Python?
It currently passes the test (Travis?), so it is supposedly working. See this test:
import xgboost as xgb
import testing as tm
import numpy as np
import unittestrng = np.random.RandomState(1994)class TestFastHist(unittest.TestCase):
def test_fast_hist(self):
tm._skip_if_no_sklearn()
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_splitdigits = load_digits(2)
X = digits[‘data’]
y = digits[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)param = {‘objective’: ‘binary:logistic’,
‘tree_method’: ‘hist’,
‘grow_policy’: ‘depthwise’,
‘max_depth’: 3,
‘eval_metric’: ‘auc’}
res = {}
bst = xgb.train(param, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
evals_result=res)
assert self.non_decreasing(res[‘train’][‘auc’])
assert self.non_decreasing(res[‘test’][‘auc’])param2 = {‘objective’: ‘binary:logistic’,
‘tree_method’: ‘hist’,
‘grow_policy’: ‘lossguide’,
‘max_depth’: 0,
‘max_leaves’: 8,
‘eval_metric’: ‘auc’}
res = {}
bst = xgb.train(param2, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
evals_result=res)
assert self.non_decreasing(res[‘train’][‘auc’])
assert self.non_decreasing(res[‘test’][‘auc’])param3 = {‘objective’: ‘binary:logistic’,
‘tree_method’: ‘hist’,
‘grow_policy’: ‘lossguide’,
‘max_depth’: 0,
‘max_leaves’: 8,
‘max_bin’: 16,
‘eval_metric’: ‘auc’}
res = {}
bst = xgb.train(param3, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
evals_result=res)
assert self.non_decreasing(res[‘train’][‘auc’])
assert self.non_decreasing(res[‘test’][‘auc’])
def non_decreasing(self, L):
return all(x<=y for x, y in zip(L, L[1:]))