xgboost’s New Fast Histogram (tree_method = hist)

Published in

Data Science & Design

4 min readJan 7, 2017

Laurae: This post is about the new feature of xgboost: the histogram tree grow method. Currently, it provides error in R but works in Python (?). You can find the pull request #1940 here. I’ll get benchmarks on my customized Bosch data set when the R package works.

Summary from PR author

Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set tree_method=hist to configuration.

Support multiple tree growing strategies. For now, two policies are supported:

grow_policy=depthwise (default): favor splitting at nodes closest to the root, i.e. grow depth-wise.
grow_policy=lossguide: favor splitting at nodes with highest loss change

Improve single-threaded performance

Unroll critical loops
Introduce specialized code for dense data (i.e. no missing values)

Additional training parameters: max_leaves, max_bin, grow_policy, verbose

Extra comments by myself

I put this here for more details (could be useful to know what to add exactly for R/Python function descriptions, along with the conditions on the new parameters).

The new/edited fields:

Histogram method: must set tree_method to hist (3) - the tree updater becomes grow_fast_histmaker
Verbose flag: I guess, for the command line? (it prints timing and some other details, see the end for details) — to ignore for R/Python packages?
Maximum leaves: using max_leaves, with no limit (setting to 0 is same as infinite, automatically limited by max_depth)
Histogram binning: using max_bin with a value between 2 to 256 (restricted to 256 maximum for a 8-bit representation optimization)
Growing policy: using grow_policy, you have the choice between depthwise (split using depth) and lossguide (split using loss), see below for comparison between Depth / Loss:

inline static bool depth_wise(ExpandEntry lhs, ExpandEntry rhs) {
      return lhs.depth > rhs.depth;
    }
    inline static bool loss_guide(ExpandEntry lhs, ExpandEntry rhs) {
      return lhs.loss_chg < rhs.loss_chg;
    }

Additional designs:

max_depth can be unlimited while max_leaves is limited only if not using a depth-wise growing policy
max_leaves can be unlimited while max_depth is limited
There are performance optimizations depending on 0-indexed dense data, 1-indexed dense data, or sparse data
gbtree does a batch of 8 instead of a batch of 1 to make predictions on single instances
Pre-initialize some variables

DMLC_DECLARE_FIELD(tree_method).set_default(0)
        .add_enum("auto", 0)
        .add_enum("approx", 1)
        .add_enum("exact", 2)
        .add_enum("hist", 3)
        .describe("Choice of tree construction method.");...    DMLC_DECLARE_FIELD(verbose)
        .set_lower_bound(0)
        .set_default(0)
        .describe(
            "Setting verbose flag with a positive value causes the updater "
            "to print out *detailed* list of tasks and their runtime");...    DMLC_DECLARE_FIELD(max_leaves).set_lower_bound(0).set_default(0).describe(
        "Maximum number of leaves; 0 indicates no limit.");
    DMLC_DECLARE_FIELD(max_bin).set_lower_bound(2).set_default(256).describe(
        "if using histogram-based algorithm, maximum number of bins per feature");
    DMLC_DECLARE_FIELD(grow_policy)
        .set_default(kDepthWise)
        .add_enum("depthwise", kDepthWise)
        .add_enum("lossguide", kLossGuide)
        .describe(
            "Tree growing policy. 0: favor splitting at nodes closest to the node, "
            "i.e. grow depth-wise. 1: favor splitting at nodes with highest loss "
            "change. (cf. LightGBM)");

Logging details:

Initialize data
Setup nodes
Create histogram
Evaluate splits
Apply splits
Sum of all the time

LOG(INFO) << "\nInitData:          "
                  << std::fixed << std::setw(4) << std::setprecision(2) << init_data
                  << " (" << std::fixed << std::setw(5) << std::setprecision(2)
                  << init_data/total_time*100 << "%)\n"
                  << "InitNewNode:       "
                  << std::fixed << std::setw(4) << std::setprecision(2) << init_new_node
                  << " (" << std::fixed << std::setw(5) << std::setprecision(2)
                  << init_new_node/total_time*100 << "%)\n"
                  << "MakeHist:          "
                  << std::fixed << std::setw(4) << std::setprecision(2) << make_hist
                  << " (" << std::fixed << std::setw(5) << std::setprecision(2)
                  << make_hist/total_time*100 << "%)\n"
                  << "EvaluateSplit:     "
                  << std::fixed << std::setw(4) << std::setprecision(2) << evaluate_split
                  << " (" << std::fixed << std::setw(5) << std::setprecision(2)
                  << evaluate_split/total_time*100 << "%)\n"
                  << "ApplySplit:        "
                  << std::fixed << std::setw(4) << std::setprecision(2) << apply_split
                  << " (" << std::fixed << std::setw(5) << std::setprecision(2)
                  << apply_split/total_time*100 << "%)\n"
                  << "========================================\n"
                  << "Total:             "
                  << std::fixed << std::setw(4) << std::setprecision(2) << total_time;

Does it work in R?

It does not (yet).

> temp_model <- xgb.train(data = data,
+ params = list(nthread = 12,
+ max_depth = 5,
+ eta = 1.0,
+ subsample = 1.0,
+ colsample_bytree = 1.0,
+ tree_method = ‘hist’,
+ grow_policy = ‘lossguide’,
+ booster = “gbtree”,
+ verbose = 1),
+ objective = “binary:logistic”,
+ nrounds = 5,
+ verbose = FALSE,
+ callbacks = list(cb.evaluation.log()))
[13:48:15] Tree method is selected to be ‘hist’, which uses histogram aggregation for faster training. Using default sequence of updaters: grow_fast_histmaker,prune
Error in xgb.iter.update(bst$handle, dtrain, iteration — 1, obj) : 
 [13:48:15] amalgamation/../src/tree/tree_updater.cc:18: Unknown tree updater grow_fast_histmaker

Does it work in Python?

It currently passes the test (Travis?), so it is supposedly working. See this test:

import xgboost as xgb
import testing as tm
import numpy as np
import unittestrng = np.random.RandomState(1994)class TestFastHist(unittest.TestCase):
 def test_fast_hist(self):
 tm._skip_if_no_sklearn()
 from sklearn.datasets import load_digits
 from sklearn.cross_validation import train_test_splitdigits = load_digits(2)
 X = digits[‘data’]
 y = digits[‘target’]
 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
 dtrain = xgb.DMatrix(X_train, y_train)
 dtest = xgb.DMatrix(X_test, y_test)param = {‘objective’: ‘binary:logistic’,
 ‘tree_method’: ‘hist’,
 ‘grow_policy’: ‘depthwise’,
 ‘max_depth’: 3,
 ‘eval_metric’: ‘auc’}
 res = {}
 bst = xgb.train(param, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
 evals_result=res)
 assert self.non_decreasing(res[‘train’][‘auc’])
 assert self.non_decreasing(res[‘test’][‘auc’])param2 = {‘objective’: ‘binary:logistic’,
 ‘tree_method’: ‘hist’,
 ‘grow_policy’: ‘lossguide’,
 ‘max_depth’: 0,
 ‘max_leaves’: 8,
 ‘eval_metric’: ‘auc’}
 res = {}
 bst = xgb.train(param2, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
 evals_result=res)
 assert self.non_decreasing(res[‘train’][‘auc’])
 assert self.non_decreasing(res[‘test’][‘auc’])param3 = {‘objective’: ‘binary:logistic’,
 ‘tree_method’: ‘hist’,
 ‘grow_policy’: ‘lossguide’,
 ‘max_depth’: 0,
 ‘max_leaves’: 8,
 ‘max_bin’: 16,
 ‘eval_metric’: ‘auc’}
 res = {}
 bst = xgb.train(param3, dtrain, 10, [(dtrain, ‘train’), (dtest, ‘test’)],
 evals_result=res)
 assert self.non_decreasing(res[‘train’][‘auc’])
 assert self.non_decreasing(res[‘test’][‘auc’])
 
 def non_decreasing(self, L):
 return all(x<=y for x, y in zip(L, L[1:]))

xgboost’s New Fast Histogram (tree_method = hist)

Summary from PR author

Extra comments by myself

Does it work in R?

Does it work in Python?

Written by Laurae