XGBoost

eXtreme Gradient Boosting (XGBoost)

Pedro Meira
Time to Work
3 min readOct 15, 2019

--

XGBoost

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

XGBoost is eXtreme Gradient Boosting. The name XGBoost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. (Tianqi Chen)

XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:

  • Command Line Interface (CLI).
  • C++ (the language in which the library is written).
  • Python interface as well as a model in scikit-learn.
  • R interface as well as a model in the caret package.
  • Julia.

Why Use XGBoost?

The two reasons to use XGBoost are also the two goals of the project:

  1. Execution Speed.
  2. Model Performance.

0. Case Study

We’ll still using the 1994 census data set on U.S. income. It contains information on marital status, age, type of work, and more. The target column, high_income, records salaries less than or equal to 50k a year (0), and more than 50k a year (1).

The data is available in University of California Irvine’s website.

1. Loading the Libraries and 2. Getting the Data

Imported DataFrame
Output df.info()

3. Clean, prepare and manipulate Data (Feature Engineering)

Feature Engineering

4. Modeling

X.head() — Features
y.head() — Labels

5. Using the GPU (CPU x GPU)

Some parameters:

Number of class labels: the num_cls equals to 2 — binary (0 or 1)

updater: A comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. a value between 2 to 256 (restricted to 256 maximum for a 8-bit representation optimization).

updater_seq is a comma separated string defining the sequence of tree updaters to run, providing a modular way to construct and to modify the trees. In default, it is set as “grow_colmaker,prune”, which means first run updater_colmaker and then run updater_prune.

6. Visualization

We are going to use the matplotlib and the plotly to show the runtime plots:

Visualization
df.head()
df.info()
Plot using matplotlib
Plot using Plotly.

7. Bibliograph

--

--