Practical Applications of XGBoost in Data Science

Harshita Aswani
3 min readAug 9, 2023

--

XGBoost (eXtreme Gradient Boosting) is a powerful machine learning algorithm known for its exceptional performance and scalability. It belongs to the gradient boosting family of algorithms and is widely used for various data problems, including classification, regression, and ranking. In this blog post, we will explore the practical applications of XGBoost and demonstrate its usage with code examples.

Classification with XGBoost

XGBoost is particularly effective for classification tasks, where the goal is to predict discrete class labels. Let’s consider an example of classifying whether a bank customer will churn or not based on various features. Here’s how you can train an XGBoost classifier:

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load your dataset (X and y)
X, y = load_dataset()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define the parameters for XGBoost
params = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 3,
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8
}

# Train the XGBoost model
num_rounds = 100
model = xgb.train(params, dtrain, num_rounds)

# Make predictions on the test set
y_pred = model.predict(dtest)
predictions = [round(value) for value in y_pred]

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

In this example, we start by loading the dataset and splitting it into training and testing sets using train_test_split from scikit-learn. We then create DMatrix objects, which are optimized data structures for XGBoost. Next, we define the parameters for XGBoost, including the objective function (binary logistic regression), evaluation metric (log loss), maximum depth, learning rate (eta), subsample ratio, and column subsampling ratio. We train the XGBoost model using xgb.train and make predictions on the test set. Finally, we evaluate the model's accuracy.

Regression with XGBoost

XGBoost is also highly effective for regression problems, where the goal is to predict continuous numeric values. Let’s consider an example of predicting house prices based on various features. Here’s how you can train an XGBoost regression model:

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load your dataset (X and y)
X, y = load_dataset()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define the parameters for XGBoost
params = {
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'max_depth': 5,
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8
}

# Train the XGBoost model
num_rounds = 100
model = xgb.train(params, dtrain, num_rounds)

# Make predictions on the test set
y_pred = model.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

In this example, we follow a similar process as the classification example. We load the dataset, split it into training and testing sets, create DMatrix objects, and define the XGBoost parameters. However, the key difference is in the objective and evaluation metric. Here, we use ‘reg:squarederror’ as the objective function for regression and ‘rmse’ (root mean squared error) as the evaluation metric. We train the XGBoost model and make predictions on the test set. Finally, we evaluate the model’s performance using the mean squared error metric.

XGBoost is a powerful machine learning algorithm that can be applied to a wide range of data problems, including classification and regression. In this blog post, we explored the practical applications of XGBoost and demonstrated its usage through code examples. By incorporating XGBoost into your projects, you can benefit from its exceptional performance, scalability, and feature importance analysis. Experiment with XGBoost on your own datasets and unlock its potential for accurate predictions and insights.

Connect with author: https://linktr.ee/harshita_aswani

Reference:

--

--

Harshita Aswani

Passionate about unlocking insights from data through advanced analytics. Constantly learning and leveraging technology to solve real-world problems.