GPs vs. Linear Regression vs. XGBoost

Pros and Cons of Gaussian Processes, Linear Regression, and XGBoost. Implementing GPs, Linear Regression, and XGBoost in scikit-learn (sklearn)

Oliver Lövström
Internet of Technology
7 min readMay 13, 2024

--

In this article, we will compare Gaussian Processes (GPs), eXtreme Gradient Boosting (XGBoost), and linear regression and provide scikit-learn implementations of the algorithms. We will also list the pros and cons of the different models.

Photo by Kyle Head on Unsplash

Data

We begin by choosing a dataset, loading it into Python, and preparing it for analysis.

Problem Definition

In this article, I will attempt to predict earnings based on likes, comments, and read time for articles written on Medium.com

Dataset

I will be working with a dataset containing: Title, Likes, Comments, Read Time, Earnings. Here is a sample of the data:

Title,Likes,Comments,Read Time,Earnings
Title1,89,2,2,0.25
Title2,378,16,2,1.06
Title3,517,8,4,1.86
Title4,20,0,4,0.02
Title5,550,14,2,7.66
...
Title200,105,3,2,0.8

Loading Dataset

We begin by loading and splitting the dataset into training and test sets:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data.
data = pd.read_csv("stats/response_stats.csv")

# Separate features and target.
X = data[["Likes", "Comments", "Read Time"]]
y = data["Earnings"]
X["Title"] = data["Title"]

# (Optional) I will be working with earnings, earnings cannot be negative,
# therefore I have converted the y-values to logarithmic scale.
y += 1
y = np.log(y)

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

# (Optional) Save titles.
test_titles = X_test["Title"]
X_train = X_train.drop(columns="Title")
X_test = X_test.drop(columns="Title")

Linear Regression

Linear regression models the relationship between variables. It does so by finding a line that minimizes the difference between the actual and predicted values. In the simplest, 1-dimensional case, the line is described by the equation f(x) = b1x + b0. In our problem, we have more input variables, in which case the equation describes the line: f(likes, comments, read_time) = b0 + b1*likes + b2*comments + b3*read_time.

Pros of Linear Regression

  • Simplicity: Linear regression is simple to understand and implement. The results are also easy to interpret since it’s possible to extract the values of b0, b1, …, bn.
  • Efficiency: Linear regression is computationally efficient.

Cons of Linear Regression

  • Linearity: All data is not linear; therefore, linear regression should only be used if an assumed linear relationship exists in the data.
  • Outliers: Linear regression is sensitive to outliers, which can affect the model's fit.

Implementation

In the implementation, we take the data from above, scale it, initialize it, and train the linear regression model. Finally, we use the model to make predictions and print our regression results as mean squared error (MSE) and R2 score.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Scale features.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the linear regression model.
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict on the test set.
y_pred = model.predict(X_test_scaled)
y_pred = np.exp(y_pred) - 1 # (Optional) Omit if you don't use logarithmic scaling.

# Evaluate the model.
mse = mean_squared_error(np.exp(y_test) - 1, y_pred)
r2 = r2_score(np.exp(y_test) - 1, y_pred)

print(f"MSE: {mse:.2f}")
print(f"R2 Score: {r2:.2f}")

# (Optional) Print the title, actual earnings, and predicted earnings.
for title, actual, predicted in zip(test_titles, np.exp(y_test) - 1, y_pred):
print(f"Title: {title}, Actual Earnings: ${actual:.2f}, Predicted Earnings: ${predicted:.2f}")

Results

MSE: 2.30
R2 Score: 0.79
Actual Earnings: $0.80, Predicted Earnings: $1.13
Actual Earnings: $1.86, Predicted Earnings: $2.43
Actual Earnings: $4.55, Predicted Earnings: $4.28
Actual Earnings: $0.30, Predicted Earnings: $1.94
...
Actual Earnings: $2.18, Predicted Earnings: $2.19

eXtreme Gradient Boosting (XGBoost)

eXtreme Gradient Boosting (XGBoost) is another machine-learning algorithm based on gradient boosting. Gradient boosting builds on the idea of ensembles of decision trees. Decision trees split data into branches based on feature values — hence the name, decision trees. Each node in the tree represents a decision, and each branch of a tree represents the outcome of the decision. In the simplest case, it can be interpreted as a tree of Yes and No questions.

CollaborativeGeneticist, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

Pros of XGBoost

  • Complexity: Unlike linear regression, XGBoost can learn complex non-linear relationships.
  • Performance: XGBoost generally has good out-of-the-box performance and comes with built-in regularization to prevent overfitting.

Cons of XGBoost

  • Tuning: Tuning XGBoost is much more complex than linear regression. The model contains many hyperparameters which can be tuned to affect performance.
  • Expensive: XGBoost is also much more computationally expensive than regular linear regression.
  • Interpretability: Unlike linear regression, it’s much more challenging to interpret the model’s predictions and the relationship between variables.

Implementation

In the implementation, we initialize the model and predict the test set. Unlike linear regression, we don’t need to scale the data for XGBoost. Finally, we evaluate the model and print the performance.

from xgboost import XGBRegressor

# Initialize and train the XGBoost model.
model = XGBRegressor(objective='reg:squarederror', n_estimators=100)
model.fit(X_train, y_train)

# Predict on the test set.
y_pred = model.predict(X_test)
y_pred = np.exp(y_pred) - 1

# Evaluate the model
mse = mean_squared_error(np.exp(y_test) - 1, y_pred)
r2 = r2_score(np.exp(y_test) - 1, y_pred) # (Optional) Omit if you don't use logarithmic scaling.

print(f"MSE: {mse:.2f}")
print(f"R2 Score: {r2:.2f}")

# (Optional) Print the title, actual earnings, and predicted earnings.
for title, actual, predicted in zip(test_titles, np.exp(y_test) - 1, y_pred):
print(f"Title: {title}, Actual Earnings: ${actual:.2f}, Predicted Earnings: ${predicted:.2f}")

Result

MSE: 2.13
R2 Score: 0.80
Actual Earnings: $0.80, Predicted Earnings: $0.90
Actual Earnings: $1.86, Predicted Earnings: $3.78
Actual Earnings: $4.55, Predicted Earnings: $6.86
Actual Earnings: $0.30, Predicted Earnings: $2.13
...
Actual Earnings: $2.18, Predicted Earnings: $2.05

Gaussian Processes (GPs)

Gaussian processes (GPs) are a Bayesian approach to machine learning. Unlike linear regression and XGBoost, GP predictions come with a measure of uncertainty.

Pros of GPs

  • Uncertainty: Apart from modeling plain predictions, GPs take it a step further by measuring the uncertainty of predictions.
  • Complexity: Like XGBoost and unlike linear regression, GPs can model complex non-linear problems.

Cons of GPs

  • Expensive: Similar to XGBoost, GPs are computationally costly.
  • Kernel: GPs add another hyperparameter in the form of the kernel — the covariance function. Selecting the optimal kernel for modeling the problem can be challenging.
  • Interpretability: Like XGBoost, GPs are usually hard to interpret.

Implementation

We load the data once more in the implementation since we must handle it somewhat differently. After loading the data, we split it into training and test sets. The features are then scaled, and the targets are converted to a logarithmic scale. After this, we define the kernel and the GP and perform a prediction on the test data. Finally, we evaluate the model's performance.

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, WhiteKernel
from sklearn.gaussian_process.kernels import RationalQuadratic

# Load data.
data = pd.read_csv("stats/response_stats.csv")

# Separate features and target.
X = data[["Likes", "Comments", "Read Time"]]
y = data["Earnings"]
X["Title"] = data["Title"]

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

# (Optional) Save titles from the test set.
test_titles = X_test["Title"]
X_train = X_train.drop(columns="Title")
X_test = X_test.drop(columns="Title")

# Scale features.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# (Optional) Convert targets to logarithmic scale.
y_train_log = np.log(y_train + 1)
y_test_log = np.log(y_test + 1)

# Define the kernel and train the Gaussian Process.
# The hyperparameters are tuned and kernel is chosen for this problem.
kernel = C(1.0, (1e-2, 1e2)) * RationalQuadratic(length_scale=39.28455176212197, alpha=8.362426847738403) + WhiteKernel(noise_level=0.33739616048352883)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=15, normalize_y=True)
gp.fit(X_train_scaled, y_train_log)

# Predict on the test set.
y_pred, std = gp.predict(X_test_scaled, return_std=True)
y_pred = np.exp(y_pred) - 1 # (Optional) Omit if you don't use logarithmic scaling.

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.3f}")
print(f"R2 Score: {r2:.3f}")

# (Optional) Print the title, actual earnings, predicted earnings, and standard deviation, and confidence intervals
for i in range(len(X_test)):
title = test_titles.iloc[i]
lower_bound = y_pred[i] - 1.96 * std[i]
upper_bound = y_pred[i] + 1.96 * std[i]
lower_bound = max(0, lower_bound)
upper_bound = max(0, upper_bound)
print(f"Title: {title}, Predicted: {y_pred[i]:.3f}, Actual: {y_test.iloc[i]:.3f}, STD: {std[i]:.3f}, 95% CI: [{lower_bound:.3f}, {upper_bound:.3f}]")

Result

MSE: 2.132
R2 Score: 0.802
Predicted: 0.897, Actual: 0.800, STD: 0.369, 95% CI: [0.174, 1.619]
Predicted: 3.779, Actual: 1.860, STD: 0.374, 95% CI: [3.046, 4.511]
Predicted: 6.863, Actual: 4.550, STD: 0.374, 95% CI: [6.130, 7.595]
Predicted: 2.129, Actual: 0.300, STD: 0.368, 95% CI: [1.408, 2.850]
...
Predicted: 2.049, Actual: 2.180, STD: 0.381, 95% CI: [1.303, 2.796]

Conclusion

Linear regression, XGBoost, and GPs are three machine learning algorithms with their own strengths and weaknesses. While linear regression is the simplest and most computationally efficient model, it fails to model non-linear relationships, which may be present in the data.

XGBoost and GPs can model non-linear relationships. However, they often require more resources and are constrained by complexity. XGBoost's out-of-the-box performance is usually good, while GPs rely heavily on the choice of kernel and other hyperparameters. GPs are advantageous when the uncertainty of the predictions is important.

Result Summary

These are the results from the different models:

# Linear Regression
MSE: 2.30
R2 Score: 0.79
Actual Earnings: $0.80, Predicted Earnings: $1.13
Actual Earnings: $1.86, Predicted Earnings: $2.43
Actual Earnings: $4.55, Predicted Earnings: $4.28
Actual Earnings: $0.30, Predicted Earnings: $1.94
...
Actual Earnings: $2.18, Predicted Earnings: $2.19

# XGBoost
MSE: 2.13
R2 Score: 0.80
Actual Earnings: $0.80, Predicted Earnings: $0.90
Actual Earnings: $1.86, Predicted Earnings: $3.78
Actual Earnings: $4.55, Predicted Earnings: $6.86
Actual Earnings: $0.30, Predicted Earnings: $2.13
...
Actual Earnings: $2.18, Predicted Earnings: $2.05

# GPs
MSE: 2.132
R2 Score: 0.802
Predicted: 0.897, Actual: 0.800, STD: 0.369, 95% CI: [0.174, 1.619]
Predicted: 3.779, Actual: 1.860, STD: 0.374, 95% CI: [3.046, 4.511]
Predicted: 6.863, Actual: 4.550, STD: 0.374, 95% CI: [6.130, 7.595]
Predicted: 2.129, Actual: 0.300, STD: 0.368, 95% CI: [1.408, 2.850]
...
Predicted: 2.049, Actual: 2.180, Std Dev: 0.381, 95% CI: [1.303, 2.796]

For this problem, the mean square error and R2 score are the same for all three models. GPs could be considered the best model since they also model the standard deviation for each prediction. However, this is entirely based on the problem definition and the importance of uncertainty in the modeling.

Further Reading

If you want to learn more about programming and, specifically, machine learning, see the following Coursera course:

Note: If you use my links to order, I’ll get a small kickback.

--

--