A Brief Introduction: Cross Validated Random Forest

Patrick Sandoval Romero
LatinXinAI

--

In this post, I will introduce one of the most popular methods for modelling categorical data, known as Random Forests Classifiers. In order to understand how this machine learning technique works I will first need to introduce what decision trees are and how they predict the target variable. In this post, I will be showing how to implement a Random Forest Classifier using the sklearn library in Python. So, some Python background will be assumed in this post. You can find the Python code and all the plots used in this post in the following GitHub repo:

Medium/CVRF at main · 159Patrick159/Medium (github.com)

Decision Trees

Decision trees are a type of supervised machine learning technique used to model either continuous or categorical data. This object is composed of nodes, which split the data based on specific questions about the features of the model. The ultimate goal of this routine is to minimize the impurity score on the terminal nodes. This metric known as the impurity score ranges from 0 to 0.5 and it measures the proportion of a class in a node. That is a score of 0 means there is no impurity, and the node contains a single class. While a score of 0.5 means that the classes are equally represented in that node.
There are three types of nodes that constitute a decision tree: root node, decision nodes and leaf nodes.

The root node is the first node of the decision tree, and it contains the most predictive feature. That being the feature that minimizes the impurity score the most.

The decision nodes are the following nodes that impose conditions based on the features available, intending to minimize the impurity score.

Lastly, the leaf nodes are the terminal nodes of the decision tree and they become leaf nodes when either of the following two conditions are met: there are too few samples in a decision node to further justify another split, or the minimum number of samples required on a leaf node is not met after a split.

Sample schematic of decision tree used to predict customer churn rates for a bank, only two layers of the decision tree were plotted. How to read this plot: The first line on every node is the feature and split point that the model identified as being most predictive, i.e. the question being asked at that split. If the answer to the question is “yes” the sample would move to the child node to the left, if the answer is “no” then the sample is moved to the right. The keyword “gini” refers to the Gini impurity score used to measure a nodes impurity. “samples” refer to the total number of samples present in that node. “value” refers to the individual count of each class in that node. Lastly “class” refers to the name of the majority class in that node.

Once the decision tree is assembled through the training data set, it can start predicting the target variable, by asking the questions on the first line of every node and move the sample to the respective child node. Once the sample reaches a leaf node the tree will predict the majority class of the leaf.
Sklearn’s decision tree object has a couple of hyperparameters we can tune in order to improve the model's performance, however, we need to be careful on how we tune these hyperparameters to avoid overfitting. I will be making another post about fine-tuning the hyperparameters of decision trees, but this post is about tuning the hyperparameters of random forests through a “GridSearch” algorithm.

Random Forests: The Theory

I find that the best way of thinking about random forests is by looking at them as an ensemble of shallow decision trees. This is the same principle behind using multiple weak learners i.e. walkers to explore the posterior distribution in an MCMC MLE routine from my previous post. These two techniques are what we call ensemble learning, which leverages the predictions of multiple base models to improve performance and accuracy of the combined model.

Random forests train their shallow decision trees through a method known as “bagging” or “bootstrap aggregation”. This method trains the trees with a randomly sampled subset of the training data. Each tree only has access to a subset of the total features to consider for each split. This technique helps the model deal with noisy data and prevents overfitting, which is a common issue for decision trees.

Once the shallow trees have been created the random forest can start predicting the target variable by aggregating the predictions of each individual tree and choosing the prediction favored by most of those trees.

Hyperparameters of Random Forests

Hyperparameters are external parameters to the model itself and are set before the learning process begins. Unlike regular model parameters (which are learned from data), hyperparameters cannot be directly estimated from a training set.

Hyperparameters are specific to the learning algorithm. In the case of a random forest, we have 6 hyperparameters.
1) n_estimators: Specifies the number of trees in the forest
2) max_depth: Specifies the depth/layers of the individual trees
3) min_samples_split: Specifies the minimum number of samples to justify a split
4) min_samples_leaf: Specifies the minimum number of samples in a leaf node.
5) max_features: Specifies the number of features considered when splitting a node.
6) criterion: The function used to measure the quality of the split. The default is the Gini impurity score.

Cross Validated Hyperparameter Tuning

Ideally, we want to find the combination of hyperparameters that produced the best results based on some metric of interest. Usually, the metric of interest depends on the goal/scope of the project but generally, a good metric to measure the performance of a model is the F1 score. This score is the harmonic mean of the Recall and Precision scores.

In this post, I will go over how to cross-validate a model using GridSearchCV. This function essentially generates a grid of possible parameter combinations and evaluates the model using cross-validation (typically k-fold cross-validation).

K-fold is a specific variant of cross-validation that splits the dataset into k subsets (or folds). The model is trained and evaluated k times. On every iteration, one-fold is held as a validation set while k-1 folds are used as training sets. The performance metrics from each fold are averaged to estimate the model’s generalization performance.

The main setback with using GridSearchCV for hyperparameter tuning is that it can get computationally expensive for larger parameter grids. To handle this issue, we can either reduce the parameter grid or use parallel computing to reduce run times.

Random Forests: The Implementation

To implement a random forest classifier, I will be using the functions and objects from the sklearn library. Additionally, I will be using the Kaggle dataset for taxi fares in New York City. I will be creating another post about using the pandas library to clean, reduce and extract valuable features from this dataset. Our current goal is to predict whether a surge will be applied to any particular trip, and we would like to examine what are the best indicators for such a price rise.

First 5 rows of taxe_fare dataset. Dataset is composed of 7 features all of which are continuous numerical variables, target variable has been one-hot-encoded with 0 meaning no surge applied and 1 meaning a surge applied. The class balance for the target variable is 71.7% to 28.3% making the surge applied class a minority class.

Whether class balancing plays an insightful role in our model depends on the goals of our project and the nature of the data. Even though class balancing can remove bias towards a majority class it alters the intrinsic frequencies of the data which may be of importance given the context of the project.

Once the features have been cleaned to avoid collinearity, duplicates, missing values and outliers we can start the training process. This process will be covered in a separate Medium post.

Reduced features corner plot for taxi_fare dataset where blue indicates rides with no surge applied and yellow indicates rides with surge applied. Diagonal panels are the marginalized distributions of the individual features, and off-diagonal panels are the scatter plots between features.

To start working with this cleaned dataset, import all the relevant functions and methods from the sklearn library.

# For cross validating
from sklearn.model_selection import GridSearchCV
# For metric evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay
# For ensembling the random forest
from sklearn.ensemble import RandomForestClassifier
# For upsampling the minority class
from sklearn.utils import resample

As previously discussed, class balancing is a technique used to alter the frequency of a target class such that we reduce the bias towards the majority class. However, in this context, we are unsure if surges should be applied 50% of the time. So, we will investigate if a balanced training set helps the CVRF perform better under an F1 score. As per every research problem, we would like to offer a hypothesis for this experiment.

Hypothesis: If two cross-validated random forests (CVRF) are trained with the different datasets, one with class balance and one with the original class imbalance, the CVRF trained with the original class ratio will become the champion model under an F1 score, because this model does not assume that there should be an equal number of surges applied to no-surges applied, thus reducing the number of false positive predictions made.

We can balance the data as follows:

# There is a class imbalance so we will upsample the minority class (surge was applied) = 1
surge = train_clean[train_clean['surge_applied']==1]
no_surge = train_clean[train_clean['surge_applied']==0]

# Upsample minority class
minority_upsample = resample(surge,replace=True,n_samples=len(no_surge),random_state=2)

# Combine majority class with upsample minority class
train_balanced = pd.concat([no_surge,minority_upsample]).reset_index(drop=True)

# Display new class balance
train_balanced['surge_applied'].value_counts(normalize=True)

The dataset contains 200,000+ rows which is great, however, this amount of data can notably increase the time it takes to train our machine learning model. Therefore, we will randomly sample 100,000 rows from our balanced and imbalanced datasets to train and validate the models.

# Define sample size
N = 100000
# Randomly sample the datasets
sample_balance = train_balanced.sample(n=N,random_state=42)
sample_imbalance = train_clean.sample(n=N,random_state=42)

# Verify class balanced is maintained on sampled data
sample_balance['surge_applied'].value_counts(normalize=True)

Now we can split these datasets into features (X) and target variable (y), with a 75–25 train to test ratio and stratifying the samples such that the original class balances are preserved.

# Split data frame into labels (y) and features (X)
yb = sample_balance['surge_applied']
yi = sample_imbalance['surge_applied']

Xb = sample_balance.copy()
Xb = Xb.drop("surge_applied", axis = 1)

Xi = sample_imbalance.copy()
Xi = Xi.drop('surge_applied',axis=1)

# Split data into training and testing sets
Xb_train, Xb_test, yb_train, yb_test = train_test_split(Xb, yb, test_size=0.25, stratify=yb, random_state=42)
Xi_train, Xi_test, yi_train, yi_test = train_test_split(Xi, yi, test_size=0.25, stratify=yi, random_state=42)

With the training and testing sets ready we can instantiate the random forest classifier and the objects needed to cross validate the model using GridSearchCV.

# Instantiate the Random Forest Classifier
rfb = RandomForestClassifier(random_state=0)
rfi = RandomForestClassifier(random_state=0)

# Define hyperparameters for CV to search over
cv_params = {'max_depth': [2,3],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'max_features': [2,3,4],
'n_estimators': [75, 100]
}

# Deine the scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Instantiate the cross validated random forest
rfb_cv = GridSearchCV(rfb, cv_params, scoring=scoring, cv=5, refit='f1',n_jobs=4)
rfi_cv = GridSearchCV(rfi, cv_params, scoring=scoring, cv=5, refit='f1',n_jobs=4)

The dictionary defined in the code block above contains all the hyperparameters we will be tuning during the cross-validation process. Recall from the previous sections there are 6 hyperparameters for a random forest, in this case, I will not be tuning the ‘criterion’ hyperparameter because I am more interested in the structure of the forest than the loss-function used.

The list named ‘scoring’ contains all the metrics that will be used to measure the success of an individual hyperparameter combination. However, the metric of most importance is defined in the ‘refit’ argument of GridSearchCV which tells the function to refit the best estimator i.e. the one with the highest average F1 score, to the entire data set.

The last argument I am providing GridSearchCV is ‘n_jobs’ which specifies the number of jobs I want to run in parallel. This will help reduce my runtimes significantly.

# Fit inbalanced model
rfi_cv.fit(Xi_train,yi_train)

# Fit balanced model
rfb_cv.fit(Xb_train,yb_train)

Once we are done training and validating the models we can look at the best estimators, their parameters and their scores.

We notice that the hyperparameters for both RF are very similar with the exception that the balanced RF has 75 decision trees while the imbalanced RF has 100 trees.

Now we can evaluate how the models perform with data they have never seen before by using the test data we previously separated.

Confusion matrix for cross-validated random forests with balanced and imbalanced target class. The data used to create these matrices was previously separated, so neither model has seen this data before.

From the confusion matrix we notice that the imbalanced CVRF predicted fewer false positive than the balanced CVRF. While the balanced model just barely predicted fewer false negatives than the imbalanced model. This can be seen in the summary table below.

Scoring summary for both cross-validated random forest classifiers using the testing set that was previously separated.

We notice that the F1 score of the imbalanced CVRF is marginally better than the balanced model which would suggest that in this particular case there is no significant advantage in balancing the data set.

There was an additional dataset with 78,000 rows provided by Kaggle that could be used exclusively for testing. So, I tested how each model performed on this new data. The results can be seen in the summary table below.

Scoring summary for both cross-validated random forest classifiers using the test dataset provided by Kaggle.

In this case, we notice that the F1 score of the balanced model is marginally better than the imbalanced model. However, this is not a strong indication that balancing plays a crucial role in this specific project. We could conduct further statistical tests between each metric to see if their difference is significant.

Lastly, we would like to explore which features are the best indicators for surge fees. For that, we can use the following methods:

# Lets look at the feature importance for he champion model
importance = rfb_cv.best_estimator_.feature_importances_
forest_importance = pd.Series(importance,index=rfb_cv.feature_names_in_)

fig, ax = plt.subplots()
forest_importance.plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

These results seem consistent with what we observed with the feature corner plot where the “misc_fees” feature had the clearest pattern for surge fees, followed by the “total_fare” feature.

Conclusion

The proposed hypothesis was correct for the testing set that was separated from the original dataset. However, when I tested the model to the testing dataset provided by Kaggle the balanced CVRF was marginally superior in all scoring metrics with the exception of precision. This means that not balancing the dataset led to more “false negative” predictions which can be troublesome if we tell a customer there won’t be any surges and there does end up being a surge applied. We also noticed that balancing the dataset led to more or no change in the prediction of false positives, which won’t harm our customers too much if we tell them there will be a surge applied but it turned out no surge was applied. So given that our priority is to minimize the prediction of false negatives, we would opt to use the balanced CVRF as our champion model.

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--