Random Forest

Deniz Gunay
12 min readSep 11, 2023

--

Random Forest

Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to create a more robust and accurate predictive model. It was introduced as an improvement over single decision trees to reduce overfitting and improve predictive performance.

You can wonder how random forest differs from single decision tree (CART). The primary difference between Random Forest and a single decision tree (CART) lies in the ensemble approach. Random Forest combines multiple decision trees, each trained on a random subset of data and features, to reduce overfitting and improve prediction accuracy. CART, on the other hand, builds a single decision tree based on the entire dataset, which can be more prone to overfitting.

Essentially, random forest is built on two basic concepts: bagging and random subspace:

  • Bagging (Boostrap Aggregating): It is the process of creating multiple subsets of the training data through random sampling with replacement and using these subsets to train a collection of decision trees. The combination of these trees’ predictions results in a more accurate and less overfitting-prone model.
  • Random Subspace: It refers to the technique of randomly selecting a subset of features (variables or attributes) from the original feature set for each individual decision tree within the ensemble. By allowing each tree to focus on a different subset of features, the Random Forest can capture different patterns and relationships within the data, leading to more accurate and stable predictions, especially when dealing with high-dimensional datasets or datasets with many irrelevant features.

What is the loss function in random forest?

The loss function in random forest and the loss function in CART are the same and they are determined by the criterionparameter. For detailed information, you can read the section about the loss function in the CART article.

How is the final prediction determined in random forest?

While each tree predicts an outcome, you may have asked yourself how the final prediction is determined. Let’s explain this!
In Random Forest, “voting” and “averaging” are two different methods used to combine the predictions made by the individual decision trees within the ensemble. The choice between voting and averaging depends on whether you’re working on a classification or regression problem.

  1. Voting (Classification Problems): In classification tasks, Random Forest uses a voting mechanism to determine the final prediction. Each decision tree in the ensemble provides its own class prediction, and the class that receives the majority of votes among the trees is selected as the final predicted class.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ -​​​​​ ​​For example, if you have a Random Forest with 100 decision trees,
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​and 70 of them predict Class A while 30 predict Class B for a ​​​ ​​​​ ​​​​ ​​​​ ​​​​ ​​​​ ​​ ​ ​​​ ​ ​​
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​particular data point, the ensemble’s final prediction will be
​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ Class A, as it received the majority of votes.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ -​​​​​ This voting mechanism helps improve the overall classification
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ accuracy and makes the model less prone to making incorrect ​​​ ​​​​ ​​​​ ​​​​ ​​​​ ​​​​ ​​ ​ ​​​ ​ ​​
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​predictions due to individual tree variations or noise in the data.

2. Averaging (Regression Problems): In regression tasks, Random Forest
​​​​​ ​​​​​​​ ​​​​​​​ ​​​​​​​ ​​​​​​​uses an averaging method to determine the final prediction. Each
​​​​​ ​​​​​​​ ​​​​​​​ ​​​​​​​ ​​decision tree in the ensemble provides its own numerical prediction,
​​ ​​ ​​ ​​and the final prediction is obtained by averaging these numerical
​​​​​ ​​​​​​​ ​​​​​​​ ​​​​​​​ ​​predictions.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ -​​​​​ ​​For instance, if you have a Random Forest with 100 decision trees,
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​and each tree predicts a different numerical value for a specific
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​data point, the ensemble’s final prediction is the average of all these
​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ numerical predictions.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ -​​​​​ Averaging helps smooth out the predictions and reduce the
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ variance of the model, resulting in more stable and accurate
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​regression predictions.

Coding

Let’s do some coding by using diabetes dataset!

NOTE: Since we will model a classification decision tree here, we will import RandomForestClassifier from SciKit Learn. However, for regression problems we must import RandomForestRegressor

################################################
# IMPORT
################################################

import warnings
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_validate, validation_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
warnings.simplefilter(action='ignore', category=Warning)

df = pd.read_csv("datasets/diabetes.csv")




####################################
# FUNCTIONS
####################################

def outlier_thresholds(dataframe ,col ,q1=.05 , q3=.95, decimal=3):
quartile1=dataframe[col].quantile(q1)
quartile3=dataframe[col].quantile(q3)
iqr=quartile3-quartile1
low_limit= round(quartile1 - (iqr*1.5) , decimal)
up_limit= round(quartile3 + (iqr*1.5), decimal)
return low_limit , up_limit



def replace_with_thresholds(dataframe, col_name, q1=.05, q3=.95, lower_limit = None, upper_limit = None):
low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)
if lower_limit != None:
dataframe.loc[(dataframe[col_name] < lower_limit), col_name] = lower_limit
else:
dataframe.loc[(dataframe[col_name] < low_limit), col_name] = low_limit

if upper_limit != None:
dataframe.loc[(dataframe[col_name] > upper_limit), col_name] = upper_limit
else:
dataframe.loc[(dataframe[col_name] > up_limit), col_name] = up_limit



def plot_importance(model, features, num:int = 0, save=False):
if num <= 0:
num = X.shape[1]
feature_imp = pd.DataFrame({'Value': model.feature_importances_, 'Feature': features.columns})
plt.figure(figsize=(10, 10))
sns.set(font_scale=1)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('Features')
plt.tight_layout()
plt.show()
if save:
plt.savefig('importances.png')



def val_curve_params(model, X, y, param_name, param_range, scoring="accuracy", cv=5):
train_score, test_score = validation_curve(
model, X=X, y=y, param_name=param_name, param_range=param_range, scoring=scoring, cv=cv)

mean_train_score = np.mean(train_score, axis=1)
mean_test_score = np.mean(test_score, axis=1)

plt.plot(param_range, mean_train_score,
label="Training Score", color='b')

plt.plot(param_range, mean_test_score,
label="Validation Score", color='g')

plt.title(f"Validation Curve for {type(model).__name__}")
plt.xlabel(f"Number of {param_name}")
plt.ylabel(f"{scoring}")
plt.tight_layout()
plt.legend(loc='best')
plt.show(block=True)







###########################################
# DATA PREPROCESSING
###########################################

#Replacing outliers.
cols = [col for col in df.columns if col != "Outcome"]
for col in cols:
replace_with_thresholds(df, col)



#Columns that cannot contain zero
problematic_cols = [col for col in df.columns if col not in ["Pregnancies",'DiabetesPedigreeFunction','Outcome']]


#Now replace these zeros with NaN
for col in problematic_cols:
df[col]=df[col].replace(0,np.nan)


# Filling NaN values by using KNN Imputer
scaler=MinMaxScaler()
df=pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
imputer=KNNImputer(n_neighbors=5)
df=pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df=pd.DataFrame(scaler.inverse_transform(df), columns=df.columns)







##################################################
# FEATURE ENGINEERING
##################################################



df.loc[(df["Age"] <= 18 ), "NEW_AGE"] = "young"
df.loc[(df["Age"] > 18 ) & (df["Age"] <= 24), "NEW_AGE"] = "adult"
df.loc[(df["Age"] > 24 ) & (df["Age"] <= 59), "NEW_AGE"] = "mid_adult"
df.loc[(df["Age"] > 59), "NEW_AGE"] = "senior"



df.loc[(df["BMI"] < 18.5) , "BMI_CAT"] ="underweight"
df.loc[(df["BMI"] >= 18.5) & (df["BMI"] < 24.9) , "BMI_CAT"] ="normal"
df.loc[(df["BMI"] >= 24.9) & (df["BMI"] < 29.9) , "BMI_CAT"]="overweight"
df.loc[(df["BMI"] >= 29.9) , "BMI_CAT"] ="obese"



df.loc[(df["Insulin"] < 15) , "INSULIN_CAT"] ="low"
df.loc[(df["Insulin"] >= 15) & (df["Insulin"] < 166) , "INSULIN_CAT"] ="normal"
df.loc[(df["Insulin"] >= 166) , "INSULIN_CAT"] ="high"


# One Hot Encoding
ohe_cols = [col for col in df.columns if 10 >= df[col].nunique() > 2]
df= pd.get_dummies(df,columns= ohe_cols, drop_first=True)


X = df.drop(["Outcome"], axis=1)
y = df["Outcome"]

Now, we have other candidate features. We want to select the ones that increase accuracy among them. We will use the feature_selecter function for this. Let’s define it first,

def feature_selecter(input_x, y, candidate_features_dict:dict, candidate_features_id:list, best_features:list, best_accuracy=0, verbose=True):
if not candidate_features_id:
return best_accuracy, best_features
best_x = input_x
best_feature= -1
if best_accuracy == 0:
rf_model = RandomForestClassifier(random_state=17).fit(input_x, y)

cv_results = cross_validate(rf_model,
input_x, y,
cv=5,
scoring="accuracy")

best_accuracy = cv_results["test_score"].mean()

if verbose:
print(f"best accuracy(old) = {best_accuracy}")
#print(candidate_features_id)

for feature in candidate_features_id:
X = input_x.copy(deep=True)

# define your candidate feature here!
if feature == 0:
X[candidate_features_dict[feature]] = X["Insulin"]*X["Glucose"]

elif feature == 1:
X[candidate_features_dict[feature]] = X["Glucose"]/(X["Insulin"]+0.0001)

elif feature == 2:
X[candidate_features_dict[feature]] = X["Age"]*X["Pregnancies"]

elif feature == 3:
X[candidate_features_dict[feature]] = X["Age"]/(X["Pregnancies"]+0.0001)

elif feature == 4:
X[candidate_features_dict[feature]] = X["Age"]*X["Pregnancies"]*X["Glucose"]

elif feature == 5:
X[candidate_features_dict[feature]] = X["Glucose"]/(X["Age"]+0.0001)

elif feature == 6:
X[candidate_features_dict[feature]] = X["Insulin"]/(X["Age"]+0.0001)

elif feature == 7:
X[candidate_features_dict[feature]] = X["BMI"]*X["Pregnancies"]

elif feature == 8:
X[candidate_features_dict[feature]] = X["BMI"]*X["Age"]

elif feature == 9:
X[candidate_features_dict[feature]] = X["BMI"]*(X["Age"])*X["Pregnancies"]

elif feature == 10:
X[candidate_features_dict[feature]] = X["BMI"]*(X["Glucose"])

elif feature == 11:
X[candidate_features_dict[feature]] = X["DiabetesPedigreeFunction"]*(X["Insulin"])

elif feature == 12:
X[candidate_features_dict[feature]] = X["SkinThickness"]*(X["Insulin"])

elif feature == 13:
X[candidate_features_dict[feature]] = X["Pregnancies"]/(X["Age"]+0.0001)

elif feature == 14:
X[candidate_features_dict[feature]] = X["Glucose"]+X["Insulin"]+X["SkinThickness"]

elif feature == 15:
X[candidate_features_dict[feature]] = X["BloodPressure"]/(X["Glucose"]+0.0001)



rf_model = RandomForestClassifier(random_state=17).fit(X, y)

cv_results = cross_validate(rf_model,
X, y,
cv=5,
scoring="accuracy")

accuracy = cv_results["test_score"].mean()
if accuracy > best_accuracy:
best_accuracy = accuracy
best_feature = feature
best_x = X

if best_feature == -1:
return best_accuracy, best_features

best_features.append(best_feature)
candidate_features_id.remove(best_feature)

if verbose:
print(f"best accuracy(new) = {best_accuracy}")
print(f"added feature = {best_feature}", end = '\n\n')
#print(best_features)

return feature_selecter(best_x, y, candidate_features_dict, candidate_features_id, best_features, best_accuracy, verbose)

Then, we can run feature_selecter() function,

candidate_features = {0:"new_glucoseXinsulin",
1:"new_glucose/insulin",
2:"new_ageXpreg",
3:"new_age/preg",
4:"new_ageXpregXglucose",
5:"new_glucose/age",
6:"new_insulin/age",
7:"new_bmiXpreg",
8:"new_bmiXage",
9:"new_bmiXageXpreg",
10:"new_bmiXglucose",
11:"new_degreeXinsulin",
12:"new_skinXinsulin",
13:"new_preg/age",
14:"new_glucose+insulin+skin",
15:"new_blood/glucose"}

accuracy, new_features = feature_selecter(X,y,candidate_features, list(candidate_features.keys()), best_features=[])
'''
best accuracy(old) = 0.8086580086580086
best accuracy(new) = 0.812562600797895
added feature = 6

best accuracy(old) = 0.812562600797895
'''




for feature in new_features:
print(candidate_features[feature])
'''
new_insulin/age
'''



#So, it seems when we add the insulin/age feature, we obtain 81% accuracy.
#Let's see it.
X["new_insulin/age"] = X["Insulin"]/(X["Age"]+0.0001)








###############################################
# MODEL BUILDING AND EVALUATION
###############################################


#Params before GridSearchCV
rf_model = RandomForestClassifier(random_state=17)
rf_model.get_params()
'''
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 17,
'verbose': 0,
'warm_start': False}
'''

#Accuracy before GridSearchCV
cv_results = cross_validate(rf_model, X, y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"])
print(f"Accuracy : {cv_results['test_accuracy'].mean()}") # 0.8125
print(f"Precision : {cv_results['test_precision'].mean()}") # 0.7598
print(f"Recall : {cv_results['test_recall'].mean()}") # 0.6831
print(f"F1 Score : {cv_results['test_f1'].mean()}") # 0.7176
print(f"ROC AUC : {cv_results['test_roc_auc'].mean()}") # 0.8744

We should tune hyperparameters to improve these metrics. But before that, let’s explain some important random forest parameters,

  1. max_depth: This hyperparameter determines the maximum depth of each decision tree in the Random Forest. It restricts the depth of the tree by limiting the number of nodes from the root to a leaf. A smaller value for max_depth makes the individual trees in the forest shallower and less complex, which can help prevent overfitting. On the other hand, a larger value allows the trees to grow deeper, potentially capturing more complex patterns in the data but increasing the risk of overfitting. The default value for the max_depth hyperparameter in decision tree-based models, including Random Forest, is set to "None".
    “None” means that decision trees are allowed to grow until they contain fewer than min_samples_split samples in each leaf node (another hyperparameter) or until all leaves are pure (all samples in a leaf node belong to the same class). While not setting a maximum depth can be beneficial in many cases, it can also lead to overfitting, especially if the dataset is noisy or if there are irrelevant features. Therefore, it’s essential to tune the max_depth hyperparameter during the model training process.
  2. max_features: max_features specifies the maximum number of features (variables or attributes) that should be considered for splitting at each node of a decision tree. It can be set as a fixed number, a fraction of the total number of features, or one of the predefined values like 'sqrt' (square root of the total features) or 'log2' (base-2 logarithm of the total features). By limiting the number of features, you can introduce randomness and diversity into the Random Forest, which helps prevent the model from relying too heavily on a single feature and improves generalization. The default value for the max_features hyperparameter is set to "auto", which means that the algorithm will automatically determine the number of features to consider for splitting at each node of the decision trees. The use of "auto" corresponds to setting max_features to the square root of the total number of features (sqrt(n_features)). Empirical studies and practical experience have shown that setting max_features to "auto" or its square root often yields good results for a wide range of datasets. It strikes a balance between exploring different aspects of the data and preventing the model from overfitting. While “auto” is a good default choice, it’s important to note that you can customize the max_features hyperparameter to better suit your specific dataset and problem.
  3. n_estimators: This hyperparameter determines the number of decision trees that will be included in the Random Forest ensemble. A higher value of n_estimators typically leads to a more robust and accurate model. However, there may be diminishing returns beyond a certain point, as adding more trees can increase computational cost without significant improvements in performance. It's common to start with a reasonable number of trees and then use cross-validation to find the optimal value. By default, n_estimator is set to 100.
  4. min_samples_split: min_samples_split sets the minimum number of samples required to split an internal node in a decision tree. If the number of samples in a node is less than min_samples_split, further splitting is halted, and the node becomes a leaf. This hyperparameter can help control the depth and complexity of the trees. A smaller value may result in deeper trees with finer splits, while a larger value can lead to shallower trees with fewer splits. By default, min_samples_split is set to 2.

Let’s apply GridSearchCV and find the best hyperparameters.

#Apply GridSearchCV
#We should also specify the default value of the parameters in the dictionary
#below, so that we do not get a worse result than the initial performance.
rf_params = {"max_depth": [5, 8, None],
"max_features": [3, 5, 7, "auto"],
"min_samples_split": [2, 5, 8, 15, 20],
"n_estimators": [100, 200, 500]}


rf_best_grid = GridSearchCV(rf_model, rf_params, cv=5, n_jobs=-1, verbose=True).fit(X,y)

print(rf_best_grid.best_params_)
'''
Fitting 5 folds for each of 180 candidates, totalling 900 fits
{'max_depth': None, 'max_features': 'auto', 'min_samples_split': 2, 'n_estimators': 100}
'''

So, the best hyperparameter values are found as :
max_depth = ‘None’
max_feature = ‘auto’
min_samples_split = 2
n_estimators = 100
Remember, these are the default values. Therefore, hyperparameter tuning did not improve the performance. Still, let’s build a final model according to these values,

#Build the final model with the hyperparameter values that we have found.
rf_final = RandomForestClassifier(**rf_best_grid.best_params_, random_state=17).fit(X, y)


#5-Fold CV results for the final model
cv_results = cross_validate(rf_final, X, y, cv=5, scoring=["accuracy", "precision", "recall", "f1", "roc_auc"])
print(f"Accuracy : {cv_results['test_accuracy'].mean()}") # 0.8125
print(f"Precision : {cv_results['test_precision'].mean()}") # 0.7598
print(f"Recall : {cv_results['test_recall'].mean()}") # 0.6831
print(f"F1 Score : {cv_results['test_f1'].mean()}") # 0.7176
print(f"ROC AUC : {cv_results['test_roc_auc'].mean()}") # 0.8744

As a result, our final performance metrics are,
​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​ ​​​ ​ ​​Accuracy : ​​​ ​​​ ​ 0.8125 ​​​
​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​ ​​Precision : ​​​ ​​​ ​ 0.7598 ​​​ ​​​ ​​​ ​​
​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​Recall​​​ ​​​ ​​​ ​​ ​​​ ​​​ ​​​​​ : ​​​ ​​​ ​ 0.6831 ​​​ ​​​ ​​
​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​F1 Score ​​​ ​​​​​​​​:​​ ​​​ ​​​ ​ ​​​​​​​​​0.7176 ​​​ ​​​
​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​ ​ ​​​ ​​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​ ​​​ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ROC AUC​​​ ​​​​​​​​:​​ ​​​ ​​​ ​ 0.8744

######################################
# FEATURE IMPORTANCE
######################################

plot_importance(rf_final, X)
#IMAGE IS BELOW (importance.png)
importance.png

Looking at the graph above, we observe that the “new_insulin/age” variable that we created is very effective in the success of the model.

#######################################
# VALIDATION CURVE
#######################################

val_curve_params(rf_final, X, y, "max_depth", range(1, 11), scoring="accuracy")
#IMAGE IS BELOW (validation_curve.png)
validation_curve.png

Thanks for reading…

--

--