Random Forest From ModelBuilding to Hyperparameter Tuning in Python
Today we will go over one of bagging method called Random Forest to predict one of air pollutants in China. Model is called “forest” because it is built with multiple decision trees and “random” since it selects subset of rows(not features) for each tree and uses subset of features(columns) at each node split.
I assume readers know what Bagging and decision trees are, if not it is recommended that you read my previous blogs to get a brief overview: Decision Tree part I, Decision Tree part II and Ensemble Learning methods.
Before we get started I’ve noticed lot of people are confused(especially me🤣) about what exactly random forest refers to, whether
- each tree gets assigned all features then use subset of features at each node.
- each tree gets assigned subset of features then use all features at each node.
- each tree gets assigned subset of features and also subset of features at each node.
and after doing some research I’ve found the answer to it, hope no one gets confused now. Thanks to https://sebastianraschka.com/faq/docs/random-forest-feature-subsets.html we now know that
random forest = each tree gets assigned all features then use subset of features at each node.
random subspace method = each tree gets assigned subset of features and also uses subset of features at each node.
We are going to use Random Forest Regressor implemented in Python to predict Air Quality, dataset offered by Bejing Municipal Environmental Monitoring Center which can be downloaded here → https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data
This dataset contains hourly air pollutant data from 12 nationally-controlled air-quality monitoring sites within China from March 1st, 2013 to February 28th, 2017 which we will use to predict one of air pollutant, “PM2.5" .
First, we will import dependencies and combine csv files because each city is separated by each csv file.
Our df looks like this:
with df.shape = (420768, 18).
As always we are going to check if it contains Null values. Can run
(df.isna().sum()/len(df))*100 to get percentage of non-null values for each column and if you want to see it for each city we can run
round((df.groupby(["station"]).count()[df.columns[5:-1]/35064)*100, 2) . Note that we divide by 35064 because it is number of rows each city contains.
Looking at above dataframe most cities seem to have less than 5% of missing values for most columns.
For Simplicity sake we will simply drop null values even though there are different methods of filling null values, for those of you who want to check it out:
- Fill with mean/median/mode values
- Fill by finding highly correlated column.
- Use other features to predict missing value in a column.
We will drop null values, “No” column since it is just incrementing id and check datatypes for each column.
All columns except “wd” and “station” are numerical values therefore only need to convert two columns. We will convert them using one-hot encoding because both categorical columns do not contain any rank order.
Best part of random forest is that it does not need much data preprocessing and feature selection, notice that we didn’t even scale numerical values. Unlike other regression models which use distances between features, each question decision tree asks use one feature to measure impurity therefore no need to scale.
We’ll be applying one-hot encoding for “wd” and “station” column then split data into train, test sets. But before building Random Forest we must create a base model to check whether our ML model is actually doing better than base model(random guessing yes/no in binary classification) if not, we should think about using a different ML model.
For our base model we’ll use mean value of PM2.5 in train set to predict each values in test set.
So as long as our model obtain Mean absolute Error less than 59.12 we should be fine(even though still testing with other models are recommended).
Now time to build Random Forest.
One very important thing to note is that by default RandomForestRegressor in sklearn actually each tree uses all features at each node split
max_features = n_features . So technically default RandomForestRegressor is not random forest however just normal bagging method with multiple decision trees. Therefore to be called “Random Forest” you must choose
max_features < n_features . According to https://en.wikipedia.org/wiki/Random_forest#From_bagging_to_random_forests
max_features = sqrt(n_features) are recommended for classification and
n_features/3 is recommend for Regression.
Now, let’s use real “Random Forest” to train our model. You can see significant reduction(due to using subset of features at each node) in computation time with only little loss of accuracy. In production setting where speed is very important part, I would definitely use Random Forest model. Reason we used
max_features = 6 is because there are 16 features therefore 6 is 16/3 rounded up. Also we can see that random forest model has much lower test_MAE compared to base model therefore it seems to handle overfitting problem better than base model.
After validating Random Forest, it is time to tune hyperparameters for maximum performance. We will use GridSearchCV from sklearn to tune our hyperparameters which is very simple to understand, it tries all combinations of hyperparameters given in
param_grid and calculate model performance for each combination by using K-fold cross validation. In our case there will be (3*3*6)*cv = 162 training.
best_estimator_ method returns model with parameters that led to best performance which in this case is max_depth =100, max_features=12 and n_estimators=300.
Here is MAE calculated by our best model.
train_MAE = 3.55
test_MAE = 9.69
test_MAE decreased by 5.4% compared to Random Forest before hyperparameter tuning which is pretty good but we need to keep in mind that best Random Forest using 300 decision trees(n_estimators) compared to 100 decision trees will be more computationally heavy even though smaller max_features parameter reduce computation time to some extent.
Since we do not need to consider computation time for training let’s use best Random Forest from above to measure feature importance.
Feature importance for each feature is calculated by averaging amount of impurity reduced by using such feature at each node split across all trees in Random Forest. For example if PM10 feature was used in 10 nodes and at each node it reduced [1,2,3,4,…10] impurity than its feature importance would be
(1+2+3+..+10)/10 = 5.5 . Each feature importance are calculated in this way after training, then it is scaled so that sum of all importance equal to 1.
Since feature importance are calculated by how much each feature decreases impurity on average using training model if the model performs poorly on test set, trust worthy level of feature_importances_ decreases.
In conclusion, Random Forest can be easily implemented in Python and it is comparatively easy to understand if you know what decision trees are. Random Forest is very powerful method because of its interpretability, automatic feature selection, and its ability to reduce bias by using multiple decision trees. However one should always be careful about its computation time since having an unpruned Random Forest may lead to large computation consumption especially when data are not easily separable.
- Hands On Machine Learning with Scikit-Learn Keras and TensorFlow chapter 7
- max_features parameter in RFR
- Does random forest select a subset of features for every tree or node?
- Recommended max_features for Random forest classification and regressor
- Scikit-learn RFR documentation
- decision tree in Random Forest Vs. Bagging