Random forest Algorithm With Python

Abhijeet Pujara
Analytics Vidhya


This article covers four parts:

  1. What is a Random Forest?
  2. Random Forest algorithm Application.
  3. Advantages of Random Forest
  4. Random Forest with Python (with code)

In this article, we will explore and also see the code of the famous supervised machine learning algorithm, “Random Forests.”

What is a Random Forest?

Random forests are bagged decision tree models that split on a subset of features on each split. Random forest is different from the vanilla bagging in just one way. It uses a modified tree learning algorithm that inspects, at each division in the learning process, a random subset of the features. We do so to avoid the correlation between the trees. Suppose that we have a powerful predictor in the data set along with several other moderately strong predictors. In the collection of bagged trees, most or all of our decision trees will use the powerful predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated. Correlated
predictors cannot help in improving the accuracy of prediction. By taking a random subset of features, Random Forests systematically avoids correlation and enhances the model’s performance. The example below illustrates how the Random Forest algorithm works.

Random Forest algorithm Application.

In e-commerce, the Random Forest algorithm can be used for predicting whether the customer will like the recommend products based on the experience of similar customers.

In the stock market, the Random Forest algorithm can be used to identify a stock’s behavior and the expected loss or profit.

In medicine, the Random Forest algorithm can be used to both identify the correct combination of components and to identify diseases by analyzing the patient’s medical records.

Advantages of Random Forest

1. Random Forest is based on the bagging algorithm and uses the Ensemble Learning technique. It creates as many trees on the subset of the data and combines the output of all the trees. In this way, it reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy.

2. Random Forests can be used to solve both classifications as well as regression problems.

3. Random Forest works well with both categorical and continuous variables.

4. Random Forest can automatically handle missing values.

5. No feature scaling required: No feature scaling (standardization and normalization) needed in the case of Random Forest as it uses a rule-based approach instead of distance calculation.

6. Handles non-linear parameters efficiently: Non-linear parameters don’t affect the performance of a Random Forest, unlike curve based algorithms. So, if there is high non-linearity between the independent variables, Random Forest may outperform as compared to other curve-based algorithms.

7. Random Forest can automatically handle missing values.

8. Random Forest is usually robust to outliers and can handle them automatically.

9. Random Forest is comparatively less impacted by noise.

Random Forest With Python

df[‘target’] = digits.target


#result of data
model.score(X_test, y_test)


%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
sn.heatmap(cm, annot=True)

Happy Learning !!!

Happy coding :)

And Don’t forget to clap clap clap…



Abhijeet Pujara
Analytics Vidhya

Data Science enthusiast. A flexible professional who enjoys learning new skills and quickly adapts to organizational changes.