Visualizing the DecisionTrees in RandomForestRegressor in a Pipeline with Python
The purpose of this post is help the people that is in doubt, as I was, about how visualize your model in the pipeline, so if you found the error “This RandomForestRegressor instance is not fitted yet”, probably this post will help you.
Introduction
Random forest is a supervised machine learning technique commonly used for classification and regression. Each decision is called a node, and the act of making this decision is called as split. It works by creating sets of decision trees whose principle is to separate your data stochastically defined as the parameters passed in the tree creation, as main parameters we have: Tree depth, minimum number of samples to make a decision, minimum number of samples to create a leaf, maximum amount of features used to make a decision and the number of trees in the forest.
The pipeline sklearn has come to save our lives by simplifying and standardizing the way of normalizing our features that we will use in our models. It allows us to apply a sequential methods that contains as parameter the transformers to adjust the independent variables scale, avoiding that none of them impact more our model due the scale. The next process in the pipeline is the estimator that is the model, a random forest regressor in our case.
In this case we will use as example a time series of a company’s revenue about a specific product, and as independent variable we will use the GDP and the unemployment rate of the Russia.
The dataset and the Jupyter Notebook code (.ipynb) used is in my Github whether you want to play with the code and perform others tests with the dataset. Enough talk let’s see the code.
First of all, let’s import the libraries used to prepare the pipeline
Loading and checking the dataset.
We can notice that the normalization is really needed for our independent variables as GDP and Unemployment Rate, because the GDP is 10¹² times bigger than Unemployment Rate, so it can cause a bigger impact in our model learning due the scale, thereby we’ll use the StandardScaler and the Principal Component Analysis to adjust our data.
Splitting our dataset to training and testing, through train_test_split function. The result is 4 datasets, train and test of the dependent variable and train and test of the independent variables.
Creating the RandomForestRegressor model
We need define the parameters, so our random forest will have 3 decision trees, it is defined for n_estimators parameter, each tree containing maximum 2 depth level (max_depth), considering at most 2 features (max_features) for each split, requiring 3 samples as minimum quantity to create a leaf (min_samples_leaf). After creating the model let’s put it into a pipeline using the transformers as already said above.
If you arrived in this post searching something like the title of it, probably you already know everything that I told above and just want to visualize your trees. So now you have everything already let’s install the pydotplus and the graphviz visualization packages. Put theses commands in your command prompt and press Enter. I’m using anaconda to management my packages, so if you don’t just install through the pip3 or pip, more details in the references
Visualize the Decision Tree
With the successfully installed packages, let’s import the libraries and plot the first Decision Tree in the Random Forest Regressor, to visualize the others Decision Trees just change the index estimators_[X].
X[0] and X[1] are ours independent variables, GDP and Unemployment Rate respectively, where the values are use to split the node, the mse is the error metric Mean Squared Error and the Value is our target variable Revenue.
That’s all folks, I hope this helped you. If you have any doubts or suggestions let me know. 😄
References
https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
https://scikitlearn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
https://pypi.org/project/pydotplus/
https://pypi.org/project/graphviz/
https://medium.com/databootcamp/pipeline-da-scikit-learn-e71c79ec744b