vAlgo ML : The Random Forest Experiment
Hello Learners,
Continuing with my vAlgo experiment (original post here).
In my last post, I wrote about the alternatives to AdaBoost that were presented to me, and how I decided to try out the Random Forest Classifier.
This is going to be a long one, lets dive in.
Random Forest Classifier.
A brief definition:
Random Forest is an ensemble method that combines multiple decision trees. It’s less prone to over-fitting compared to a single decision tree, and it’s often more computationally efficient than AdaBoost. Random Forests can handle large datasets and are less sensitive to hyper-parameter tuning.
# The library
from sklearn.ensemble import RandomForestClassifier
# Create a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
# Train your model and proceed forward
For me I set up my classifier like this.
model = RandomForestClassifier(n_estimators=50 max_depth=None, min_samples_split=2, random_state=0).
I then set up a very long chart to document the end result of the accuracy score as I change each parameter individually.
The four parameters I manipulated were:
N_estimator , max_depth, sim_samples_split, & random_state.
Result Summary Chart
The experiment I did was to change one parameter at a time, manually, so that I can keep track of what is happening. The goal was to get an accuracy score above 3%. Happy to say, I was able to do that. After many iterations and changes, I was able to create the above chart of the best results.
I’m still using the original dataset that I started with, but now instead of a “Decision Tree”, I’m using a “Random Forest Classifier”.
There were some interesting things I noticed, how certain configurations worked better than others. Full transparency, the higher number results, happened once at this point, when I try to re-run the cells of that same configuration, the results vary, yet the lower ones, were consistent. We’ll find out why later on.
Lets break that summary chart down:
N_Estimater: 50
That gave me an accuracy of 9%.
Max_Depth: 5
That gave me an accuracy of 12%.
Min_Samples_Split: 3
That gave me an accuracy of 9%.
Random_State: None
That gave me an accuracy of 16%.
So, with that knowledge I re-configured the parameter, combining each element of the above and ran it again. You know, “with our powers combined” Captain Planet style.
This was the result.
And that puts me back at where I started, at 3%.
Ok, that’s dis-heartening, but also interesting. I would have thought if you puzzle together all the best parts, you would get the best result. I guess that’s not the case here.
Ok, let me re-produce the one that gave me the highest result.
Hmm, not “16%” as before, but still more than 3%.
I want to try something. I’m leaving it at this configuration, since I know that gave me the highest result, and I will run it 10 times to see how the results change.
1: 6%
2: 0%
3: 6%
4: 3%
5: 6%
6: 3%
7: 9%
8: 9%
9: 3%
10: 9%
And THAT is when I realized, the flaw in my experiment. I did not take into consideration the epoch cycle. I was looking at how the results change, by changing the parameter, but didn’t keep in mind that with each run I do, it is training through the data. Yes there are minor modification, but it is never the less training. So my current assumption is, the reason I got those high results, may have been part of the training it had gone through by the time I had arrived to that point.
What now?
Well, time for another experiment.
This time adding an epoch cycle in it. I will have a consistent number of cycles, and change the parameters using the summary result chart above to see if I can get those same (or higher) results.
Lets do it, these machines aren’t going to teach themselves (at least not mine, not yet).
The New Set Up with Epochs.
I will set up a ‘for loop’ for the Epoch cycles, and while I’m at it, I’ll set up variables for the parameters, to make it easier to change the values.
So after Step two (cleaning the data) I’m combining all the remaining steps in the Epoch loop.
Now the cell looks like this:
# Random Forest Classifier Parameter variables.
n_estimator = 100 # n_estimator
m_depth = 5 # max-depth
ms_split = 2 # min_samples_split
r_state = None # random_state
epochs = 3 # The Number of Epochs
for i in range(epochs):
print("\n\nEpoch:"+ str(i+1) + "/" + str(epochs))
# 3: Split the data.
# Using the Train Test Split function for an 80/20 split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 4: Create the model. Assign it the classifier
model = RandomForestClassifier(n_estimators=n_estimator, max_depth=m_depth, min_samples_split=ms_split, random_state=r_state)
# 5: Train the model.
model.fit(X_train, y_train)
# 6: Make a prediction.
# The prediction
prediction = model.predict(X_test)
# The accuracy : using the accuracy score funtion
score = accuracy_score(y_test, prediction)
# Display the score as a percentage.
print(">>> Accuracy Score: {:.2%}".format(score))
Now this is the ONLY cell I need to re-run after I make any modifications.
It really simplified this part of the testing.
To make sure everything is working, I set Epoch to ‘3’ and ran it, this is the result I got:
Epoch:1/3
>>> Accuracy Score: 0.00%
Epoch:2/3
>>> Accuracy Score: 0.03%
Epoch:3/3
>>> Accuracy Score: 0.06%
Now that I know everything works.
I will increase the number of epochs, and easily change the parameters and see if I can get those higher results again.
The Epoch Test.
Now that I have a summary of the results that I liked, and I now have an Epoch cycle, I’m ready to re-run the experiment. Now my goal is to get to the higher numbers that I have seen, which have been 12% and 16% since I know that it can go that far.
I have the cell printing out each Epoch and its accuracy score, that way I can visually see, and take note of the progress and to give me an idea of when the best results appear.
Yes, I do realize that I can also set up a loss function and have it calculate all of this for me, and in the future I will add that as well, but for now, I’m trying to do things as manual as possible to get a better understanding of what is happening.
What the chart is showing here is when it hit the desired percentage (12% or 16%), and at which Epoch, and looking at the numbers it seems to go up and down, but I wanted to distill it even further.
What I noted is how many high results each one gave over all, how many were 12% how many were 16% and when was the last Epoch that produced that high result.
The two that grabbed my attention at this point were n_e 45 & 100.
They both gave me the largest number of high results, but they each had a unique part to them as well.
n_e: 100.
It gave me 11 high results, 9 of them were (12%), 2 of them were (16%).
What I did notice is that the 2nd time it hit (16%) was at Epoch № 23. Last time it reached (12%) was at Epoch № 69, and then started declining again, and ended at (3%).
n_e: 45.
It gave me 9 high results, it consistently stayed at (12%), it never hit (16%) like the others, BUT even with the fluctuation of the learning degrees, the last Epoch ended at (12%). Its the only one that did that, all the others ended the 100th Epoch between (3% — 9%).
Woh! That was a lot, but honestly, I enjoyed messing around with this and learned a lot. At this point I think I have a baseline of the two configurations that I like, and to experiment with different Epochs with those two and see if anything changes. But I’m excited to venture into one of the other methods as well and see what happens. I hope you found this helpful.
Thanks for reading.
Lets code something cool.
Ash, The Machine Learner.
Support The Project.
Buy me a coffee | Become my GitHub Sponsor | Become a Patreon