A few years back, I worked at a consulting firm where we focused on identifying system failures. We analyzed system log files, created a feature vector, and used a random forest classifier. It worked well, and the client adopted it. One day, my boss made an incorrect prediction on his machine. I checked on mine, and it was correct. After investigating, we found we used different random seeds in training. Aligning the seed fixed the issue, showing how crucial the random seed can be.
To gain a better understanding of this phenomenon, we conducted tests on several datasets. In this blog post, we will present the results for the Pima diabetes dataset and the Indian Employee dataset. The Pima dataset contains 768 instances, while the Employee dataset has 4653. For categorical features, we used one-hot encoding. The train-test split was fixed, with the training set comprising one-third of the dataset. Additionally, we tested the scikit-learn default random forest classifier without any parameter tuning, running tests with fifty different random seeds.
In both datasets, the prevalence is approximately 35%, meaning around 35% of the data is positive. The random seed had a significant impact on the results. For instance:
For the Pima:
For the employee dataset:
The impact was more pronounced in the Pima dataset, but the phenomenon manifested in both cases. To delve deeper, we scrutinized prediction probabilities, ensuring that variations were predominantly due to different random seeds rather than minor setting discrepancies. In over sixty percent of the Pima test set and over eighty percent of the Employee test set, the maximum change in prediction probabilities for an instance exceeded fifty percent. Essentially, opting for different random seeds often resulted in disparate predictions, challenging the certainty of specific predictions.
To comprehend this, it’s crucial to note that random forests utilize randomness through two mechanisms: Bootstrap Sampling (a random subset of data) and considering a random subset of features. Given the absence of a boosting process, the choice of data and feature subsets significantly influences the output.
In conclusion, our exploration of the random seed’s impact on the random forest classifier revealed a noteworthy effect on results, prompting reflections on predictor efficiency. This phenomenon transcends Python, as the original issue was encountered using an R code. We advocate for readers to assess this effect across their chosen models. Similar tests with LightGBM and XGBoost affirmed consistent predictions with the different random seed. Lastly, immense gratitude to Ron Ritter, my team leader, for his pivotal role in addressing this matter.