Great article! Much of the code uses Spark (which is naturally distributed using YARN or some other resource manager), but XGBoost’s Python binding doesn’t appear to be distributed. Does this mean that the actual training happens on only one node?
Wow! Awesome work! This is a great case study in lack of data integrity and validation. I personally relied on this dataset without putting any work into validating its completeness or representativeness. I hope this leads to a much needed conversation about blindly relying on bad data