Hi William. I like your tutorial, except for one part that I think is important. Your decision tree on the Census data is severely overfit: its ROC “curve” has only one non-trivial point! (see cell 26) In my experience this is a clear sign of pathological overfitting. (In the sklearn implementation, the default value of min_leaf_samples is 1, which tells the algorithm to memorize individual examples). I understand you’re trying to illustrate random forests here, but the lesson shouldn’t be that decision trees inherently overfit (they don’t) or to switch to another model when you see overfitting (you shouldn’t); it should be to get a decent baseline. With a little experimentation on single decision trees I get this:
Recall Baseline: 1.0 Test: 0.94 Train: 0.95
Precision Baseline: 0.81 Test: 0.89 Train: 0.9
Roc Baseline: 0.5 Test: 0.84 Train: 0.91which is pretty close to your Random Forest model.
Might I suggest, for a better pedagogical example, you do RandomizedSearchCV on a single decision tree first before going to an ensemble.
