Summary for Practical Tips from fast.ai Machine Learning Course — Part 2

10 min readOct 28, 2018

This is my high-level summary of machine learning course by Jeremy. The focus is on practical tricks and tips for machine learning, and in particular for random forest and basic neural network. It is assumed that you’ve known their basic theories. Special thanks to Hiromi Suenaga for her wonderful notes with great details on every lesson. Most content of this summary takes reference from her note (all the figures are from her notes).

part 1 for general knowledge in machine learning and tools
part 2 for random forest
part 3 for neural network

key concepts of random forest:

ensemble and bagging
the important thing is to create uncorrelated trees rather than more accurate trees.

The effective machine learning model is accurate at finding the relationships in the training data and generalizes well to new data [55:53]. In bagging, that means that each of your individual estimators, you want them to be as predictive as possible but for the predictions of your individual trees to be as uncorrelated as possible.
— courtesy of Hiromi Suenaga’s note on lesson 2

random forest essentially cares more about relative value instead of absolute value, so scaling is not needed.

fast.ai functions:

set_rf_samples() and reset_rf_samples() : set the size of subset of data used in each tree, and reset to sklearn default respectively. It should be noted that oob=True cannot be used when setting set_rf_samples() to be a small number agains a huge dataset.
parallel_trees() : takes a random forest model m and some function to call . This calls this function on every tree in parallel. It will return a list of the result of applying that function to every tree.
rf_feat_importance() : takes a model m and dataframe df_trn, and it returns a Pandas dataframe showing you in order of importance how important each column was.

important hyper-parameters in random forest:

set_rf_samples(): to pick a subset of rows for each tree. On the contrary, sklearn uses sampling with replacement from the whole dataset, and the resulting tree in random forest will still have the number of nodes equal to the dataset size.

setting set_rf_samples to be a smaller number will lead to trees that overfit less but also predict less accurate.

By decreasing the set_rf_samples number, we are actually decreasing the power of the estimator and increasing the correlation — so is that going to result in a better or worse validation set result for you? It depends. This is the kind of compromise which you have to figure out when you do machine learning models.

setting set_rf_samples to be a smaller number also speed up the training.

min_samples_leaf=N : For each tree, rather than just taking one point, we are taking the average of at least N points that we would expect the each tree to generalize better. But each tree is going to be slightly less powerful on its own. Good values to use are 1, 3, 5, 10, 25, 100… As you increase, if you notice by the time you get to 10, it’s already getting worse then there is no point going further. If you get to 100 and it’s still going better, then you can keep trying.
max_features=0.5 : to randomly sample a subset of columns at each split. Setting this to be smaller will leat to each individual tree to be less accurate but the trees are going to be more varied. Good values to use are 1, 0.5, log2, or sqrt .

random forest interpretation:

— confidence based on tree variance

— feature importance

— dendrogram

— tree interpreter

confidence based on tree variance:

It is useful to check average confidence interval for each group in a categorical column as well as for each row.

###### get treesset_rf_samples(50000)
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, 
                        max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)###### compute mean and std across trees in the forestdef get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))
np.mean(preds[:,0]), np.std(preds[:,0]) for one observation
CPU times: user 100 ms, sys: 180 ms, total: 280 ms
Wall time: 505 ms
(9.1960278072006023, 0.21225113407342761)%time preds = np.stack([t.predict(X_valid) for t in m.estimators_])
np.mean(preds[:,0]), np.std(preds[:,0]) # for one observation
CPU times: user 1.38 s, sys: 20 ms, total: 1.4 s
Wall time: 1.4 s
(9.1960278072006023, 0.21225113407342761)

1). compute prediction for each validation observation, output preds with shape (num_of_trees, num_of_observation) , and further obtain mean and standard deviation using preds .

x = X_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)

2). for categorical columns, compute the mean and std using groupby .

flds = [col_name, y_label, 'pred', 'pred_std']
summ = x[flds].groupby(col_name, as_index=False).mean()

3). on average, a column with a bigger number will have a higher standard deviation, and it makes sense to sort by the ratio of pred_std to pred.

(summ.pred_std/summ.pred).sort_values(ascending=False)

4). pred and pred_std ratio tells us how accurate the prediction is with respect different category for a specific column, and generally, less accuracy will be achieved for a smaller group.

feature importance:

It tells you how relatively important each column is with respect to each other, and how predictive it is of the dependent variable.

fi = rf_feat_importance(m, df_trn); fi[:10]def plot_fi(fi): 
  return fi.plot('cols','imp','barh', figsize=(12,7), legend=False)plot_fi(fi[:30]);

Under the hood, the feature importance was computed by taking the trained model, first randomly shuffle the target column, and then compute the prediction on the shuffled dataset, the difference between this with the training prediction will be feature importance value. The more importance a column is, the larger deterioration it will result in.

Feature importance is useful in the identification of data leakage for some particular columns (to see whether is entirely predictive and overshadow all other columns) and collinearity (to see whether two or more columns are highly correlated).

There is an iterative procedure you can take to make use of feature importance in removing redundant columns:

1). remove columns whose feature importance is smaller than a threshold value, say 0.005.

2). retraining on the new dataset with less columns, check the validation score, if it gets worse, decrease the threshold a little bit until it doesn’t get worse.

3). redo the feature importance computation, the new dataset with less columns should have less colinearity.

dendrogram:

A type of hierarchical clustering to put similar columns together. Similarity can be measured using correlation like spearmanr .

from scipy.cluster import hierarchy as hccorr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, 
      orientation='left', leaf_font_size=16)
plt.show()

Removing redundant columns using dendrogram: iteratively removing each of the similar columns and check whether OOB score gets worse. It would be safe to remove the one giving worse OOB among the similar columns.

def get_oob(df):
    m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, 
           max_features=0.6, n_jobs=-1, oob_score=True)
    x, _ = split_vals(df, n_trn)
    m.fit(x, y_train)
    return m.oob_score_for c in ('saleYear', 'saleElapsed', 'fiModelDesc', 'fiBaseModel', 
          'Grouser_Tracks', 'Coupler_System'):
    print(c, get_oob(df_keep.drop(c, axis=1)))saleYear 0.889037446375
saleElapsed 0.886210803445
fiModelDesc 0.888540591321
fiBaseModel 0.88893958239
Grouser_Tracks 0.890385236272
Coupler_System 0.889601052658

partial dependence:

It is used to learn more about important columns, and it also investigates the interactive relationship between columns.

— for one column:

First, you plot relationship between the column with the dependant variable.

ggplot(df_all, aes(col_name, y_label))+stat_smooth(se=True, 
       method='loess')

Then, you further study how the variation in this column affect the prediction.

from pdpbox import pdp
from plotnine import *def plot_pdp(feat, clusters=None, feat_name=None):
    feat_name = feat_name or feat
    p = pdp.pdp_isolate(m, x, feat)
    return pdp.pdp_plot(p, feat_name, plot_lines=True, 
                        cluster=clusters is not None, 
                        n_cluster_centers=clusters)plot_pdp('YearMade')

Under the hood, the partial dependence for a target column is computed by taking the trained model and a set of values for the target columns, say [x0, x1, … , xn], for each value of x, first change all values in the target column to be a value xi, then compute the prediction for xi, connecting the prediction dots against x will give a line in a 2D figure. For each observation, we can generate such a line, and we can check the average of all lines, or we can cluster the lines.

— for two columns:

feats = [col_name_1, col_name_2]
p = pdp.pdp_interact(traind_model, x, feats)
pdp.pdp_interact_plot(p, feats)

It reveals how these two columns together impact the dependant variable.

tree interpreter:

It is used to interpret how each feature impact the final prediction. Feature importance is for complete random forest model, and this tree interpreter is for feature importance for particular observation.

Therefore, tree interpreter tells us the contributions for a particular row based on the difference in the tree, and we could also obtain feature importance by adding up the contribution calculated from every row in our dataset.

from treeinterpreter import treeinterpreter as tidf_train, df_valid = split_vals(df_raw[df_keep.columns], n_trn)
row = X_valid.values[None,0]
prediction, bias, contributions = ti.predict(m, row)prediction[0], bias[0]
(9.1909688098736275, 10.10606580677884)idxs = np.argsort(contributions[0])
[o for o in zip(df_keep.columns[idxs], df_valid.iloc[0][idxs], contributions[0][idxs])][('ProductSize', 'Mini', -0.54680742853695008),
 ('age', 11, -0.12507089451852943),
 ('fiProductClassDesc',
  'Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons',
  -0.11143111128570773),
 ('fiModelDesc', 'KX1212', -0.065155113754146801),
 ('fiSecondaryDesc', nan, -0.055237427792181749),
 ('Enclosure', 'EROPS', -0.050467175593900217),
 ('fiModelDescriptor', nan, -0.042354676935508852),
 ('saleElapsed', 7912, -0.019642242073500914),
 ('saleDay', 16, -0.012812993479652724),
 ('Tire_Size', nan, -0.0029687660942271598),
 ('SalesID', 4364751, -0.0010443985823001434),
 ('saleDayofyear', 259, -0.00086540581130196688),
 ('Drive_System', nan, 0.0015385818526195915),
 ('Hydraulics', 'Standard', 0.0022411701338458821),
 ('state', 'Ohio', 0.0037587658190299409),
 ('ProductGroupDesc', 'Track Excavators', 0.0067688906745931197),
 ('ProductGroup', 'TEX', 0.014654732626326661),
 ('MachineID', 2300944, 0.015578052196894499),
 ('Hydraulics_Flow', nan, 0.028973749866174004),
 ('ModelID', 665, 0.038307429579276284),
 ('Coupler_System', nan, 0.052509808150765114),
 ('YearMade', 1999, 0.071829996446492878)]

Under the hood, the computation was done along one path of the tree, where we calculate the change in the intermediate result for dependant variable, like the average of log price, by each additional criteria (column value). A waterfall plot can be generated using these value changes in splits.

— prediction: the same as the random forest prediction

— bias: this is going to be always the same — it’s the average sale price for everybody for each of the random samples in the tree

— contributions: the total of all the contributions for each time we see that specific column appear in a tree.

So if we sum up all the contributions together, and then add them to the bias, then that would give us the final prediction.

dealing with categorical column:

use one-hot encoding for more efficient split, [low, median, high] will be map to three columns as is_low, is_median, is_high with each being a boolean type.

— fast.ai provides parameter max_n_cat=7 in proc_df() for this. Columns with more levels will be left as integer numbers.

— feature importance may be changed after using one-hot encoding instead of integer category, since now single level of a categorical column may stand out and change the features.

— before one-hot encoding, it is good practice to encode with integer first:

df.cat_col.cat.set_categories([‘high’,’low,’median’],ordered=True, inplace=True)
df.cat_col = df.cat_col.cat.codes

dealing with the big difference between validation score and OOB score:

First, keep in mind that OOB score should be a little worse because it’s using less trees.

Second, this extrapolation issue of random forest is closely related to temporal ordered problem, and this stems from the fact that there exists difference between validation set and training set in terms of past and future values.

So the problem boils down to figure out what are the predictors which have a strong temporal component and therefore they may be irrelevant by the time I get to the future time period, which causes a difference between the training set and the validation set. In particular, we will identify columns that is not a random identifier but instead something that’s set consecutively as time goes on, and we do the following:

1). we create a new random forest to detect “whether a sample is in the validation set”. If your variables were not time dependent, then it shouldn’t be possible to figure out if something is in the validation set or not, since the validation set was constructed based on temporal order with respect to the training set.

df_ext = df_keep.copy()
df_ext['is_valid'] = 1
df_ext.is_valid[:n_trn] = 0
x, y, nas = proc_df(df_ext, 'is_valid', nas)

This is a great trick in Kaggle because they often won’t tell you whether the test set is a random sample or not. So you could put the test set and training set together, create a new column called is_test and see if you can predict it. If you can, you don’t have a random sample which means you have to figure out how to create a validation set from it.

m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(x, y);
m.oob_score_0.99998753505765037

If the oob_score_ is high, we do NOT have a random validation set since we can predict it correctly.

2). we check the feature importance of columns in prediction the validity of the validation set. The time dependant features will have higher importance, and their values in training set will be very different from that in validation set. The rationale behind this is that this rule helps you to differentiate training sample from validation sample and make correct prediction. In other words, if some feature have similar values in training set and validation set, it is very unlikely that it can provide useful information to predict whether a sample is in validation set or not.

3). we remove the most important features one by one, and we will expect the score goes up for time dependent features, because if we’ve removed a time dependent feature, there were other features that could find similar relationships without the time dependency, and removing it caused our validation to go up.

def print_score(m):
    res = [rmse(m.predict(X_train),y_train),
           rmse(m.predict(X_valid),y_valid), 
           m.score(X_train, y_train), 
           m.score(X_valid, y_valid)]
    if hasattr(m, ‘oob_score_’): res_append(m.oob_score_)
    print(res)for f in feats:
    df_subs = df_keep.drop(f, axis=1)
    X_train, X_valid = split_vals(df_subs, n_trn)
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, 
            max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(X_train, y_train)
    print(f)
    print_score(m)

Getting rid of time dependent features will lead to a result that validation score is better than OOB score.

4). Removing time dependent features — features that makes validation score goes up after removing it — and then retrain the model on the rest dataset. If things work well, we should see an increase in the prediction performance.

more on the extrapolation issue:

The biggest downside of a random forest model is that it cannot predict on something it hasn’t seen. In other words, it does NOT generalize well for values which a function governs implicitly. Solutions for this issue can be found by either using neural nets or using conventional time series processing techniques. Another not-so-good solution is to use gradient boosting machine, although GBM still can’t extrapolate to the future, at least they can deal with time-dependent data more conveniently.