What I Learned from fast.ai ML till 5

Masaki Kozuki
5 min readOct 21, 2018

--

This post is about what I learned from fastai Machine Learning course published this September.
Edit (2018/10/20): Since fastai course 1 V3 starts in a few days. I stopped watching 6~12 lectures.
Note: In this post, I use fastai v0.7, not v1.0.

What is Random Forest?

Random Forest is one of the most famous machine learning algorithms because it is easy to use and applicable to both classification and regression problems even if each data sample is composed of both categorical (e.g. ZIP code) and continuous (e.g. price) variables. Also, random forest avoids BAD overfittings and it can achieve fairly good results with a few pieces of feature engineering. Further, data samples do not need to be i.i.d. samples while most linear machine learning algorithms require this property. So, it is a good point to start any projects related to machine learning.

Use Random Forest To Understand Data More!

Random Forest consists of a bunch of trees. In scikit-learn, we can choose the number of trees by passing 1 ton_estimators argument. In this case, the trained model is decision tree. So, the visualization of your model tells you which features are more important/effective than the others.

Let me show you an example from Lesson 1 notebook. Where the goal is to predict the sale price (= regression) of bulldozers. Details are Kaggle bulldozers competition. This dataset has both categorical and continuous variables and each data sample has a timestamp.

Visualization of the decision tree from lecture notebook.

In the above figure, every single node has 4 lines: 1) feature (column) name ≤ criterion, 2) Mean Squared Error = loss value, 3) # of samples included, 4) average of predicted sale prices. As you can see, more left features are more important/effective than others. In this figure, some features are categorical though, their criteria are float numbers. Why? Because categorical variables are translated into integers. Of course, sometimes categorical variables have an order, but usually, reflecting the order to the translation doesn’t improve scores. Another thing I want to mention is fastai provides really useful draw_tree function as below:

draw_tree function

Random Forest in Supervised Learning

In the above paragraph, I show that machine learning sometimes helps us know more about datasets. Though, our original goal is to get super cool models which can predict values/labels of unknown data samples. Also, literally, a random forest is composed of a bunch of decision trees. Every single tree is like a tree used in the previous section. In other words, averaging predictions to make outputs more accurate with less variance. This averaging method is usually called bagging. As a side note, we can use bagging when there are different models like a pair of SVM and Random Forest. In scikit-learn’s random forest, n_estimators defines how many trees are used to build a random forest. So, larger n_estimators means less variance and higher accuracy. You can check this effect by changing n_estimators argument of Random Forest. But note that there is a limit where you cannot improve any more. When you use Random Forest and your model has a serious gap between a training set and corresponding validation/development set, it is a good choice to set oob_score true. OOB stands for “Out-of-bag”. What is out-of-bag score? Out of bag score is calculated by using samples not used in each tree building. So it is like quasi-validation score intuitively. However, this score is usually worse than the validation score.

Frequently Tuned Hyper Parameters

Of course, there are some parameters frequently tuned. I experimented these parameters’ effects in this notebook.

  1. n_estimators : Number of trees composing one forest.
  2. max_depth : Maximum depth (= height) of each tree. If not specified, the depth is up to min_samples_leaf.
  3. min_samples_leaf : Number of samples whom each leaf node has. In other words, the minimum number of samples to expand nodes / deepen trees.
  4. max_features: Number of features (columns) to obtain the best split. If this is not specified, in each step, the algorithm looks for all the remaining columns. By using this argument, every single tree is going to be less accurate but have different properties. This leads to better models.

Technique for Categorical Variables

Random Forest doesn’t know the order of categorical variables until we tell. For example, if one category is about the size of a product: large, middle, small, then we assume labels to be 2, 1, and 0 or 1, 0, and -1, respectively. However, if labels are messed up like high->1, middle->2, and small->0, a model doesn’t know anything at all. This situation goes worse when the number of classes in a category is large. One way to attack this is one-hot encoding. It is easy to implement that add #classes columns and each column represents whether the sample has the attribute represented by the column. So, if you apply one-hot encoding to high, middle, and small, then adding high column, middle column and small column to the tabular dataset and every single row sets 1 in one of 3 columns. This technique might be ineffective, but usually changes the order of feature importance. So there will be a new understanding of your datasets.1

Feature Importance

One easy way to tell the importance is calculating the difference between scores on the dataset where one column is randomly shuffled and the original validation dataset. After applying this method to all the columns one by one, you will get the list of gaps. Intuitively, if the gap is small, the shuffled column (feature) is not important.

Another way is partial dependence. This is obtained by replacing one column with one constant value. By plotting this, we can know whether something extraordinary happens or not.

When you use temporal datasets…

Datasets containing timestamps is more difficult than other datasets to split them into training, validation, and test. Because if a validation and/or test dataset includes older samples than ones of training, your model predicts using strong prior knowledge about validation and/or test. So if your dataset is related time, splitting is done according to chronological order. Hence prediction on test dataset is executed after finding good hyperparameters and retrain a model using them on training + validation dataset.
Also, it is worth trying to remove implicitly time-related variables from model inputs. To detect whether a variable is time-related or not, train a model to predict each sample is from training or validation dataset. Easy to predict variables should be removed and it will improve your models.

--

--

Masaki Kozuki

graduate student / comp. sci. / machine learning / deep learning / Chainer / PyTorch / Optuna