Summary for Practical Tips from fast.ai Machine Learning Course — Part 1

Mei Leng
6 min readOct 28, 2018

--

This is my high-level summary of machine learning course by Jeremy. The focus is on practical tricks and tips for machine learning, and in particular for random forest and basic neural network. It is assumed that you’ve known their basic theories. Special thanks to Hiromi Suenaga for her wonderful notes with great details on every lesson. Most content of this summary takes reference from her note (all the figures are from her notes).

  • part 1 for general knowledge in machine learning and tools
  • part 2 for random forest
  • part 3 for neural network

Jupyter Notebook Tricks

  • get information for a function: display, ?display, and ??display
  • execution time: %time , %%time , and %prun

Python Tricks:

  • load dataset from online resources: get_data()
from fastai.io import get_datadef load_mnist(filename):
return pickle.load(gzip.open(filename, 'rb'), encoding='latin-1')
# create local path for saving file
PATH = 'data/mnist/'
import os
os.makedirs(path, exist_ok=True)
# get file
URL = 'http://deeplearning.net/data/mnist/'
FILENAME = 'mnist.pkl.gz'
get_data(URL+FILENAME, PATH+FILENAME)
((x, y), (x_valid, y_valid), _) = load_mnist(PATH+FILENAME)
  • loading and saving data: pickle or feather

— In terms of data format: pickle works for nearly every Python object but it’s probably not optimal for each, and feather is specifically designed for loading large Pandas DataFrame.

— In terms of language, pickle files are only for Python, and feather works for different scenario.

  • adding new axis to array: np.expand_dims() or array[None], array[None,:]
  • broadcasting: np.broadcast_to() , it enables fast computation for matrices.

The smaller array (lower rank tensor) is “broadcast” across the larger array so that they have compatible shapes. At least one dimension of two arrays should match. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when they are equal, or one of them is 1.

  • generator : iter() and next()
  • join pandas dataframe: merge with left join :
def join_df(left, right, left_on, right_on=None, suffix='_y'):
if right_on is None: right_on = left_on
return left.merge(right, how='left', left_on=left_on,
right_on=right_on, suffixes=("", suffix))
# join table store with store_states:
store_pre = store.copy()
store = join_df(store, store_states, "Store")
# check null from right table
len(store[store.State.isnull()])
# check the number of rows:
store_pre.shape[0], store.shape[0]

After a left join, always check if there were things in the right-hand side that are now null — if so, it means that some things are missed. Also always check the number of rows hasn’t varied before and after. If it has, that means that the right hand side table wasn’t unique.

Data Preparation Tricks

read data into pandas dataframe:

types = {'id': 'int64',
'item_nbr': 'int32',
'store_nbr': 'int8',
'unit_sales': 'float32',
'onpromotion': 'object'}
%%time
df_all = pd.read_csv(f'{PATH}train.csv', parse_dates=['date'],
dtype=types, infer_datetime_format=True)
CPU times: user 1min 41s, sys: 5.08s, total: 1min 46s
Wall time: 1min 48s
  • For fast loading:

For large dataset, it is always beneficial to specify the dtype of each column in advance to reduce the memory requirement. In order to figure out the column data type, you can pick a random line from a file, by setting nrows=5 in proc_df , or by using shuf with the -n option, which limits the output to the number specified. You can also specify the output file:

shuf -n 5 -o sample_training.csv train.csv

For columns with dtype='object'object is a general purpose Python datatype which is slow and memory heavy — it would be better to map the values to a correct dtype before reading the whole dataset. For example, the column onpromotion with boolean type was identified as dtype object due to the existence of np.nan as missing values, it makes sense to preprocess as follow:

df_all.onpromotion.fillna(False, inplace=True)
df_all.onpromotion = df_all.onpromotion.map({'False': False,
'True': True})
df_all.onpromotion = df_all.onpromotion.astype(bool)
  • For dataset with date column, add parse_dates to enable standard datetime parsing. For non-standard datetime format, two methods to handle the datetime parsing.

a. pass parser function to date_parser

import datetime as dt
dt.dattime.strptime('30MAR1990', '%d%b%Y')
parser = lambda date: pd.datatime.strptime(date, '%d%b%Y')
pd.read_csv(file, parse_dates=['date'], date_parser=parser)

b. use pd.to_datetime after pd.read_csv()

df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')

fast.ai functions:

  • proc_df()
%%time
trn, y, nas = proc_df(train, 'unit_sales', nas)
val, y_val, nas = proc_df(valid, 'unit_sales', nas)
  • add_datepart()
  • train_cats() and apply_cats()

display the dataframe nicely:

def display_all(df):
with pd.option_context("display.max_rows", 1000):
with pd.option_context("display.max_columns", 1000):
display(df)
display_all(df_raw.tail().transpose())

General Practice in Machine Learning

general procedure for structured dataset:

  1. preparing data:
  • add additional information, e.g. using add_datepart()
  • convert categorical columns to numeric, e.g. using train_cats() and apply_cats() , or using numericalize()
  • fix missing value, e.g. using fix_missing()

More steps can be customized in fast.ai function proc_df()

2. obtain training, validation and test datasets:

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, 
na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Creating a validation set is the most important thing you need to do when you are doing a machine learning project. What you need to do is to come up with a dataset where the score of your model on that dataset is going to be representative of how well your model is going to do in the real world.

— courtesy of Hiromi Suenaga’s note on lesson 2

  • set subset to start with a smaller randomly sampled dataset so that the first-cut training can be speed up and be interactive
  • training, validation and test datasets should be time ordered for problems involving the prediction of future values
  • calibrate the validation set for Kaggle competetions:

Built N different models, N=4 in the figure, and plot its Kaggle score over x-axis against its local score on a validation set. If the validation set is good, the relationship should lie in a straight line.

3. instantiate the machine learning algorithm, call fit() on training and call predict() on validate.

4. understand the dataset using a small subset.

— figure out the relationship between columns using methods and plots like feature importance and partial dependence, clean up the highly correlated columns so that key features stand out and training gets speed up. In practice, you throw some columns away and see whether it makes a difference on the result, since removing redundant columns should not make it worse.

— think more about the most important columns: how is its relation with the dependant variable? what is its distribution? is there any noise in the column and how to fix it? is it colinear with other columns?

5. tuning hyper-parameter:

  • grid-search: use one training set and one validation set to search over a grid of parameters until the best combination of different parameters are found.
  • cross-validation: partition the whole dataset differently and iteration over partitions for training and validation. Particularly:

1). Randomly shuffle the data.

2). Split it into N groups, say N=5.

3). We will call the first one the validation set, and the bottom four the training set.

4). We will train the model using the training set, and we will check against the validation set to get some evaluation results, say RMSE, R², etc.

5). We will repeat that N times, and we will take the average of RMSE, R², etc, and that is a cross-validation average accuracy.

Benefit of using cross-validation:

— use all data

— ensemble N models with each one using (N-1)/N percent of the whole data

Disadvantage of using cross-validation:

— slow for large dataset

— not suitable for datasets whose temporal order is critical due to the random split of validation sets

6. evaluate the test dataset using the trained model on the whole dataset.

probing dataset:

It is always good to start with a subset of the full dataset, play around with different hyper parameters and architectures, until it all works well we can apply the training and further tuning on the full dataset. Examples as follow:

# get a subsetidxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index("Date")
samp_size = len(joined_samp);
..... do stuffs, training, tuning..., on joined_samp# get the full dataset
samp_size = n
joined_samp = joined.set_index("Date")

--

--