Techniques for avoiding overfitting
- Dropout: remove activations at random during training in order to regularize the model
- Data augmentation: modify model inputs during training in order to effectively increase data size
- Batch normalization: adjust the parameterization of a model in order to make the loss surface smoother.
Lecture announcement: platform.ai allows you to train models using images. You can use this as a tool to train models on unlabelled data.
Dataset: Rossman Store Sales
- created a folder rossmann in /home/jupyter/.fastai/data
- put the rossmann.tgz inside that folder and ran
tar -xvf rossman.tgz
join_df lets you join tables on specific fields. We'll do a left outer join of
right on the
left argument using given fields for each table. Pandas does joins using the
suffixes argument describes naming convention for duplicate fields
- important for time series
- sometimes the only data you have is a sequence of time points. The only thing you have is 1 sequence. In real-life that’s almost never the case. We have metadata, sequences of other things measured in different time periods, etc. In practice, the state of the art results don’t use RNNs but take time piece and they add a bunch of metadata like
WeekOfMonth, etc. And this is what
add_datepart()does for us
- But we can use
add_datepart()to enrich the columns
- You can treat time series more as tabular data now
- Our goal is to predict the number of sales on a particular date given a store id
run once on training set, and the same transformations are applied to training and test sets
idx = np.random.permutation(range(n))[:2000] # grab 2000 ids at random
# grab 5 columns
small_train_df = train_df.iloc[idx[:1000]]
small_test_df = train_df.iloc[idx[1000:]]
small_cont_vars = ['CompetitionDistance', 'Mean_Humidity']
small_cat_vars = ['Store', 'DayOfWeek', 'PromoInterval']
small_train_df = small_train_df[small_cat_vars + small_cont_vars + ['Sales']]
small_test_df = small_test_df[small_cat_vars + small_cont_vars + ['Sales']]
Observe the training and test data
First processing step: categorify
We can look at the specific categories by accessing them:
We can also apply
FillMissing to identify missing fields in certain columns. And then we're going to ill those with the median.
The fact that something is missing is of itself an insight. So we want to keep that information but we still need the field to be some variable (in the case of CompetitionDistance, a continuous variable), so we can replace it with almost any number.
You don’t have to manually call pre-processors yourself. When you call any kind of ItemList creator in DataBlock (i.e.
TabularList) , you can pass in a list of pre-processors which you can first define:
procs=[FillMissing, Categorify, Normalize]
and then pass that in with
data = (TabularList.from_df(df, path=path, cat_names=cat_vars, cont_names=cont_vars, procs=procs,)
.label_from_df(cols=dep_var, label_cls=FloatList, log=True)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_vars, cont_names=cont_vars))
cat_vars or categorical values are not just strings, but also day of week, month, etc
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
'SchoolHoliday_fw', 'SchoolHoliday_bw']cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC',
'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h',
'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE',
'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']
y_range is the range for the sigmoid, which we've seen before. The output will have log so we'll take the max of Sales column and get the log of that as well, to use that as the max y.
So we use
log=True and in the
data take the log of y as our RMSE
max_log_y = np.log(np.max(train_df['Sales'])*1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)
The intermediate weight matrix needs to go from 1000 activation input to 500 activation output. So there will be 500k thousand elements in the matrix. It will overfit. To make sure it doesn’t, we use regularization (not to reduce number of parameters). We use weight decay for this.
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04,
But we want to give it even more regularization, so we pass in
emb_drop, which will give us dropout.
At random, we throw away some percentage of the activations
We only have 2 types of elements in a neural network: parameters or activations. So we’re going to throw away some activations
For each minibatch, we throw away a different subset of activations. Specifically, we throw each one away with a probability
p. A common value for
p is 0.5
Means that no one activation can memorize some part of the input. This is what happens when there’s overfitting. With dropout, it will be hard for the activation to memorize a particular input. This is an analogy given by Geoffrey Hinton:
Dropout worked really well. We can use it in our models to get generalization for free. Dropout can reduce the capacity of your model if it causes it to underfit, so you have to tweak it.
In pretty much every fast.ai learner, there’s a parameter called
ps which will be the p-value for the dropout for each layer. So you can just pass in a list, or you can pass it an int and it’ll create a list with that value everywhere.
You can just pass in a list like
ps=[0.001,0.01], or you can pass it an int and it'll create a list with that.
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, y_range=y_range, metrics=exp_rmspe)
Training time is when propagation happens. During training time, dropout is working as we described.
At test time, we don’t apply dropout. The dropout paper suggests that we multiply our weights at test time by
This means we’re going to use a tiny bit of dropout on the first layer, slightly more on the second layer and a special dropout (0.04) in the embedding layer.
If we inspect
Embedding matrix tells you the number of levels for each input (the first number).
You can match these with your list
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear', 'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw', 'SchoolHoliday_fw', 'SchoolHoliday_bw']
So the first one will be Store, so that’s not surprising there are 1,116 stores. Then the second number in the tuple is the size of the embedding. That’s a number that you get to choose.
len(data.train_ds.cont_names) tells us the number of variables in the batch norm layer. Makes sense that it's 16 because we have 16 continuous variables
cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday']
The value of our predictions is some function of our various weights. There could be millions of them. And of course, the inputs to our layer are also taken in. This function is the function of our neural net.
Our loss, say MSE, is the actuals minus predicted squared.
Say we’re trying to predict movie reviews between 1 and 5. We’re trying to train our model and the activations at the end are between [-1,1]. Which is way off from the [1,5] which is where we want it to be. So we can come up with a new set of weights that can cause this scale and mean to increase. But that’s hard to do because weights interact with very complex ways.
But what if we add two parameter vectors — times g plus b
Now it’s really easy. To increase the scale, that number adds a direct gradient
g . And to change the mean, there's a direct number
b to do that. that's what batchnorm does.
You definitely want to use it.
Another normalization technique in fast.ai is called weight norm.
More on the model
One interesting thing is this momentum. This is not momentum like in optimization, but momentum as in exponentially weighted moving average. We take an exponentially weighted moving average of the mean and standard deviation.
Coming back to computer vision and pets dataset.
get_transforms() as usual but there's a lot of parameters we can control:
tfms = get_transforms(max_rotate=20, max_zoom=1.3, max_lighting=0.4, max_warp=0.4,
You can pick padding mode, reflection, etc.
If we use
plot_multi, we get a 3x3 grid of plots, each containing the result of a call to
_plot() which will receive the plot coordinates and the axis.
These pictures all look pretty different. But we didn’t have to do extra labeling — so it’s like free extra data.
One big area of research is figuring out how to do data augmentation with other kinds of data.
Train the model
We know the process of creating and running a CNN model:
learn = cnn_learner(data, models.resnet34, metrics=error_rate, bn_final=True)
Next, we do
fit_one_cycle() on a number of convolutions, learning rate and then unfreeze and do it again, etc.
We want to make a heatmap from scratch:
This is a picture that shows what part of the image the CNN focused on what it was trying to decide what this picture is.
Instead of the matrix multiplications we’ve seen before this, we’re going to do a convolution.
A convolution is just a kind of matrix multiplication which has some interesting properties.
Each item in the 3x3 matrix (red square) is a pixel value from the picture. If you move the red box, the numbers will change.
This is the convolution kernel:
We take each little 3x3 part of this image, and we’re going to do an element-wise multiplication of each of the 9 pixels that we are mousing over with each of the 9 items in our kernel.
Once we multiply each set together, we add them all up. And that is what’s shown on the right image. The black borders are at edge the 3x3 kernel, which can’t go any further. So the furthest you can go is to end up with a dot in the middle just off the corner.
This procedure where we take each 3x3 area, and element wise multiply them with a kernel, and add each of those up together to create one output is called a convolution.
Another visualization by Matt Kleinsmith:
The 2x2 squares on the left are kernels, and 3x3 in the middle are pixels. On the right is the output
So the pink bit will be correspondingly multiplied by the pink bit, the green by the green, and so forth. And they all get added up together to create this top left in the output. In other words:
A convolution is just a process where 2 things happen:
- some of the entries are set to zero all the time
- all of the ones are the same color, always have the same weight
When you have elements with the same weight, that’s called weight tying.
We have to think about padding because otherwise we might miss pixels. So padding involves pudding additional numbers around the border of the image
This means you’ll have the same output size as you started with. For simple convolutions we can use 0 padding but that’s not always the case.
We take it a step further.
We need to create a 3x3x3 kernel. Rather than doing an element-wise multiplication of 9 things, we’re going to do an element-wise multiplication of 27 things (3 by 3 by 3) and we’re still going to then add them up into a single number.
We can do that on the entire padded image input.
Our image was initially 5x5, so we’ll have an output that’s also 5x5. But our input was 3 channels (red, green, blue) and our output is only one channel. But we still need to find a gradient, an area of constant white, and more information so we’ll need another kernel and do that convolution over the input. Which will create another 5x5. We then stack those outputs.
That’s going to result in another rank 3 tensor output.
If we look at a particular image:
data = get_data(352,16)learn = cnn_learner(data, models.resnet34, metrics=error_rate, bn_final=True).load('352')
If we take a look at the
The input we asked for was 352 by 352 pixels (from
get_data(352,16)) and generally speaking, the very first convolution tends to have a stride 2. So after the first layer, it's 176 by 176.
Then, as we go along, you’ll see that from time to time we halve (e.g. go from 88 by 88 to 44 by 44 grid size, so that was a 2D conv) and then when we do that we generally double the number of channels.