Linear layers (or Dense layers) are to Deep Learning what rockets are to space ships: essential. Most, if not all, models I’ve seen to date use them in one capacity or another. But can the proverbial wheel be improved on? We’ll look at a simple method to put your Linear layers on “steroids” in this article.
What happens in a Linear Layer is we take all of the data as a matrix (or tensor) and use matrix multiplication with a matrix of trained weights, then add a vector of trained biases. The number of operations in matrix multiplication grows exponentially with size. What would happen, though, if we split the data up into smaller, parallel parts and then run it through a Linear Layer with more depth? Let’s test that out on CIFAR10 and see.
We’ll modify the classification model from Pytorch’s tutorial here, so we can get up and running quick: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
A Github link to the scripts used is at the end of the article.
For Keras/Tensorflow users, if you can follow through this Pytorch example, the same benefit can be achieved in Keras via the Dense layer. The Dense layer allows an arbitrary number of dimension as inputs, as well.
First, we’ll get our images transformed and loaded:
Next, we will define our model:
So what we have are two Conv2d layers followed by two Linear layers. Pretty simple.
And finally, the train and test iteration.
Running this, you should get something like the following:
The performance indicators are 56% accuracy and 1.184 best loss after 2 epochs.
The model above took the output of the final Conv2D layer and turned it into a 2 dimensional view with a size of (batch_size, out_features). For our “hack”, we will be altering the view size of the Conv2D output into a 3 dimensional tensor. Then run the data through fc1. Next, we will change the fc1 output view to a 2 dimensional tensor before going through fc3.
Now let’s make the view into a 3 dimensional tensor with 10 splits. The lines changed in the model are commented in.
If we exclude the comments, we only changed 2 lines of code and added 2 lines of code. Printing this out, we can see that this drastically reduces fc1’s size. We should have something like this now:
That brought the total parameters from 62,186 → 8,618. Let’s add some more parameters back in to get the size back to around 62k, so we can compare apples to apples.
We could put them into the Conv2D layers. BUT that would not truly demonstrate whether this change is what is causing the benefit, or that larger Conv2D layers are more ideal for this model. So we will put this increase only to the linear layers. Changing n3 from 120 to 120x31=3,720 gives us a total parameter size of 61,178.
So it’s just a hair smaller than the control model we want to compare with: 62,186 → 61,178 parameters. Now, let’s see how this performs.
We can see it already has nearly the same accuracy as the the final result of the original model after only the first epoch. By the second epoch, it outperformed the first model!
Best Loss: 1.184 → 1.090
Accuracy: 56% → 62%
So what is going on here?
Let’s try to visualize this, starting from the first layer.
- (conv1) Conv2D — The image of, say, a cat, enters in conv1, which runs it through 13 trained filters.
- (pool) MaxPool2D — Then we use a max pooling layer that takes the highest value pixel from each 2x2 area in all of those 13 filtered images, reducing them to 1/4th the size.
- (conv2) Conv2D — Then those 13 smaller images of our cat run through conv2 which has 18 trained filters.
- (pool) MaxPool2D — Another pooling layer. Each time we use a filter and pooling, our cat’s image size gets smaller. So the final pool passes 18 smaller 5x5 images, highlighting different features of our furry Tabby.
- x.view() — Those get flattened and passed on to our linear layer with 450 features in total (5x5x18) in the original model. But with our second model, those 450 data points are sandwiched into 10 feature sets, 45 wide each. (batch_size, splits, in_features)
- (fc1) Linear — Then each of those sets get passed into a 45 feature wide fc1 layer, 3,720 weights deep. So the smaller fc1 layer is being fed the data of the Conv2D layer 10 times consecutively, instead of all at once.
Similar to how a higher batch size can yield better results, this forces the smaller fc1 layer input to be more generalized. The gradients of those 10 sets are accumulated with each image and are averaged together before the backward pass. Thus, this method also helps to alleviate the overfitting problem, which yields higher accuracy.
In theory, this should work to improve linear layers in most use cases, assuming the total parameter size stays approximately equal. Feel free to try it out on your models and if it helps, I’d love to hear about it. Leave a comment to let me know! Thank you for reading.
Here is a repository with both the control and test Pytorch scripts if you’d like to play with them: