Best Machine Learning “Hack” of 2021

Jeremiah Johnson
May 5 · 4 min read

Linear layers (or Dense layers) are to Deep Learning what rockets are to space ships: essential. Most, if not all, models I’ve seen to date use them in one capacity or another. But can the proverbial wheel be improved on? We’ll look at a simple method to put your Linear layers on “steroids” in this article.

Photo by Damir Spanic on Unsplash

What happens in a Linear Layer is we take all of the data as a matrix (or tensor) and use matrix multiplication with a matrix of trained weights, then add a vector of trained biases. The number of operations in matrix multiplication grows exponentially with size. What would happen, though, if we split the data up into smaller, parallel parts and then run it through a Linear Layer with more depth? Let’s test that out on CIFAR10 and see.

We’ll modify the classification model from Pytorch’s tutorial here, so we can get up and running quick:

A Github link to the scripts used is at the end of the article.

For Keras/Tensorflow users, if you can follow through this Pytorch example, the same benefit can be achieved in Keras via the Dense layer. The Dense layer allows an arbitrary number of dimension as inputs, as well.

First, we’ll get our images transformed and loaded:

Transforms and Dataloader

Next, we will define our model:

Control Model

So what we have are two Conv2d layers followed by two Linear layers. Pretty simple.

And finally, the train and test iteration.

Train and Test

Running this, you should get something like the following:

Control Model Performance

The performance indicators are 56% accuracy and 1.184 best loss after 2 epochs.

The model above took the output of the final Conv2D layer and turned it into a 2 dimensional view with a size of (batch_size, out_features). For our “hack”, we will be altering the view size of the Conv2D output into a 3 dimensional tensor. Then run the data through fc1. Next, we will change the fc1 output view to a 2 dimensional tensor before going through fc3.

Now let’s make the view into a 3 dimensional tensor with 10 splits. The lines changed in the model are commented in.

Test Model

If we exclude the comments, we only changed 2 lines of code and added 2 lines of code. Printing this out, we can see that this drastically reduces fc1’s size. We should have something like this now:

Test Model Size

That brought the total parameters from 62,186 → 8,618. Let’s add some more parameters back in to get the size back to around 62k, so we can compare apples to apples.

We could put them into the Conv2D layers. BUT that would not truly demonstrate whether this change is what is causing the benefit, or that larger Conv2D layers are more ideal for this model. So we will put this increase only to the linear layers. Changing n3 from 120 to 120x31=3,720 gives us a total parameter size of 61,178.

Adjusted Test Model Size

So it’s just a hair smaller than the control model we want to compare with: 62,186 → 61,178 parameters. Now, let’s see how this performs.

Test Model Performance

We can see it already has nearly the same accuracy as the the final result of the original model after only the first epoch. By the second epoch, it outperformed the first model!

Best Loss: 1.184 → 1.090

Accuracy: 56% → 62%

So what is going on here?

Photo by Dex Ezekiel on Unsplash

Let’s try to visualize this, starting from the first layer.

  1. (conv1) Conv2D — The image of, say, a cat, enters in conv1, which runs it through 13 trained filters.
  2. (pool) MaxPool2D — Then we use a max pooling layer that takes the highest value pixel from each 2x2 area in all of those 13 filtered images, reducing them to 1/4th the size.
  3. (conv2) Conv2D — Then those 13 smaller images of our cat run through conv2 which has 18 trained filters.
  4. (pool) MaxPool2D — Another pooling layer. Each time we use a filter and pooling, our cat’s image size gets smaller. So the final pool passes 18 smaller 5x5 images, highlighting different features of our furry Tabby.
  5. x.view() — Those get flattened and passed on to our linear layer with 450 features in total (5x5x18) in the original model. But with our second model, those 450 data points are sandwiched into 10 feature sets, 45 wide each. (batch_size, splits, in_features)
  6. (fc1) Linear — Then each of those sets get passed into a 45 feature wide fc1 layer, 3,720 weights deep. So the smaller fc1 layer is being fed the data of the Conv2D layer 10 times consecutively, instead of all at once.

Similar to how a higher batch size can yield better results, this forces the smaller fc1 layer input to be more generalized. The gradients of those 10 sets are accumulated with each image and are averaged together before the backward pass. Thus, this method also helps to alleviate the overfitting problem, which yields higher accuracy.

In theory, this should work to improve linear layers in most use cases, assuming the total parameter size stays approximately equal. Feel free to try it out on your models and if it helps, I’d love to hear about it. Leave a comment to let me know! Thank you for reading.

Here is a repository with both the control and test Pytorch scripts if you’d like to play with them:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store