Current Convolutional Neural Networks are not translation equivariant!
--
Are Convolutional Neural Networks translation-equivariant? Yes… NO. I will take you for a journey on how and why they are not equivariant.
We will present the work in two posts:
- Current Convolutional Neural Networks are not translation equivariant!
- Full convolution experiments with details.
The content of this post is based on our recent paper which is going to be presented at CVPR 2020.
Translation equivariance
It is best to start by explaining equivariance. Imagine you are hungry and you have a sandwich in your lunch box. The lunch box stays on the table. What do you do? First, you can take your lunch box from the table and open it. Then you grasp and eat your delicious sandwich. Instead, you can first open the box on the table and take the box in your hands. After that, you can eat your sandwich. In this case, opening the lunch box is equivariant to translation. Similarly, you can first shift your input then apply convolution to it, or instead, you can first apply convolution to the input and then shift it. If the results are the same, then we can say that convolution operation is translation equivariant.
Rook dilemma
Let’s look at rook images. Identical rook patches are put on the upper-left corner (class-1) and bottom-right corner (class-2) of a black template. We let a single-layer fully-convolutional network try to learn the location of the patch. For the setup, we have a randomly initialized 5x5 filter with the same convolution followed by ReLU, global max pooling, and softmax classifier. Because convolution operation is translation equivariant (with global max pooling, the network is translation invariant), then we expect that network cannot distinguish the difference between class-1 and class-2. Let’s look at the training visualization: wait a minute...
Boundary effects break translation equivariance in CNNs.
The network finds a filter to easily distinguish class-1 and class-2. But how? By using boundary conditions, the network learns the filter (epoch 25) to detect the upper left corner of the rook patch. For class-2, the filter response is kept inside the feature map; however, for class-1, the response is outside of the feature map. Namely, the response is cropped by the boundary and allow CNN to learn a filter that activates only on certain parts of the image. Even if we have a global max pooling operator, still the convolution layer can exploit the location. The fact breaks the translation equivariance.
Convolution types
It turns out that the convolution type plays a major role. What are the effects of convolution types? In the figure below, we show 3 convolution types: valid, same, and full convolution. This time, we have a more simple setup. We have a 3x3 filter. Our images have pixel value as 1 on the upper-left side (class-1) and bottom-right side (class-2) and the values of other pixels are zero. We apply the filter on the same images with different convolution types. If we check each feature map of class-1, valid and same convolution have a completely black feature map since they cannot keep the response inside the feature map due to boundaries. On the other hand, full convolution keeps the response inside of the feature map. For class-2, all convolution types retain the response. Consequently, we demonstrate that valid and same convolutions are not completely translation equivariant, yet full convolution is.
Full convolution preserves the translation equivariance.
Pretrained networks exploit the location
What about big networks, can they exploit the spatial location? If yes, how far from the image boundary can the absolute location be exploited? To show that, we create Quadrant Imagenet (QI) dataset. We take 3000 images from the Imagenet validation set, resize them as 56x56 and place them in the upper-left corner (class-1) and bottom-right corner of a black template (total 2k images for train/val/test). To evaluate the distance from the boundary we create 7 versions by adding a black border of size ∈ {0, 16, 32, 64, 128, 256, 512} on all 4 sides of the QI images.
Now we have the dataset. To answer the question, we have 3 architectures (Bagnet-33, Resnet-18, and Densenet-121) and 3 different weight initializations: (i) trained completely from scratch to see how well it can do; (ii) randomly initialized with frozen convolution weights to evaluate the architectural bias for location classification; (iii) ImageNet pre-trained with frozen convolution weights to evaluate the location classification of a converged realistic model used in a typical image classification setting.
Pretrained networks can exploit the absolute spatial location even far from the boundary.
As we can see in the Figure above, each pre-trained architecture, depending on their receptive field size and model capacity, can exploit the absolute spatial location even far from the boundary.
Data efficiency
Improving equivariance and invariance by using full convolution provides data efficiency for small datasets.
Both Imagenet classification and patch matching experiments show that full convolution outperforms same convolution when the dataset sizes are small. When the dataset size is large, then convolution types result in similarly.
Summary
Boundary conditions break the translation equivariance property of CNNs. The fact leads CNNs to exploit spatial location. On the other hand, full convolution preserves the translation equivariance and provides data efficiency.
In the next post, we will indicate the benefits of using full convolution instead of same or valid convolution.
Full convolution provides
- robustness to location biases and image shifts,
- data efficiency on Imagenet classification and patch matching,
- better accuracy and less overfitting with small datasets (on action recognition).
You can find all the implementations of the paper on our Github page. For more information, please check our paper.