Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation
Semantic segmentation is a basic computer vision task whose purpose is to predict the pixel-level classification results of images. Due to the vigorous development of deep learning research in recent years, the performance of semantic segmentation models has made considerable progress. However, compared with other tasks (such as classification and detection), semantic segmentation requires the collection of pixel-level class labels that are time-consuming and expensive. In recent years, many researchers have devoted themselves to weakly-supervised semantic segmentation (WSSS) research, such as image-level classification labels, smearing and bounding boxes, trying to achieve segmentation performance comparable to fully-supervised methods.  focuses on semantic segmentation through image-level classification tags .
Most advanced weakly-supervised semantic segmentation methods are implemented based on Class Activation Feature Map (CAM), which is an effective method for locating objects through image classification tags . However, CAM usually only covers the most distinguishing part of the object, and is usually incorrectly activated in the background area, which can be summarized as insufficient target activation and excessive background activation.
When the image is enhanced by affine transformation, the generated CAM is not consistent.The root cause of these phenomena is the supervision gap between fully-supervised and weak-supervised semantic segmentation , and there is an insurmountable gap between the CAM obtained using the classification network and the ground truth, because there is still an essential contradiction between classification and segmentation.
SEAM  applies consistency regularization to CAMs from various transformed images to provide self-supervision for network learning. In order to further improve the consistency of network predictions, SEAM introduced a pixel correlation module (PCM) , which can capture the contextual appearance information of each pixel and modify the original CAM through the learned affinity attention map, so as to compare the original CAM in different branches. Standardize with the improved CAM. SEAM is implemented by a twin network with equal-variant cross-regularization (ECR) loss .
SEAM is a combination of equal variation regularization (ER) and pixel correlation module (PCM). After a specially designed loss, the revised CAM not only maintains consistency in the affine transformation, but is also very suitable for object contour classification and the attributes of the segmentation function are different .
The segmentation function tends to be equivariant , but the classification task pays more attention to invariance. Although the invariance of the classification function is mainly caused by the merge operation , there is no equivariant constraint, which makes it almost impossible to achieve segmentation in the network learning process ,other regulators should be integrated to narrow the supervision gap between fully and weakly supervised learning.
Self-attention is a widely accepted mechanism that can significantly improve network approximation capabilities . It revises the feature map by capturing contextual feature relevance, which is also in line with the idea of most WSSS methods , that is, using the similarity of pixels to refine the original activation map.
When performing data enhancement, various affine transformations are used. In the case of full supervision, since the ground truth will be enhanced at the same time, this implicitly imposes equal-variable constraints on the network, so that its segmentation at various scales maintains a certain consistency.
Here F (·) represents the network, and A (·) represents any spatial affine transformation, such as rescaling, rotation, and flipping.
However, the supervision information of weak supervision is only the classification label. After the original image has been affinely changed, the classification label cannot be changed in the same way. This will lose the original implicit constraint, leading to the problem shown in Figure 1.
The authors introduced the Siamese Network, two networks that have exactly the same structure and share weights. It measures how similar two inputs are.
The twin neural network has two inputs ,the two input feeds are fed into two neural networks . The two neural networks respectively map the input to a new space to form an input in the new space .Through the calculation of Loss, evaluate the similarity of the two inputs.
Therefore, in order to integrate regularization into the original network, the network is extended to a twin structure with shared weights. One branch applies the transformation on the network output, and the other branch distorts the image through the same transformation before the network feed forward. Regularize the output activation maps from the two branches to ensure the consistency of the CAM. The inputs of the two networks are the original image and the image after affine transformation. Through the mapping of the twin network, a new representation is formed, and then Loss is designed to make these two representations as small as possible.
Pixel Correlation Module (PCM)
Although equal variation regularization provides additional supervision for network learning, it is difficult to achieve ideal equal variance only through the classic convolutional layer. Self-attention is an effective module to capture contextual information and refine the pixel-by-pixel prediction results.
In order to further refine the original CAM with context information, a pixel correlation module (PCM) is proposed at the end of the network to integrate the low-level features of each pixel.
The structure of PCM refers to the core part of the self-attention mechanism, which has been modified and trained under the supervision of equal variation regularization. And use the cosine distance to evaluate the feature similarity between pixels . At the same time , the inner product in the normalized feature space is used to calculate the affinity relationship between the current pixel and other pixels . ReLU activates similarity to suppress negative values.
The final CAM is the weighted sum of the normalized similarity of the original CAM.
Compared with traditional self-attention,
- PCM eliminates redundant jump connections to maintain the same activation strength as the original CAM (it may be that the original CAM adds more error messages) .
- In addition, since another network branch provides pixel-level monitoring for PCM, its accuracy is not as good as ground truth, so the parameters are reduced by deleting the embedding functions φ and g to avoid overfitting in inaccurate monitoring.
- The activation function uses Relu instead of sigmoid, and uses the ReLU activation function together with L1 normalization to cover up irrelevant pixels and generate a smoother affinity attention map in the relevant area. All in all, it revises the original CAM module by learning the context relationship.
SEAM’s Loss design
SEAM’s loss is divided into three, in which the cls classification loss is used to roughly locate the object, and the ER loss is used to narrow the gap between pixel-level and image-level monitoring. ECR loss is used to integrate PCM with the network to make consistent predictions for various affine transformations.
Extensive experiments on PASCAL VOC 2012 dataset demonstrate the proposed method outperforms state-of-the-art methods using the same level of supervision.
1.Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, Xilin Chen.Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation,arXiv:2004.04581.