影像分割 Image Segmentation — 語義分割 Semantic Segmentation(2)

李謦伊

Published in

謦伊的閱讀筆記

8 min readJun 15, 2021

上一篇文章有介紹關於 Image Segmentation 的重要任務應用以及一些 Semantic segmentation 的代表演算法，本文將會接續介紹~

HRNet (CVRP 2019、TPAMI 2020)

🔖 Github: https://github.com/HRNet/HRNet-Semantic-Segmentation

HRNet 有兩個特點:

採用 maintain high-resolution 的方式：將 high-resolution 與 low-resolution 併行連接以維持 resolution。以往的語義分割模型使用上採樣恢復至 high-resolution，容易損失細節訊息。

但若只有各自一路的 resolution 是不夠的，因此就有了第二點改進。

2. high-resolution 與 low-resolution 之間會不斷地互相交換訊息，以學習更豐富的特徵。

交互的方式可由下圖看到，左圖上採樣是將 low-resolution feature map 先進行線性插值再使用 1x1 卷積，而右圖下采樣則是使用 3x3 卷積 (stride=2)。

此外，HRNet 有很廣泛的應用，提出了三種不同的 Representation Head：HRNetV1、HRNetV2、HRNetV2p，分別應用於 human pose estimation, semantic segmentation, object detection

OCRNet (Object-Contextual Representations，ECCV 2020)

OCRNet 提出利用目標物區域的上下文訊息來增強 pixel representations，將原本語義分割任務的像素分類問題 (預測 pixel 類別) 轉化為物體區域的像素分類問題。

其網路架構可由下圖看到，步驟如下

先通過 backbone 預測出粗略的分割結果 — Soft Object Regions
計算 Pixel Representations (每個 pixel) 與 Soft Object Regions 得到 Object Region Representations，即每個目標物區域的 pixel representations
再計算 Pixel Representations 和 Object Region Representations 得到 Pixel-Regin Relation，即每個 pixel 與每個目標物區域內 pixel 的相似度
將 Pixel-Regin Relation 與 Object Region Representations 進行加權得到 Object-Contextual Representations，即該論文的特點，用於增強 pixel representations
最後 Object-Contextual Representations 再和 Pixel Representations 進行 concat 得到 Augmented Representations，即增強後的特徵

HRNet-OCR (Hierarchical Multi-Scale Attention，CVPR 2020)

🔖 Github: https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/HRNet-OCR

HRNet-OCR 就是指 HRNet + OCR + self-attention，提出一種分層多尺度注意力機制 (hierarchical multi-scale attention mechanism)，讓網路能夠自己學習如何最佳化地組合不同 resolution 的特徵。

另外，還提出了基於 hard-threshold 的 auto-labelling strategy，可以利用未標記的圖像來提高 mIOU。在 Cityscapes test 上達到 85.1% mIoU、Mapillary val 則是達到 61.1% mIoU，並且是目前 Semantic Segmentation 的 SOTA。

網路架構可由下圖看到，右上方是訓練流程，通過網路學習兩個相鄰 scale 之間的相對權重。並且因為 self-attention mechanism 是 hierarchical，所以能夠將記憶體的使用效率提高四倍，加快訓練速度，同時可以使用更大的 crop 進行訓練以增加準確率。而右下方的 inference 採用 hierarchical 的方式融合多個 scale predictions。

SETR (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers，CVPR 2020)

🔖 Github: https://github.com/gupta-abhay/setr-pytorch

SETR 的特點在於採用 sequence-to-sequence 的方式來進行語義分割。其網路架構基於 Encoder-Decoder，在 Encoder 部分加入了 Transformer，而 Decoder 則分別使用了三種方法來比較。其結果在 ADE20K 達到了 50.28% mIoU、Pascal Context 達到了 55.83% mIoU，並且在當時的 ADE20K test server leaderboard 上取得第一。

Encoder

由於 Transformer 的輸入是針對一維向量，因此先將輸入圖像 H x W x 3 劃分成 H/16 x W/16 x 3，接著 flatten 為一維向量，輸入至 Linear Projection 後得到 Patch embedding, Position embedding，再送入 Transformer Layer。由下圖可看到 Encoder 由 24 個 Transformer Layer 所組成，每個 Transformer Layer 有 Multi-head Attention、Layer Norm、MLP。

Decoder

這部分使用了三種不同的方法: Naive upsampling (Naive)、Progressive UPsampling (PUP)、Multi-Level feature Aggregation (MLA)

Naive

首先將 transformer feature 投影至類別數的維度，然後通過兩個 layer (1 × 1 conv + batch norm+ 1 × 1 conv)，再把輸出進行 bilinearly upsample 到與輸入一樣的維度。

由於 one-step upscaling 有可能會引入 noisy，因此採用漸進式 upsampling 的策略。操作步驟如下圖，先將 transformer feature 進行 reshape，然後在卷積層與 upsampling 之間交互的運算，並且為了最大化的減輕 adversarial effect，將 upsampling 限制在 2x。

將 transformer feature 平均分成 M 份後，reshape 為 3 維 feature map，接著通過三個 layer (1 × 1 conv + 3× 3 conv + 3× 3 conv)，採用 4x bilinearly upsample，再由上而下進行 concat，最後用 3× 3 conv 融合 feature，4x bilinearly upsample 恢復到與輸入一樣的維度。

各種 Semantic Segmentation 論文整理