FusionCount
FUSIONCOUNT: EFFICIENT CROWD COUNTING VIA MULTISCALE FEATURE FUSION
Intro
Before
- Features extracted at earlier stages during encoding are under-utilized
- The multi-scale modules can only capture a limited range of receptive fields
- High computational cost
FusionCount
- exploits the adaptive fusion of encoded features to obtain multi-scale features
- cover larger scope of receptive field sizes
- lower computational cost
- use channel reduction block to extract contextual information
How?
FusionCount architecture consists of main 3 parts which are Encoder, Feature Fusion, Decoder
Encoder:
- VGG-16 without last max-pooling and fc layer are used
- First two feature maps before first max-pooling are considered not informative therefore are not preserved
- Rest are preserved and fused
Feature Fusion:
- A group of feature maps are fused by The Fusion Block(shown above)
- Each fused feature maps have same width and height
fusion_block(x1, x2, x3) = f1
fusion_block(x4, x5, x6, x7) = f2
fusion_block(x8, x9, x10, x11) = f3
fusion_block(x12, x13, x14, x15) = f4
Decoder:
- Fused feature maps are once more fused in the reverse order
- Channel reduction and upsample
- The final fused feature to generate the estimation
fused(upsampled(channel_reduction(f4)), f3) = f3
fused(upsampled(channel_reduction(f3)), f2) = f2
fused(upsampled(channel_reduction(f2)), f1) = f1
1x1conv(f1) = result
Why?
FusionCount’s MAE and MSE excel other models on ShanghaiTech part B dataset
Conclusion
FusionCount has not shown good enough result to choose this model over other models. But its work to obtain multi-scale features using adaptive fusion seem interesting. Further improvements of extracting contextual information of an image as the paper suggested could help to better deal with scale changes therefore can improve overall results of a model.