FusionCount

FUSIONCOUNT: EFFICIENT CROWD COUNTING VIA MULTISCALE FEATURE FUSION

standfsk
2 min readNov 20, 2023

Intro

Before

  • Features extracted at earlier stages during encoding are under-utilized
  • The multi-scale modules can only capture a limited range of receptive fields
  • High computational cost

FusionCount

  • exploits the adaptive fusion of encoded features to obtain multi-scale features
  • cover larger scope of receptive field sizes
  • lower computational cost
  • use channel reduction block to extract contextual information

How?

FusionCount architecture consists of main 3 parts which are Encoder, Feature Fusion, Decoder

Encoder:

  • VGG-16 without last max-pooling and fc layer are used
  • First two feature maps before first max-pooling are considered not informative therefore are not preserved
  • Rest are preserved and fused

Feature Fusion:

  • A group of feature maps are fused by The Fusion Block(shown above)
  • Each fused feature maps have same width and height
fusion_block(x1, x2, x3) = f1
fusion_block(x4, x5, x6, x7) = f2
fusion_block(x8, x9, x10, x11) = f3
fusion_block(x12, x13, x14, x15) = f4

Decoder:

  • Fused feature maps are once more fused in the reverse order
  • Channel reduction and upsample
  • The final fused feature to generate the estimation
fused(upsampled(channel_reduction(f4)), f3) = f3
fused(upsampled(channel_reduction(f3)), f2) = f2
fused(upsampled(channel_reduction(f2)), f1) = f1
1x1conv(f1) = result

Why?

FusionCount’s MAE and MSE excel other models on ShanghaiTech part B dataset

Conclusion

FusionCount has not shown good enough result to choose this model over other models. But its work to obtain multi-scale features using adaptive fusion seem interesting. Further improvements of extracting contextual information of an image as the paper suggested could help to better deal with scale changes therefore can improve overall results of a model.

Reference

--

--