Efficient Convolutional Network Learning — ScatterNet

Hanoch kremer
5 min readApr 6, 2020

--

When I led a DNN project based on few data ~20K examples for biometric ID, based on poor quality modality, I was trying to find out ways of incorporating prior knowledge to CNN in order to ease the data scale requirements for training. Then I’ve found the ScatterNet idea which is a nice trick in terms of using a fixed and known structure of Wavelets based complex tree borrowed from the image processing domain.

ScatterNets incorporates geometric knowledge of images to produce discriminative and invariant (translation and rotation) features i.e. edge information. The same outcome as CNN's first layers hold. So why not replace that first layer/s with an equivalent, fixed, structure and let the optimizer find the best weights for the CNN with its leading-edge removed.

Although it looks promising, the weights saving replacing the first layers with a fixed structure is not too dramatic as one can figure out.

The main motivations of the idea of replacing the first convolutional, ReLU and pooling layers of the CNN with a two-layer parametric log-based Dual-Tree Complex Wavelets Transform (DTCWT), covered by a few papers, were:

  • Despite the success of CNNs, the design and optimizing configuration of these networks is not well understood which makes it difficult to develop these networks
  • This improves the training of the network as the later layers can learn more complex patterns from the start of learning because the edge representations are already present
  • Converge faster as it has fewer filter weights to learn
  • My takeaway: a slight reduction in the amount of data necessary for training!

The proposed approached termed AS-1 termed (log-based)DTCWT + CNN = DTSCNN (ScatterNet CNN)

Let's go back to DSP basics :

Decimated filter bank structure: instead of creating many passbands for processing the signal in the frequency domain aka filter bank, processing in sub-bands is more efficient. Processing in bands raise the viability of decimation which reduces the sampling rate =>ease processing complexity

g, h: The time-domain responses of the LPF and HPF respectively

Filtering and reconstruction pairs (left and right part) creates unit transfer function that is to say: Y=X (γ,λ sustain the same duration)

Filter bank decomposition and synthesis/reconstruction pairs
The frequency response of the LPF/HPF that sum up to constant — decomposition is perfect

Taking this idea where g,h are wavelets (filters), which has few advantages over few of image processing tasks (the place is too short to drill down):

  • Non-redundant orthonormal bases
  • Perfect reconstruction multiresolution decomposition
  • Attractive for object matching
  • Fast O(n) algorithms with short filters
  • Compact support (localized waveform and is thus stable to deformation)

The Dual-tree complex wavelet transform (DTCWT) calculates the complex transform of a signal using two separate DWT decompositions as was demonstrated by the 1D-decimated filter bank above. If the filters used in one tree (tree a and tree b) are specifically designed differently from those in the other it is possible for one DWT to produce the real coefficients and the other the imaginary, hence complex response!

We know that CNN based classification is translation invariant(pooling) and translation equivariant. However, wavelet transform commutes with translations and is therefore not translation invariant! to be continued..

The invariant features are obtained at the first layer by filtering the input signal x with dual-tree complex wavelets

at different scales (j) and six pre-defined orientations (r) fixed to 15◦; 45◦; 75◦; 105◦; 135◦ and 165◦. To build a translation-invariant representation, it is necessary to introduce a non-linearity, point-wise L2 non-linearity (complex modulus) is applied to the real and imaginary part of the filtered signal:

  • star-convolution operator
  • x-the input image to the CNN

The invariant information (U[λm=1]) obtained for each R, G and B channel of an image is combined into a single invariant feature by taking an L2 norm of them. Log transformation is applied with parameter k1 = 1.1 for scale j = 1. The representations at all the layers (m = 0(3), m = 1(12) and m = 2(36)) are concatenated to produce 51*2 (two resolutions) = 102 image representations that are given as input to the mid and back layers of the CNN.

The parametric log transformation layer is then applied to all the oriented representations extracted at the first scale j = 1 with a parameter kj=1, to reduce the effect of outliers :

And further averaged.

Finally, the conclusions and my take away:

On CIFAR10 and Caltech-101 with 14 self-made CNN with increasing depth, VGG, NIN and WideResnet:

  • When doing transfer learning(Imagenet): DTSCNN outperformed (“useful margin”) all the CNN architectures counterpart when finetuning with only 1000 examples(balanced over classes). While on larger datasets the gap decreases ending on par with. However, when freezing the first layers on VGG and NIN, as in DTSCNN, the NIN results are in par with, while VGG outperforms!
  • DTSCNN learns faster in the rate but reaches the same target with minor speedup (few mins)
  • Complexity analysis in terms of weights and operations is missing
  • Datasets: CIFAR-10 & Caltech-101, is a good start point (further step with a substantial dataset like COCO would be a plus). For other modalities/domains, please try and let me know
  • Great work but ablation study is missing such as comparing full training WResNet+DTCWT vs. WResNet
  • 14 citation so far (Cambridge): probably low value per money at the moment
  • I’d give it a try on a medical dataset

--

--

Hanoch kremer

Applied researcher in computer vision, deep/machine learning and ASR, challenged by new and disruptive domains where a large space for innovation exists.