Deeplearning.ai: CNN week 1 — Convolutional Neural Network terminology

From edge filtering to convolutional filters

Published in

datatype

4 min readJan 31, 2018

Applications: e.g Neural Style Transfer

Vertical edge detection, 6x6 grayscale image, 3x3 filter/kernel (sobel, robert, prewitt filter) -> 4x4 image. In images with clear edges, visualize the image and the Prewitt filter.
Vertical vs horizontal

prewitt filter → sobel filter: more weighted for the center, more robust
What about auto learn the filter ? hard-code filter -> data-learned filter, robust with the rotation. data-dependent filter.

The output filtered is depended on how “fit” the filter [f x f] to the image [n x n]. Then, the output is [(n-f+1), (n-f+1)]. “f” usually odd.
The “edge” pixels are used less times in filtering step, rather than more center pixels. Loss the informative and size of the ouput dismiss very fast.
Solution: padding edge “p” pixels to expand the size of the input image -> (n+2p-f+1) x (n+2p-f+1)
“Valid” convolution: no padding, “Same” convolutions: padding, #input == #output => p = (f-1)/2

Sweeping the filter not in 1 by 1 pixel, but jump 2 two pixels step or more => smaller output image.
Jumping “s” pixels => output[( (n+2p-f)/s+1) ; ( (n+2p-f)/s+1)]
Typical “covolution” operator in math/signal processing textbooks: flip and multiply. The “covolution” of deep learning is “cross-correlation”.

RGB images (3 channels= Nc), including 3 layers of matrices => 3 covolution matrices to do the filterting for each channel.
The mechanism: in each area, doing the filtering for 3 channel by 3 layers respectively. Add all of them together to get the final number ( 1 number).

Multiple filters: use different filters and stackup the results. Each filter for different purpose.

Input -> Filtering -> output_tmp -> add bias -> relu -> final ouput
Parameters are the number of elements in these filters cubic. NOT the number of weights in the full connected neural network. Less parameters + less overfitting + easier to learn + more robust.

One convolution layer: Images input (3 channels) -> filtering -> output (affected by padding, stride, size of the filter) x (number of filters).
Input -> many convolution layers -> flattening -> full connected layer -> softmax -> outputs.
Type of layer in the convolutional network: convolution (conv) + pooling (pool) + fully connected (fc)

Max operator: break image -> region -> pick the max value of each region.
Hyper parameter: stride (s) and filter size (f), do not need to be learned.
Reserve the key feature by max operator, if the feature is dismiss in the output, it is not “key” feature.

Andrew Ng convention: Convolution+pool = one layer.
LeNet-5: Input -> layer 1 -> layer 2 -> fully connected layer 3 -> fully connected layer 4 -> softmax -> output

Activation size decrease fastly, but the number of parameters increase (still a lot smaller than the full connected weights).

Parameters sharing and Sparsity of the connection
Parameters sharing in the cubic of features, it learns the data-dependent filter based on areas/parts of input images. Useful in one part is also useful in other parts. Less parameters
Sparsity of the connection: each output element is depended only some number of input : as this element is the result number of the convolution operator of the filter on a part of the image. Less overfitting, less translation invariance.
How to train ? need a “cost” function, gradient descent