Different Layers In Caffe

Fully Connected Layer/Inner Product

  • Input:n * c_i * h_i * w_i
  • Output:n * c_o * 1 * 1
layer {
name: "fc8"
type: "InnerProduct"
# learning rate and decay multipliers for the weights
param { lr_mult: 1 decay_mult: 1 }
# learning rate and decay multipliers for the biases
param { lr_mult: 2 decay_mult: 0 }
inner_product_param {
num_output: 1000
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
bottom: "fc7"
top: "fc8"
}

Scale Layer

The purpose of this layer is to divide input values by their standard deviation in order to have a variance of approximately 1. This is used to avoid activations from saturating. If you reduce the learning rate to e.g. 0.0001 you should notice that your network is able to learn even without the scale layer.

Element wise Layer

It can do element wise SUM/MAX/PROD operation. You can also set coefficient to it

e.g.

layer
{
name: "innder_2"
type: "Eltwise"
bottom: "pred_2_inner"
bottom: "pred_scale_conv_2"
top: "score-final-2"
eltwise_param {
operation: PROD
}
}


layer
{
name: "inner_final"
type: "Eltwise"
bottom: "score-final-1"
bottom: "score-final-2"
top: "pred_final"
eltwise_param {
operation: SUM
coeff: -1
coeff:1 #-a+b
}
}

Convolutional Layer

When the kernel-size is 1 and the pad size is 0, M input channels and output channel is 1. It is weighted sum of M filters.

  • Input:n * c_i * h_i * w_i
  • Ouput: tn * c_o * h_o * w_o, where h_o = (h_i + 2 * pad_h - kernel_h) / stride_h + 1 and w_o likewise.

loss = -nan

When this happens, it might become the initialization of the convolutional layer is wrong, try to leave it to 0. Also, we can decrease the learning rate till it is not happening.

Deconvolution Layer

  • Input:n * c_i * h_i * w_i
  • Ouput: tn * c_o * h_o * w_o, where h_o = (h_i -1)*stride_h — 2*pad_h + kernel_h w_o likewise.

Example:

layer {
name: "score4"
type: "Deconvolution"
bottom: "score-fused"
top: "score4"
param {
lr_mult: 0#1
decay_mult: 1
}
convolution_param {
num_output: 21
bias_term: false
kernel_size: 4
stride: 2
}
}

The parameter choice could be the follow. dilation could be used to realized the dilation layer. The traditional deconvolution layer could be one subtype of dilation layer when this parameter is set to 1.

message ConvolutionParameter {
optional uint32 num_output = 1; // The number of outputs for the layer
optional bool bias_term = 2 [default = true]; // whether to have bias terms

// Pad, kernel size, and stride are all given as a single value for equal
// dimensions in all spatial dimensions, or once per spatial dimension.
repeated uint32 pad = 3; // The padding size; defaults to 0
repeated uint32 kernel_size = 4; // The kernel size
repeated uint32 stride = 6; // The stride; defaults to 1
// Factor used to dilate the kernel, (implicitly) zero-filling the resulting
// holes. (Kernel dilation is sometimes referred to by its use in the
// algorithme à trous from Holschneider et al. 1987.)
repeated uint32 dilation = 18; // The dilation; defaults to 1

// For 2D convolution only, the *_h and *_w versions may also be used to
// specify both spatial dimensions.
optional uint32 pad_h = 9 [default = 0]; // The padding height (2D only)
optional uint32 pad_w = 10 [default = 0]; // The padding width (2D only)
optional uint32 kernel_h = 11; // The kernel height (2D only)
optional uint32 kernel_w = 12; // The kernel width (2D only)
optional uint32 stride_h = 13; // The stride height (2D only)
optional uint32 stride_w = 14; // The stride width (2D only)

optional uint32 group = 5 [default = 1]; // The group size for group conv

optional FillerParameter weight_filler = 7; // The filler for the weight
optional FillerParameter bias_filler = 8; // The filler for the bias
enum Engine {
DEFAULT = 0;
CAFFE = 1;
CUDNN = 2;
}
optional Engine engine = 15 [default = DEFAULT];

// The axis to interpret as "channels" when performing convolution.
// Preceding dimensions are treated as independent inputs;
// succeeding dimensions are treated as "spatial".
// With (N, C, H, W) inputs, and axis == 1 (the default), we perform
// N independent 2D convolutions, sliding C-channel (or (C/g)-channels, for
// groups g>1) filters across the spatial axes (H, W) of the input.
// With (N, C, D, H, W) inputs, and axis == 1, we perform
// N independent 3D convolutions, sliding (C/g)-channels
// filters across the spatial axes (D, H, W) of the input.
optional int32 axis = 16 [default = 1];

// Whether to force use of the general ND convolution, even if a specific
// implementation for blobs of the appropriate number of spatial dimensions
// is available. (Currently, there is only a 2D-specific convolution
// implementation; for input blobs with num_axes != 2, this option is
// ignored and the ND implementation will be used.)
optional bool force_nd_im2col = 17 [default = false];
}

Data Layer

If using the transform_param, and you want to do data augmentation with semantic segmentation,which might has multiple data input layers, plus a label layer. Since the scale and mirror is doing randomly, there is no way to do it automatically using the original caffe. The choice could be: 1) Using python to write a separate data processing layer. 2) Add another layer in the caffe code, example could be https://github.com/kevinlin311tw/caffe-augmentation.

About Resuming from the Snapshot State

The solver.prototxt is going to resume from the solverstate file. However, we can still do change in the prototxt file with the neural network parameters.