YOLO V2 Configuration file Explained!!

Tanmay Thaker
Nerd For Tech
Published in
9 min readMar 29, 2022

The YOLOv2 config file can be found in darknet/cfg/yolov2.cfg. The link to the configuration file is given here: https://github.com/pjreddie/darknet/blob/master/cfg/yolov2.cfg

Now lets us understand this config file step by step. First, let us deep-dive into the net layer:

[net] layer
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=8
width=608
height=608
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1
  1. Batch
    Batch means Batch Size. It represents the number of training examples utilized in one iteration. When we load the datasets(images in our case) to the memory we have two options:
    a) either load the whole data set to the memory at once or
    b) You can load a sample set of data into the memory. Now if the size of the data is huge, then the training speed of the model will be very slow because you are using a lot of memory in your CPU which is very
    inefficient. Hence, we divide our data set into batches. Another reason for using batch is when we train our deep learning model without splitting into batches, then our neural network has to store errors values for all those images in the memory and this will cause a great decrease in the speed of training. As we know that the model updates weights and a bias only after passing through the whole data set, hence if we pass the large data set then the training will be quite slow. It's better if we split our data into small batches.
  2. Subdivisions
    The batch is subdivided into this many “blocks”. It is basically used like how many mini-batches you split your batch in. Ex: Batch=64 -> loading 64 images for this “iteration”. Subdivision=8 -> Split the batch into 8 “mini-batches” so 64/8 = 8 images per “minibatch” and this gets sent to the GPU for the process. That will be repeated 8 times until the batch is completed and a new iteration will start with 64 new images. The images of a block run in parallel on the GPU. If your GPU has enough memory you can reduce the subdivision to load more images into the GPU to process at the same time. One of the most important things regarding batch is that it must be divisible by subdivisions as the code uses small batches of
    batch /subdivisions as you can see in parcer.c:
    net->batch /= subdivs ;
    Also, the number of images in every step is defined in detector.c like
    int imgs = net.batch * net.subdivisions * ngpus;
  3. Width, Height and Channels
    The next three parameters are width, height, and channels respectively.
    Width=608 -network size (width), It means that every image will be resized to the network size during training and detection.
    height=608- network size (height), It means that every image will be resized to the network size during training and detection.
    channels=3 — network size (channels), It means that every image will be converted to this number of channels during training and detection.4.
  4. Momentum
    Momentum is nothing but an accumulation of movement i.e. how much the history affects the further change of weights (optimizer). It is an extension of the gradient descent optimization algorithm that allows the search to build inertia in a direction in the search space and overcome the oscillations of noisy gradients and coast across flat
    spots of the search space. By default, the value of momentum is 0.9.
  5. Decay
    As we all know real-world data is complex and in order to solve complex problems, we need complex solutions. Having fewer parameters is only one way of preventing our model from getting overly complex. But it is actually a very limiting strategy. More parameters mean more interactions between various parts of our neural network. And more interactions mean more non-linearity. This non-linearity helps us solve complex problems. However, we don’t want these interactions to get out of hand. Hence, what if we penalize complexity. We will still use a lot of parameters, but we will prevent our model from getting too complex. This is how the idea of weight decay came up. One way to penalize the complexity is if we add the squares of all the parameters to our loss function. However, it might result in our loss getting so huge that the best model would be to set all the parameters to 0. In order to To prevent that from happening, we multiply the sum of squares with another smaller number. This number is called weight decay wd.
    Our loss function now looks as follows:
    Loss = MSE(y_hat, y) + wd * sum(w²)
    When we update weights using gradient descent we do the following:
    w(t) = w(t-1) — lr * dLoss / dw
    Now since our loss function has 2 terms in it, the derivative of the 2nd term w.r.t w would be:
    d(wd * w²) / dw = 2 * wd * w (similar to d(x²)/dx = 2x)
    That is from now on, we would not only subtract the learning rate *
    gradient from the weights but also 2 * wd * w. We are subtracting constant times the weight from the original weight. This is why it is called weight decay.
  6. Angle
    It randomly rotates images during training (classification only). By default, the degree of angle is 0.
  7. Saturation
    It randomly changes the saturation of images during training. By default, the value of saturation is 1.5.
  8. Exposure
    It randomly changes exposure (brightness) during training. Its value is 1.5 by default.
  9. Hue
    It randomly changes the hue (color) during training. By default, its value is 0.1.
  10. Learning Rate
    The learning rate is a hyper-parameter that controls how much to change the model in response to the estimated error each time the model weights are updated. The learning rate may be the most important hyper-parameter when configuring your neural network. The learning rate controls how quickly the model is adapted to the problem. Smaller learning rates require more training epochs given the smaller changes made to the weights each update, whereas larger learning rates result in rapid changes and require fewer training epochs. If the learning rate that is too large can cause the model to converge too quickly to a sub-optimal solution, a learning rate that is too small can cause the process to get stuck. The challenge of training deep learning neural networks involves carefully selecting the learning rate.
  11. Burn In
    Initial burn_in will be processed for the first 1000 iterations,
    current_learning rate = learning_rate * pow(iterations /
    burn_in, power)
    = 0.001 * pow(iterations/1000, 4) where is
    power=4 by default.
  12. Max Batches
    max_batches = 500200 — the training will be processed for this number of
    iterations (batches).
  13. Policy
    By default, the policy is constant. It is for changing learning rate for example
    constant (by default), sgdr, steps, step, sig, exp, poly,
    random (i.e., if policy=random — then-current learning rate will be changed in this way = learning_rate * pow(rand_uniform(0,1), power)).
    power=4 — if policy=poly — the learning rate will be = learning_rate *
    pow(1 — current_iteration / max_batches, power)
    sgdr_cycle=1000 — if policy=sgdr — the initial number of iterations in
    cosine-cycle
    sgdr_mult=2 — if policy=sgdr — multiplier for cosine-cycle
  14. Steps
    Steps is a checkpoint (number of iterations) at which scales will be
    applied. steps=400000,450000- if policy=steps — they represent numbers
    of iterations the learning rate will be multiplied by scales factor.
  15. Scales
    Scales is a coefficients at which learning_rate will be multiplied at this checkpoints. scales=.1,.1,.1 — if policy=steps — i.e. if steps=8000,9000,12000, scales=.1,.1,.1 and the current iteration number is 10000 then
    current_learning_rate = learning_rate * scales[0] *
    scales[1] = 0.001 * 0.1 * 0.1 = 0.00001

Now let us talk about the convolutional layers. As we know that we use the Darknet-19 model in YOLOv2 which has 19 convolutional layers and 5 max-pooling layers.

[convolutional] layer
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky
  1. Batch Normalize
    Batch Normalize or Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. batch_normalize=1 means the model will
    use batch normalization and 0 means it will not. It is 0 by default.
  2. Filters
    Filters are used to detect spatial patterns such as edges in an image by detecting the changes in intensity values of the image. By default, the filter size is 1.
  3. Size
    Size is nothing but kernel size.
  4. Stride
    It is an offset step of the kernel filter. The default value of stride is 1.
  5. Pad
    pad means Padding. It refers to the number of pixels added to an image when it is being processed by the kernel of convolutional neural networks. For example, if the padding in a CNN is set to zero, then every pixel value that is added will be of value zero. The value of passing is 0 by default
  6. Activation
    Activation is nothing but the activation function of CNN. One of the commonly used activation functions is Leaky Relu. There are also various other activation functions like linear (by default), loggy, relu, elu, selu, relie, plse, hardtan, lhtan, linear, ramp, leaky, tanh, stair.
    Now let us see the max-pooling layer:
[maxpool]
size=2
stride=2

Max Pooling layer:

Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. Max Pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map. It is usually used after a convolutional layer. In max pool layer, Size is max-pooling kernel, and stride is the offset step of max-pooling kernel.

Route layer
[route]- It is nothing but a concatenation layer. The route layer is like a route
sign and points to the layer we want to concatenate. layers = -1, 4 — means
that will be concatenated into two layers, with relative indexes -1 and -4. output: W x H x C_layer_1 + C_layer_2
if index < 0, then it is relative layer number (-1 means previous layer)
if index >= 0, then it is absolute layer number

Reorg layer


[reorg]

stride=2

The reorganization layer improves the performance of the YOLOv2 object detection network by facilitating feature concatenation from different layers. It reorganizes the dimension of a lower layer feature map so that it can be concatenated with the higher layer feature map. Here, reorg layer reshapes the output tensor, so that the H and W of the tensor match the other output tensor downstream, and these two tensor outputs
could be concatenated together.

stride=2 means that width and height will be decreased by 2 times, and the number of channels will be increased by 2x2 = 4 times, so the total number of elements will still be the same:
width_old*height_old*channels_old = width_new*height_new*channels_new

Now let us see the region layer:

[region]
anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843,
5.47434, 7.88282, 3.52778, 9.77052, 9.16828
bias_match=1classes=80
coords=4
num=5
softmax=1
jitter=.3
rescore=1
object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1
absolute=1
thresh = .6
random=1

1. Anchors
YOLO can work well for multiple objects where each object is associated with one grid cell. But in the case of overlap, in which one grid cell actually contains the centre points of two different objects, we can use something called anchor boxes to allow one grid cell to detect multiple objects. By defining anchor boxes, we can create a longer grid cell vector and associate multiple classes with each grid cell. Anchor boxes have a defined aspect ratio and they detect objects that nicely fit into a box with that ratio.

2. Bias Match

bias_match is used only for training if bias_match=1 then the detected object will have the same as in one of the anchors, else if bias_match=0 then of the anchor will be refined by a neural network.

3. Classes
Classes are the number of classes that we have in our dataset.

4. Num
Num is the total number of anchors. In our case num=9.

5. Softmax
YOLO applies a softmax function to convert scores into probabilities that sum up to one.

6. Jitter
It randomly crops and resizes images with changing aspect ratio from x(1 -
2*jitter) to x(1 + 2*jitter)
.The larger the value of jitter, the more
invariance would neural network to change of size and aspect ratio of the objects.

7. Rescore
It determines what the loss (delta, cost, …) function will be used.

8. Object Scale
object_scale used for loss (delta, cost, …) function for objects.

9. noobject_scale
It is used for loss function for objects and backgrounds.

10. Class Scale
It is used as a scale in the delta_region_class():

11. Coord Scale
It is also used as a scale in the delta_region_box().

12. Threshold
YOLO uses Non-Maximal Suppression (NMS) to only keep the best bounding box. The first step in NMS is to remove all the predicted bounding boxes that have a detection probability that is less than a given NMS threshold.

13. Random
It randomly resizes the network for every 10 iterations from 1/1.4 to 1.4.

References:
https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-different- layers#cfg-parameters-in-the-different-layers
https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab

--

--

Tanmay Thaker
Nerd For Tech

Software Engineer (Machine Learning) | Passionate about Machine Learning and Artificial Intelligence