YOLOv3 in Tensorflow

8 min readDec 29, 2018

What is YOLO?

‘You Only Look Once’ is an Object Detection Algorithm.

So what’s great about object detection?

In comparison to recognition algorithms, a detection algorithm does not only predict class labels but detects locations of objects as well. So, It not only classifies the image into a category, but it can also detect multiple Objects within an Image.

And this Algorithm doesn’t depend on multiple Neural networks. It applies a single Neural network to the Full Image. This network divides the image
into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

Here are few Sample output Images -

Images with Object Detection by YOLO algorithm with bounding boxes.

Why this Blog?

The original YOLO algorithm is deployed in Darknet. Darknet is an open source neural network framework written in C and CUDA. We will deploy this Algorithm in Tensorflow with Python 3, source code here.

Dependencies

To build the YOLO in Tensorflow we will require :

1. Tensorflow (GPU version preferred for Deep Learning)
2. NumPy (for Numeric Computation)
3. Pillow/PIL (for Image Processing)
4. IPython (for displaying images in Jupyter Notebook)
5. Glob (for finding pathname of all the files)

Anaconda is recommended as it has most of the libraries of Machine Learning and Deep Learning and is easy to use and a interacive Python Notebook(Jupyter Notebook).

For Programmimg part visit GitHub Repository here.

The Jupyter Notebook of coding can be found here, and the pdf explaination of it here.

Model Tuning and Hyperparameters

Batch Normalization

It is a preprocessing step of features extracted from previous layers, before feeding it to the next layers of the network. We normalize the input layer by adjusting and scaling the activations. For example, when we have one feature in range 0 to 1 and other from 1 to 1000, we should normalize them to speed up learning. So, the Neural network does not assume the feature with a range from 1 to 1000 as a high priority in the features dependencies. This allows each layer of a network to learn by itself a little bit more independently of other layers. Almost every convolutional layer in Yolo has batch normalization after it. It helps the model train faster and reduces variance between units (and total variance as well).

Leaky ReLU

ReLU(Rectified Linear Unit) is an Activation Function used in the Neural Network. Leaky ReLU is an advanced version of ReLU. Suppose if, for whatever reason, the output of a ReLU is consistently 0 (for example, if the
ReLU has a large negative bias), then the gradient through it will consistently be 0. The error signal backpropagated from later layers gets multiplied by this 0, so no error signal ever passes to earlier layers, the ReLU has died. Thus Leaky ReLU is used.

Anchors

Anchors are sort of bounding box priors, that were calculated on the COCO dataset using k-means clustering. We are going to predict the width and height of the box as offsets from cluster centroids. The center coordinates of the box relative to the location of filter application are predicted using a
sigmoid function.

Implementation of Darknet-53 layers

In YOLO v3 paper, the authors present new, deeper architecture of feature extractor called Darknet-53. As it’s name suggests, it contains of 53 convolutional layers, each followed by batch normalization layer and Leaky ReLU activation function. Downsampling is done by convolution layers with stride=2.

Converting pre-trained COCO weights

We defined detector’s architecure. To use it, we have to either train it on our own dataset or use pretrained weights. Weights pretrained on COCO dataset are available for public use. We can download it using this link:

https://pjreddie.com/media/files/yolov3.weights

The structure of this binary file is as follows:

The first 3 int32 values are header information: major version number, minor version number, subversion number, followed by int64 value: number of images seen by the network during training. After them, there are 62 001 757 float32 values which are weights of each conv and batch norm layer. It is important to remember that they are saved in row-major format, which is opposite to format used by Tensorflow (column-major).

So, how should we read the weights from this file?

We start from the first conv layer. Most of the convolution layers are immediately followed by batch normalization layer. In this case, we need to read first read 4* num_filters weights of batch norm layer: gamma, beta, moving mean and moving variance, thenkernel_size[0] * kernel_size[1] * num_filters * input_channels weights of conv layer.

In the opposite case, when conv layer is not followed by batch norm layer, instead of reading batch norm params, we need to readnum_filters bias weights.

Let’s start writing code of load_weights function. It takes 2 arguments: a list of variables in our graph and a name of the binary file.

We start with opening the file, skipping first 5 int32 values and reading everything else as a list:

def load_weights(var_list, weights_file):
    with open(weights_file, "rb") as fp:
        _ = np.fromfile(fp, dtype=np.int32, count=5)        weights = np.fromfile(fp, dtype=np.float32)

Then we will use two pointers, first to iterate over the list of variables var_list and second to iterate over the list with loaded variables weights. We need to check the type of the layer following the one currently processed and read appriopriate number of values. In the code i will be iterating over var_list and ptr will be iterating over weights. We will return a list of tf.assign ops. I check the type of the layer simply by comparing it’s name. (I agree that it is a little ugly, but I don’t know any better way of doing it. This approach seems to work for me.)

ptr = 0
i = 0
assign_ops = []
while i < len(var_list) - 1:
    var1 = var_list[i]
    var2 = var_list[i + 1]
    # do something only if we process conv layer
    if 'Conv' in var1.name.split('/')[-2]:
        # check type of next layer
        if 'BatchNorm' in var2.name.split('/')[-2]:
            # load batch norm params
            gamma, beta, mean, var = var_list[i + 1:i + 5]
            batch_norm_vars = [beta, gamma, mean, var]
            for var in batch_norm_vars:
                shape = var.shape.as_list()
                num_params = np.prod(shape)
                var_weights = weights[ptr:ptr + num_params].reshape(shape)
                ptr += num_params
                assign_ops.append(tf.assign(var, var_weights, validate_shape=True))            # we move the pointer by 4, because we loaded 4 variables
            i += 4
        elif 'Conv' in var2.name.split('/')[-2]:
            # load biases
            bias = var2
            bias_shape = bias.shape.as_list()
            bias_params = np.prod(bias_shape)
            bias_weights = weights[ptr:ptr + bias_params].reshape(bias_shape)
            ptr += bias_params
            assign_ops.append(tf.assign(bias, bias_weights, validate_shape=True))            # we loaded 2 variables
            i += 1
        # we can load weights of conv layer
        shape = var1.shape.as_list()
        num_params = np.prod(shape)        var_weights = weights[ptr:ptr + num_params].reshape((shape[3], shape[2], shape[0], shape[1]))
        # remember to transpose to column-major
        var_weights = np.transpose(var_weights, (2, 3, 1, 0))
        ptr += num_params
        assign_ops.append(tf.assign(var1, var_weights, validate_shape=True))
        i += 1return assign_ops

And that’s it! Now we can restore the weights of the model by executing lines of code similar to these:

with tf.variable_scope('model'):
    model = yolo_v3(inputs, 80)model_vars = tf.global_variables(scope='model')
assign_ops = load_variables(model_vars, 'yolov3.weights')sess = tf.Session()
sess.run(assign_ops)

For the future use, it will probably be much easier to export the weights using tf.train.Saver and load from a checkpoint.

Implementation of post-processing algorithms

Our model returns a tensor of shape:

batch_size x 10647 x (num_classes + 5 bounding box attrs)

The number 10647 is equal to the sum 507 +2028 + 8112, which are the numbers of possible objects detected on each scale. The 5 values describing bounding box attributes stand for center_x, center_y, width, height. In most cases, it is easier to work on coordinates of two points: top left and bottom right. Let’s convert the output of the detector to this format.

The function which does it is pretty straightforward:

def detections_boxes(detections):
    center_x, center_y, width, height, attrs = tf.split(detections, [1, 1, 1, 1, -1], axis=-1)
    w2 = width / 2
    h2 = height / 2
    x0 = center_x - w2
    y0 = center_y - h2
    x1 = center_x + w2
    y1 = center_y + h2    boxes = tf.concat([x0, y0, x1, y1], axis=-1)
    detections = tf.concat([boxes, attrs], axis=-1)
    return detections

It is usual that our detector detects the same object multiple times (with slightly different centers and sizes). In most cases we don’t want to keep all of these detections which differ only by a small number of pixels. The standard solution to this problem is Non-max suppression. Good description of this method is available here.

First we need a function to compute IoU (Intersection over Union) of two bounding boxes:

def _iou(box1, box2):
    b1_x0, b1_y0, b1_x1, b1_y1 = box1
    b2_x0, b2_y0, b2_x1, b2_y1 = box2    int_x0 = max(b1_x0, b2_x0)
    int_y0 = max(b1_y0, b2_y0)
    int_x1 = min(b1_x1, b2_x1)
    int_y1 = min(b1_y1, b2_y1)    int_area = (int_x1 - int_x0) * (int_y1 - int_y0)    b1_area = (b1_x1 - b1_x0) * (b1_y1 - b1_y0)
    b2_area = (b2_x1 - b2_x0) * (b2_y1 - b2_y0)    iou = int_area / (b1_area + b2_area - int_area + 1e-05)
    return iou

Now we can write code of non_max_suppression function. We use NumPy library for fast vector operations.

def non_max_suppression(predictions_with_boxes, confidence_threshold, iou_threshold=0.4):
    """
    Applies Non-max suppression to prediction boxes.    :param predictions_with_boxes: 3D numpy array, first 4 values in 3rd dimension are bbox attrs, 5th is confidence
    :param confidence_threshold: the threshold for deciding if prediction is valid
    :param iou_threshold: the threshold for deciding if two boxes overlap
    :return: dict: class -> [(box, score)]
    """

It takes 3 arguments: outputs from our YOLO v3 detector, confidence threshold and IoU threshold. The body of this function is as follows:

conf_mask = np.expand_dims((predictions_with_boxes[:, :, 4] > confidence_threshold), -1)
predictions = predictions_with_boxes * conf_maskresult = {}
for i, image_pred in enumerate(predictions):
    shape = image_pred.shape
    non_zero_idxs = np.nonzero(image_pred)
    image_pred = image_pred[non_zero_idxs]
    image_pred = image_pred.reshape(-1, shape[-1])    bbox_attrs = image_pred[:, :5]
    classes = image_pred[:, 5:]
    classes = np.argmax(classes, axis=-1)    unique_classes = list(set(classes.reshape(-1)))    for cls in unique_classes:
        cls_mask = classes == cls
        cls_boxes = bbox_attrs[np.nonzero(cls_mask)]
        cls_boxes = cls_boxes[cls_boxes[:, -1].argsort()[::-1]]
        cls_scores = cls_boxes[:, -1]
        cls_boxes = cls_boxes[:, :-1]        while len(cls_boxes) > 0:
            box = cls_boxes[0]
            score = cls_scores[0]
            if not cls in result:
                result[cls] = []
            result[cls].append((box, score))
            cls_boxes = cls_boxes[1:]
            ious = np.array([_iou(box, x) for x in cls_boxes])
            iou_mask = ious < iou_threshold
            cls_boxes = cls_boxes[np.nonzero(iou_mask)]
            cls_scores = cls_scores[np.nonzero(iou_mask)]return result

In the tutorial repo we can find the code and an ipyn(jupyter notebook) for detections.

Thanks for reading. Please let me know if you liked it by clapping and/or sharing it! :)