Count people in webcam using pre-trained YOLOv3

Learn how to use instance segmentation (YOLOv3) to count the number of people using its pre-trained weights with TensorFlow and OpenCV in python.

Vardan Agarwal

Published in

Analytics Vidhya

10 min readMay 6, 2020

Requirements
Introduction

Instance Segmentation vs Semantic Segmentation
YOLOv3 vs Faster RCNN vs SSD
YOLOv3
Anchor boxes
Non-maximal suppression

Code

Requirements

For this project, we require Tensorflow, OpenCV, and wget-python (to download YOLOv3 weights. You can manually download them as well.)

Using pip:

pip install tensorflow-gpu # this is the gpu version
pip install tensorflow # if you don't have gpu like me 😥
pip install opencv-python
pip install wget

If you are using anaconda then using conda:

conda install -c anaconda tensorflow-gpu # this is the gpu version
conda install -c conda-forge tensorflow # if you don't have gpu version
conda install -c conda-forge opencv
conda install -c conda-forge python-wget

Introduction

Here I will discuss the basic terminologies related to YOLOv3 and instance segmentation in brief and provide additional reading resources. If you know about them and want to skip them feel free to move to the next section.

Instance Segmentation VS Semantic Segmentation

In semantic segmentation, the images are segmented based on various labels like a person, dog, cat, etc but there is no way to distinguish between two person objects. This drawback is solved in instance segmentation where along with distinguishing between different labels we are also able to distinguish between multiple objects of that label.

Semantic segmentation (left) vs Instance segmentation (right) if segmented by the number of sides.

YOLOv3 vs Faster RCNN vs SSD

Which one to choose and why?

Want the best accuracy? Faster RCNN

Want the fastest speed? YOLOv3

Want a tradeoff between the two? SSD

I was trying to make a real-time implementation through the webcam on CPU (LOL) so I chose YOLOv3. I tried tiny YOLO as well but its predictions were not accurate so I dropped it. You can choose whichever suits you best.

Additional reading for in-depth understandings of these models

Faster RCNN — https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46

SSD — https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-detection-in-deep-learning-495ef744fab

Comparison between them — https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359

YOLOv3

YOLOv3 pre-trained model can be used to classify 80 objects and is super fast and nearly as accurate as SSD. It has 53 convolutional layers with each of them followed by a batch normalization layer and a leaky RELU activation. To downsample, instead of using pooling they have used a stride of 2 in convolutional layers. The input format for it is that the images should be in RGB (so if using OpenCV remember to convert), having an input type of float32, and can have dimensions 320x320 or 416x416 or 608x608.

Additional reading: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

Anchor boxes

Anchor boxes help the model to specialize better. Take an example of a standing person and a car. A person would require a tall box while the car will require a fat box. How will our model know this? This is achieved through different size anchor boxes. This means that all objects will have more than one bounding box. To decide which bounding box is kept non-maximal suppression is used. Anchor boxes also help when different objects have their centers at the same place in predicting both the objects.

Additional reading: https://medium.com/@andersasac/anchor-boxes-the-key-to-quality-object-detection-ddf9d612d4f9

Non-Maximal Suppression

Non-maximal suppression or NMS uses IOU to work. Intersection over Union (IOU) as the name suggests is the ration between intersection and union of two boxes. The box with the highest probability of detection is selected. Then all the boxes which have a high IOU with that box are removed.

Additional reading: https://medium.com/@sarangzambare/object-detection-using-non-max-supression-over-yolov2-382a90212b51

Code

The first task is to download the pre-trained weights which can be done by using wget.

import wgeturl = 'https://pjreddie.com/media/files/yolov3.weights'
yolov3 = wget.download(url, out='yolov3.weights')

Running this code in your working directory will save the weights there.

Now before showing and explaining the code, I would like to give credit to this Github repository of YOLO implementation in Tensorflow 2 as I have mostly copied from it.

Make the necessary imports

import tensorflow as tf
import numpy as np
import cv2
from tensorflow.keras import Model
from tensorflow.keras.layers import (
    Add,
    Concatenate,
    Conv2D,
    Input,
    Lambda,
    LeakyReLU,
    UpSampling2D,
    ZeroPadding2D,
    BatchNormalization
)
from tensorflow.keras.regularizers import l2

Load the darknet weights and assign those weights to the layers of the model. Create a function to define the convolutional layers and whether to apply batch normalization on it or not.

def load_darknet_weights(model, weights_file):
    '''
    Helper function used to load darknet weights.
    
    :param model: Object of the Yolo v3 model
    :param weights_file: Path to the file with Yolo V3 weights
    '''
    
    #Open the weights file
    wf = open(weights_file, 'rb')
    major, minor, revision, seen, _ = np.fromfile(wf, dtype=np.int32, count=5)#Define names of the Yolo layers (just for a reference)    
    layers = ['yolo_darknet',
            'yolo_conv_0',
            'yolo_output_0',
            'yolo_conv_1',
            'yolo_output_1',
            'yolo_conv_2',
            'yolo_output_2']for layer_name in layers:
        sub_model = model.get_layer(layer_name)
        for i, layer in enumerate(sub_model.layers):
          
            
            if not layer.name.startswith('conv2d'):
                continue
                
            #Handles the special, custom Batch normalization layer
            batch_norm = None
            if i + 1 < len(sub_model.layers) and \
                    sub_model.layers[i + 1].name.startswith('batch_norm'):
                batch_norm = sub_model.layers[i + 1]filters = layer.filters
            size = layer.kernel_size[0]
            in_dim = layer.input_shape[-1]if batch_norm is None:
                conv_bias = np.fromfile(wf, dtype=np.float32, count=filters)
            else:
                # darknet [beta, gamma, mean, variance]
                bn_weights = np.fromfile(
                    wf, dtype=np.float32, count=4 * filters)
                # tf [gamma, beta, mean, variance]
                bn_weights = bn_weights.reshape((4, filters))[[1, 0, 2, 3]]# darknet shape (out_dim, in_dim, height, width)
            conv_shape = (filters, in_dim, size, size)
            conv_weights = np.fromfile(
                wf, dtype=np.float32, count=np.product(conv_shape))
            # tf shape (height, width, in_dim, out_dim)
            conv_weights = conv_weights.reshape(
                conv_shape).transpose([2, 3, 1, 0])if batch_norm is None:
                layer.set_weights([conv_weights, conv_bias])
            else:
                layer.set_weights([conv_weights])
                batch_norm.set_weights(bn_weights)assert len(wf.read()) == 0, 'failed to read all data'
    wf.close()def DarknetConv(x, filters, kernel_size, strides=1, batch_norm=True):
    '''
    Call this function to define a single Darknet convolutional layer
    
    :param x: inputs
    :param filters: number of filters in the convolutional layer
    :param kernel_size: Size of kernel in the Conv layer
    :param strides: Conv layer strides
    :param batch_norm: Whether or not to use the custom batch norm layer.
    '''
    #Image padding
    if strides == 1:
        padding = 'same'
    else:
        x = ZeroPadding2D(((1, 0), (1, 0)))(x)  # top left half-padding
        padding = 'valid'
        
    #Defining the Conv layer
    x = Conv2D(filters=filters, kernel_size=kernel_size,
               strides=strides, padding=padding,
               use_bias=not batch_norm, kernel_regularizer=l2(0.0005))(x)
    
    if batch_norm:
        x = BatchNormalization()(x)
        x = LeakyReLU(alpha=0.1)(x)
    return x

Create functions to define DarkNet Residual layer and DarkNet Block which will you the convolutional layers created above followed by a function to use them and create the whole DarkNet model.

def DarknetResidual(x, filters):
    '''
    Call this function to define a single DarkNet Residual layer
    
    :param x: inputs
    :param filters: number of filters in each Conv layer.
    '''
    prev = x
    x = DarknetConv(x, filters // 2, 1)
    x = DarknetConv(x, filters, 3)
    x = Add()([prev, x])
    return x
  
  
def DarknetBlock(x, filters, blocks):
    '''
    Call this function to define a single DarkNet Block (made of multiple Residual layers)
    
    :param x: inputs
    :param filters: number of filters in each Residual layer
    :param blocks: number of Residual layers in the block
    '''
    x = DarknetConv(x, filters, 3, strides=2)
    for _ in range(blocks):
        x = DarknetResidual(x, filters)
    return xdef Darknet(name=None):
    '''
    The main function that creates the whole DarkNet.
    '''
    x = inputs = Input([None, None, 3])
    x = DarknetConv(x, 32, 3)
    x = DarknetBlock(x, 64, 1)
    x = DarknetBlock(x, 128, 2)  # skip connection
    x = x_36 = DarknetBlock(x, 256, 8)  # skip connection
    x = x_61 = DarknetBlock(x, 512, 8)
    x = DarknetBlock(x, 1024, 4)
    return tf.keras.Model(inputs, (x_36, x_61, x), name=name)

Now we need to create helper functions for our YOLOv3 model to define YOLOv3 convolutional layers, the output of our YOLO model, drawing the outputs, creating bounding boxes from predictions, and a function for non-maximal suppression. We also need to define our anchor boxes and finally create a function which combines all of it to produce our model.

def draw_outputs(img, outputs, class_names):
    '''
    Helper, util, function that draws predictons on the image.
    
    :param img: Loaded image
    :param outputs: YoloV3 predictions
    :param class_names: list of all class names found in the dataset
    '''
    boxes, objectness, classes, nums = outputs
    boxes, objectness, classes, nums = boxes[0], objectness[0], classes[0], nums[0]
    wh = np.flip(img.shape[0:2])
    for i in range(nums):
        x1y1 = tuple((np.array(boxes[i][0:2]) * wh).astype(np.int32))
        x2y2 = tuple((np.array(boxes[i][2:4]) * wh).astype(np.int32))
        img = cv2.rectangle(img, x1y1, x2y2, (255, 0, 0), 2)
        img = cv2.putText(img, '{} {:.4f}'.format(
            class_names[int(classes[i])], objectness[i]),
            x1y1, cv2.FONT_HERSHEY_COMPLEX_SMALL, 1, (0, 0, 255), 2)
    return imgyolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45),
                         (59, 119), (116, 90), (156, 198), (373, 326)],
                        np.float32) / 416yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])
def YoloConv(filters, name=None):
    '''
    Call this function to define the Yolo Conv layer.
    
    :param flters: number of filters for the conv layer
    :param name: name of the layer
    '''
    def yolo_conv(x_in):
        if isinstance(x_in, tuple):
            inputs = Input(x_in[0].shape[1:]), Input(x_in[1].shape[1:])
            x, x_skip = inputs# concat with skip connection
            x = DarknetConv(x, filters, 1)
            x = UpSampling2D(2)(x)
            x = Concatenate()([x, x_skip])
        else:
            x = inputs = Input(x_in.shape[1:])x = DarknetConv(x, filters, 1)
        x = DarknetConv(x, filters * 2, 3)
        x = DarknetConv(x, filters, 1)
        x = DarknetConv(x, filters * 2, 3)
        x = DarknetConv(x, filters, 1)
        return Model(inputs, x, name=name)(x_in)
    return yolo_convdef YoloOutput(filters, anchors, classes, name=None):
    '''
    This function defines outputs for the Yolo V3. (Creates output projections)
     
    :param filters: number of filters for the conv layer
    :param anchors: anchors
    :param classes: list of classes in a dataset
    :param name: name of the layer
    '''
    def yolo_output(x_in):
        x = inputs = Input(x_in.shape[1:])
        x = DarknetConv(x, filters * 2, 3)
        x = DarknetConv(x, anchors * (classes + 5), 1, batch_norm=False)
        x = Lambda(lambda x: tf.reshape(x, (-1, tf.shape(x)[1], tf.shape(x)[2],
                                            anchors, classes + 5)))(x)
        return tf.keras.Model(inputs, x, name=name)(x_in)
    return yolo_outputdef yolo_boxes(pred, anchors, classes):
    '''
    Call this function to get bounding boxes from network predictions
    
    :param pred: Yolo predictions
    :param anchors: anchors
    :param classes: List of classes from the dataset
    '''
    
    # pred: (batch_size, grid, grid, anchors, (x, y, w, h, obj, ...classes))
    grid_size = tf.shape(pred)[1]
    #Extract box coortinates from prediction vectors
    box_xy, box_wh, objectness, class_probs = tf.split(
        pred, (2, 2, 1, classes), axis=-1)#Normalize coortinates
    box_xy = tf.sigmoid(box_xy)
    objectness = tf.sigmoid(objectness)
    class_probs = tf.sigmoid(class_probs)
    pred_box = tf.concat((box_xy, box_wh), axis=-1)  # original xywh for loss# !!! grid[x][y] == (y, x)
    grid = tf.meshgrid(tf.range(grid_size), tf.range(grid_size))
    grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)  # [gx, gy, 1, 2]box_xy = (box_xy + tf.cast(grid, tf.float32)) / \
        tf.cast(grid_size, tf.float32)
    box_wh = tf.exp(box_wh) * anchorsbox_x1y1 = box_xy - box_wh / 2
    box_x2y2 = box_xy + box_wh / 2
    bbox = tf.concat([box_x1y1, box_x2y2], axis=-1)return bbox, objectness, class_probs, pred_boxdef yolo_nms(outputs, anchors, masks, classes):
    # boxes, conf, type
    b, c, t = [], [], []for o in outputs:
        b.append(tf.reshape(o[0], (tf.shape(o[0])[0], -1, tf.shape(o[0])[-1])))
        c.append(tf.reshape(o[1], (tf.shape(o[1])[0], -1, tf.shape(o[1])[-1])))
        t.append(tf.reshape(o[2], (tf.shape(o[2])[0], -1, tf.shape(o[2])[-1])))bbox = tf.concat(b, axis=1)
    confidence = tf.concat(c, axis=1)
    class_probs = tf.concat(t, axis=1)scores = confidence * class_probs
    boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
        boxes=tf.reshape(bbox, (tf.shape(bbox)[0], -1, 1, 4)),
        scores=tf.reshape(
        scores, (tf.shape(scores)[0], -1, tf.shape(scores)[-1])),
        max_output_size_per_class=100,
        max_total_size=100,
        iou_threshold=0.5,
        score_threshold=0.6
    )return boxes, scores, classes, valid_detectionsdef YoloV3(size=None, channels=3, anchors=yolo_anchors,
           masks=yolo_anchor_masks, classes=80):
  
    x = inputs = Input([size, size, channels], name='input')x_36, x_61, x = Darknet(name='yolo_darknet')(x)x = YoloConv(512, name='yolo_conv_0')(x)
    output_0 = YoloOutput(512, len(masks[0]), classes, name='yolo_output_0')(x)x = YoloConv(256, name='yolo_conv_1')((x, x_61))
    output_1 = YoloOutput(256, len(masks[1]), classes, name='yolo_output_1')(x)x = YoloConv(128, name='yolo_conv_2')((x, x_36))
    output_2 = YoloOutput(128, len(masks[2]), classes, name='yolo_output_2')(x)boxes_0 = Lambda(lambda x: yolo_boxes(x, anchors[masks[0]], classes),
                     name='yolo_boxes_0')(output_0)
    boxes_1 = Lambda(lambda x: yolo_boxes(x, anchors[masks[1]], classes),
                     name='yolo_boxes_1')(output_1)
    boxes_2 = Lambda(lambda x: yolo_boxes(x, anchors[masks[2]], classes),
                     name='yolo_boxes_2')(output_2)outputs = Lambda(lambda x: yolo_nms(x, anchors, masks, classes),
                     name='yolo_nms')((boxes_0[:3], boxes_1[:3], boxes_2[:3]))return Model(inputs, outputs, name='yolov3')

Now with all the functions defined its time to create a YOLOv3 model which can be done using:

yolo = YoloV3()
load_darknet_weights(yolo, 'yolov3.weights')

Start the webcam and make each frame ready for prediction. As we are using OpenCV we need to convert our images to RGB, resize them to 320x320 or 416x416 or 608x608, convert their datatype to float32, expand them to four dimensions and divide them by 255.

cap = cv2.VideoCapture(0)
while(True):
    ret, image = cap.read()
    if ret==True:
        img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        img = cv2.resize(img, (320, 320))
        img = img.astype(np.float32)
        img = np.expand_dims(img, 0)
        img = img / 255

Use yolo to get the predictions on the image. Load the class names files containing all the object names for which the model was trained. You can find it here. You can draw outputs to check if the model is working or not.

boxes, scores, classes, nums = yolo(img)
class_names = [c.strip() for c in open("classes.txt").readlines()]
image = draw_outputs(image, (boxes, scores, classes, nums), class_names)
cv2.imshow('Prediction', image)
if cv2.waitKey(1) & 0xFF == ord('q'):
    break

Now to count persons or anything present in the classes.txt we need to know its index in it. The index of person is 0 so we need to check if the class predicted is zero then we increment a counter.

count=0
for i in range(nums[0]):
    if int(classes[0][i] == 0):
        count +=1print('Number of people:', count)

You can find the complete code on my Github repo.