Real-Time Object Detection with OpenVINO™

Published in

OpenVINO-toolkit

5 min readJun 30, 2022

Having the ability to detect objects in real-time is giving businesses the opportunity to innovate and move faster than ever before. It’s spotting errors on the manufacturing floor. It’s detecting issues in medical images and scans. It’s analyzing customer traffic, dwell time and product placement in retail stores. And it can even be used to detect and respond to incidents in a public space. But not everyone has the expertise or knowledge on how to properly perform object detection.

To help AI developers create these types of experiences, Intel® has provided a series of tutorials that empower them to apply these types of AI models to applications.

In this guide, I will focus on implementing live object detection using OpenVINO™. AI developers can run the workload in real-time on a computer with a webcam or upload a video that can run the inference workload.

Step 1: Download the model

First, we must download the model, which can be done using the code below. To do so, we will use the openvino-dev package command-line tool omz_downloader. For this demo, we will also be using the SSDLite MobileNetV2 from Open Model Zoo. If you would like to use a different model, simply change the model name.

# directory where model will be downloaded
base_model_dir = "model"# model name as named in Open Model Zoo
model_name = "ssdlite_mobilenet_v2"download_command = f"omz_downloader " \
                   f"--name {model_name} " \
                   f"--output_dir {base_model_dir} " \
                   f"--cache_dir {base_model_dir}"
! $download_command

Step 2: Convert the model

Since we are using a model from the public directory, we will need to convert it to OpenVINO format (Intermediate Representation or IR file) using the openvino-dev package command-line tool Model Converter: omz_converter. Developers can specify the precision to tell the tool exactly which code they need. In this case, we use half-precision or FP16. If you decide not to specify the precision, the Model Converter will create code for every available precision.

Different models may have different levels of precision available. This processing step takes approximately two minutes per level of precision.

precision = "FP16"# output path for the conversion
converted_model_path = f"model/public/{model_name}/{precision}/{model_name}.xml"if not os.path.exists(converted_model_path):
    convert_command = f"omz_converter " \
                      f"--name {model_name} " \
                      f"--download_dir {base_model_dir} " \
                      f"--precisions {precision}"
    ! $convert_command

Step 3: Load the model

Next, we load the model.

The following code initializes the runtime, reads the model and its corresponding weights, and compiles the model to the user’s device of choice. Network architecture and model weights are read from the .bin and .xml files. If detection is set to “AUTO,” the engine will choose which device to target based on workload characteristics.

# initialize inference engine
ie_core = Core()
# read the network and corresponding weights from file
model = ie_core.read_model(model=converted_model_path)
# compile the model for the CPU (you can choose manually CPU, GPU, MYRIAD etc.)
# or let the engine choose the best available device (AUTO)
compiled_model = ie_core.compile_model(model=model, device_name="CPU")# get input and output nodes
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)# get input size
height, width = list(input_layer.shape)[1:3]

Step 4: Process the results

Now, it’s time to process the results.

First, we define the classes of objects the application can see. In this case, we will differentiate between a wide range of common objects such as people, airplanes, buses, doors, and couches (see code below). We also define the color for each object to be outlined when detected. If multiple objects are detected at once, the model will be able to differentiate them and surround them with the appropriate color (Figure 1).

# https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/
classes = [
    "background", "person", "bicycle", "car", "motorcycle", 
    "airplane", "bus", "train", "truck", "boat", "traffic light", 
    "fire hydrant", "street sign", "stop sign", "parking meter",
    "bench", "bird", "cat", "dog", "horse", "sheep", "cow", 
    "elephant", "bear", "zebra", "giraffe", "hat", "backpack",
    "umbrella", "shoe", "eye glasses", "handbag", "tie", "suitcase",
    "frisbee", "skis", "snowboard", "sports ball", "kite",
    "baseball bat", "baseball glove", "skateboard", "surfboard",  
    "tennis racket", "bottle", "plate", "wine glass", "cup", "fork",
    "knife", "spoon", "bowl", "banana", "apple", "sandwich",
    "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", 
    "cake", "chair", "couch", "potted plant", "bed", "mirror", 
    "dining table", "window", "desk", "toilet", "door", "tv", 
    "laptop", "mouse", "remote", "keyboard", "cell phone", 
    "microwave", "oven", "toaster", "sink", "refrigerator",  
    "blender", "book", "clock", "vase", "scissors", "teddy bear", 
    "hair drier", "toothbrush", "hair brush"
]# colors for above classes (Rainbow Color Map)
colors = cv2.applyColorMap(
    src=np.arange(0, 255, 255 / len(classes),
    dtype=np.float32).astype(np.uint8),
    colormap=cv2.COLORMAP_RAINBOW
).squeeze()def process_results(frame, results, thresh=0.6):
    # size of the original frame
    h, w = frame.shape[:2]
    # results is a tensor [1, 1, 100, 7]
    results = results.squeeze()
    boxes = []
    labels = []
    scores = []
    for _, label, score, xmin, ymin, xmax, ymax in results:
        # create a box with pixels coordinates from the box with normalized coordinates [0,1]
        boxes.append(tuple(map(int, (xmin * w, ymin * h, xmax * w, ymax * h))))
        labels.append(int(label))
        scores.append(float(score))    # apply non-maximum suppression to get rid of many overlapping entities
    # see https://paperswithcode.com/method/non-maximum-suppression
    # this algorithm returns indices of objects to keep
    indices = cv2.dnn.NMSBoxes(bboxes=boxes, scores=scores, score_threshold=thresh, nms_threshold=0.6)    # if there are no boxes
    if len(indices) == 0:
        return []    # filter detected objects
    return [(labels[idx], scores[idx], boxes[idx]) for idx in indices.flatten()]def draw_boxes(frame, boxes):
    for label, score, box in boxes:
        # choose color for the label
        color = tuple(map(int, colors[label]))
        # draw box
        cv2.rectangle(img=frame, pt1=box[:2], pt2=box[2:], color=color, thickness=3)
        # draw label name inside the box
        cv2.putText(img=frame, text=f"{classes[label]} {score:.2f}", org=(box[0] + 10, box[1] + 30), fontFace=cv2.FONT_HERSHEY_COMPLEX, fontScale=frame.shape[1] / 1000, color=color, thickness=1, lineType=cv2.LINE_AA)    return frame

***Figure 1****. The real-time object detection model detects a keyboard, cell phone, and person, which are labeled and outlined in their corresponding class and color. (Source:* *GitHub*)

Step 5: Object detection

Finally, the main processing function handles the actual object detection. The code snippet is a bit too large to include here, but the explanation is straightforward. The OpenVINO sample creates a video player instance at a target fps rate, 30 by default, and resizes input resolution to 1280x720 to boost overall performance. The program draws boxes in a frame around objects it detects, with the box color depending on the type of object. It displays its identification confidence and the name of the object in question.

There’s also a non-maximum suppression algorithm that scans and selects a single entity out of a group of overlapping entities and scans for the boundary box most likely to be the appropriate target. In the image above, the AI model correctly determines that the person and the keyboard are two different objects, despite the fact that they partly overlap each other.

I hope that this blog post and accompanying code sample have shown you that object detection is more approachable than you thought.