Detecting Objects in Video or Camera Images using ImageAI

Rahul Kapoor
Analytics Vidhya
Published in
3 min readMay 22, 2021

Object Detection in a video is to locate the presence of objects, put them under certain classes based on our deep learning model, and place bounding boxes around them.

To simply put our Input is a video or images with multiple objects and Output is the same video or image with bounding boxes (of certain height and width) around the objects + class names and probabilities to which they belong.

Here we will be using a pre-trained YOLO (You Only Look Once) model which was trained for a large dataset of around 80 classes of objects for a long time with high-powered resources. This article, will not go much deeper into the YOLO architecture and focus more on the ImageAI library to do object detection on our video and get the results.

To look further into the inner workings of YOLO you can always refer to this great article: Link to Article

Building these models from scratch takes a lot of understanding of mathematics and its architecture along with thousands of lines of code. So we can always take advantage of the concept of Transfer Learning and use trained weights of someone else who has been kind enough to train their models with very expensive resources and made them public.

This is where ImageAI comes up that has made it possible for anyone with basic knowledge of Python to build applications and systems that can detect objects in videos using only a few lines of programming code.

Before we start using ImageAI we have to install few dependencies. I am using Jupyter Notebooks to execute the below code:

  1. TensorFlow
!pip3 install tensorflow==2.4.0

2. Other Dependencies

!pip install keras==2.4.3 numpy==1.19.3 pillow==7.0.0 scipy==1.4.1 h5py==2.10.0 matplotlib==3.3.2 opencv-python keras-resnet==0.2.0

3. ImageAI

!pip install imageai --upgrade

Now we have installed all the tools that are required we need to get a pre-trained YOLOv3 model. You can get one from here: Link to model

This model is trained already to detect around 80 classes of objects mentioned below:

person, bicycle, car, motorcycle, airplane,
bus, train, truck, boat, traffic light, fire hydrant, stop_sign,
parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra,
giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard,
sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket,
bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange,
broccoli, carrot, hot dog, pizza, donot, cake, chair, couch, potted plant, bed,
dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave,
oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair dryer,
toothbrush.

You can download any video from youtube with a bunch of these objects or always capture one of your own.

I have used a street video showing a bunch of people and cars. Below is a screen capture from that video.

Image captured from my street video on which Object Detection will happen
Image captured from my street video on which Object Detection will happen

Execute the below code and it will start processing your video in frames.

If you have a good-enough NVIDIA GPU you can always use tensorflow-gpu dependency and this detection process should be done in less than a minute depending upon your video size. Otherwise, it may take a few minutes.

The detection will progress for each frame of the video detected, and the detected video that’s saved will be automatically updated for each frame detected.

from imageai.Detection import VideoObjectDetection
import os
path = os.getcwd()det = VideoObjectDetection()
det.setModelTypeAsYOLOv3()
det.setModelPath(os.path.join(path, "yolo.h5"))
det.loadModel()
video_path = det.detectObjectsFromVideo(input_file_path=os.path.join(path,"your_video_file.mp4"),
output_file_path=os.path.join(path, "your_video_file_detected"),
display_percentage_probability = False,
frames_per_second=20, log_progress=True)
print(video_path)

Output:

Image captured from my processed video with bounding boxes and names around classes detected
Image captured from my processed video with bounding boxes and names around classes detected

The output shows objects detected with bounding boxes and names around them. For simplicity, I have used display_percentage_probability = False.

If you gained some concepts from reading this article, give it a clap. You can reach me in the comments section if you have any questions or suggestions.

--

--