Day 102(DL) — Implementing an object Tracker for German Shepherd — part1
Now that we’ve gathered all the theoretical knowledge on object detection, tracking and the evaluation criteria. It’s time to apply the learnings by implementing a simple object tracker to track the German Shepherd. This can be considered as a small mini-project, split into multiple steps. We’ll go through each and every step to gain in-depth details so that the learnings can be extrapolated to any custom videos(vehicle tracking, monitoring the animal behaviour etc). The main focus of this article will be setting up a data processing pipeline.
Requirement: To track German Shepherd
Entire Process flow:
- Data Collection
- Data Processing
- Object Detection Model building(yolov5)
- Object tracking
Step1(Data Collection): We need a bunch of videos for training, validation and testing. Let’s navigate to https://www.pexels.com/search/videos/german%20shepherd/ (free to use and download) and assemble the videos. If you’d like to try with your own videos, the same steps can be followed along. We need variations in the videos in terms of lighting and other environmental factors. Meanwhile, the dog should have some similarity across the videos.
If we’re training the model with puppies, the model can only detect puppies and not the dogs. For our scenario, let’s download some videos of dogs(10 videos maybe, if needed we can expand). For the processing steps in the post, we’ll use one particular video https://www.pexels.com/video/a-dog-running-in-the-snow-covered-grass-4009906/.
Step2(Data Processing): The prerequisite for any object tracker is an object detection model(YOLO, SSD, Faster RCNN, Detectron, Efficientnet). The inputs to the object detection models are images and labels(in the form of bounding boxes). But right now, we only have video and not the image. Since the videos are basically a collection of image frames per second. The more frames per second, the images will become videos.
- Converting video into frames: We’ll be using a small python script to perform the conversion process. The OpenCV can be used to create image frames out of the video. Since each video will result in ’n’ number of frames, the best practice is to create a separate folder for each video where the frames get stored.
os.getcwd()'C:\\Users\\91965\\Desktop\\mini_project'#creating a folder called video1 under the main folder
directory = 'video1'
path = 'C:\\Users\\91965\\Desktop\\mini_project'
new_path = os.path.join(path, directory)
- After executing the above code, we could notice a new folder called ‘video1’ might have generated. The next logical point is to create the image frames and place them in the newly created folder. First, we need to read the video file present in the parent directory.
#providing the name of the downloaded file
list1 = ['production ID_4009906.mp4'
]count = 1
for i in list1:
#reading the video from parent directory
vidcap = cv2.VideoCapture(i)
count1 = 0
success = True
- After capturing the video, let’s get into the subfolder ‘video1’ to store the image frames. The change directory command(chdir) allows switching the folders. Every frame from the video is extracted and stored in PNG format. For the video considered, we got around 208 frames. Note: For further processing, we need not require all the frames. The reason being two consecutive frames will be almost similar with a meagre difference. In order to create a more generalised model, we can skip some frames and consider only a small count.
directory = 'video' + str(count)
parent_dir = 'C:\\Users\\91965\\Desktop\\mini_project'
path = os.path.join(parent_dir, directory)
ret,image = vidcap.read()
if ret == False:
cv2.imwrite(directory+"_frame%d.PNG" % count1, image)
count1 += 1
- Bounding box creation using CVAT Tool: All the images are ready for annotation. Let’s follow the annotation process using CVAT that we’ve discussed earlier. Here, we’ve only one class (i.e) a German Shepherd. We can load the extracted files(a few can be skipped actually) into CVAT for creating bb boxes.
We can download the bounding box coordinates in the Yolo format. The output looks like something as below,
The first value corresponds to the label(as we have only one, so ‘0’). The subsequent values correspond to the x-centre, y-centre, width and height(all normalised). We can repeat the same process for all the videos and the respective frames.
The code used in the article can be found in the Github repository.