Day 102(DL) — Implementing an object Tracker for German Shepherd — part1

Nandhini N
Apr 23 · 4 min read

Now that we’ve gathered all the theoretical knowledge on object detection, tracking and the evaluation criteria. It’s time to apply the learnings by implementing a simple object tracker to track the German Shepherd. This can be considered as a small mini-project, split into multiple steps. We’ll go through each and every step to gain in-depth details so that the learnings can be extrapolated to any custom videos(vehicle tracking, monitoring the animal behaviour etc). The main focus of this article will be setting up a data processing pipeline.

Requirement: To track German Shepherd

Entire Process flow:

  • Data Collection
  • Data Processing
  • Object Detection Model building(yolov5)
  • Object tracking
  • Deployment

Step1(Data Collection): We need a bunch of videos for training, validation and testing. Let’s navigate to (free to use and download) and assemble the videos. If you’d like to try with your own videos, the same steps can be followed along. We need variations in the videos in terms of lighting and other environmental factors. Meanwhile, the dog should have some similarity across the videos.

If we’re training the model with puppies, the model can only detect puppies and not the dogs. For our scenario, let’s download some videos of dogs(10 videos maybe, if needed we can expand). For the processing steps in the post, we’ll use one particular video

Step2(Data Processing): The prerequisite for any object tracker is an object detection model(YOLO, SSD, Faster RCNN, Detectron, Efficientnet). The inputs to the object detection models are images and labels(in the form of bounding boxes). But right now, we only have video and not the image. Since the videos are basically a collection of image frames per second. The more frames per second, the images will become videos.

  • Converting video into frames: We’ll be using a small python script to perform the conversion process. The OpenCV can be used to create image frames out of the video. Since each video will result in ’n’ number of frames, the best practice is to create a separate folder for each video where the frames get stored.
import os
'C:\\Users\\91965\\Desktop\\mini_project'#creating a folder called video1 under the main folder
directory = 'video1'
path = 'C:\\Users\\91965\\Desktop\\mini_project'
new_path = os.path.join(path, directory)
  • After executing the above code, we could notice a new folder called ‘video1’ might have generated. The next logical point is to create the image frames and place them in the newly created folder. First, we need to read the video file present in the parent directory.
#providing the name of the downloaded file
list1 = ['production ID_4009906.mp4'
count = 1

for i in list1:
#reading the video from parent directory
vidcap = cv2.VideoCapture(i)

count1 = 0
success = True
  • After capturing the video, let’s get into the subfolder ‘video1’ to store the image frames. The change directory command(chdir) allows switching the folders. Every frame from the video is extracted and stored in PNG format. For the video considered, we got around 208 frames. Note: For further processing, we need not require all the frames. The reason being two consecutive frames will be almost similar with a meagre difference. In order to create a more generalised model, we can skip some frames and consider only a small count.
while success:   
directory = 'video' + str(count)
parent_dir = 'C:\\Users\\91965\\Desktop\\mini_project'
path = os.path.join(parent_dir, directory)

ret,image =

if ret == False:
cv2.imwrite(directory+"_frame%d.PNG" % count1, image)
count1 += 1
  • Bounding box creation using CVAT Tool: All the images are ready for annotation. Let’s follow the annotation process using CVAT that we’ve discussed earlier. Here, we’ve only one class (i.e) a German Shepherd. We can load the extracted files(a few can be skipped actually) into CVAT for creating bb boxes.

We can download the bounding box coordinates in the Yolo format. The output looks like something as below,

The first value corresponds to the label(as we have only one, so ‘0’). The subsequent values correspond to the x-centre, y-centre, width and height(all normalised). We can repeat the same process for all the videos and the respective frames.

The code used in the article can be found in the Github repository.

Recommended Video:

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Nandhini N

Written by

AI Enthusiast | Blogger✍

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store