Day 104(DL) —Implementing an object Tracker for German Shepherd — part2

Nandhini N
Apr 25 · 3 min read

Citation: All the videos/images used in this post are taken from

In the part1 blog, we understood how to set up the data pipeline (i.e) converting the downloaded videos to frames and annotating the images(bb boxes) using CVAT. Now we are all set for the implementation of the object detection model.

step1: Gathered 13 videos from One point to be taken care of here is, when we download the videos there should be some similarity between the train, validation and test set videos. As the object detection model can only identify the new data with prior knowledge, choosing a completely different video(different angles, different backgrounds and size) for the test set may not be effective.

step2: Converted all the videos to frames along with resizing the individual frame. As these are high-quality videos with higher width and height, using the original information might consume a lot of time during the image loading into CVAT as well as setting up the folder structure for training. So to make things simple, have reduced the width & height by 3 times the original.

Step3: Selected around 15 frames(from each video) with some sufficient interval gap, so that the selected frames carry some individual identity rather than duplication. Uploaded each image into CVAT for annotation.

Step4: Since we are employing yolov5 for creating an object detection model, the annotated images are split into three groups train, valid and test. we can refer to the link for setting up the folder structure of yolov5. The labels are downloaded from CVAT in the Yolo format.

Step5: Let’s start the model building by referring to the previous blog. The only update needed here is the number of classes given in the data.yaml file. Since our objective is to identify German-Shepherd, we have only one target label. The same information can be updated in the data.yaml file accordingly.

Step6: Once we train our model using, the next step would be to detect the test video/images. Predicting videos is also similar to images, just that we need to load the video under test -> images (labels folder is basically not required for the test, so dropping it will not harm the process).

Step7: When the prediction is completed, we can see the annotated video/image under runs → detect -> exp.

Since the dataset included for training is only a small subset of around 100 images, the accuracy level is medium(with some extra glitches). As we expand the training samples, the predicting power of the model proportionally increases.

Now that we’ve modelled a simple object detector that can identify German_Shepherd. The same logic could be applied for other images/videos and the only prime point is the training images should be altered according to the need.

Recommended Reading:

Nerd For Tech

From Confusion to Clarification

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store