Tracking Cows with Mask R-CNN and SORT

Deep learning is hot. There are lots of projects on the cutting edge of deep learning appearing every month, lots of research papers on deep learning coming out every week, and lots of very interesting models for all possible applications being developed and trained every day.

With this neverending stream of constant advances, it is common that when you are just about to start solving some computer vision (or natural language processing, or any other) problem, you naturally begin by googling possible solutions, and there is always a bunch of open repositories with ready-made models that promise to solve all your problems. If you are lucky enough, you will even find some pre-trained weights for these neural network models, and even maybe a handy API for them. Basically, for most problems you can usually find a model, or two, or a dozen, and all of them seem to work fine.

So if that is the case, what exactly are we doing here? Are deep learning experts just running existing ready-made models (when they are not doing state of the art research where these models actually come from)? Why the big bucks? Well, actually applying even ready-to-use products is not always easy. In this post, we will see a specific example of why it is hard, how to detect the problem, and what to do when you noticed it. We have written this post together with our St. Petersburg researcher Anastasia Gaydashenko, whom we have already presented in a previous post on segmentation; she has also prepared the models that we use in this post.

And we will be talking about cows.

Problem description

We begin with the problem. It is quite easy to formulate: we would like to learn to track objects from flying drones. We have already talked about very similar problems: object detection, segmentation, pose estimation, and so on. Tracking is basically object detection but for videos rather than still images. Performance is also very important because you probably want tracking to be done in real time: if you spend more time to process the video than to record it you cut off most possible applications that require raising alarms or round-the-clock tracking. And today, we will consider tracking with a slightly unusual but very interesting example.

We at Neuromation believe that artificial intelligence is the future of agriculture. We have written about it extensively, and we hope to bring the latest advances of computer vision and deep learning in general to agricultural applications. We are already applying computer vision computer vision models to help grow healthy pigs, by tracking them in the Piglet’s Big Brother project. So as the testing grounds for the models, we chose this beautiful video, Herding Cattle with a Drone by Trey Lyda for La Escalera Ranch:

We believe that adding high-quality real-time tracking from drones that herd cows opens up even more opportunities: maybe some cows didn’t pay attention to the drone and were left behind, maybe some of them got sick and can’t move as fast or at all… the first step to fixing these problems would be to detect them. And it appears that there are plenty of already developed solutions for tracking that should work for this problem. Let’s see how they do…

Simple Online and Realtime Tracking

The most popular and one of the simplest algorithms for tracking is SORT (Simple Online and Realtime Tracking). It can track multiple objects in real time but the algorithm merely associates already detected objects across different frames based on the coordinates of detection results, like this:

The idea is to use some off-the-shelf model for object detection (we already did a survey of those here) and then plug the results into the SORT algorithm that matches detected objects across frames.

This approach obviously yields a multi-purpose algorithm: SORT doesn’t need to know which type of object we track. It doesn’t even need to learn anything: to perform the associations SORT uses mathematical heuristics such as maximizing the IOU (intersection-over-union) metrics between bounding boxes in neighboring frames. Each box is labeled with a number (object id), and if there is no relevant box in the next frame, the algorithm assumes that the object has left the frame.

The quality of such an approach naturally depends a lot on the quality of the underlying object detection. The whole point of the original SORT paper was to show that object detection algorithms have advanced so much that you don’t have to do anything too fancy about tracking and can achieve state-of-the-art results with straightforward heuristics. Since then, improvements have appeared, in particular the next generation of the SORT algorithm, Deep SORT (deep learning is really fast: SORT came out in 2016, and Deep SORT already in 2017). It was designed especially to reduce the number of switchings between identities, ensuring that the tracking is more stable.

First results

To use SORT for tracking, we need to plug in some model for the detection step. In our case, it could be any object detection model pretrained to recognize cows. We used this open repository that includes a SORT implementation based on YOLO (actually, YOLOv2) detection model; it also has an implementation of Deep SORT.

Since YOLO is pretrained on the standard COCO dataset that has “cow” as one of its classes, we can simply launch the detection and tracking. The results are quite poor:

Note that we haven’t made any bad decisions along the way. Frankly, we haven’t really made any decisions at all: we are using a standard pretrained implementation of SORT with a standard YOLO model for object detection that usually works quite well. But the video clearly shows that the results are poor because of the first step, detection. In almost all frames the model does not detect any cows, only sometimes finding a couple of them. So we need to go deeper…

You Only Look Once

To understand the issue and decide how to deal with it, let’s take a closer look at the YOLO architecture.

The pipeline itself is pretty straightforward: unlike many popular detection models which perform detection on many region proposals (RoIs, region of interest), YOLO passes the image through the neural network only once (this is where the title comes from: You Only Look Once) and returns bounding boxes and class probabilities for predictions. Like this:

To do that, YOLO breaks up the image into a grid, and for each cell in the grid considers a number of possible bounding boxes; neural networks are used to estimate the confidence that each of those boxes contains an object and find class probabilities for this object:

The network architecture is pretty simple too; it contains 24 convolutional layers followed by two fully connected layers, reminiscent of AlexNet and even earlier convolutional architectures:

Since the original image is divided into cells, detection happens if the center of an object falls into a cell. But since each grid cell only predicts two boxes, the model struggles with small objects that appear in groups, such as a flock of birds… or a herd of cows (or is it a kine? a flink? it’s all pure rubbish, of course). It is even explicitly pointed out by the authors in the section on the limitations of their approach.

Okay, so by now we have tried a straightforward approach that seemed very plausible but utterly failed. Time to pivot.

Pivoting to a different model

As we have seen, even if you can find open repositories that seem tailor-made for your specific problem, the models you find, even if they are perfectly good models in general, may not be the best option for your particular problem.

To get the best performance (or some reasonable performance, at least), you usually have to try several different approaches. So as the next step, we changed the model to Mask R-CNN that we have talked about in detail in one of our previous posts. Due to a totally different approach to detection, it should be able to recognize cows better, and it really did:

The basic network that you can download from the repositories was also trained on the COCO dataset.

But to achieve even better results, we decided to get rid of all extra classes and train the model only on classes responsible for cows and sheep. We left sheep in because, first, we wanted to reproduce the results on sheep as well, and second, they look pretty similar from afar, so a similar but different class could be useful for the detection.

There is a pretty easy way to upload new training data for the model in the Mask R-CNN repository that we used. So we retrained the network to detect only these two classes. After that, all we needed to do was to embed the new detection model into the tracking algorithm. And here we go, the results are now much better:

We can again compare all three detection versions on a sample frame from the video.

YOLO did not detect anything:

Vanilla Mask R-CNN did much better but it’s still not something you can call a good result:

And our version of Mask R-CNN is better yet:

All the code for our experiments can be found here, in the Neuromation fork of the “Tracking with darkflow” repository.

As you can see, even almost without any new code, by fiddling with existing repositories you can often go from a completely unworkable model to a reasonably good one. Still, even in the last picture one can notice a few missing detections that really should be there, and the tracking based on this detector is also far from perfect yet. Our simple illustration ends here, but the real work of artificial intelligence experts only begins: now we have to push the model from “reasonable” to “truly useful”. And that’s a completely different flink of cows altogether…

Sergey Nikolenko
Chief Research Officer, Neuromation

Anastasia Gaydashenko
Junior Researcher, Neuromation